• Preventing robot indexing attacks

    From Ivan Shmakov@21:1/5 to All on Sat Jul 15 16:24:44 2017
    XPost: alt.html

    jdallen2000@yahoo.com writes:

    [Cross-posting to news:comp.infosystems.www.misc as I feel that
    this question has more to do with Web than HTML per se.]

    I have a website organized as a large number (> 200,000) of pages.
    It is hosted by a large Internet hosting company.

    Many websites provide much more information than mine by computing
    info on-the-fly with server scripts, but I have, in effect, all the
    query results pre-computed. I waste a few gigabytes for the data,
    but that's almost nothing these days, and don't waste the server's
    time on scripts.

    My users may click to 10 or 20 pages in a session. But the indexing
    bots want to read all 200,000+ pages! My host has now complained
    that the site is under "bot attack" and has asked me to check my own
    laptop for viruses!

    I'm happy anyway to reduce the bot activity. I don't mind having my
    site indexed, but once or twice a year would be enough!

    I see that there is a way to stop the Google Bot specifically. I'd
    love it if I could do the opposite -- have *only* Google index my
    site.

    JFTR, I personally (as well as many other users who value their
    privacy) refrain from using Google Search and rely on, say,
    https://duckduckgo.com/ instead.

    A technician at the hosting company wrote to me

    As per the above logs and hitting IP addresses, we have blocked the
    46.229.168.* IP range to prevent the further abuse and advice you to
    also check incoming traffic and block such IP's in future.

    We have also blocked the bots by adding the following entry
    in robots.txt:-

    User-agent: AhrefsBot
    Disallow: /
    User-agent: MJ12bot
    Disallow: /
    User-agent: SemrushBot
    Disallow: /
    User-agent: YandexBot
    Disallow: /
    User-agent: Linguee Bot
    Disallow: /

    I believe that the solution(s) like the above will only lead to
    your site fading into obscurity to the majority of Web users --
    by the means of being removed from Web search results.

    As long as the troublesome bots honor robots.txt (there're those
    that do not; but then, the above won't work on them, either),
    a more sane solution would be to limit the /rate/ the bots
    request your pages for indexing, like:

    ### robots.txt

    ### Data:

    ## Request that the bots wait at least 3 seconds between requests.
    User-agent: *
    Crawl-delay: 3

    ### robots.txt ends here

    This way, the bots will still scan all your 2e5 pages, but their
    accessess will be spread over about a week -- which (I hope)
    will be well within "acceptable use limits" of your hosting
    company.

    [...]

    --
    FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to ivan@siamics.net on Sat Jul 15 21:04:08 2017
    XPost: alt.html

    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:
    jdallen2000@yahoo.com writes:
    [Cross-posting to news:comp.infosystems.www.misc as I feel that
    this question has more to do with Web than HTML per se.]

    :^)

    I have a website organized as a large number (> 200,000) of pages.
    It is hosted by a large Internet hosting company.
    ...
    My users may click to 10 or 20 pages in a session. But the indexing
    bots want to read all 200,000+ pages! My host has now complained
    that the site is under "bot attack" and has asked me to check my own laptop for viruses!

    200k pages isn't that huge, and if static files on disk, as described in
    a snipped out part, shouldn't be that hard to serve. Bandwidth may be an
    issue, depending on how you are being charged. And on a shared system,
    which I think you might have, your options for optimizing for massive
    amounts of static files might be limited.

    I'm happy anyway to reduce the bot activity. I don't mind having my
    site indexed, but once or twice a year would be enough!

    Some of the better search engines will gladly consult site map files
    that give hints about what needs reindexing. See:

    https://www.sitemaps.org/protocol.html

    I see that there is a way to stop the Google Bot specifically. I'd
    love it if I could do the opposite -- have *only* Google index my
    site.
    JFTR, I personally (as well as many other users who value their
    privacy) refrain from using Google Search and rely on, say,
    https://duckduckgo.com/ instead.

    Yeah, Google only is an "all your eggs in one basket" route. I, too,
    have been using DDG almost exclusively for several years.

    A technician at the hosting company wrote to me
    As per the above logs and hitting IP addresses, we have blocked the
    46.229.168.* IP range to prevent the further abuse and advice you to
    also check incoming traffic and block such IP's in future.

    46.229.168.0-46.229.168.255 is:

    netname: ADVANCEDHOSTERS-NET

    Can't say I've heard of them.

    We have also blocked the bots by adding the following entry
    in robots.txt:-
    User-agent: AhrefsBot

    Yes, block them. Not a search engine, but a commercial SEO service. https://ahrefs.com/robot

    User-agent: MJ12bot

    Eh, maybe block, maybe not. Seems to be real serach engine.
    http://mj12bot.com/

    User-agent: SemrushBot

    Yes, block them. Not a search engine, but a commercial SEO service. https://www.semrush.com/bot/

    User-agent: YandexBot

    Real Russian search engine. https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml

    User-agent: Linguee Bot

    Real service, but dubious value to a webmaster. http://www.botreports.com/user-agent/linguee-bot.shtml

    All bots can be impersonated by other bots, so you can't be sure the User-Agent: will be the real identity of the bots. You can spend a lot
    of time researching bots and the characteristics of real bot usage, eg hostnames or IP address ranges of legit bot servers.

    Given the little I've seen here, I wonder if you have someone at
    Advanced Hosters impersonating bots to suck your site down.

    As long as the troublesome bots honor robots.txt (there're those
    that do not; but then, the above won't work on them, either),
    a more sane solution would be to limit the /rate/ the bots
    request your pages for indexing, like:

    ### robots.txt

    ### Data:

    ## Request that the bots wait at least 3 seconds between requests. User-agent: *
    Crawl-delay: 3

    ### robots.txt ends here

    Except for Linguee, I think all of the bots listed above are
    well-behaved and will obey robots.txt, but I don't know if they are all advanced enough to know Crawl-delay. Some of them explicitly state they
    do, however.

    This way, the bots will still scan all your 2e5 pages, but their
    accessess will be spread over about a week -- which (I hope)
    will be well within "acceptable use limits" of your hosting
    company.

    Only bot I've ever had to blacklist was a MSN bot that absolutely
    refused to stop hitting one page over and over again a few years ago. I
    used a server directive to shunt that one bot to 403 Forbidden errors.

    Elijah
    ------
    stopped worring about bots a long time ago

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Sun Jul 16 08:00:56 2017
    XPost: alt.html

    Eli the Bearded <*@eli.users.panix.com> writes:
    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote: >>>>> jdallen2000@yahoo.com writes:

    [...]

    I'm happy anyway to reduce the bot activity. I don't mind having
    my site indexed, but once or twice a year would be enough!

    Some of the better search engines will gladly consult site map files
    that give hints about what needs reindexing. See:

    https://www.sitemaps.org/protocol.html

    ... Learning sitemaps is in my todo list for a while now...

    [...]

    A technician at the hosting company wrote to me

    As per the above logs and hitting IP addresses, we have blocked
    the 46.229.168.* IP range to prevent the further abuse and advice
    you to also check incoming traffic and block such IP's in future.

    46.229.168.0-46.229.168.255 is:

    netname: ADVANCEDHOSTERS-NET

    Can't say I've heard of them.

    Same here.

    [...]

    All bots can be impersonated by other bots, so you can't be sure the User-Agent: will be the real identity of the bots.

    True in general, but of little relevance in the context of
    robots.txt. For one thing, a misbehaving robot may very well
    have one string for User-Agent:, yet look for something entirely
    different in robots.txt (if it even decides to honor the file.)

    You can spend a lot of time researching bots and the characteristics
    of real bot usage, eg hostnames or IP address ranges of legit bot
    servers.

    [...]

    Except for Linguee, I think all of the bots listed above are
    well-behaved and will obey robots.txt,

    FWIW, Linguee claim "[they] want [their] crawler to be as polite
    as possible." (http://linguee.com/bot.)

    but I don't know if they are all advanced enough to know Crawl-delay.
    Some of them explicitly state they do, however.

    Now that I've watched closely, but I don't recall stumbling upon
    a robot that would honor robots.txt, yet would issue request in
    quick succession contrary to Crawl-delay:. That might've been
    because of bot-side rate limits, of course.

    This way, the bots will still scan all your 2e5 pages, but their
    accessess will be spread over about a week -- which (I hope) will be
    well within "acceptable use limits" of your hosting company.

    Only bot I've ever had to blacklist was a MSN bot that absolutely
    refused to stop hitting one page over and over again a few years ago.
    I used a server directive to shunt that one bot to 403 Forbidden
    errors.

    There seem to be a few misbehaving robots that frequent my
    servers; most mask as browsers -- and of course do not ever
    consider robots.txt.

    For instance, 37.59.55.128, 109.201.142.109, 188.165.233.228,
    and I recall some such activity from Baidu networks. (Alongside
    with their "regular", well-behaving crawler.)

    Elijah ------ stopped worrying about bots a long time ago

    How so?

    --
    FSF associate member #7257 np. Into The Dark -- Radiarc 3013 B6A0 230E 334A

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Doc O'Leary@21:1/5 to Ivan Shmakov on Sun Jul 16 14:50:46 2017
    XPost: alt.html

    For your reference, records indicate that
    Ivan Shmakov <ivan@siamics.net> wrote:

    jdallen2000@yahoo.com writes:

    [Cross-posting to news:comp.infosystems.www.misc as I feel that
    this question has more to do with Web than HTML per se.]

    Assuming we’re not missing any info as a result . . .

    My users may click to 10 or 20 pages in a session. But the indexing
    bots want to read all 200,000+ pages! My host has now complained
    that the site is under "bot attack" and has asked me to check my own laptop for viruses!

    This doesn’t make much sense. The web host sounds incompetent, so I
    don’t know that we can trust what is being reported by them. Getting (legitimately) spidered is not an attack. Any attack you *may* be
    under would not be as a result of a virus on your own non-server
    computer. I’d find a different hosting provider.

    A technician at the hosting company wrote to me

    As per the above logs and hitting IP addresses, we have blocked the
    46.229.168.* IP range to prevent the further abuse and advice you to
    also check incoming traffic and block such IP's in future.

    There is nothing about the 46.229.160.0/20 range in question that
    indicates it represents a legitimate bot. Do the logs actually
    indicate vanilla spidering, or something more nefarious like looking
    for PHP/WordPress exploits? I see a lot of traffic like that.

    In such cases, editing robots.txt is unlikely to solve the problem.
    Generally, I don’t even bother configuring the web server to deny a
    serious attacker. I’d just drop their whole range into my firewall,
    because odds are good that a dedicated attacker isn’t going to only
    go after port 80.

    That might be beyond the scope of what a basic web hosting company
    provides but, really, given that a $15/year VPS can handle most
    traffic for even a 200K page static site with ease, I really can’t
    imagine what the real issue is here. More details needed.

    --
    "Also . . . I can kill you with my brain."
    River Tam, Trash, Firefly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to ivan@siamics.net on Sun Jul 16 20:27:40 2017
    XPost: alt.html

    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:
    Elijah ------ stopped worrying about bots a long time ago

    A combination of enough capacity to not care about load and no longer
    using web advertisting, removing the necessity to audit logs.

    Elijah
    ------
    understands others have other needs and priorities

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Wed Jul 19 12:12:36 2017
    XPost: alt.html

    Doc O'Leary <droleary@2017usenet1.subsume.com> writes:
    For your reference, records indicate that Ivan Shmakov wrote:
    jdallen2000@yahoo.com writes:

    A technician at the hosting company wrote to me

    As per the above logs and hitting IP addresses, we have blocked
    the 46.229.168.* IP range to prevent the further abuse and advice
    you to also check incoming traffic and block such IP's in future.

    There is nothing about the 46.229.160.0/20 range in question that
    indicates it represents a legitimate bot. Do the logs actually
    indicate vanilla spidering, or something more nefarious like looking
    for PHP/WordPress exploits? I see a lot of traffic like that.

    Same here.

    In such cases, editing robots.txt is unlikely to solve the problem.

    Yes. (Although blocking the range is.)

    Generally, I don’t even bother configuring the web server to deny a serious attacker. I’d just drop their whole range into my firewall, because odds are good that a dedicated attacker isn’t going to only
    go after port 80.

    Personally, I configured my Web server to redirect requests like
    that to localhost:discard, and let the scanners disconnect at
    their own timeouts. (5-15 s, from the looks of it.) Like:

    <IfModule mod_rewrite.c>
    RewriteCond %{REQUEST_URI} \
    ^/(old|sql(ite)?|wp|XXX|[-/])*(admin|manager|YYY) [nocase]
    RewriteRule .* http://ip6-localhost:9/ [P]
    </IfModule>

    (That is, the "tar pit" approach. Alternatively, one may use
    mod_security, but to me that seemed like overkill.)

    When the scanner in question is a simple single-threaded,
    "sequential" program, that may also reduce its impact on the
    rest of the Web.

    OTOH, when I start receiving spam from somewhere, the respective
    range has a good chance of ending up in my firewall rules.

    (There was one dedicated spammer that I reported repeatedly to
    their hosters, only for them to move to some other service. I'm
    afraid I grew lazy when they moved to "Zomro" networks; e. g.,
    178.159.42.0/25. They seem to be staying there for months now.)

    That might be beyond the scope of what a basic web hosting company
    provides but, really, given that a $15/year VPS can handle most
    traffic for even a 200K page static site with ease, I really can’t
    imagine what the real issue is here. More details needed.

    Now, that's interesting. The VPS services I use (or used) are
    generally $5/month or more.

    Virpus VPSes were pretty cheap back when I used them, but
    somewhat less reliable.

    -- "Also . . . I can kill you with my brain." River Tam, Trash,
    Firefly

    ... Also in my to-watch list. (I've certainly liked Serenity.)

    --
    FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Doc O'Leary@21:1/5 to Ivan Shmakov on Wed Jul 19 14:55:51 2017
    XPost: alt.html

    For your reference, records indicate that
    Ivan Shmakov <ivan@siamics.net> wrote:

    Doc O'Leary <droleary@2017usenet1.subsume.com> writes:
    That might be beyond the scope of what a basic web hosting company provides but, really, given that a $15/year VPS can handle most
    traffic for even a 200K page static site with ease, I really can’t imagine what the real issue is here. More details needed.

    Now, that's interesting. The VPS services I use (or used) are
    generally $5/month or more.

    Well, I certainly can and do pay more for a VPS when I need more
    resources, but when it comes to hosting a static site like we’re
    talking about here, you really don’t need much to do it. These days,
    the limiting factor is quickly becoming the cost of an IPv4 address.

    There’s really no reason I can think of that a basic virtual web host
    should be balking over the OP’s site. It’s the kind of thing I’d
    host for friends for free because the overhead would seem like a
    rounding error.

    ... Also in my to-watch list. (I've certainly liked Serenity.)

    Personally, I think the series was *much* better than the movie.

    --
    "Also . . . I can kill you with my brain."
    River Tam, Trash, Firefly

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)