Forum: >>> Magnum BBS <<<

Preventing robot indexing attacks

From Ivan Shmakov@21:1/5 to All on Sat Jul 15 16:24:44 2017

XPost: alt.html

jdallen2000@yahoo.com writes:

[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]

I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.

Many websites provide much more information than mine by computing
info on-the-fly with server scripts, but I have, in effect, all the
query results pre-computed. I waste a few gigabytes for the data,
but that's almost nothing these days, and don't waste the server's
time on scripts.

My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own
laptop for viruses!

I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!

I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.

JFTR, I personally (as well as many other users who value their
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.

A technician at the hosting company wrote to me

As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.

We have also blocked the bots by adding the following entry
in robots.txt:-

User-agent: AhrefsBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: SemrushBot
Disallow: /
User-agent: YandexBot
Disallow: /
User-agent: Linguee Bot
Disallow: /

I believe that the solution(s) like the above will only lead to
your site fading into obscurity to the majority of Web users --
by the means of being removed from Web search results.

As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:

### robots.txt

### Data:

## Request that the bots wait at least 3 seconds between requests.
User-agent: *
Crawl-delay: 3

### robots.txt ends here

This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.

[...]

--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to ivan@siamics.net on Sat Jul 15 21:04:08 2017

XPost: alt.html

In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

jdallen2000@yahoo.com writes:

[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]

:^)

I have a website organized as a large number (> 200,000) of pages.
It is hosted by a large Internet hosting company.

...

My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own laptop for viruses!

200k pages isn't that huge, and if static files on disk, as described in
a snipped out part, shouldn't be that hard to serve. Bandwidth may be an
issue, depending on how you are being charged. And on a shared system,
which I think you might have, your options for optimizing for massive
amounts of static files might be limited.

I'm happy anyway to reduce the bot activity. I don't mind having my
site indexed, but once or twice a year would be enough!

Some of the better search engines will gladly consult site map files
that give hints about what needs reindexing. See:

https://www.sitemaps.org/protocol.html

I see that there is a way to stop the Google Bot specifically. I'd
love it if I could do the opposite -- have *only* Google index my
site.

JFTR, I personally (as well as many other users who value their
privacy) refrain from using Google Search and rely on, say,
https://duckduckgo.com/ instead.

Yeah, Google only is an "all your eggs in one basket" route. I, too,
have been using DDG almost exclusively for several years.

A technician at the hosting company wrote to me

As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.

46.229.168.0-46.229.168.255 is:

netname: ADVANCEDHOSTERS-NET

Can't say I've heard of them.

We have also blocked the bots by adding the following entry
in robots.txt:-
User-agent: AhrefsBot

Yes, block them. Not a search engine, but a commercial SEO service. https://ahrefs.com/robot

User-agent: MJ12bot

Eh, maybe block, maybe not. Seems to be real serach engine.
http://mj12bot.com/

User-agent: SemrushBot

Yes, block them. Not a search engine, but a commercial SEO service. https://www.semrush.com/bot/

User-agent: YandexBot

Real Russian search engine. https://yandex.com/support/webmaster/robot-workings/check-yandex-robots.xml

User-agent: Linguee Bot

Real service, but dubious value to a webmaster. http://www.botreports.com/user-agent/linguee-bot.shtml

All bots can be impersonated by other bots, so you can't be sure the User-Agent: will be the real identity of the bots. You can spend a lot
of time researching bots and the characteristics of real bot usage, eg hostnames or IP address ranges of legit bot servers.

Given the little I've seen here, I wonder if you have someone at
Advanced Hosters impersonating bots to suck your site down.

As long as the troublesome bots honor robots.txt (there're those
that do not; but then, the above won't work on them, either),
a more sane solution would be to limit the /rate/ the bots
request your pages for indexing, like:

### robots.txt

### Data:

## Request that the bots wait at least 3 seconds between requests. User-agent: *
Crawl-delay: 3

### robots.txt ends here

Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt, but I don't know if they are all advanced enough to know Crawl-delay. Some of them explicitly state they
do, however.

This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope)
will be well within "acceptable use limits" of your hosting
company.

Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago. I
used a server directive to shunt that one bot to 403 Forbidden errors.

Elijah
------
stopped worring about bots a long time ago

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Sun Jul 16 08:00:56 2017

XPost: alt.html

Eli the Bearded <*@eli.users.panix.com> writes:
In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote: >>>>> jdallen2000@yahoo.com writes:

[...]

I'm happy anyway to reduce the bot activity. I don't mind having
my site indexed, but once or twice a year would be enough!

Some of the better search engines will gladly consult site map files
that give hints about what needs reindexing. See:

https://www.sitemaps.org/protocol.html

... Learning sitemaps is in my todo list for a while now...

[...]

A technician at the hosting company wrote to me

As per the above logs and hitting IP addresses, we have blocked
the 46.229.168.* IP range to prevent the further abuse and advice
you to also check incoming traffic and block such IP's in future.

46.229.168.0-46.229.168.255 is:

netname: ADVANCEDHOSTERS-NET

Can't say I've heard of them.

Same here.

[...]

All bots can be impersonated by other bots, so you can't be sure the User-Agent: will be the real identity of the bots.

True in general, but of little relevance in the context of
robots.txt. For one thing, a misbehaving robot may very well
have one string for User-Agent:, yet look for something entirely
different in robots.txt (if it even decides to honor the file.)

You can spend a lot of time researching bots and the characteristics
of real bot usage, eg hostnames or IP address ranges of legit bot
servers.

[...]

Except for Linguee, I think all of the bots listed above are
well-behaved and will obey robots.txt,

FWIW, Linguee claim "[they] want [their] crawler to be as polite
as possible." (http://linguee.com/bot.)

but I don't know if they are all advanced enough to know Crawl-delay.
Some of them explicitly state they do, however.

Now that I've watched closely, but I don't recall stumbling upon
a robot that would honor robots.txt, yet would issue request in
quick succession contrary to Crawl-delay:. That might've been
because of bot-side rate limits, of course.

This way, the bots will still scan all your 2e5 pages, but their
accessess will be spread over about a week -- which (I hope) will be
well within "acceptable use limits" of your hosting company.

Only bot I've ever had to blacklist was a MSN bot that absolutely
refused to stop hitting one page over and over again a few years ago.
I used a server directive to shunt that one bot to 403 Forbidden
errors.

There seem to be a few misbehaving robots that frequent my
servers; most mask as browsers -- and of course do not ever
consider robots.txt.

For instance, 37.59.55.128, 109.201.142.109, 188.165.233.228,
and I recall some such activity from Baidu networks. (Alongside
with their "regular", well-behaving crawler.)

Elijah ------ stopped worrying about bots a long time ago

How so?

--
FSF associate member #7257 np. Into The Dark -- Radiarc 3013 B6A0 230E 334A

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Doc O'Leary@21:1/5 to Ivan Shmakov on Sun Jul 16 14:50:46 2017

XPost: alt.html

For your reference, records indicate that
Ivan Shmakov <ivan@siamics.net> wrote:

jdallen2000@yahoo.com writes:

[Cross-posting to news:comp.infosystems.www.misc as I feel that
this question has more to do with Web than HTML per se.]

Assuming we’re not missing any info as a result . . .

My users may click to 10 or 20 pages in a session. But the indexing
bots want to read all 200,000+ pages! My host has now complained
that the site is under "bot attack" and has asked me to check my own laptop for viruses!

This doesn’t make much sense. The web host sounds incompetent, so I
don’t know that we can trust what is being reported by them. Getting (legitimately) spidered is not an attack. Any attack you *may* be
under would not be as a result of a virus on your own non-server
computer. I’d find a different hosting provider.

A technician at the hosting company wrote to me

As per the above logs and hitting IP addresses, we have blocked the
46.229.168.* IP range to prevent the further abuse and advice you to
also check incoming traffic and block such IP's in future.

There is nothing about the 46.229.160.0/20 range in question that
indicates it represents a legitimate bot. Do the logs actually
indicate vanilla spidering, or something more nefarious like looking
for PHP/WordPress exploits? I see a lot of traffic like that.

In such cases, editing robots.txt is unlikely to solve the problem.
Generally, I don’t even bother configuring the web server to deny a
serious attacker. I’d just drop their whole range into my firewall,
because odds are good that a dedicated attacker isn’t going to only
go after port 80.

That might be beyond the scope of what a basic web hosting company
provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t
imagine what the real issue is here. More details needed.

--
"Also . . . I can kill you with my brain."
River Tam, Trash, Firefly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to ivan@siamics.net on Sun Jul 16 20:27:40 2017

XPost: alt.html

In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

Elijah ------ stopped worrying about bots a long time ago

A combination of enough capacity to not care about load and no longer
using web advertisting, removing the necessity to audit logs.

Elijah
------
understands others have other needs and priorities

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Wed Jul 19 12:12:36 2017

XPost: alt.html

Doc O'Leary <droleary@2017usenet1.subsume.com> writes:
For your reference, records indicate that Ivan Shmakov wrote:
jdallen2000@yahoo.com writes:

A technician at the hosting company wrote to me

As per the above logs and hitting IP addresses, we have blocked
the 46.229.168.* IP range to prevent the further abuse and advice
you to also check incoming traffic and block such IP's in future.

There is nothing about the 46.229.160.0/20 range in question that
indicates it represents a legitimate bot. Do the logs actually
indicate vanilla spidering, or something more nefarious like looking
for PHP/WordPress exploits? I see a lot of traffic like that.

Same here.

In such cases, editing robots.txt is unlikely to solve the problem.

Yes. (Although blocking the range is.)

Generally, I don’t even bother configuring the web server to deny a serious attacker. I’d just drop their whole range into my firewall, because odds are good that a dedicated attacker isn’t going to only
go after port 80.

Personally, I configured my Web server to redirect requests like
that to localhost:discard, and let the scanners disconnect at
their own timeouts. (5-15 s, from the looks of it.) Like:

<IfModule mod_rewrite.c>
RewriteCond %{REQUEST_URI} \
^/(old|sql(ite)?|wp|XXX|[-/])*(admin|manager|YYY) [nocase]
RewriteRule .* http://ip6-localhost:9/ [P]
</IfModule>

(That is, the "tar pit" approach. Alternatively, one may use
mod_security, but to me that seemed like overkill.)

When the scanner in question is a simple single-threaded,
"sequential" program, that may also reduce its impact on the
rest of the Web.

OTOH, when I start receiving spam from somewhere, the respective
range has a good chance of ending up in my firewall rules.

(There was one dedicated spammer that I reported repeatedly to
their hosters, only for them to move to some other service. I'm
afraid I grew lazy when they moved to "Zomro" networks; e. g.,
178.159.42.0/25. They seem to be staying there for months now.)

That might be beyond the scope of what a basic web hosting company
provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t
imagine what the real issue is here. More details needed.

Now, that's interesting. The VPS services I use (or used) are
generally $5/month or more.

Virpus VPSes were pretty cheap back when I used them, but
somewhat less reliable.

-- "Also . . . I can kill you with my brain." River Tam, Trash,
Firefly

... Also in my to-watch list. (I've certainly liked Serenity.)

--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Doc O'Leary@21:1/5 to Ivan Shmakov on Wed Jul 19 14:55:51 2017

XPost: alt.html

For your reference, records indicate that
Ivan Shmakov <ivan@siamics.net> wrote:

Doc O'Leary <droleary@2017usenet1.subsume.com> writes:

That might be beyond the scope of what a basic web hosting company provides but, really, given that a $15/year VPS can handle most
traffic for even a 200K page static site with ease, I really can’t imagine what the real issue is here. More details needed.

Now, that's interesting. The VPS services I use (or used) are
generally $5/month or more.

Well, I certainly can and do pay more for a VPS when I need more
resources, but when it comes to hosting a static site like we’re
talking about here, you really don’t need much to do it. These days,
the limiting factor is quickly becoming the cost of an IPv4 address.

There’s really no reason I can think of that a basic virtual web host
should be balking over the OP’s site. It’s the kind of thing I’d
host for friends for free because the overhead would seem like a
rounding error.

... Also in my to-watch list. (I've certainly liked Serenity.)

Personally, I think the series was *much* better than the movie.

--
"Also . . . I can kill you with my brain."
River Tam, Trash, Firefly

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	295
Nodes:	16 (2 / 14)
Uptime:	19:04:19
Calls:	6,640
Files:	12,188
Messages:	5,325,221

Preventing robot indexing attacks

Who's Online

System Info