XPost: comp.unix.shell
Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 15.10.2016 10:02, Kenny McCormack wrote:
Ivan Shmakov <ivan@siamics.net> wrote:
[Cross-posting to news:comp.infosystems.www.misc, as the issue
at hand is arguably more related to WWW than to Unix Shell.]
The remote appears to filter by User-Agent:.
$ lynx --dump --useragent=xnyL -- http://aruljohn.com/mac/000B14
And what is 'xnyL' ?
'Lynx' backwards. But I'm also interested in the rationale behind it.
The rationale behind filtering by User-Agent:, or how did I find
it out?
Per my observations, sites attempt to filter by User-Agent:
to mitigate certain kinds of "abuse", such as unsanctioned
mirroring, or recursive retrieval in general (which is part of
operation of, say, email harvesters.) As such, disallowing
"Wget" -- a popular recursive downloading and mirroring tool --
is not uncommon; I've seen it done at such domains as arxiv.org,
classiccmp.org and datasheetcatalog.org. The proper solution
is, of course, to use the /robots.txt control file instead.
(Granted, GNU Wget can be configured to ignore one -- but so
can it be configured to use an arbitrary User-Agent: string.
For which my long-time preference is, and I'm not trying to
surprise anyone, "tegW".)
Personally, I consider it far worse an issue when the recursive
retrieval software misidentifies itself as a common Web user
agent. Per my experience, a number of such requests originate
from 202.46.48.0/20. Like, say:
202.46.54.133 - - 2016-10-15 21:27:23 +0000 "GET / HTTP/1.1" 200 2546 "-"
"Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36"
Worse still is that even those requests from that same network
identified as "Baiduspider/2.0" in my logs do not seem to ever
refer to /robots.txt. As such, I've decided to deny access to
certain sections of my Web sites to certain User-Agent: and
request source IP combinations.
... Another popular option for ad-hoc crawlers is the Net::HTTP
library for Perl, commonly identified by "libwww-perl" in the
User-Agent: header. Incidentally, Lynx has exact same "libwww"
substring in its own default User-Agent: value, leading to some,
what I presume are, "false positives."
Which is one of the reasons why I tend to use somewhat random
User-Agent: strings for my long-running Lynx sessions. Thus,
when I was able to access the site in question from one so
configured Lynx instances perfectly well, and was then refused
access running $ lynx --dump from command line, the "User-Agent"
filtering was my guess right away.
--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)