Forum: >>> Magnum BBS <<<

Command line browser (or browser-like utility) that does java/javas

From Ivan Shmakov@21:1/5 to All on Sun Oct 16 09:45:42 2016

XPost: comp.unix.shell

Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
On 15.10.2016 10:02, Kenny McCormack wrote:
Ivan Shmakov <ivan@siamics.net> wrote:

[Cross-posting to news:comp.infosystems.www.misc, as the issue
at hand is arguably more related to WWW than to Unix Shell.]

The remote appears to filter by User-Agent:.

$ lynx --dump --useragent=xnyL -- http://aruljohn.com/mac/000B14

And what is 'xnyL' ?

'Lynx' backwards. But I'm also interested in the rationale behind it.

The rationale behind filtering by User-Agent:, or how did I find
it out?

Per my observations, sites attempt to filter by User-Agent:
to mitigate certain kinds of "abuse", such as unsanctioned
mirroring, or recursive retrieval in general (which is part of
operation of, say, email harvesters.) As such, disallowing
"Wget" -- a popular recursive downloading and mirroring tool --
is not uncommon; I've seen it done at such domains as arxiv.org,
classiccmp.org and datasheetcatalog.org. The proper solution
is, of course, to use the /robots.txt control file instead.
(Granted, GNU Wget can be configured to ignore one -- but so
can it be configured to use an arbitrary User-Agent: string.
For which my long-time preference is, and I'm not trying to
surprise anyone, "tegW".)

Personally, I consider it far worse an issue when the recursive
retrieval software misidentifies itself as a common Web user
agent. Per my experience, a number of such requests originate
from 202.46.48.0/20. Like, say:

202.46.54.133 - - 2016-10-15 21:27:23 +0000 "GET / HTTP/1.1" 200 2546 "-"
"Mozilla/5.0 (Windows NT 10.0; WOW64)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36"

Worse still is that even those requests from that same network
identified as "Baiduspider/2.0" in my logs do not seem to ever
refer to /robots.txt. As such, I've decided to deny access to
certain sections of my Web sites to certain User-Agent: and
request source IP combinations.

... Another popular option for ad-hoc crawlers is the Net::HTTP
library for Perl, commonly identified by "libwww-perl" in the
User-Agent: header. Incidentally, Lynx has exact same "libwww"
substring in its own default User-Agent: value, leading to some,
what I presume are, "false positives."

Which is one of the reasons why I tend to use somewhat random
User-Agent: strings for my long-running Lynx sessions. Thus,
when I was able to access the site in question from one so
configured Lynx instances perfectly well, and was then refused
access running $ lynx --dump from command line, the "User-Agent"
filtering was my guess right away.

--
FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Michal Wronka
  Wed Apr 24 14:13:57 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Wed Apr 24 14:02:51 2024
  from Wroclaw, Poland via SSH
- Guest
  Wed Apr 24 01:40:10 2024
  from A via Telnet
- Bob Worm
  Thu Apr 25 11:52:12 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	51:21:39
Calls:	6,649
Calls today:	1
Files:	12,200
Messages:	5,330,303

Command line browser (or browser-like utility) that does java/javas

Who's Online

Recent Visitors

System Info