• Command line browser (or browser-like utility) that does java/javas

    From Ivan Shmakov@21:1/5 to All on Sun Oct 16 09:45:42 2016
    XPost: comp.unix.shell

    Janis Papanagnou <janis_papanagnou@hotmail.com> writes:
    On 15.10.2016 10:02, Kenny McCormack wrote:
    Ivan Shmakov <ivan@siamics.net> wrote:

    [Cross-posting to news:comp.infosystems.www.misc, as the issue
    at hand is arguably more related to WWW than to Unix Shell.]

    The remote appears to filter by User-Agent:.

    $ lynx --dump --useragent=xnyL -- http://aruljohn.com/mac/000B14

    And what is 'xnyL' ?

    'Lynx' backwards. But I'm also interested in the rationale behind it.

    The rationale behind filtering by User-Agent:, or how did I find
    it out?

    Per my observations, sites attempt to filter by User-Agent:
    to mitigate certain kinds of "abuse", such as unsanctioned
    mirroring, or recursive retrieval in general (which is part of
    operation of, say, email harvesters.) As such, disallowing
    "Wget" -- a popular recursive downloading and mirroring tool --
    is not uncommon; I've seen it done at such domains as arxiv.org,
    classiccmp.org and datasheetcatalog.org. The proper solution
    is, of course, to use the /robots.txt control file instead.
    (Granted, GNU Wget can be configured to ignore one -- but so
    can it be configured to use an arbitrary User-Agent: string.
    For which my long-time preference is, and I'm not trying to
    surprise anyone, "tegW".)

    Personally, I consider it far worse an issue when the recursive
    retrieval software misidentifies itself as a common Web user
    agent. Per my experience, a number of such requests originate
    from 202.46.48.0/20. Like, say:

    202.46.54.133 - - 2016-10-15 21:27:23 +0000 "GET / HTTP/1.1" 200 2546 "-"
    "Mozilla/5.0 (Windows NT 10.0; WOW64)
    AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.93 Safari/537.36"

    Worse still is that even those requests from that same network
    identified as "Baiduspider/2.0" in my logs do not seem to ever
    refer to /robots.txt. As such, I've decided to deny access to
    certain sections of my Web sites to certain User-Agent: and
    request source IP combinations.

    ... Another popular option for ad-hoc crawlers is the Net::HTTP
    library for Perl, commonly identified by "libwww-perl" in the
    User-Agent: header. Incidentally, Lynx has exact same "libwww"
    substring in its own default User-Agent: value, leading to some,
    what I presume are, "false positives."

    Which is one of the reasons why I tend to use somewhat random
    User-Agent: strings for my long-running Lynx sessions. Thus,
    when I was able to access the site in question from one so
    configured Lynx instances perfectly well, and was then refused
    access running $ lynx --dump from command line, the "User-Agent"
    filtering was my guess right away.

    --
    FSF associate member #7257 58F8 0F47 53F5 2EB2 F6A5 8916 3013 B6A0 230E 334A

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)