• no-https: a plain-HTTP to HTTPS proxy

    From Ivan Shmakov@21:1/5 to All on Sun Sep 16 07:07:35 2018
    XPost: comp.misc

    [Cross-posting to news:comp.misc as the issue of plain-HTTP
    unavailability was recently discussed there.]

    It took me about a day to write a crude but apparently (more or
    less) working HTTP to HTTPS proxy. (That I hope to beat into
    shape and release via news:alt.sources around next Wednesday
    or so. FTR, the code is currently under 600 LoC long, or 431 LoC
    excluding comments and empty lines.) Some design notes are below.


    Basics

    The basic algorithm is as follows:

    1. receive a request header from the client; we only allow
    GET and HEAD requests for now, as we do not support request
    /bodies/ as of yet;

    2. decide the server and connect there;

    3. send the header to the server;

    4. receive the response header;

    5. if that's an https: redirect:

    5.1. connect over TLS, alter the request (Host:, "request target")
    accordingly, go to step 3;

    6. strip certain headers (such as Strict-Transport-Security: and
    Upgrade:, but also Set-Cookie:) off the response and send the
    result to the client;

    7. copy up to Content-Length: octets from the server to the
    client -- or all the remaining data if no Content-Length:
    is given; (somewhat surprisingly, this seems to also work with
    the "chunked" coding not otherwise considered in the code);

    8. close the connection to the server and repeat from step 1
    so long as the client connection remains active.

    The server uses select(2) so that socket reads do not block and
    supports an arbitrary number (up to the system-enforced limits)
    of concurrent connections. For simplicity, socket writes /are/
    allowed to block. (Hopefully not a problem for proxy-to-server
    connections most of the time, and even less so for proxy-to-client
    ones; assuming no malicious intent on the part of either,
    obviously. The latter case may be mitigated by using a "proper"
    HTTP proxy, such as Polipo, in the front of this one.)


    Dealing with the https: references

    There was an idea of transparently replacing https: references
    in HTML and XML attributes with scheme-relative ones (like, e. g.,
    https://example.com/ to //example.com/.) So far, that fails
    more often than it works, for two primary reasons: compression
    (although that can be solved by forcing Accept-Encoding: identity
    in requests) -- and the fact that by the time such filtering can
    take place, we've already sent the Content-Length: (if any) for
    the original (unaltered) body to the client!

    Also, as the code does not currently handle the "chunked" coding,
    references split across chunks will not be handled. (The code
    should handle references split across bufferfuls of data, though.)

    Two possible ways to solve that would be to, for desired
    Content-Type: values, either retrieve the whole response in full
    before altering and forwarding to the client, /or/ to implement
    support for "chunked" coding and force its use there (striping
    Content-Length: off the original response, if any.)

    I suppose both approaches can be implemented, with the first
    used, say, when Content-Length: is below a configured limit,
    although that increases the complexity of the code, which is
    something I'd rather avoid.

    That said, I don't think the https: references /should/ be an
    issue in practice, as most of the links are ought to be relative
    in the first place, such as:

    <p ><a href="page2.html" >Continue reading of this article</a>,
    or <a href="/" >go back to the top page.</a></p>

    However, I suspect that images and such may be a common
    exception in practice, like:

    <img src="https://static.example.com/useless-stock-photo.jpeg" />

    Which of course would've worked just as well (and require no
    specific action on the part of this proxy) being written as:

    <img src="//static.example.com/useless-stock-photo.jpeg" />


    Making responses even better

    Other possible response alterations may include removing <link />
    elements and Link: HTTP headers pointing to JavaScript code
    (running arbitrary software from the Web is a bad idea, and
    doing so while forgoing the meager TLS protection isn't making
    it better) /and/ also <script /> elements. The latter, in turn,
    will probably either require rather complex state tracking --
    or getting the server response in full before the alterations
    can take place.


    Thoughts?

    --
    FSF associate member #7257 np. Nine Lives -- Slaygon

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to ivan@siamics.net on Sun Sep 16 20:52:00 2018
    XPost: comp.misc

    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:
    It took me about a day to write a crude but apparently (more or
    less) working HTTP to HTTPS proxy. (That I hope to beat into
    shape and release via news:alt.sources around next Wednesday
    or so. FTR, the code is currently under 600 LoC long, or 431 LoC
    excluding comments and empty lines.) Some design notes are below.

    What language?

    The basic algorithm is as follows:

    1. receive a request header from the client; we only allow
    GET and HEAD requests for now, as we do not support request
    /bodies/ as of yet;

    No POST requests will stop a lot of forms. HEAD is an easy case, but
    largely unused.

    2. decide the server and connect there;
    3. send the header to the server;
    4. receive the response header;
    5. if that's an https: redirect:
    5.1. connect over TLS, alter the request (Host:, "request target")
    accordingly, go to step 3;
    6. strip certain headers (such as Strict-Transport-Security: and
    Upgrade:, but also Set-Cookie:) off the response and send the
    result to the client;

    That probably covers it. If you change HTTP/1.1 to HTTP/1.0 on the
    requests, then 1% of servers will have issues and 50% fewer servers will
    send chunked requests. (Numbers made up, based on my experiences.) You
    can also drop Accept-Encoding: if you want to avoid dealing with
    compressed responses.

    7. copy up to Content-Length: octets from the server to the
    client -- or all the remaining data if no Content-Length:
    is given; (somewhat surprisingly, this seems to also work with
    the "chunked" coding not otherwise considered in the code);

    Yup, that works in my experience, too.

    Dealing with the https: references

    There was an idea of transparently replacing https: references
    in HTML and XML attributes with scheme-relative ones (like, e. g.,
    https://example.com/ to //example.com/.) So far, that fails
    more often than it works, for two primary reasons: compression
    (although that can be solved by forcing Accept-Encoding: identity

    No accept-encoding header == no compression.

    in requests) -- and the fact that by the time such filtering can
    take place, we've already sent the Content-Length: (if any) for
    the original (unaltered) body to the client!

    You can fix that with whitespace padding.

    <img src="https://qaz.wtf/tmp/chree.png" ...>
    <img src="//qaz.wtf/tmp/chree.png" ...>

    Beware of parsing issues. Real world HTML usually looks like one of the
    first two but may sometimes look like one of second two of these:

    <img src="https://qaz.wtf/tmp/chree.png" ...>
    <img src='https://qaz.wtf/tmp/chree.png' ...>
    <img src=https://qaz.wtf/tmp/chree.png ...>
    <img src = "https://qaz.wtf/tmp/chree.png" ...>

    (And that's ignoring case.)

    That said, I don't think the https: references /should/ be an
    issue in practice, as most of the links are ought to be relative
    in the first place, such as:

    Hahaha. There are so many different ways it is done in the real world.

    Thoughts?

    Are you going to fix Referer: headers to use the https: version when communicating with an https site? I think you probably should.

    Elijah
    ------
    only forces https on his site for the areas that require login

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Ivan Shmakov on Sun Sep 16 22:52:54 2018
    XPost: comp.misc

    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

    It took me about a day to write a crude but apparently (more or
    less) working HTTP to HTTPS proxy. (That I hope to beat into
    shape and release via news:alt.sources around next Wednesday
    or so. FTR, the code is currently under 600 LoC long, or 431 LoC
    excluding comments and empty lines.) Some design notes are below.

    Sounds like a great start. I'm looking forward to trying it out.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Tue Sep 18 13:10:44 2018
    XPost: comp.misc

    Eli the Bearded <*@eli.users.panix.com> writes:
    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

    It took me about a day to write a crude but apparently (more or
    less) working HTTP to HTTPS proxy. (That I hope to beat into shape
    and release via news:alt.sources around next Wednesday or so.
    FTR, the code is currently under 600 LoC long, or 431 LoC excluding
    comments and empty lines.) Some design notes are below.

    What language?

    Perl 5. It appears the most apt for the task of the five general
    purpose languages I'm using regularly these days. (The others
    being Emacs Lisp, Shell, Awk; and C, though that's mostly limited
    to occasional embedded programming.)

    The basic algorithm is as follows:

    1. receive a request header from the client; we only allow GET and
    HEAD requests for now, as we do not support request /bodies/ as of yet;

    No POST requests will stop a lot of forms.

    My intent was to support Web /reading/ over plain HTTP specifically
    -- which is something that shouldn't involve forms IMO. That said,
    I suppose there can be any number of resources that use POST for
    /search/ forms, which is something that may be worth supporting.

    HEAD is an easy case, but largely unused.

    Easy, indeed, and I do use it myself, so the question of whether
    to implement its handling or not wasn't really considered.

    [...]

    6. strip certain headers (such as Strict-Transport-Security: and
    Upgrade:, but also Set-Cookie:) off the response and send the result
    to the client;

    That probably covers it. If you change HTTP/1.1 to HTTP/1.0 on the requests, then 1% of servers will have issues and 50% fewer servers
    will send chunked requests. (Numbers made up, based on my experiences.)

    The idea was to require the barest minimum of mangling in the
    code, so to leave up the most choices to the user. As such,
    HTTP/1.1 and chunked encoding appears worth enough supporting.

    You can also drop Accept-Encoding: if you want to avoid dealing with compressed responses.

    Per RFC 7231, Accept-Encoding: identity communicates the client's
    preference for "no encoding." Omitting the header, OTOH, means
    "no preference":

    5.3.4. Accept-Encoding

    [...]

    A request without an Accept-Encoding header field implies that the
    user agent has no preferences regarding content-codings. Although
    this allows the server to use any content-coding in a response, it
    does not imply that the user agent will be able to correctly process
    all encodings.

    That said, I do wish for the user to have the choice of having
    /both/ compression and transformations available. And while I'm
    not constrained much by bandwidth, some of the future users of
    this code may be.

    [...]

    There was an idea of transparently replacing https: references in
    HTML and XML attributes with scheme-relative ones (like, e. g.,
    https://example.com/ to //example.com/.) So far, that fails more
    often than it works, for two primary reasons: compression (although
    that can be solved by forcing Accept-Encoding: identity in requests)
    -- and the fact that by the time such filtering can take place,
    we've already sent the Content-Length: (if any) for the original
    (unaltered) body to the client!

    You can fix that with whitespace padding.

    <img src="https://qaz.wtf/tmp/chree.png" ...>
    <img src="//qaz.wtf/tmp/chree.png" ...>

    Yes, I've tried it (alongside Accept-Encoding: identity), it
    worked, but I don't like it for the lack of generality.

    Beware of parsing issues.

    Other than those shown in the examples below?

    Real world HTML usually looks like one of the first two but may
    sometimes look like one of second two of these:

    <img src="https://qaz.wtf/tmp/chree.png" ...>
    <img src='https://qaz.wtf/tmp/chree.png' ...>
    <img src=https://qaz.wtf/tmp/chree.png ...>
    <img src = "https://qaz.wtf/tmp/chree.png" ...>

    (And that's ignoring case.)

    Indeed; and case and lack of quotes will require specialcasing
    for HTML (I aim to support XML applications as well, which
    fortunately are somewhat simpler in this respect.)

    OTOH, I don't think I've ever seen the " = " form; do the blanks
    around the equals sign even conform to any HTML version?

    [...]

    Thoughts?

    Are you going to fix Referer: headers to use the https: version when communicating with an https site? I think you probably should.

    I guess I'll leave it up to the user. Per my experience (with
    copying Web pages using Wget), resources requiring Referer: are
    more an exception rather than the rule, but still.

    Elijah ------ only forces https on his site for the areas that
    require login

    And that's a sensible approach.

    --
    FSF associate member #7257 http://am-1.org/~ivan/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Tue Sep 18 17:05:35 2018
    XPost: comp.misc

    Rich <rich@example.invalid> writes:
    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

    [...]

    OTOH, I don't think I've ever seen the " = " form; do the blanks
    around the equals sign even conform to any HTML version?

    The HTML spec does not appear to explicitly exclude use of spaces
    around the equals sign. So unless there is an explicit exclusion
    somewhere that I've missed, it would be legal to add spaces around
    the equals.

    Does it explicitly allow spaces?

    The fact is, even ignoring the spaced equals item, that HTML is
    "flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out"
    issues if you make use of an HTML parser to parse the HTML instead of
    trying to do anything by string or regex search/replace on the HTML. Anything string/regex search based on HTML will appear to work ok
    until the day it hits a legal bit of HTML it was not designed to
    handle, then it will break badly.

    I. e., the "edge conditions" are so numerous that you are better off
    using a parser that has already been designed to handle those edge conditions.

    I tend to agree with the above for the general case: where I'd
    expect the code to /fail/ if it encounters something it does not
    understand.

    In this case, something that the code does not understand is
    ought to be left untouched, and I'm unsure if I can readily get
    an HTTP parser that does that.

    --
    FSF associate member #7257 http://am-1.org/~ivan/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Ivan Shmakov on Tue Sep 18 16:36:51 2018
    XPost: comp.misc

    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
    Eli the Bearded <*@eli.users.panix.com> writes:
    Real world HTML usually looks like one of the first two but may
    sometimes look like one of second two of these:

    <img src="https://qaz.wtf/tmp/chree.png" ...>
    <img src='https://qaz.wtf/tmp/chree.png' ...>
    <img src=https://qaz.wtf/tmp/chree.png ...>
    <img src = "https://qaz.wtf/tmp/chree.png" ...>

    (And that's ignoring case.)

    Indeed; and case and lack of quotes will require specialcasing
    for HTML (I aim to support XML applications as well, which
    fortunately are somewhat simpler in this respect.)

    OTOH, I don't think I've ever seen the " = " form; do the
    blanks around the equals sign even conform to any HTML
    version?

    The HTML spec does not appear to explicitly exclude use of spaces
    around the equals sign. So unless there is an explicit exclusion
    somewhere that I've missed, it would be legal to add spaces around the
    equals.

    The fact is, even ignoring the spaced equals item, that HTML is
    "flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out" issues
    if you make use of an HTML parser to parse the HTML instead of trying
    to do anything by string or regex search/replace on the HTML. Anything string/regex search based on HTML will appear to work ok until the day
    it hits a legal bit of HTML it was not designed to handle, then it will
    break badly.

    I.e., the "edge conditions" are so numerous that you are better off
    using a parser that has already been designed to handle those edge
    conditions.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Ivan Shmakov on Tue Sep 18 18:32:19 2018
    XPost: comp.misc

    Ivan Shmakov wrote:

    Rich <rich@example.invalid> writes:

    The HTML spec does not appear to explicitly exclude use of spaces
    around the equals sign.

    Does it explicitly allow spaces?

    The w3c validity checker doesn't warn if spaces are included.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Ivan Shmakov on Tue Sep 18 18:56:52 2018
    XPost: comp.misc

    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
    Rich <rich@example.invalid> writes:
    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

    [...]

    OTOH, I don't think I've ever seen the " = " form; do the blanks
    around the equals sign even conform to any HTML version?

    The HTML spec does not appear to explicitly exclude use of spaces
    around the equals sign. So unless there is an explicit exclusion
    somewhere that I've missed, it would be legal to add spaces around
    the equals.

    Does it explicitly allow spaces?

    It is fully silent. It shows examples without the spaces, but is
    silent otherwise as to their allowance (or disallowance) around the
    equals. Given the silence, it is very possible that examples with
    spaces may exist in the wild, and possible (although I have not tested)
    that browsers accept HTML with spaces present.

    The fact is, even ignoring the spaced equals item, that HTML is
    "flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out"
    issues if you make use of an HTML parser to parse the HTML instead
    of trying to do anything by string or regex search/replace on the
    HTML. Anything string/regex search based on HTML will appear to
    work ok until the day it hits a legal bit of HTML it was not
    designed to handle, then it will break badly.

    I. e., the "edge conditions" are so numerous that you are better
    off using a parser that has already been designed to handle those
    edge conditions.

    I tend to agree with the above for the general case: where I'd
    expect the code to /fail/ if it encounters something it does
    not understand.

    In this case, something that the code does not understand is
    ought to be left untouched, and I'm unsure if I can readily
    get an HTTP parser that does that.

    That is, of course, always the final 'out' for something so broken that
    the 'content modification' module fails.

    The difference is that you'll significantly reduce the number of
    failure instances by using a parser to handle the parsing of the
    incoming HTML, then passing the parse tree off to the 'content
    modification' module vs. trying to do content modification with string
    matching and/or regex matching (both of which are essentially creating
    weak 'parsers' that only handle a small subset of the full
    possibilities allowed).

    But you can't possibly reduce the potential for failure to zero, no
    matter what you do, because it is always possible to retreive something
    that claims to be html but is so broken that it simply can't be handled
    (or is simply miss-identified, i.e., someone sending a jpeg image but mime-typing it in the header as text/html).

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Marko Rauhamaa@21:1/5 to All on Tue Sep 18 22:02:22 2018
    XPost: comp.misc

    Rich <rich@example.invalid>:

    The HTML spec does not appear to explicitly exclude use of spaces
    around the equals sign. So unless there is an explicit exclusion
    somewhere that I've missed, it would be legal to add spaces around the equals.

    No need to guess or improvise. The W3 consortium has provided an
    explicit pseudocode implementation of an HTML parser:

    <URL: https://www.w3.org/TR/html52/syntax.html#syntax>

    In fact, I happened to implement the lexical analysis of HTML based on
    this specification just a couple of weeks ago. It was about 3,000 lines
    of code.

    The specification is careful to address the proper behavior of a parser
    when illegal HTML is encountered.


    Marko

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Marko Rauhamaa on Tue Sep 18 19:08:48 2018
    XPost: comp.misc

    In comp.misc Marko Rauhamaa <marko@pacujo.net> wrote:
    Rich <rich@example.invalid>:

    The HTML spec does not appear to explicitly exclude use of spaces
    around the equals sign. So unless there is an explicit exclusion
    somewhere that I've missed, it would be legal to add spaces around the
    equals.

    No need to guess or improvise. The W3 consortium has provided an
    explicit pseudocode implementation of an HTML parser:

    <URL: https://www.w3.org/TR/html52/syntax.html#syntax>


    Thanks for that reference. Looking through it, one finds this for
    attributes:

    The attribute name, followed by zero or more space characters,
    followed by a single U+003D EQUALS SIGN character, followed by zero
    or more space characters, followed by the attribute value,

    So spaces around the equals sign are actually allowed per that syntax
    page.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Burns@21:1/5 to Ivan Shmakov on Tue Sep 18 20:16:37 2018
    XPost: comp.misc

    Ivan Shmakov wrote:

    I don't think I've ever seen the " = " form; do the blanks
    around the equals sign even conform to any HTML version?

    yes, e.g.

    "The attribute name, followed by zero or more space characters, followed
    by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any
    literal space characters, any U+0022 QUOTATION MARK characters ("),
    U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=),
    U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN
    characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be
    the empty string"

    <https://www.w3.org/TR/html5/syntax.html#attribute-names>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Wed Sep 19 05:15:57 2018
    XPost: comp.misc

    Rich <rich@example.invalid> writes:
    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
    Rich <rich@example.invalid> writes:

    [...]

    I. e., the "edge conditions" are so numerous that you are better
    off using a parser that has already been designed to handle those
    edge conditions.

    I tend to agree with the above for the general case: where I'd
    expect the code to /fail/ if it encounters something it does not
    understand.

    In this case, something that the code does not understand is ought
    to be left untouched, and I'm unsure if I can readily get an HTTP

    s/HTTP/HTML/, obviously.

    parser that does that.

    That is, of course, always the final 'out' for something so broken
    that the 'content modification' module fails.

    The difference is that you'll significantly reduce the number of
    failure instances by using a parser to handle the parsing of the
    incoming HTML, then passing the parse tree off to the 'content
    modification' module vs. trying to do content modification with
    string matching and/or regex matching (both of which are essentially creating weak 'parsers' that only handle a small subset of the full possibilities allowed).

    I also consider the possibility of running no-https as a public
    service. As such, considerations like CPU and memory consumption,
    including the ability to run in more or less constant space (per
    connection, with the number of concurrent connections possibly
    also limited) take priority. Creating a full DOM for the
    possibly multi-MiB document, OTOH, is not an option.

    (That said, if there's an HTML parser for per that /can/ be used
    for running in constant space, I'd be curious to consider the
    examples.)

    If you want for these alterations to take place for every
    possible document supported by your browser -- implement them as
    a browser extension. For instance, user JavaScript run with
    Greasemonkey for Firefox has (AIUI) full access to the DOM and
    can walk that and consistently strip "https:" off attribute
    values, regardless of the HTML document's syntax specifics.

    [...]

    --
    FSF associate member #7257 http://am-1.org/~ivan/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Spencer@21:1/5 to Computer Nerd Kev on Wed Sep 19 17:27:44 2018
    XPost: comp.misc

    not@telling.you.invalid (Computer Nerd Kev) writes:

    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

    It took me about a day to write a crude but apparently (more or
    less) working HTTP to HTTPS proxy. (That I hope to beat into
    shape and release via news:alt.sources around next Wednesday
    or so. FTR, the code is currently under 600 LoC long, or 431 LoC
    excluding comments and empty lines.) Some design notes are below.

    Sounds like a great start. I'm looking forward to trying it out.

    Same. As the guy who (possibly) triggered thie thread, I'm archiving
    all posts. Weather is heavenly but will soon turn less salubrious, my
    winter's firewood is all under cover and I'll be spending more time
    hunched over the keyboard, trying to keep up with the evolution of
    then web on my own terms.

    Tnx for discussion; checking alt.sources periodically.
    --
    Mike Spencer Nova Scotia, Canada

    Grumpy old geezer

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Tue Sep 25 18:39:32 2018
    XPost: comp.misc

    Ivan Shmakov <ivan@siamics.net> writes:

    It took me about a day to write a crude but apparently (more or less) working HTTP to HTTPS proxy. (That I hope to beat into shape and
    release via news:alt.sources around next Wednesday or so. FTR, the
    code is currently under 600 LoC long, or 431 LoC excluding comments
    and empty lines.) Some design notes are below.

    It took much longer (of course), and the code has by now expanded
    about threefold. The HTTP/1 support is much improved, however;
    for instance, request bodies and chunked coding should now be
    fully supported. Moreover, the relevant code was split off into
    a separate HTTP1::MessageStream push-mode parser module (or about
    a third of the overall code currently), allowing it to be used
    in other applications.

    The no-https.perl code proper still needs some clean-up after
    all the modifications it got.

    The command-line interface is about as follows. (Not all the
    options are as of yet thoroughly tested, though.)

    Usage:
    $ no-https
    [-d|--[no-]debug] [--listen=BIND|-l BIND] [--mangle=MANGLE]
    [--connect=COMMAND] [--ssl-connect=COMMAND]
    $ no-https {-h|--help}

    BIND is either [HOST:]PORT or, if includes a /, a file name for a
    Unix socket to create and listen on. The default is 8080.

    COMMAND will have %%, %h, %p replaced with a literal %, target host
    and TCP port, respectively. Also, %s and %t are replaced respectively
    with a space and a TAB.

    MANGLE can be minimal, header, or a name of an App::NoHTTPS::Mangle::
    package to require and use. If not specified, default is tried
    first, falling back to (internally-implemented) header.

    The --connect= and --ssl-connect= should make it possible to
    utilize a parent proxy, including a SOCKS one, such as that
    provided by Tor, like: --connect="socat STDIO
    SOCKS4:localhost:%h:%p,socksport=9050". For --ssl-connect=,
    a tsocks(1)-wrapped gnutls-cli(1) may be an option.

    Basics

    The basic algorithm is as follows:

    1. receive a request header from the client; we only allow GET and
    HEAD requests for now, as we do not support request /bodies/ as of yet;

    RFC 7230 section 3.3 actually provides simple criteria for
    determining whether the request has a body:

    The presence of a message body in a request is signaled by a
    Content-Length or Transfer-Encoding header field. Request message
    framing is independent of method semantics, even if the method does
    not define any use for a message body.

    As such, and given that message passing was "symmetrized," any
    request method except CONNECT is now allowed by the code.

    2. decide the server and connect there;

    3. send the header to the server;

    Preceded by the request line, obviously. (It was considered
    a part of the header in the original version of the code.)

    4. receive the response header;

    (Same here, for the status line.)

    We also pass any number of "100 Continue" messages here from
    server to client before the "payload" response.

    5. if that's an https: redirect:

    5.1. connect over TLS, alter the request (Host:, "request target") accordingly, go to step 3;

    A Host: header is prepended to the request header if the
    original has none.

    6. strip certain headers (such as Strict-Transport-Security: and
    Upgrade:, but also Set-Cookie:) off the response and send the result
    to the client;

    Both the decision whether to "eat up" the redirect and how to
    alter the header and body of the messages (requests and responses
    alike) are left to the "mangler" object. The object is ought to
    implement the following methods.

    $ma->message_mangler (PARSER, URI)
    Return a new mangler object for the given HTTP1::MessageStream
    parser state (either request or response) and request URI.

    Alternatively, return an URI of the resource to transparently
    request instead of the given one.

    Return undef if this mangler has nothing to do with the
    given parser state and URI.

    $ma->parser ([PARSER]), $ma->uri ([URI]),
    $ma->start_line ([START-LINE]), $ma->header ([HEADER])
    Get or set the HTTP1::MessageStream object, URI, HTTP/1
    start line and HTTP/1 header, respectively, associated with
    the particular request.

    $ma->chunked_p ()
    Return a true value if the body is ought to be transmitted
    to the remote using chunked coding. (The associated header
    is set up accordingly.)

    $ma->get_mangled_body_part ()
    Return the next part of the (possibly modified) HTTP/1
    message body. This will typically involve a call to the
    parser object to interpret the portion of the message
    currently in its own buffer.

    There're currently two such classes implemented: "minimal" and
    "header," and I believe that the above interface can be used to
    implement rather arbitrary HTTP message filters.

    The "minimal" class removes Upgrade and Proxy-Connection headers
    from the messages (requests and responses alike) and causes the
    calling code to transparently replace all the https: redirects
    with requested resources.

    The "header" class also filters Strict-Transport-Security and
    Set-Cookie off the responses. (Although the former should have
    no effect anyway.)

    There's a minor issue with the handling of https: redirects.
    When http://example.com/ redirects to https://example.com/foo/bar,
    for instance, the links in the latter document will become
    relative to the former URI (unless the 'base' URI is explicitly
    given in the document); thus <a href="baz" /> will point to
    /baz -- instead of the intended /foo/baz. A likely solution
    is to only eat up http:SAME to https:SAME redirects, rewriting
    http:SOME to https:OTHER instead to point to http:OTHER (which
    will then likely result in a redirect to https:OTHER, in turn
    eaten up by the mangler.)

    7. copy up to Content-Length: octets from the server to the client --
    or all the remaining data if no Content-Length: is given; (somewhat surprisingly, this seems to also work with the "chunked" coding not otherwise considered in the code);

    Both the chunked coding and client-to-server body passing are
    now ought to be supported (although POST requests remain untested.)

    8. close the connection to the server and repeat from step 1 so long
    as the client connection remains active.

    [...]

    --
    FSF associate member #7257 http://am-1.org/~ivan/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Eli the Bearded@21:1/5 to ivan@siamics.net on Tue Sep 25 22:29:27 2018
    XPost: comp.misc

    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:
    Ivan Shmakov <ivan@siamics.net> writes:
    It took me about a day to write a crude but apparently (more or less) working HTTP to HTTPS proxy. (That I hope to beat into shape and
    release via news:alt.sources around next Wednesday or so. FTR, the
    code is currently under 600 LoC long, or 431 LoC excluding comments
    and empty lines.) Some design notes are below.

    It took much longer (of course), and the code has by now expanded
    about threefold. The HTTP/1 support is much improved, however;
    for instance, request bodies and chunked coding should now be
    fully supported. Moreover, the relevant code was split off into
    a separate HTTP1::MessageStream push-mode parser module (or about
    a third of the overall code currently), allowing it to be used
    in other applications.

    Sounds interesting. I don't see it in alt.sources here (nor did you
    include a message ID, as I know you have done in the past for such
    things). When do you expect to have a version someone can try out?

    (Will you be posting the code to CPAN?)

    Elijah
    ------
    recalls Ivan dislikes github

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Wed Sep 26 01:05:15 2018
    XPost: comp.misc

    Eli the Bearded <*@eli.users.panix.com> writes:
    In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

    [...]

    It took much longer (of course), and the code has by now expanded
    about threefold. The HTTP/1 support is much improved, however;
    for instance, request bodies and chunked coding should now be fully
    supported. Moreover, the relevant code was split off into a
    separate HTTP1::MessageStream push-mode parser module (or about
    a third of the overall code currently), allowing it to be used in
    other applications.

    Sounds interesting. I don't see it in alt.sources here (nor did you
    include a message ID, as I know you have done in the past for such
    things). When do you expect to have a version someone can try out?

    Hopefully within this week; I'm still testing the proxy code
    proper, and yet to write the READMEs. (Though by now you should
    be well aware that my estimates can be overly optimistic.)

    (Will you be posting the code to CPAN?)

    One of the later versions; as a dependency, HTTP1::MessageStream
    will take priority here, but no-https.perl will likely follow.

    Elijah ------ recalls Ivan dislikes github

    I by no means single out GitHub here; rather, I dislike any
    platform that requires the user to run proprietary software,
    such as proprietary JavaScript, to operate. Hence, GitLab, or
    Savannah, sound like much better choices.

    --
    FSF associate member #7257 http://am-1.org/~ivan/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ivan Shmakov@21:1/5 to All on Thu Oct 4 20:07:49 2018
    XPost: comp.misc

    While I'm yet to make a proper announce, I'm glad to inform
    anyone interested that the first public version of no-https.perl
    is available from news:alt.sources: news:87r2h5tr8d.fsf@siamics.net.

    --
    FSF associate member #7257 http://am-1.org/~ivan/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Ivan Shmakov on Fri Oct 5 00:11:40 2018
    XPost: comp.misc

    In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
    While I'm yet to make a proper announce, I'm glad to inform
    anyone interested that the first public version of no-https.perl
    is available from news:alt.sources: news:87r2h5tr8d.fsf@siamics.net.


    Great! I'm looking forward to trying it out once I get the time.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)