Forum: >>> Magnum BBS <<<

no-https: a plain-HTTP to HTTPS proxy

From Ivan Shmakov@21:1/5 to All on Sun Sep 16 07:07:35 2018

XPost: comp.misc

[Cross-posting to news:comp.misc as the issue of plain-HTTP
unavailability was recently discussed there.]

It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.

Basics

The basic algorithm is as follows:

1. receive a request header from the client; we only allow
GET and HEAD requests for now, as we do not support request
/bodies/ as of yet;

2. decide the server and connect there;

3. send the header to the server;

4. receive the response header;

5. if that's an https: redirect:

5.1. connect over TLS, alter the request (Host:, "request target")
accordingly, go to step 3;

6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the
result to the client;

7. copy up to Content-Length: octets from the server to the
client -- or all the remaining data if no Content-Length:
is given; (somewhat surprisingly, this seems to also work with
the "chunked" coding not otherwise considered in the code);

8. close the connection to the server and repeat from step 1
so long as the client connection remains active.

The server uses select(2) so that socket reads do not block and
supports an arbitrary number (up to the system-enforced limits)
of concurrent connections. For simplicity, socket writes /are/
allowed to block. (Hopefully not a problem for proxy-to-server
connections most of the time, and even less so for proxy-to-client
ones; assuming no malicious intent on the part of either,
obviously. The latter case may be mitigated by using a "proper"
HTTP proxy, such as Polipo, in the front of this one.)

Dealing with the https: references

There was an idea of transparently replacing https: references
in HTML and XML attributes with scheme-relative ones (like, e. g.,
https://example.com/ to //example.com/.) So far, that fails
more often than it works, for two primary reasons: compression
(although that can be solved by forcing Accept-Encoding: identity
in requests) -- and the fact that by the time such filtering can
take place, we've already sent the Content-Length: (if any) for
the original (unaltered) body to the client!

Also, as the code does not currently handle the "chunked" coding,
references split across chunks will not be handled. (The code
should handle references split across bufferfuls of data, though.)

Two possible ways to solve that would be to, for desired
Content-Type: values, either retrieve the whole response in full
before altering and forwarding to the client, /or/ to implement
support for "chunked" coding and force its use there (striping
Content-Length: off the original response, if any.)

I suppose both approaches can be implemented, with the first
used, say, when Content-Length: is below a configured limit,
although that increases the complexity of the code, which is
something I'd rather avoid.

That said, I don't think the https: references /should/ be an
issue in practice, as most of the links are ought to be relative
in the first place, such as:

<p ><a href="page2.html" >Continue reading of this article</a>,
or <a href="/" >go back to the top page.</a></p>

However, I suspect that images and such may be a common
exception in practice, like:

<img src="https://static.example.com/useless-stock-photo.jpeg" />

Which of course would've worked just as well (and require no
specific action on the part of this proxy) being written as:

<img src="//static.example.com/useless-stock-photo.jpeg" />

Making responses even better

Other possible response alterations may include removing <link />
elements and Link: HTTP headers pointing to JavaScript code
(running arbitrary software from the Web is a bad idea, and
doing so while forgoing the meager TLS protection isn't making
it better) /and/ also <script /> elements. The latter, in turn,
will probably either require rather complex state tracking --
or getting the server response in full before the alterations
can take place.

Thoughts?

--
FSF associate member #7257 np. Nine Lives -- Slaygon

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to ivan@siamics.net on Sun Sep 16 20:52:00 2018

XPost: comp.misc

In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.

What language?

The basic algorithm is as follows:

1. receive a request header from the client; we only allow
GET and HEAD requests for now, as we do not support request
/bodies/ as of yet;

No POST requests will stop a lot of forms. HEAD is an easy case, but
largely unused.

2. decide the server and connect there;
3. send the header to the server;
4. receive the response header;
5. if that's an https: redirect:
5.1. connect over TLS, alter the request (Host:, "request target")
accordingly, go to step 3;
6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the
result to the client;

That probably covers it. If you change HTTP/1.1 to HTTP/1.0 on the
requests, then 1% of servers will have issues and 50% fewer servers will
send chunked requests. (Numbers made up, based on my experiences.) You
can also drop Accept-Encoding: if you want to avoid dealing with
compressed responses.

7. copy up to Content-Length: octets from the server to the
client -- or all the remaining data if no Content-Length:
is given; (somewhat surprisingly, this seems to also work with
the "chunked" coding not otherwise considered in the code);

Yup, that works in my experience, too.

Dealing with the https: references

There was an idea of transparently replacing https: references
in HTML and XML attributes with scheme-relative ones (like, e. g.,
https://example.com/ to //example.com/.) So far, that fails
more often than it works, for two primary reasons: compression
(although that can be solved by forcing Accept-Encoding: identity

No accept-encoding header == no compression.

in requests) -- and the fact that by the time such filtering can
take place, we've already sent the Content-Length: (if any) for
the original (unaltered) body to the client!

You can fix that with whitespace padding.

<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src="//qaz.wtf/tmp/chree.png" ...>

Beware of parsing issues. Real world HTML usually looks like one of the
first two but may sometimes look like one of second two of these:

<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src='https://qaz.wtf/tmp/chree.png' ...>
<img src=https://qaz.wtf/tmp/chree.png ...>
<img src = "https://qaz.wtf/tmp/chree.png" ...>

(And that's ignoring case.)

That said, I don't think the https: references /should/ be an
issue in practice, as most of the links are ought to be relative
in the first place, such as:

Hahaha. There are so many different ways it is done in the real world.

Thoughts?

Are you going to fix Referer: headers to use the https: version when communicating with an https site? I think you probably should.

Elijah
------
only forces https on his site for the areas that require login

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Ivan Shmakov on Sun Sep 16 22:52:54 2018

XPost: comp.misc

In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.

Sounds like a great start. I'm looking forward to trying it out.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Tue Sep 18 13:10:44 2018

XPost: comp.misc

Eli the Bearded <*@eli.users.panix.com> writes:
In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into shape
and release via news:alt.sources around next Wednesday or so.
FTR, the code is currently under 600 LoC long, or 431 LoC excluding
comments and empty lines.) Some design notes are below.

What language?

Perl 5. It appears the most apt for the task of the five general
purpose languages I'm using regularly these days. (The others
being Emacs Lisp, Shell, Awk; and C, though that's mostly limited
to occasional embedded programming.)

The basic algorithm is as follows:

1. receive a request header from the client; we only allow GET and
HEAD requests for now, as we do not support request /bodies/ as of yet;

No POST requests will stop a lot of forms.

My intent was to support Web /reading/ over plain HTTP specifically
-- which is something that shouldn't involve forms IMO. That said,
I suppose there can be any number of resources that use POST for
/search/ forms, which is something that may be worth supporting.

HEAD is an easy case, but largely unused.

Easy, indeed, and I do use it myself, so the question of whether
to implement its handling or not wasn't really considered.

[...]

6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the result
to the client;

That probably covers it. If you change HTTP/1.1 to HTTP/1.0 on the requests, then 1% of servers will have issues and 50% fewer servers
will send chunked requests. (Numbers made up, based on my experiences.)

The idea was to require the barest minimum of mangling in the
code, so to leave up the most choices to the user. As such,
HTTP/1.1 and chunked encoding appears worth enough supporting.

You can also drop Accept-Encoding: if you want to avoid dealing with compressed responses.

Per RFC 7231, Accept-Encoding: identity communicates the client's
preference for "no encoding." Omitting the header, OTOH, means
"no preference":

5.3.4. Accept-Encoding

[...]

A request without an Accept-Encoding header field implies that the
user agent has no preferences regarding content-codings. Although
this allows the server to use any content-coding in a response, it
does not imply that the user agent will be able to correctly process
all encodings.

That said, I do wish for the user to have the choice of having
/both/ compression and transformations available. And while I'm
not constrained much by bandwidth, some of the future users of
this code may be.

[...]

There was an idea of transparently replacing https: references in
HTML and XML attributes with scheme-relative ones (like, e. g.,
https://example.com/ to //example.com/.) So far, that fails more
often than it works, for two primary reasons: compression (although
that can be solved by forcing Accept-Encoding: identity in requests)
-- and the fact that by the time such filtering can take place,
we've already sent the Content-Length: (if any) for the original
(unaltered) body to the client!

You can fix that with whitespace padding.

<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src="//qaz.wtf/tmp/chree.png" ...>

Yes, I've tried it (alongside Accept-Encoding: identity), it
worked, but I don't like it for the lack of generality.

Beware of parsing issues.

Other than those shown in the examples below?

Real world HTML usually looks like one of the first two but may
sometimes look like one of second two of these:

<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src='https://qaz.wtf/tmp/chree.png' ...>
<img src=https://qaz.wtf/tmp/chree.png ...>
<img src = "https://qaz.wtf/tmp/chree.png" ...>

(And that's ignoring case.)

Indeed; and case and lack of quotes will require specialcasing
for HTML (I aim to support XML applications as well, which
fortunately are somewhat simpler in this respect.)

OTOH, I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?

[...]

Thoughts?

Are you going to fix Referer: headers to use the https: version when communicating with an https site? I think you probably should.

I guess I'll leave it up to the user. Per my experience (with
copying Web pages using Wget), resources requiring Referer: are
more an exception rather than the rule, but still.

Elijah ------ only forces https on his site for the areas that
require login

And that's a sensible approach.

--
FSF associate member #7257 http://am-1.org/~ivan/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Tue Sep 18 17:05:35 2018

XPost: comp.misc

Rich <rich@example.invalid> writes:
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

[...]

OTOH, I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?

The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around
the equals.

Does it explicitly allow spaces?

The fact is, even ignoring the spaced equals item, that HTML is
"flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out"
issues if you make use of an HTML parser to parse the HTML instead of
trying to do anything by string or regex search/replace on the HTML. Anything string/regex search based on HTML will appear to work ok
until the day it hits a legal bit of HTML it was not designed to
handle, then it will break badly.

I. e., the "edge conditions" are so numerous that you are better off
using a parser that has already been designed to handle those edge conditions.

I tend to agree with the above for the general case: where I'd
expect the code to /fail/ if it encounters something it does not
understand.

In this case, something that the code does not understand is
ought to be left untouched, and I'm unsure if I can readily get
an HTTP parser that does that.

--
FSF associate member #7257 http://am-1.org/~ivan/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Ivan Shmakov on Tue Sep 18 16:36:51 2018

XPost: comp.misc

In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

Eli the Bearded <*@eli.users.panix.com> writes:

Real world HTML usually looks like one of the first two but may
sometimes look like one of second two of these:

<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src='https://qaz.wtf/tmp/chree.png' ...>
<img src=https://qaz.wtf/tmp/chree.png ...>
<img src = "https://qaz.wtf/tmp/chree.png" ...>

(And that's ignoring case.)

Indeed; and case and lack of quotes will require specialcasing
for HTML (I aim to support XML applications as well, which
fortunately are somewhat simpler in this respect.)

OTOH, I don't think I've ever seen the " = " form; do the
blanks around the equals sign even conform to any HTML
version?

The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around the
equals.

The fact is, even ignoring the spaced equals item, that HTML is
"flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out" issues
if you make use of an HTML parser to parse the HTML instead of trying
to do anything by string or regex search/replace on the HTML. Anything string/regex search based on HTML will appear to work ok until the day
it hits a legal bit of HTML it was not designed to handle, then it will
break badly.

I.e., the "edge conditions" are so numerous that you are better off
using a parser that has already been designed to handle those edge
conditions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andy Burns@21:1/5 to Ivan Shmakov on Tue Sep 18 18:32:19 2018

XPost: comp.misc

Ivan Shmakov wrote:

Rich <rich@example.invalid> writes:

The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign.

Does it explicitly allow spaces?

The w3c validity checker doesn't warn if spaces are included.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Ivan Shmakov on Tue Sep 18 18:56:52 2018

XPost: comp.misc

In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

Rich <rich@example.invalid> writes:
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

[...]

OTOH, I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?

The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around
the equals.

Does it explicitly allow spaces?

It is fully silent. It shows examples without the spaces, but is
silent otherwise as to their allowance (or disallowance) around the
equals. Given the silence, it is very possible that examples with
spaces may exist in the wild, and possible (although I have not tested)
that browsers accept HTML with spaces present.

The fact is, even ignoring the spaced equals item, that HTML is
"flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out"
issues if you make use of an HTML parser to parse the HTML instead
of trying to do anything by string or regex search/replace on the
HTML. Anything string/regex search based on HTML will appear to
work ok until the day it hits a legal bit of HTML it was not
designed to handle, then it will break badly.

I. e., the "edge conditions" are so numerous that you are better
off using a parser that has already been designed to handle those
edge conditions.

I tend to agree with the above for the general case: where I'd
expect the code to /fail/ if it encounters something it does
not understand.

In this case, something that the code does not understand is
ought to be left untouched, and I'm unsure if I can readily
get an HTTP parser that does that.

That is, of course, always the final 'out' for something so broken that
the 'content modification' module fails.

The difference is that you'll significantly reduce the number of
failure instances by using a parser to handle the parsing of the
incoming HTML, then passing the parse tree off to the 'content
modification' module vs. trying to do content modification with string
matching and/or regex matching (both of which are essentially creating
weak 'parsers' that only handle a small subset of the full
possibilities allowed).

But you can't possibly reduce the potential for failure to zero, no
matter what you do, because it is always possible to retreive something
that claims to be html but is so broken that it simply can't be handled
(or is simply miss-identified, i.e., someone sending a jpeg image but mime-typing it in the header as text/html).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Marko Rauhamaa@21:1/5 to All on Tue Sep 18 22:02:22 2018

XPost: comp.misc

Rich <rich@example.invalid>:

The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around the equals.

No need to guess or improvise. The W3 consortium has provided an
explicit pseudocode implementation of an HTML parser:

<URL: https://www.w3.org/TR/html52/syntax.html#syntax>

In fact, I happened to implement the lexical analysis of HTML based on
this specification just a couple of weeks ago. It was about 3,000 lines
of code.

The specification is careful to address the proper behavior of a parser
when illegal HTML is encountered.

Marko

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Marko Rauhamaa on Tue Sep 18 19:08:48 2018

XPost: comp.misc

In comp.misc Marko Rauhamaa <marko@pacujo.net> wrote:

Rich <rich@example.invalid>:

The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around the
equals.

No need to guess or improvise. The W3 consortium has provided an
explicit pseudocode implementation of an HTML parser:

<URL: https://www.w3.org/TR/html52/syntax.html#syntax>

Thanks for that reference. Looking through it, one finds this for
attributes:

The attribute name, followed by zero or more space characters,
followed by a single U+003D EQUALS SIGN character, followed by zero
or more space characters, followed by the attribute value,

So spaces around the equals sign are actually allowed per that syntax
page.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andy Burns@21:1/5 to Ivan Shmakov on Tue Sep 18 20:16:37 2018

XPost: comp.misc

Ivan Shmakov wrote:

I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?

yes, e.g.

"The attribute name, followed by zero or more space characters, followed
by a single U+003D EQUALS SIGN character, followed by zero or more space characters, followed by the attribute value, which, in addition to the requirements given above for attribute values, must not contain any
literal space characters, any U+0022 QUOTATION MARK characters ("),
U+0027 APOSTROPHE characters ('), U+003D EQUALS SIGN characters (=),
U+003C LESS-THAN SIGN characters (<), U+003E GREATER-THAN SIGN
characters (>), or U+0060 GRAVE ACCENT characters (`), and must not be
the empty string"

<https://www.w3.org/TR/html5/syntax.html#attribute-names>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Wed Sep 19 05:15:57 2018

XPost: comp.misc

Rich <rich@example.invalid> writes:
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
Rich <rich@example.invalid> writes:

[...]

I. e., the "edge conditions" are so numerous that you are better
off using a parser that has already been designed to handle those
edge conditions.

I tend to agree with the above for the general case: where I'd
expect the code to /fail/ if it encounters something it does not
understand.

In this case, something that the code does not understand is ought
to be left untouched, and I'm unsure if I can readily get an HTTP

s/HTTP/HTML/, obviously.

parser that does that.

That is, of course, always the final 'out' for something so broken
that the 'content modification' module fails.

The difference is that you'll significantly reduce the number of
failure instances by using a parser to handle the parsing of the
incoming HTML, then passing the parse tree off to the 'content
modification' module vs. trying to do content modification with
string matching and/or regex matching (both of which are essentially creating weak 'parsers' that only handle a small subset of the full possibilities allowed).

I also consider the possibility of running no-https as a public
service. As such, considerations like CPU and memory consumption,
including the ability to run in more or less constant space (per
connection, with the number of concurrent connections possibly
also limited) take priority. Creating a full DOM for the
possibly multi-MiB document, OTOH, is not an option.

(That said, if there's an HTML parser for per that /can/ be used
for running in constant space, I'd be curious to consider the
examples.)

If you want for these alterations to take place for every
possible document supported by your browser -- implement them as
a browser extension. For instance, user JavaScript run with
Greasemonkey for Firefox has (AIUI) full access to the DOM and
can walk that and consistently strip "https:" off attribute
values, regardless of the HTML document's syntax specifics.

[...]

--
FSF associate member #7257 http://am-1.org/~ivan/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Spencer@21:1/5 to Computer Nerd Kev on Wed Sep 19 17:27:44 2018

XPost: comp.misc

not@telling.you.invalid (Computer Nerd Kev) writes:

In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.

Sounds like a great start. I'm looking forward to trying it out.

Same. As the guy who (possibly) triggered thie thread, I'm archiving
all posts. Weather is heavenly but will soon turn less salubrious, my
winter's firewood is all under cover and I'll be spending more time
hunched over the keyboard, trying to keep up with the evolution of
then web on my own terms.

Tnx for discussion; checking alt.sources periodically.
--
Mike Spencer Nova Scotia, Canada

Grumpy old geezer

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Tue Sep 25 18:39:32 2018

XPost: comp.misc

Ivan Shmakov <ivan@siamics.net> writes:

It took me about a day to write a crude but apparently (more or less) working HTTP to HTTPS proxy. (That I hope to beat into shape and
release via news:alt.sources around next Wednesday or so. FTR, the
code is currently under 600 LoC long, or 431 LoC excluding comments
and empty lines.) Some design notes are below.

It took much longer (of course), and the code has by now expanded
about threefold. The HTTP/1 support is much improved, however;
for instance, request bodies and chunked coding should now be
fully supported. Moreover, the relevant code was split off into
a separate HTTP1::MessageStream push-mode parser module (or about
a third of the overall code currently), allowing it to be used
in other applications.

The no-https.perl code proper still needs some clean-up after
all the modifications it got.

The command-line interface is about as follows. (Not all the
options are as of yet thoroughly tested, though.)

Usage:
$ no-https
[-d|--[no-]debug] [--listen=BIND|-l BIND] [--mangle=MANGLE]
[--connect=COMMAND] [--ssl-connect=COMMAND]
$ no-https {-h|--help}

BIND is either [HOST:]PORT or, if includes a /, a file name for a
Unix socket to create and listen on. The default is 8080.

COMMAND will have %%, %h, %p replaced with a literal %, target host
and TCP port, respectively. Also, %s and %t are replaced respectively
with a space and a TAB.

MANGLE can be minimal, header, or a name of an App::NoHTTPS::Mangle::
package to require and use. If not specified, default is tried
first, falling back to (internally-implemented) header.

The --connect= and --ssl-connect= should make it possible to
utilize a parent proxy, including a SOCKS one, such as that
provided by Tor, like: --connect="socat STDIO
SOCKS4:localhost:%h:%p,socksport=9050". For --ssl-connect=,
a tsocks(1)-wrapped gnutls-cli(1) may be an option.

Basics

The basic algorithm is as follows:

1. receive a request header from the client; we only allow GET and
HEAD requests for now, as we do not support request /bodies/ as of yet;

RFC 7230 section 3.3 actually provides simple criteria for
determining whether the request has a body:

The presence of a message body in a request is signaled by a
Content-Length or Transfer-Encoding header field. Request message
framing is independent of method semantics, even if the method does
not define any use for a message body.

As such, and given that message passing was "symmetrized," any
request method except CONNECT is now allowed by the code.

2. decide the server and connect there;

3. send the header to the server;

Preceded by the request line, obviously. (It was considered
a part of the header in the original version of the code.)

4. receive the response header;

(Same here, for the status line.)

We also pass any number of "100 Continue" messages here from
server to client before the "payload" response.

5. if that's an https: redirect:

5.1. connect over TLS, alter the request (Host:, "request target") accordingly, go to step 3;

A Host: header is prepended to the request header if the
original has none.

6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the result
to the client;

Both the decision whether to "eat up" the redirect and how to
alter the header and body of the messages (requests and responses
alike) are left to the "mangler" object. The object is ought to
implement the following methods.

$ma->message_mangler (PARSER, URI)
Return a new mangler object for the given HTTP1::MessageStream
parser state (either request or response) and request URI.

Alternatively, return an URI of the resource to transparently
request instead of the given one.

Return undef if this mangler has nothing to do with the
given parser state and URI.

$ma->parser ([PARSER]), $ma->uri ([URI]),
$ma->start_line ([START-LINE]), $ma->header ([HEADER])
Get or set the HTTP1::MessageStream object, URI, HTTP/1
start line and HTTP/1 header, respectively, associated with
the particular request.

$ma->chunked_p ()
Return a true value if the body is ought to be transmitted
to the remote using chunked coding. (The associated header
is set up accordingly.)

$ma->get_mangled_body_part ()
Return the next part of the (possibly modified) HTTP/1
message body. This will typically involve a call to the
parser object to interpret the portion of the message
currently in its own buffer.

There're currently two such classes implemented: "minimal" and
"header," and I believe that the above interface can be used to
implement rather arbitrary HTTP message filters.

The "minimal" class removes Upgrade and Proxy-Connection headers
from the messages (requests and responses alike) and causes the
calling code to transparently replace all the https: redirects
with requested resources.

The "header" class also filters Strict-Transport-Security and
Set-Cookie off the responses. (Although the former should have
no effect anyway.)

There's a minor issue with the handling of https: redirects.
When http://example.com/ redirects to https://example.com/foo/bar,
for instance, the links in the latter document will become
relative to the former URI (unless the 'base' URI is explicitly
given in the document); thus <a href="baz" /> will point to
/baz -- instead of the intended /foo/baz. A likely solution
is to only eat up http:SAME to https:SAME redirects, rewriting
http:SOME to https:OTHER instead to point to http:OTHER (which
will then likely result in a redirect to https:OTHER, in turn
eaten up by the mangler.)

7. copy up to Content-Length: octets from the server to the client --
or all the remaining data if no Content-Length: is given; (somewhat surprisingly, this seems to also work with the "chunked" coding not otherwise considered in the code);

Both the chunked coding and client-to-server body passing are
now ought to be supported (although POST requests remain untested.)

8. close the connection to the server and repeat from step 1 so long
as the client connection remains active.

[...]

--
FSF associate member #7257 http://am-1.org/~ivan/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Eli the Bearded@21:1/5 to ivan@siamics.net on Tue Sep 25 22:29:27 2018

XPost: comp.misc

In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

Ivan Shmakov <ivan@siamics.net> writes:

It took me about a day to write a crude but apparently (more or less) working HTTP to HTTPS proxy. (That I hope to beat into shape and
release via news:alt.sources around next Wednesday or so. FTR, the
code is currently under 600 LoC long, or 431 LoC excluding comments
and empty lines.) Some design notes are below.

It took much longer (of course), and the code has by now expanded
about threefold. The HTTP/1 support is much improved, however;
for instance, request bodies and chunked coding should now be
fully supported. Moreover, the relevant code was split off into
a separate HTTP1::MessageStream push-mode parser module (or about
a third of the overall code currently), allowing it to be used
in other applications.

Sounds interesting. I don't see it in alt.sources here (nor did you
include a message ID, as I know you have done in the past for such
things). When do you expect to have a version someone can try out?

(Will you be posting the code to CPAN?)

Elijah
------
recalls Ivan dislikes github

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Wed Sep 26 01:05:15 2018

XPost: comp.misc

Eli the Bearded <*@eli.users.panix.com> writes:
In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:

[...]

It took much longer (of course), and the code has by now expanded
about threefold. The HTTP/1 support is much improved, however;
for instance, request bodies and chunked coding should now be fully
supported. Moreover, the relevant code was split off into a
separate HTTP1::MessageStream push-mode parser module (or about
a third of the overall code currently), allowing it to be used in
other applications.

Sounds interesting. I don't see it in alt.sources here (nor did you
include a message ID, as I know you have done in the past for such
things). When do you expect to have a version someone can try out?

Hopefully within this week; I'm still testing the proxy code
proper, and yet to write the READMEs. (Though by now you should
be well aware that my estimates can be overly optimistic.)

(Will you be posting the code to CPAN?)

One of the later versions; as a dependency, HTTP1::MessageStream
will take priority here, but no-https.perl will likely follow.

Elijah ------ recalls Ivan dislikes github

I by no means single out GitHub here; rather, I dislike any
platform that requires the user to run proprietary software,
such as proprietary JavaScript, to operate. Hence, GitLab, or
Savannah, sound like much better choices.

--
FSF associate member #7257 http://am-1.org/~ivan/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ivan Shmakov@21:1/5 to All on Thu Oct 4 20:07:49 2018

XPost: comp.misc

While I'm yet to make a proper announce, I'm glad to inform
anyone interested that the first public version of no-https.perl
is available from news:alt.sources: news:87r2h5tr8d.fsf@siamics.net.

--
FSF associate member #7257 http://am-1.org/~ivan/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Ivan Shmakov on Fri Oct 5 00:11:40 2018

XPost: comp.misc

In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:

While I'm yet to make a proper announce, I'm glad to inform
anyone interested that the first public version of no-https.perl
is available from news:alt.sources: news:87r2h5tr8d.fsf@siamics.net.

Great! I'm looking forward to trying it out once I get the time.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Michal Wronka
  Wed Apr 24 14:13:57 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Wed Apr 24 14:02:51 2024
  from Wroclaw, Poland via SSH
- Michal Wronka
  Thu Apr 25 14:02:21 2024
  from Wroclaw, Poland via SSH
- Bob Worm
  Thu Apr 25 11:52:12 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	52:15:51
Calls:	6,650
Calls today:	2
Files:	12,200
Messages:	5,330,387

no-https: a plain-HTTP to HTTPS proxy

Who's Online

Recent Visitors

System Info