It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.
The basic algorithm is as follows:
1. receive a request header from the client; we only allow
GET and HEAD requests for now, as we do not support request
/bodies/ as of yet;
2. decide the server and connect there;
3. send the header to the server;
4. receive the response header;
5. if that's an https: redirect:
5.1. connect over TLS, alter the request (Host:, "request target")
accordingly, go to step 3;
6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the
result to the client;
7. copy up to Content-Length: octets from the server to the
client -- or all the remaining data if no Content-Length:
is given; (somewhat surprisingly, this seems to also work with
the "chunked" coding not otherwise considered in the code);
Dealing with the https: references
There was an idea of transparently replacing https: references
in HTML and XML attributes with scheme-relative ones (like, e. g.,
https://example.com/ to //example.com/.) So far, that fails
more often than it works, for two primary reasons: compression
(although that can be solved by forcing Accept-Encoding: identity
in requests) -- and the fact that by the time such filtering can
take place, we've already sent the Content-Length: (if any) for
the original (unaltered) body to the client!
That said, I don't think the https: references /should/ be an
issue in practice, as most of the links are ought to be relative
in the first place, such as:
Thoughts?
It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.
Eli the Bearded <*@eli.users.panix.com> writes:
In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:
It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into shape
and release via news:alt.sources around next Wednesday or so.
FTR, the code is currently under 600 LoC long, or 431 LoC excluding
comments and empty lines.) Some design notes are below.
What language?
The basic algorithm is as follows:
1. receive a request header from the client; we only allow GET and
HEAD requests for now, as we do not support request /bodies/ as of yet;
No POST requests will stop a lot of forms.
HEAD is an easy case, but largely unused.
6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the result
to the client;
That probably covers it. If you change HTTP/1.1 to HTTP/1.0 on the requests, then 1% of servers will have issues and 50% fewer servers
will send chunked requests. (Numbers made up, based on my experiences.)
You can also drop Accept-Encoding: if you want to avoid dealing with compressed responses.
There was an idea of transparently replacing https: references in
HTML and XML attributes with scheme-relative ones (like, e. g.,
https://example.com/ to //example.com/.) So far, that fails more
often than it works, for two primary reasons: compression (although
that can be solved by forcing Accept-Encoding: identity in requests)
-- and the fact that by the time such filtering can take place,
we've already sent the Content-Length: (if any) for the original
(unaltered) body to the client!
You can fix that with whitespace padding.
<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src="//qaz.wtf/tmp/chree.png" ...>
Beware of parsing issues.
Real world HTML usually looks like one of the first two but may
sometimes look like one of second two of these:
<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src='https://qaz.wtf/tmp/chree.png' ...>
<img src=https://qaz.wtf/tmp/chree.png ...>
<img src = "https://qaz.wtf/tmp/chree.png" ...>
(And that's ignoring case.)
Thoughts?
Are you going to fix Referer: headers to use the https: version when communicating with an https site? I think you probably should.
Elijah ------ only forces https on his site for the areas that
require login
Rich <rich@example.invalid> writes:
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
OTOH, I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?
The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around
the equals.
The fact is, even ignoring the spaced equals item, that HTML is
"flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out"
issues if you make use of an HTML parser to parse the HTML instead of
trying to do anything by string or regex search/replace on the HTML. Anything string/regex search based on HTML will appear to work ok
until the day it hits a legal bit of HTML it was not designed to
handle, then it will break badly.
I. e., the "edge conditions" are so numerous that you are better off
using a parser that has already been designed to handle those edge conditions.
Real world HTML usually looks like one of the first two but mayEli the Bearded <*@eli.users.panix.com> writes:
sometimes look like one of second two of these:
<img src="https://qaz.wtf/tmp/chree.png" ...>
<img src='https://qaz.wtf/tmp/chree.png' ...>
<img src=https://qaz.wtf/tmp/chree.png ...>
<img src = "https://qaz.wtf/tmp/chree.png" ...>
(And that's ignoring case.)
Indeed; and case and lack of quotes will require specialcasing
for HTML (I aim to support XML applications as well, which
fortunately are somewhat simpler in this respect.)
OTOH, I don't think I've ever seen the " = " form; do the
blanks around the equals sign even conform to any HTML
version?
Rich <rich@example.invalid> writes:
The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign.
Does it explicitly allow spaces?
Rich <rich@example.invalid> writes:
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
[...]
OTOH, I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?
The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around
the equals.
Does it explicitly allow spaces?
The fact is, even ignoring the spaced equals item, that HTML is
"flexible" enough that if you get to the point of wanting to do rewriting/editing that you'll have way less "pull your hair out"
issues if you make use of an HTML parser to parse the HTML instead
of trying to do anything by string or regex search/replace on the
HTML. Anything string/regex search based on HTML will appear to
work ok until the day it hits a legal bit of HTML it was not
designed to handle, then it will break badly.
I. e., the "edge conditions" are so numerous that you are better
off using a parser that has already been designed to handle those
edge conditions.
I tend to agree with the above for the general case: where I'd
expect the code to /fail/ if it encounters something it does
not understand.
In this case, something that the code does not understand is
ought to be left untouched, and I'm unsure if I can readily
get an HTTP parser that does that.
The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around the equals.
Rich <rich@example.invalid>:
The HTML spec does not appear to explicitly exclude use of spaces
around the equals sign. So unless there is an explicit exclusion
somewhere that I've missed, it would be legal to add spaces around the
equals.
No need to guess or improvise. The W3 consortium has provided an
explicit pseudocode implementation of an HTML parser:
<URL: https://www.w3.org/TR/html52/syntax.html#syntax>
I don't think I've ever seen the " = " form; do the blanks
around the equals sign even conform to any HTML version?
Rich <rich@example.invalid> writes:
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
Rich <rich@example.invalid> writes:
I. e., the "edge conditions" are so numerous that you are better
off using a parser that has already been designed to handle those
edge conditions.
I tend to agree with the above for the general case: where I'd
expect the code to /fail/ if it encounters something it does not
understand.
In this case, something that the code does not understand is ought
to be left untouched, and I'm unsure if I can readily get an HTTP
parser that does that.
That is, of course, always the final 'out' for something so broken
that the 'content modification' module fails.
The difference is that you'll significantly reduce the number of
failure instances by using a parser to handle the parsing of the
incoming HTML, then passing the parse tree off to the 'content
modification' module vs. trying to do content modification with
string matching and/or regex matching (both of which are essentially creating weak 'parsers' that only handle a small subset of the full possibilities allowed).
In comp.misc Ivan Shmakov <ivan@siamics.net> wrote:
It took me about a day to write a crude but apparently (more or
less) working HTTP to HTTPS proxy. (That I hope to beat into
shape and release via news:alt.sources around next Wednesday
or so. FTR, the code is currently under 600 LoC long, or 431 LoC
excluding comments and empty lines.) Some design notes are below.
Sounds like a great start. I'm looking forward to trying it out.
Ivan Shmakov <ivan@siamics.net> writes:
It took me about a day to write a crude but apparently (more or less) working HTTP to HTTPS proxy. (That I hope to beat into shape and
release via news:alt.sources around next Wednesday or so. FTR, the
code is currently under 600 LoC long, or 431 LoC excluding comments
and empty lines.) Some design notes are below.
Basics
The basic algorithm is as follows:
1. receive a request header from the client; we only allow GET and
HEAD requests for now, as we do not support request /bodies/ as of yet;
2. decide the server and connect there;
3. send the header to the server;
4. receive the response header;
5. if that's an https: redirect:
5.1. connect over TLS, alter the request (Host:, "request target") accordingly, go to step 3;
6. strip certain headers (such as Strict-Transport-Security: and
Upgrade:, but also Set-Cookie:) off the response and send the result
to the client;
7. copy up to Content-Length: octets from the server to the client --
or all the remaining data if no Content-Length: is given; (somewhat surprisingly, this seems to also work with the "chunked" coding not otherwise considered in the code);
8. close the connection to the server and repeat from step 1 so long
as the client connection remains active.
It took me about a day to write a crude but apparently (more or less) working HTTP to HTTPS proxy. (That I hope to beat into shape andIvan Shmakov <ivan@siamics.net> writes:
release via news:alt.sources around next Wednesday or so. FTR, the
code is currently under 600 LoC long, or 431 LoC excluding comments
and empty lines.) Some design notes are below.
It took much longer (of course), and the code has by now expanded
about threefold. The HTTP/1 support is much improved, however;
for instance, request bodies and chunked coding should now be
fully supported. Moreover, the relevant code was split off into
a separate HTTP1::MessageStream push-mode parser module (or about
a third of the overall code currently), allowing it to be used
in other applications.
Eli the Bearded <*@eli.users.panix.com> writes:
In comp.infosystems.www.misc, Ivan Shmakov <ivan@siamics.net> wrote:
It took much longer (of course), and the code has by now expanded
about threefold. The HTTP/1 support is much improved, however;
for instance, request bodies and chunked coding should now be fully
supported. Moreover, the relevant code was split off into a
separate HTTP1::MessageStream push-mode parser module (or about
a third of the overall code currently), allowing it to be used in
other applications.
Sounds interesting. I don't see it in alt.sources here (nor did you
include a message ID, as I know you have done in the past for such
things). When do you expect to have a version someone can try out?
(Will you be posting the code to CPAN?)
Elijah ------ recalls Ivan dislikes github
While I'm yet to make a proper announce, I'm glad to inform
anyone interested that the first public version of no-https.perl
is available from news:alt.sources: news:87r2h5tr8d.fsf@siamics.net.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 296 |
Nodes: | 16 (2 / 14) |
Uptime: | 52:15:51 |
Calls: | 6,650 |
Calls today: | 2 |
Files: | 12,200 |
Messages: | 5,330,387 |