I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ² show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
So I changed the charset to latin-1, and then to iso-8859-1. With...
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
So what charset should I use to represent a file where every
character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1 character set?
I found this gem: "If the attribute is present, its value must be an
ASCII case-insensitive match for the string "utf-8", because UTF-8 is
the only valid encoding for HTML5 documents."
(I tried looking at character encodings in Vim, and indeed it does
have a utf-8 option, but after I do my editing I run all my pages
through a very complicated awk script, and it looks like awk can't
handle UTF-8, at least not in Windows.)
--
Stan Brown, Tehachapi, California, USA
https://BrownMath.com/
https://OakRoadSystems.com/
HTML 4.01 spec: http://www.w3.org/TR/html401/
validator: http://validator.w3.org/
CSS 2.1 spec: http://www.w3.org/TR/CSS21/
validator: http://jigsaw.w3.org/css-validator/
Why We Won't Help You: http://preview.tinyurl.com/WhyWont
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ˛ show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
So what charset should I use to represent a file where every
character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1 character set?
To make things even more murky, at https://developer.mozilla.org/en-US/docs/Web/HTML/Element/meta#attr-
charset
I found this gem: "If the attribute is present, its value must be an
ASCII case-insensitive match for the string "utf-8", because UTF-8 is
the only valid encoding for HTML5 documents."
If that's true, it sounds very much like I can't generate my web
pages unless I code every 160-255 character as a six-byte &#nnn;
string, which is not only a pain but makes editing harder.
(I tried looking at character encodings in Vim, and indeed it does
have a utf-8 option, but after I do my editing I run all my pages
through a very complicated awk script, and it looks like awk can't
handle UTF-8, at least not in Windows.)
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ² show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
I can't tell for sure without seeing your page, [...]
On Thu, 15 Oct 2020, Eli the Bearded wrote:
I can't tell for sure without seeing your page, [...]
Just tell us the URL (the web address) where we can see your page and thus will discover
? what charset you are really using
? what the web server says about it
? what the web page tells about it
? what the default charset would be if none of the above applies
and whether these four contradict each other.
In comp.infosystems.www.authoring.html,
Stan Brown <the_stan_brown@fastmail.fm> wrote:
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ˛ show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
it's all not plain.
So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
So what charset should I use to represent a file where every...
character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1 character set?
I found this gem: "If the attribute is present, its value must be an
ASCII case-insensitive match for the string "utf-8", because UTF-8 is
the only valid encoding for HTML5 documents."
I can't tell for sure without seeing your page,
but I think you are
running into the declared document type specifies an allowed list of "charset"s to conformant to that document type. One fix is to declare
your document to be a type that allows the charset you feel you need to
use, eg some variant of HTML4.
Another fix is to find a compatible
chaset from the allowed list.
Stan Brown:
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ˛ show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
Did you try <meta charset="ISO-8859-1">?
On Fri, 16 Oct 2020 10:06:59 +0200, Arno Welzel wrote:
Stan Brown:
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ² show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly, but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
Did you try <meta charset="ISO-8859-1">?
Yes. In HTML 4.01 and 5, same problem as in the longer form
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Stan Brown:...
On Fri, 16 Oct 2020 10:06:59 +0200, Arno Welzel wrote:
Did you try <meta charset="ISO-8859-1">?
Yes. In HTML 4.01 and 5, same problem as in the longer form
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Indeed - HTML 4 does not know anything about the charset attribute and
for HTML 5 using UTF-8 is a requiredment. In fact this is the *only*
allowed encoding for HTML 5. So you have convert your existing documents
to UTF-8 before publishing them.
Also see here:
<https://html.spec.whatwg.org/multipage/semantics.html#character-encoding-declaration>
4.2.5.4 Specifying the document's character encoding
The Encoding standard requires use of the UTF-8 character encoding and requires use of the "utf-8" encoding label to identify it....
To enforce the above rules, authoring tools must default to using UTF-8
for newly-created documents.
4.2.5.4 Specifying the document's character encoding...
The Encoding standard requires use of the UTF-8 character encoding and requires use of the "utf-8" encoding label to identify it....
To enforce the above rules, authoring tools must default to using UTF-8
for newly-created documents.
Well, heck! It seems unfortunate that they would retroactively change
the HTML 4.01 standard, which I am 100% certain allowed other
charsets for quite a few years.
It seems like my only options are to completely redesign how I
produce Web pages, or to declare utf-8, but only use characters 000-
127 and use numeric references for everything >=160, which will bloat
my documents.
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ² show up as a question
mark in a lozenge.
I figured out that that's because my HTML files
are all plain text,
So I changed the charset to latin-1, and then to iso-8859-1. With
each of them, characters 160-255 display correctly,
but the W3C's
validator gives this error message:
Bad value ?text/html; charset=iso-8859-1? for attribute
?content? on element ?meta?: ?charset=? must be followed by ?utf-8?
So what charset should I use to represent a file where every
character is 8 bits, and those 8 bits match the iso=8851-1 or latin-1 character set?
If that's true, it sounds very much like I can't generate my web
pages unless I code every 160-255 character as a six-byte &#nnn;
string, which is not only a pain but makes editing harder.
The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
it's all not plain.
Stan Brown wrote:
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Just remove it, unless it matches the actual encoding used.
You should notice that[...]
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to declare the content type and encoding via the HTTP protocol. This hint is actually ignored by the web server you use, I see only "Content-Type: text/html" appearingÂą).
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to declare the content type and encoding via the HTTP protocol. This hint is actually ignored by the web server you use, I see only "Content-Type: text/html" appearingą). By many browsers it is also interpreted as if it were a declaration of the encoding used in the document ? this is why it works and will probably work as long as HTML4 documents exist and are interpreted by browsers. But strictly speaking, it is not a usage of anything that is well-defined in HTML4. ? meta_charset is indeed a declaration of the encoding used in the document, albeit meaningless as there is no choice.
ą) The full answer of the web server to the browser's request for https://brownmath.com/Charsets/charset_utf-8_html4.htm was:
HTTP/1.1 200 OK
Server: nginx
Date: Fri, 16 Oct 2020 16:12:13 GMT
Content-Type: text/html
Content-Length: 798
Connection: keep-alive
Last-Modified: Fri, 16 Oct 2020 13:43:53 GMT
ETag: "31e-5b1c9f48d5840"
alt-svc: quic=":443"; ma=86400; v="43,39"
Host-Header: 5d77dd967d63c3104bced1db0cace49c
X-Proxy-Cache: MISS
Accept-Ranges: bytes
So, you are not in a hurry to change anything, but you should have a plan for the future. You can even validate your non-UTF-8 HTML files:
* Declare them as HTML4, otherwise it will complain that only UTF-8 is allowed.
* Before starting the validator, check ?More Options? and fill in the correct encoding.
I tried it out with https://brownmath.com/Charsets/charset_utf-8_html4.htm, and it worked.
I consider the behaviour of the validator extreme user-unfriendly.
When people use habits that were not only tolerated but even
recommended in the past, it could give a hint that and why they are
no longer supported and what to do instead.
It seems like my only options are to completely redesign how I
produce Web pages, or to declare utf-8, but only use characters 000-
127 and use numeric references for everything >=160, which will bloat
my documents.
I am not sure it requires a complete redesign. When I changed to UTF-8, I had only to tell the editor used that it should encode in UTF-8 instead of ISO-8859-1. Well, I work on a Unix system, and the editor used is emacs, which
has such an option.
On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
Stan Brown wrote:
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Just remove it, unless it matches the actual encoding used.
Brilliant! I tried with no <meta .. charset> tag. The characters were displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5
version passed validation. (The W3C validator failed the HTML4.01
version with "obsolete DOCTYPE", which seems a bit harsh.) The
revised examples are at <URL:https://brownmath.com/Charsets/>.
I know that encoding is complicated, but just because the characters
are displayed correctly in my browsers, is it safe to assume they'll
be correct in (the great majority of) other browsers?
I guess in a way I'm asking: what figures out the document encoding
if it's not specified, the Web server or the user-agent? If it's the
server, then the fact that they worked for me says they should work
for anyone. But if it's the browser, maybe not so much.
The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
it's all not plain.
Eli the Bearded:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond thatWhat exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?
it's all not plain.
And why do you define "ASCII = plain"? Even ASCII has its history of
changes and not all 7-bit characters had the same meaning in the past:
<https://www.aivosto.com/articles/charsets-7bit.html>
In comp.infosystems.www.authoring.html,
Arno Welzel <usenet@arnowelzel.de> wrote:
Eli the Bearded:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond thatWhat exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?
it's all not plain.
It is not "plain" in the sense of how documents without content types
should be interpreted according to the RFCs I remember reading. Consider
RFC-2045 - Multipurpose Inter Mail Extensions (MIME) Part One:
5.2. Content-Type Defaults
Default RFC 822 messages without a MIME Content-Type header are taken
by this protocol to be plain text in the US-ASCII character set,
which can be explicitly specified as:
Content-type: text/plain; charset=us-ascii
This default is assumed if no Content-Type header field is specified.
It is also recommend that this default be assumed when a
syntactically invalid Content-Type header field is encountered. In
the presence of a MIME-Version header field and the absence of any
Content-Type header field, a receiving User Agent can also assume
that plain US-ASCII text was the sender's intent. Plain US-ASCII
^^^^^^^^^^^^^^
text may still be assumed in the absence of a MIME-Version or the
^^^^^^^^^^^^^^^^^^^^^^^^^
presence of an syntactically invalid Content-Type header field, but
the sender's intent might have been otherwise.
and
RFC-2046 - Multipurpose Inter Mail Extensions (MIME) Part Two:
4.1.2. Charset Parameter
A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set. This is specified with a
"charset" parameter, as in:
Content-type: text/plain; charset=iso-8859-1
Unlike some other parameter values, the values of the charset
parameter are NOT case sensitive. The default character set, which
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
must be assumed in the absence of a charset parameter, is US-ASCII.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
And why do you define "ASCII = plain"? Even ASCII has its history of
changes and not all 7-bit characters had the same meaning in the past:
Agreed that ASCII was not created in it's final form.
<https://www.aivosto.com/articles/charsets-7bit.html>
The last change to ASCII there is in 1986. The last change there that >involved the characters enumerated by ASCII was in 1977. The list of
things that were important for computers in 1977 that are still
important today is very small. ASCII, awkward as it is for many
purposes, remains a bedrock upon which other, better things, are
built. I just don't call UTF-8, eg, "plain text".
Elijah
------
notes that the unicode character table is written in US-ASCII
I know that encoding is complicated, but just because the characters
are displayed correctly in my browsers, is it safe to assume they'll
be correct in (the great majority of) other browsers?
Helmut Richter:
[...]
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to declare the content type and encoding via the HTTP protocol. This hint is actually ignored by the web server you use, I see only "Content-Type: text/html" appearingÂą).[...]
Because it is not for the server but for the *browser*.
In fact this meta element is used *instead* sending a HTTP response
header. That's why it is called "http-equiv" - it should be treated by
the *browser* in the same way as the respective HTTP header for the Content-Type.
What exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?
It is not "plain" in the sense of how documents without content types
should be interpreted according to the RFCs I remember reading.
and start and end tags and entity references included.
that plain US-ASCII text was the sender's intent. Plain US-ASCII
^^^^^^^^^^^^^^
text may still be assumed in the absence of a MIME-Version or the
^^^^^^^^^^^^^^^^^^^^^^^^^
presence of an syntactically invalid Content-Type header field, but
the sender's intent might have been otherwise.
On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
Stan Brown wrote:
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Just remove it, unless it matches the actual encoding used.
Brilliant! I tried with no <meta .. charset> tag. The characters were displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5
version passed validation.
I know that encoding is complicated, but just because the characters
are displayed correctly in my browsers, is it safe to assume they'll
be correct in (the great majority of) other browsers?
I guess in a way I'm asking: what figures out the document encoding
if it's not specified, the Web server or the user-agent?
Stan Brown wrote:
On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:What I tried to say is that declaring an encoding that is not the
actual encoding used (or compatible with it) is worse than not
declaring the encoding at all. This gives the user agent a chance
to guess right, as opposite to applying wrong information.
I know that encoding is complicated, but just because the characters
are displayed correctly in my browsers, is it safe to assume they'll
be correct in (the great majority of) other browsers?
The practical way is to use
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
and to ignore what the validator says about it. You can even use
automated ignoring by using the W3C validator tools for hiding messages
by type.
WHATWG and W3C just wish to promote UTF-8 on all pages at any cost.
That?s why they specify that only UTF-8 is kosher and make the validator
nag about it.
The theoretically most correct way is to make the server send HTTP
headers specifying the encoding. I have no idea how to do that when
using Nginx. You might need access to the server configuration files.
QUESTION 2: It would be awfully convenient to type a Windows
apostrophe (8-bit character 146) rather than ’ or ’. If
I specify a charset of windows-1252, am I safe to do that, or should
I still stay away from Windows characters 128-159?
QUESTION 3: If I should still stay away from 128-159, even with a windows-1252 declaration, is there any particular reason you suggest windows-1252 rather than iso-8859-1? know they're the same for 32-
127 and 160-255, but in my mind windows-1252 suggests that I'll be
using Windows 128-159, and iso-8859-1 does not.
In comp.infosystems.www.authoring.html,
Arno Welzel <usenet@arnowelzel.de> wrote:
Eli the Bearded:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond thatWhat exactly is not "plain" in a text encoded as UTF-8 or Windows-1252?
it's all not plain.
It is not "plain" in the sense of how documents without content types
should be interpreted according to the RFCs I remember reading. Consider
RFC-2045 - Multipurpose Inter Mail Extensions (MIME) Part One:
5.2. Content-Type Defaults
Default RFC 822 messages without a MIME Content-Type header are taken
by this protocol to be plain text in the US-ASCII character set,
which can be explicitly specified as:
Content-type: text/plain; charset=us-ascii
This default is assumed if no Content-Type header field is specified.
Stan Brown:
On Fri, 16 Oct 2020 23:42:47 +0300, Jukka K. Korpela wrote:
Stan Brown wrote:
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Just remove it, unless it matches the actual encoding used.
Brilliant! I tried with no <meta .. charset> tag. The characters were displayed correctly in the HTML5 and HTML4.01 versions, and the HTML5 version passed validation. (The W3C validator failed the HTML4.01
version with "obsolete DOCTYPE", which seems a bit harsh.) The
revised examples are at <URL:https://brownmath.com/Charsets/>.
Well this is just by chance correct. In fact your server does not send
any charset at all:
HTTP/2 200 OK
server: nginx
date: Fri, 16 Oct 2020 22:16:16 GMT
content-type: text/html
content-length: 784
last-modified: Fri, 16 Oct 2020 13:44:01 GMT
etag: "310-5b1c9f5076a40"
alt-svc: quic=":443"; ma=86400; v="43,39"
host-header: 5d77dd967d63c3104bced1db0cace49c
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
QUESTION 1: Any reason you suggest that rather than the simpler
<meta charset="windows-1252">
? This page says the two forms are equivalent in HTML5:
QUESTION 2: It would be awfully convenient to type a Windows
apostrophe (8-bit character 146) rather than ’ or ’. If
I specify a charset of windows-1252, am I safe to do that, or should
I still stay away from Windows characters 128-159?
QUESTION 3: If I should still stay away from 128-159, even with a windows-1252 declaration, is there any particular reason you suggest windows-1252 rather than iso-8859-1?
I think I can get that access, probably via some override file in my
root directory. In fact, there's already a .htaccess file there with
one AddType, so I think it must be an Apache server or a workalike.
I should be able to add
AddType text/plain;charset=windows-1252
AddType text/html;charset=windows-1252
and have the server emit the desired headers.
But the stackoverflow
article above makes the point that we still want to include a charset
in each file, for the folks who download a file for later reading.
On Sat, 17 Oct 2020, Stan Brown wrote:
QUESTION 2: It would be awfully convenient to type a Windows
apostrophe (8-bit character 146) rather than ’ or ’. If
I specify a charset of windows-1252, am I safe to do that, or should
I still stay away from Windows characters 128-159?
There is no reason to stay away from code points that are defined in the code.
(I don’t have the problem, though. If I want a real apostrophe like the
one in the preceding sentence, I just type it (on my keyboard AltGr+'),
and it lands in the file as the UTF-8 representation of that character.
QUESTION 3: If I should still stay away from 128-159, even with a
windows-1252 declaration, is there any particular reason you suggest
windows-1252 rather than iso-8859-1? know they're the same for 32-
127 and 160-255, but in my mind windows-1252 suggests that I'll be
using Windows 128-159, and iso-8859-1 does not.
If you use these code points, you have to specify windows-1252; if not,
the effect is the same for the two code names.
Helmut Richter wrote:
On Sat, 17 Oct 2020, Stan Brown wrote:
QUESTION 2: It would be awfully convenient to type a Windows
apostrophe (8-bit character 146) rather than ’ or ’. If
I specify a charset of windows-1252, am I safe to do that, or should
I still stay away from Windows characters 128-159?
There is no reason to stay away from code points that are defined in the code.
Well, apart from some code points not being assigned to any character, or some
assigned characters being somewhat questionable. (For example, how often would
it make sense to use the florin sign Ć’?) Sorry, today is my nitpicking day.
(I don’t have the problem, though. If I want a real apostrophe like the one in the preceding sentence, I just type it (on my keyboard AltGr+'),
I just press the key labeled with the Ascii apostophe ('). Well, that’s how I
use my personal keyboard layout when typing text (as opposite to code), and using the standard Finnish international layout I need to use AltGr+'
and it lands in the file as the UTF-8 representation of that character.
This depends on the software that processes the typed characters.
QUESTION 3: If I should still stay away from 128-159, even with a windows-1252 declaration, is there any particular reason you suggest windows-1252 rather than iso-8859-1? know they're the same for 32-
127 and 160-255, but in my mind windows-1252 suggests that I'll be
using Windows 128-159, and iso-8859-1 does not.
If you use these code points, you have to specify windows-1252; if not,
the effect is the same for the two code names.
No, the effect is always the same on all browsers use nowadays (possibly excluding some you might see in a museum of technology).
Yes, but I hate to write iso-8859-1 when it is a lie, whereas windows-1252 would work exactly the same and would be true.
Stan Brown wrote:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
QUESTION 1: Any reason you suggest that rather than the simpler
<meta charset="windows-1252">
No good reason. I just wrote the original format because I learned it 25 years or so ago
QUESTION 2: It would be awfully convenient to type a Windows
apostrophe (8-bit character 146) rather than ’ or ’. If
I specify a charset of windows-1252, am I safe to do that, or should
I still stay away from Windows characters 128-159?
You?re safe. Twenty years ago it was different.
QUESTION 3: If I should still stay away from 128-159, even with a windows-1252 declaration, is there any particular reason you suggest windows-1252 rather than iso-8859-1?
The reason is that browsers treat iso-8859-1 as windows-1252, and HTML5
made this the rule. In the old times it was different, mainly in the
sense that browsers running on Unix platforms actually treated
iso-8859-1 declared data so that octets 128?159 were control characters
and sometimes had odd effects.
I think I can get that access, probably via some override file in my
root directory. In fact, there's already a .htaccess file there with
one AddType, so I think it must be an Apache server or a workalike.
I should be able to add
AddType text/plain;charset=windows-1252
AddType text/html;charset=windows-1252
and have the server emit the desired headers.
I?m afraid Nginx does not support .htaccess but has other tools.
But the stackoverflow
article above makes the point that we still want to include a charset
in each file, for the folks who download a file for later reading.
That?s a valid point, because browsers probably still haven?t learned to
save a web page locally in a proper way. That is, they don?t use the
HTTP headers when saving the file. This is understandable, ...
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
In comp.infosystems.www.authoring.html,
Stan Brown <the_stan_brown@fastmail.fm> wrote:
I have this line in the <head> of my Web pages:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
But perfectly decent characters like é, ×, ² show up as a question
mark in a lozenge. I figured out that that's because my HTML files
are all plain text, 8 characters per byte, which is not UTF8 when I
use characters above 127.
The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
it's all not plain.
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to declare the content type and encoding via the HTTP protocol. […]
HTML is by definition not plain text.
Eli the Bearded wrote:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond that
it's all not plain.
That’s nonsense. Plain text is just text, as oppotite to “rich text”, like MS Word format, or HTML.
Helmut Richter wrote:
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to
declare the content type and encoding via the HTTP protocol. […]
Not at all. How did you get that idea?
It is not a job of a Web server to
*interpret* the body of a HTTP message in order to generate a header for
that HTTP message. Parsing and interpreting HTML, for example, is solely
the domain of a HTML user agent.
Instead, both HTML elements are a *substitute* – an *equivalent* – for the
Content-Type HTTP header field, to be used by the Web _browser_, if that header field is not sent by the Web server.
The various HTML Specifications make that very clear.
Helmut Richter wrote:
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to declare the content type and encoding via the HTTP protocol. […]
Not at all. How did you get that idea? It is not a job of a Web server to *interpret* the body of a HTTP message in order to generate a header for
that HTTP message. Parsing and interpreting HTML, for example, is solely
the domain of a HTML user agent.
Helmut Richter wrote:
You should notice that
<meta http-equiv="Content-Type" content="text/html; charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to
declare the content type and encoding via the HTTP protocol. […]
Not at all. How did you get that idea? It is not a job of a Web server to *interpret* the body of a HTTP message in order to generate a header for
that HTTP message. Parsing and interpreting HTML, for example, is solely
the domain of a HTML user agent.
Instead, both HTML elements are a *substitute* – an *equivalent* – for the
Content-Type HTTP header field, to be used by the Web _browser_, if that header field is not sent by the Web server.
The various HTML Specifications make that very clear.
Stan Brown:
On Thu, 15 Oct 2020 14:31:10 -0700, I started this thread with:
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
A very big thank-you to all those who responded! I have learned quite[...]
a lot in the past few days, and you were a big help in that. Here are changes completed or in progress:
Thank you for this summary of your findings.
On Thu, 15 Oct 2020 14:31:10 -0700, I started this thread with:[...]
I'm trying, and failing, to write the proper charset in my meta tag.
Help, please!
A very big thank-you to all those who responded! I have learned quite
a lot in the past few days, and you were a big help in that. Here are
changes completed or in progress:
Eli the Bearded wrote:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond thatNonsense. "Plain text" means - literally - content that can be read
it's all not plain.
by a person as opposed to "binary" data; that is, content where byte sequences represent characters, in particular digits and letters.
(As an aside, I'm seeing that my stance that US-ASCII is "plain text"
and "plain text" does not necessarily mean "text/plain" is an unpopular
one. I'm tired of arguing the point, but no one has convinced me that
I'm wrong.)
Elijah
------
utf-8 in the sheets, ascii in the style sheets
In comp.infosystems.www.authoring.html,
Thomas 'PointedEars' Lahn <cljs@PointedEars.de> wrote:
Eli the Bearded wrote:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond thatNonsense. "Plain text" means - literally - content that can be read
it's all not plain.
by a person as opposed to "binary" data; that is, content where byte sequences represent characters, in particular digits and letters.
So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
those are _not plain text_.
(As an aside, I'm seeing that my stance that US-ASCII is "plain text"
and "plain text" does not necessarily mean "text/plain" is an unpopular
one. I'm tired of arguing the point, but no one has convinced me that
I'm wrong.)
Elijah
------
utf-8 in the sheets, ascii in the style sheets
Is PostScript plain text?
Phillip Helbig (undress to reply):
[...]
Is PostScript plain text?
It can be:
<http://paulbourke.net/dataformats/postscript/>
Arno Welzel wrote:
Phillip Helbig (undress to reply):
[...]
Is PostScript plain text?
It can be:
<http://paulbourke.net/dataformats/postscript/>
That old document seems to say that PostScript is plain text, since you
can create, edit, and read a PostScript file using a text editor. But that’s not how ”plain text” is defined in MIME:
The simplest and most important subtype of "text" is "plain". This
indicates plain text that does not contain any formatting commands or
directives. Plain text is intended to be displayed "as-is", that is,
no interpretation of embedded formatting commands, font attribute
specifications, processing instructions, interpretation directives,
or content markup should be necessary for proper display. T
https://tools.ietf.org/html/rfc2046#section-4.1.3
ObHTML: Similarly, HTML is not plain text.
Technically, PostScript isn’t even classified as text; the media type
for it is application/postscript. This does not mean that it would be impossible to write PostScript using a text editor.
ObHTML: For XHTML, the media type application/xhtml+xml is specified.
Arno Welzel wrote:
Phillip Helbig (undress to reply):
Is PostScript plain text?It can be:
That old document seems to say that PostScript is plain text, since you
can create, edit, and read a PostScript file using a text editor. But that’s not how ”plain text” is defined in MIME:
But even application/xhtml+xml is in fact plain text which is
*interpreted* as XHTML.
The important point is, that the content of a file of that type can be
read as plain text as well.
In comp.infosystems.www.authoring.html,
Thomas 'PointedEars' Lahn <cljs@PointedEars.de> wrote:
Eli the Bearded wrote:
The only thing that is plain text is US-ASCII, 0 to 127. Beyond thatNonsense. "Plain text" means - literally - content that can be read
it's all not plain.
by a person as opposed to "binary" data; that is, content where byte
sequences represent characters, in particular digits and letters.
So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
those are _not plain text_.
[Ex falso quodlibet]
Arno Welzel wrote:
But even application/xhtml+xml is in fact plain text which is
*interpreted* as XHTML.
The important point is, that the content of a file of that type can be
read as plain text as well.
Please read
text.
Arno Welzel wrote:
Phillip Helbig (undress to reply):
[...]
Is PostScript plain text?
It can be:
<http://paulbourke.net/dataformats/postscript/>
That old document seems to say that PostScript is plain text, since you
can create, edit, and read a PostScript file using a text editor. But that’s not how ”plain text” is defined in MIME:
https://tools.ietf.org/html/rfc2046#section-4.1.3
Please read this as plain
text.
In comp.infosystems.www.authoring.html,
Jukka K. Korpela <jukkakk@gmail.com> wrote:
Please read
this as plain
text.
Reading it as plain text is trivial.
So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
those are _not plain text_.
No, of course not. Not all code points of US-ASCII or Unicode represent digits and letters. In particular, the first 32 code points do not; they represent non-printable control characters or are left unassigned. That
is, they represent *data*, but not necessarily *text*.
[Ex falso quodlibet]
Lahn wrote:^^^^
^^^^^^^^^^^^^^Helmut Richter wrote:
You should notice that
<meta http-equiv="Content-Type" content="text/html;
charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to
declare the content type and encoding via the HTTP protocol. […]
Not at all. How did you get that idea?
Perhaps from the HTML specifications.
It is not a job of a Web server to *interpret* the body of a HTTP message
in order to generate a header for that HTTP message. Parsing and
interpreting HTML, for example, is solely the domain of a HTML user
agent.
Instead, both HTML elements are a *substitute* – an *equivalent* – for >> the Content-Type HTTP header field, to be used by the Web _browser_, if
that header field is not sent by the Web server.
The various HTML Specifications make that very clear.
”HTTP servers may read the content of the document HEAD to generate
header fields corresponding to any elements defining a value for the attribute HTTP-EQUIV.” https://www.w3.org/MarkUp/html-spec/html-spec_5.html#SEC5.2.5
Since that’s not how things actually worked, HTML5 specs don’t even mention the possibility of servers using <meta> tags. Neither do they prohibit such things; they don’t really deal with the operation of
servers.
The early HTML5 drafts/specs didn’t even allow <meta
http-equiv=...> and instead used the <meta charset=...> invention,
which was, from the beginning, meant to be handled by user agents.
Arno Welzel wrote:
But even application/xhtml+xml is in fact plain text which is
*interpreted* as XHTML.
The important point is, that the content of a file of that type can be
read as plain text as well.
Please read this as plain
text.
Eli the Bearded wrote:
In comp.infosystems.www.authoring.html,
Jukka K. Korpela <jukkakk@gmail.com> wrote:
Please read
this as plain
text.
Reading it as plain text is trivial.
Didn’t someone quote this from the relevant RFC:
Plain text is intended to be displayed "as-is", that is,
Are you saying tai displaying the character sequence “as-is” is proper display?
On Tue, 20 Oct 2020, Thomas 'PointedEars' Lahn wrote:
Helmut Richter wrote:
You should notice that
<meta http-equiv="Content-Type" content="text/html;
charset=any-code">
(HTML before HTML5 as well)
and
<meta charset="utf-8"> (HTML5 only, only utf-8 allowed)
have different meanings. meta_http-equiv is a hint to the web server to
declare the content type and encoding via the HTTP protocol. […]
Not at all. How did you get that idea? It is not a job of a Web server
to *interpret* the body of a HTTP message in order to generate a header
for
that HTTP message. Parsing and interpreting HTML, for example, is solely
the domain of a HTML user agent.
Thank you for repeating <huuk9tF11ncU1@mid.individual.net>. I understood
that one as well, though.
Jukka K. Korpela:
Eli the Bearded wrote:
In comp.infosystems.www.authoring.html,
Jukka K. Korpela <jukkakk@gmail.com> wrote:
Please read
this as plain
text.
Reading it as plain text is trivial.
Didn’t someone quote this from the relevant RFC:
Plain text is intended to be displayed "as-is", that is,
Which is possible:
Ampersand, Hash, Five, Zero, Colon...
In comp.infosystems.www.authoring.html,
Thomas 'PointedEars' Lahn <dciwam@PointedEars.de> wrote:
I note the lack of an attribution there[*].
My writing:
So, by that rule, anything in RAM, on magnetic disk, on magnetic tape,
on SSD, on DVD-R or CD-ROM, in transit over ethernet or wifi, all of
those are _not plain text_.
Thomas's reply:
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^No, of course not. Not all code points of US-ASCII or Unicode represent
digits and letters. In particular, the first 32 code points do not; they
represent non-printable control characters or are left unassigned. That
is, they represent *data*, but not necessarily *text*.
Control characters between 0 and 31 are either generally not used in
output or have very well defined meanings in output.
[Ex falso quodlibet]
[*] This is not something I wrote, although the >> implies it was in
my article. So perhaps the lack of attribution was deliberate?
Ampersand, Hash, Five, Zero, Colon...
[...]
Are you saying tai displaying the character sequence “as-is” is proper >> display?
Yes. You did not ask for "interpret what this text means".
A pseudonymous coward and liar trolled:
Why are you lying?[*] This is not something I wrote, although the >> implies it was in[Ex falso quodlibet]
my article. So perhaps the lack of attribution was deliberate?
[Ex falso quodlibet]$
Jukka K. Korpela:
Arno Welzel wrote:
But even application/xhtml+xml is in fact plain text which is
*interpreted* as XHTML.
The important point is, that the content of a file of that type can be
read as plain text as well.
Please read this as plain
text.
Is this the way *you* create your XHTML files?
In comp.infosystems.www.authoring.html,
Thomas 'PointedEars' Lahn <usenet@PointedEars.de> wrote:
A pseudonymous coward and liar trolled:
Ever the classy person there.
Why are you lying?[*] This is not something I wrote, although the >> implies it was in[Ex falso quodlibet]
my article. So perhaps the lack of attribution was deliberate?
$ lynx -source -dump 'news:<eli$2010201433@qaz.wtf>' |grep
quodlibet
$ lynx -source -dump 'news:<2173853.ElGaqSPkdT@PointedEars.de>' |grep quodlibet
[Ex falso quodlibet]$
Arno Welzel wrote:
Ampersand, Hash, Five, Zero, Colon...
[...]
Are you saying tai displaying the character sequence “as-is” is proper >>> display?
Yes. You did not ask for "interpret what this text means".
For HTML (which is what we are discussing here), “proper display” means
Jukka K. Korpela:
Arno Welzel wrote:
Ampersand, Hash, Five, Zero, Colon...
[...]
Are you saying tai displaying the character sequence ?as-is? is proper >>> display?
Yes. You did not ask for "interpret what this text means".
For HTML (which is what we are discussing here), ?proper display? means
"proper display" is not required to read something as plain text.
You can even print this on a sheet of paper and give it to someone to
type it in and you ge the the same file again which can again be
displayed using a web browser.
Try this with a PNG image or a MP3 file.
I think the two of you are actually using different terminology. To
Arno, and to me, "plain text" is not something with no codes in it,
it's something where a "text editor" can see all the characters.
I think Jukka is equating plain text" to type="text/plain". I won't
say that's wrong, but it's not the only interpretation.
For HTML (which is what we are discussing here), proper display means displaying the content as defined in HTML specifications. It would inappropriate for a browser to display the tags, the character
references, the comments, etc., as-is. It would mean rendering an HTML document as plain text (which it is not, by definition), refusing to do
the job of a browser.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 435 |
Nodes: | 16 (2 / 14) |
Uptime: | 176:43:09 |
Calls: | 9,128 |
Calls today: | 7 |
Files: | 13,428 |
Messages: | 6,034,130 |