## Plain text
The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a use-case where considering any other technology is worth it.
And similar is the case with simple static HTML websites--a simple
static page is better than all publishing platforms that can ever be
created.
Anything you write and that you want to last should be put on plain text files.
# Human technology: Text files
It is a well-known engineering principle, that you should always use the weakest technology capable of solving your problem--the weakest
technology is likely the cheapest, easiest to maintain, extend or
replace and there are no sane arguments for using anything else.
The main problem with this principle is marketing--few people would
sell you a 10$ product that can solve your problem for ever, when they
can sell you a 1000$ product, with 10$ per month maintenance cost, that
will become obsolete after 10 years. If you listen to the "experts"
you would likely end up not with the simplest, but with the most
advanced technology.
And with software the situation is particularly bad, because the
simplest technologies often cost zero, and so they have zero marketing budget. And since nobody would be benefiting from convincing you to
use something that does not cost anything, nobody is actively selling
those. In this post, I will try to fill that gap by reviewing some technologies for web publishing that are based on plain text and
putting forward their benefits. Read on to understand why and how
you should write everything you write in plain text files and
self-publish them on your own website.
## Plain text
The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a use-case where considering any other technology is worth it.
And similar is the case with simple static HTML websites--a simple
static page is better than all publishing platforms that can ever be
created.
Anything you write and that you want to last should be put on plain text files.
...
From: https://boris-marinov.github.io/text/
You’d have to be NUTS to try to keep your precious data around in any other format. Images and videos, audio, all have common formats but is there a “forever” format for these data which rivals plain text? No. Of course not.
The original article was not talking about multimedia. You don't write images, video, nor audio, though you might write plots, scripts,
screenplays, scores, etc.
# Human technology: Text files
A problem is that at this point most users have no concept of what plain
text even is. If they think about it at all they think it means Microsoft Word
or just "Microsoft".
If I ask someone to send me something in plain text format I usually just
get a blank stare. About the best I can usually do to get anyone to send something in an open format is pdf.
Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.
Well perhaps Markdown is from a reader's perspective, but not for a
writer because they need knowledge of the syntax.
This is a quote.
Roger Blake <rogblake@iname.invalid> wrote:
A problem is that at this point most users have no concept of what
plain text even is. If they think about it at all they think it
means Microsoft Word or just "Microsoft".
On the other hand I find HTML quite readable if it's formatted
sensibly...
But neither Markdown, nor HTML, is plain text to me anyway.
Actually I'd go further and say that as an English speaker who
doesn't need extra characters, I prefer ASCII text. UTF-8 includes
things like emoticons which, were they to become widely used in
text documents for conveying important information, would cause me
all sorts of trouble. Thankfully so far they never seem to be used
for anything remotely important.
Computer Nerd Kev <not@telling.you.invalid> writes:
Roger Blake <rogblake@iname.invalid> wrote:
A problem is that at this point most users have no concept of what
plain text even is. If they think about it at all they think it
means Microsoft Word or just "Microsoft".
A friend on another newsgroup, after decades as a programmer, is
struggling with the challenge of persuading/coercing his (mostly Mac) software to send 7-bit ASCII mail and news posts. The software wants
to make everything UTF-8 (left & right double & single quotes,
ellipses and some other punctuation are each 3 bytes). It appears
that his solution will be to compose mail/posts on a Rapberry Pi
running Linux over his LAN, the retrieve the result to post via his Mac.
It remains unclear if his Mac apps will do that without "fixing" the deficient ASCII text.
On the other hand I find HTML quite readable if it's formatted
sensibly...
Another e-acquaintance re-posts articles from the web to a mailing
list. It appears that he righteously hits the button in his browser
labeled "Email as plain text" or similar.
The result is:
* HTML is elided but
* Much of the punctuation is 3-byte UTF-8 chars
* All links/anchors in the original HTML are included in-line
inside <https://miskatonic.edu/using_brokets> brokets.
* A "line" is whatever was rendered as a paragraph in HTML
* Then his mail client (or something) does everything up as
quoted-printable
The UTF-8 puntuation is actually 9 bytes as QP (=E2=NN=NN) and urls
are frequently quite long. It's a dog's breakfast. Not totally
UNreadable but "Quite readable" wouldn't be my choice of descriptor.
But neither Markdown, nor HTML, is plain text to me anyway.
Actually I'd go further and say that as an English speaker who
doesn't need extra characters, I prefer ASCII text. UTF-8 includes
things like emoticons which, were they to become widely used in
text documents for conveying important information, would cause me
all sorts of trouble. Thankfully so far they never seem to be used
for anything remotely important.
Many years ago, I and others ridiculed Microsoft's tilt toward dumbing everything down the the acephalic lowest common denominator with
notions such as:
* Windows Iconic Droolproof Descriptive Language Extension
* Cognitive Reassembler Access Protocol for Windows Applications
with Rebus Enhancement
* Microsoft Iconic Canonical Reassembler for Ontic Cognitive
Enhancement of Proactive Heuristic Access to Linguistic
Youthfulness
only to have reality upstage satire, a decade or so ago, with iConji (q.g.)[1]
[1] q.g.: quod google
That brings up a point I was wondering: does usenet/email support utf-8
yet, or is everything expected to be ASCII? 7-bit?
What happens if I do insert a non-ascii unicode glyph?
Computer Nerd Kev <not@telling.you.invalid> wrote:
Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.
If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing.
Well perhaps Markdown is from a reader's perspective, but not for a
writer because they need knowledge of the syntax.
There's not much to it. Markdown seems largely to follow the sorts of conventions most people have used in text files anyway:
*this line is emphasized*
This line is a heading
======================
1. This is the first item of an ordered list.
2. This is the second line.
3. etc.
This is a quote.
* This is the first item of an unordered list.
* etc.
If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.
Mike Spencer <mds@bogus.nodomain.nowhere> wrote:
Computer Nerd Kev <not@telling.you.invalid> writes:
Roger Blake <rogblake@iname.invalid> wrote:
A problem is that at this point most users have no concept of what
plain text even is. If they think about it at all they think it
means Microsoft Word or just "Microsoft".
A friend on another newsgroup, after decades as a programmer, is
struggling with the challenge of persuading/coercing his (mostly Mac)
software to send 7-bit ASCII mail and news posts. The software wants
to make everything UTF-8 (left & right double & single quotes,
Hi, Mike, PMFJI.
In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
and similarly for other substitutions that are not required.
See also Preferences / Composing / Message Format: Plain Text.
Obviously this does not necessarily hold true for third party software.
[relurk]
ellipses and some other punctuation are each 3 bytes). It appears
that his solution will be to compose mail/posts on a Rapberry Pi
running Linux over his LAN, the retrieve the result to post via his Mac.
It remains unclear if his Mac apps will do that without "fixing" the
deficient ASCII text.
Computer Nerd Kev <not@telling.you.invalid> wrote:
Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.
If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.
That brings up a point I was wondering: does usenet/email support utf-8
yet, or is everything expected to be ASCII? 7-bit?
What happens if I do insert a non-ascii unicode glyph?
I believe so.
On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
If you're lucky, you can extract text from a PDF by selecting and
copying it. If it's just an image, though (as it might be if the
PDF was produced from a scan), you'll get back nothing. You might
be able to feed the PDF through an OCR engine and extract the text
that way, but the quality of those results depends largely on the
quality of the scan.
I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.
Hi, Mike, PMFJI.
All help welcome. Most of us need all the help we can get.
In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
and similarly for other substitutions that are not required.
See also Preferences / Composing / Message Format: Plain Text.
And a Mac will interpret "Plain Text" as 7-bit ASCII? I would but
Mac-world is a black box.
Obviously this does not necessarily hold true for third party software.
[relurk]
Forwarded to Mac-user party in question.
TYVM.
Lets try it out :
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
Lets try it out :
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
Works just fine for me! Good to know I won't accidentally break
everything if I include unusual characters.
I agree, I think that we should first try to solve technological problems with the simplest solutions. One of the reasons why I've moved
my blog to gopher is that it's just easier to maintain overall. I don't
have to worry about a database, or whether my CMS is working or not. I
just fire up my text editor, write stuff and 'scp' my files to my remote server.
Lets try it out :
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.
On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:
On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.
Pandoc does quite a nice job of converting docx to other formats.
On Fri, 07 Oct 2022 11:53:18 +1000, Computer Nerd Kev wrote:
Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is.
Doesn't work if the PostScript file is just a load of images.
I usually print, scan and OCR.
Lets try it out :
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
I'm hoping you are aware that you don't need a CMS or a database to
publish information over HTTP, but if you aren't then you can quite
happily (and just as easily) publish things to a web server to present
over HTTP using a text editor and scp. This has the benefit of still
being supported by modern browsers.
Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is.
I usually print, scan and OCR.
Surely you can OCR without the printing and scanning? Ghostscript can generate PNG (etc.) bitmap images for each page of a PDF, at a specified resolution.
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
Spiros Bousbouras <spibou@gmail.com> wrote:
Lets try it out :
Greek alphabet :
????????????????????????
????????????????????????
Some mathematical symbols :
? ? ? ? ? ? \ ? ? ? ? ? ? ? ? ? ? ? ?
Can you read all this ?
Received five-by-five, though the math symbols are a bit small. Pretty sure that's just down to font choice (Lucida Console, 9 pt.).
As you might see from examining the header, I'm using tin.
On Sat, 08 Oct 2022 03:58:04 +0000, Spiros Bousbouras wrote:
Lets try it out :
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
Fine for me. Pan on FreeBSD.
On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:
Not in this case. I have a lot of material that is on a CD, in a formatI usually print, scan and OCR.Surely you can OCR without the printing and scanning? Ghostscript can
generate PNG (etc.) bitmap images for each page of a PDF, at a specified
resolution.
only accessible by a Windows program that won't run on anything later
than XP. It fails when printed to a file!
On 10/11/2022 2:21 AM, Bob Eager wrote:
On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:Can the program that reads the file export it as something else? Out of curiosity, what is the file format called, and is it by any chance documented?
Not in this case. I have a lot of material that is on a CD, in a formatI usually print, scan and OCR.Surely you can OCR without the printing and scanning? Ghostscript can
generate PNG (etc.) bitmap images for each page of a PDF, at a
specified resolution.
only accessible by a Windows program that won't run on anything later
than XP. It fails when printed to a file!
MIME-Version: 1.0[...]
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Lets try it out :
Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω
Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋
Can you read all this ?
On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced >> from a scan), you'll get back nothing. You might be able to feed the PDF
through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.
I used to be able to extract text directly from Microsoft Word documents using "antiword" but it only works with the old binary (.doc) format and
of course the default has been the new .docx format since the 2007 version.
At least pdf is an open format. The "pdftotext" program can extract any actual text it finds in a pdf file but sometimes those are just an image which would require ocr to interpret.
Retrograde <fungus@amongus.com.invalid> wrote:
On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:
On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.
Pandoc does quite a nice job of converting docx to other formats.
I just discovered that myself actually. This command seems to work
well to generate a HTML file with any images embedded within it (I
prefer this a little over PDF):
pandoc -s --embed-resources --ascii -o file.htm file.docx
The other one that I would like to handle is Excel spreadsheets in
xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
seems to handle this, but as it's not designed for command-line use
it's going to take some more work to get equivalent functionality
out of it.
https://github.com/PHPOffice
At least pdf is an open format. The "pdftotext" program can extract anyWith MUPDF you can select the text with the right click mouse button and
actual text it finds in a pdf file but sometimes those are just an
image which would require ocr to interpret.
it will be copied into the clipboard.
On 2022-10-10, Computer Nerd Kev <not@telling.you.invalid> wrote:
The other one that I would like to handle is Excel spreadsheets in
xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
seems to handle this, but as it's not designed for command-line use
it's going to take some more work to get equivalent functionality
out of it.
https://github.com/PHPOffice
Get sc-im+gnuplot for xls and xlsx files. It's like LibreOffice Calc
but for the CLI and with vi keys.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 344 |
Nodes: | 16 (2 / 14) |
Uptime: | 36:57:14 |
Calls: | 7,524 |
Files: | 12,713 |
Messages: | 5,643,098 |