Forum: >>> Magnum BBS <<<

in praise of text files

From Ben Collver@21:1/5 to All on Tue Oct 4 16:37:29 2022

# Human technology: Text files

It is a well-known engineering principle, that you should always use the weakest technology capable of solving your problem--the weakest
technology is likely the cheapest, easiest to maintain, extend or
replace and there are no sane arguments for using anything else.

The main problem with this principle is marketing--few people would
sell you a 10$ product that can solve your problem for ever, when they
can sell you a 1000$ product, with 10$ per month maintenance cost, that
will become obsolete after 10 years. If you listen to the "experts"
you would likely end up not with the simplest, but with the most
advanced technology.

And with software the situation is particularly bad, because the
simplest technologies often cost zero, and so they have zero marketing
budget. And since nobody would be benefiting from convincing you to
use something that does not cost anything, nobody is actively selling
those. In this post, I will try to fill that gap by reviewing some technologies for web publishing that are based on plain text and
putting forward their benefits. Read on to understand why and how
you should write everything you write in plain text files and
self-publish them on your own website.

## Plain text

The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a
use-case where considering any other technology is worth it.

And similar is the case with simple static HTML websites--a simple
static page is better than all publishing platforms that can ever be
created.

Anything you write and that you want to last should be put on plain text
files.

...

From: https://boris-marinov.github.io/text/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Ben Collver on Tue Oct 4 19:29:57 2022

On Tue, 04 Oct 2022 16:37:29 +0000, Ben Collver wrote:

## Plain text

I see what was done there!

The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a use-case where considering any other technology is worth it.

And similar is the case with simple static HTML websites--a simple
static page is better than all publishing platforms that can ever be
created.

Anything you write and that you want to last should be put on plain text files.

Indeed. Some years ago there was a discussion in some newsgroup (I forget which) about extracting names from several hundred web pages. They were
the names of crews that flew from a British airfield in WWII. The problem
was that the webpages had been created by quite a few different people,
and it seemed that mechanical extraction (several crews per page) was difficult. Various suggestions were made, and in the end I had a go with
a tool that is now 55 years old. After about three iteratons, it worked.

It would have been a lot easier with plain text, or even Markdown.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From 5GyYap52yQ1UGMWD@21:1/5 to Ben Collver on Wed Oct 5 11:33:54 2022

Ben Collver <bencollver@tilde.pink> writes:

# Human technology: Text files

It is a well-known engineering principle, that you should always use the weakest technology capable of solving your problem--the weakest
technology is likely the cheapest, easiest to maintain, extend or
replace and there are no sane arguments for using anything else.

The main problem with this principle is marketing--few people would
sell you a 10$ product that can solve your problem for ever, when they
can sell you a 1000$ product, with 10$ per month maintenance cost, that
will become obsolete after 10 years. If you listen to the "experts"
you would likely end up not with the simplest, but with the most
advanced technology.

And with software the situation is particularly bad, because the
simplest technologies often cost zero, and so they have zero marketing budget. And since nobody would be benefiting from convincing you to
use something that does not cost anything, nobody is actively selling
those. In this post, I will try to fill that gap by reviewing some technologies for web publishing that are based on plain text and
putting forward their benefits. Read on to understand why and how
you should write everything you write in plain text files and
self-publish them on your own website.

## Plain text

The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a use-case where considering any other technology is worth it.

And similar is the case with simple static HTML websites--a simple
static page is better than all publishing platforms that can ever be
created.

Anything you write and that you want to last should be put on plain text files.

...

From: https://boris-marinov.github.io/text/

Thanks for that good write up.

I agree, I think that we should first try to solve technological problems with the simplest solutions. One of the reasons why I've moved
my blog to gopher is that it's just easier to maintain overall. I don't
have to worry about a database, or whether my CMS is working or not. I
just fire up my text editor, write stuff and 'scp' my files to my remote server.

--
Pointless meanderings in a bleak and lonely world.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Oregonian Haruspex@21:1/5 to All on Thu Oct 6 06:40:58 2022

You’d have to be NUTS to try to keep your precious data around in any other format. Images and videos, audio, all have common formats but is there a “forever” format for these data which rivals plain text? No. Of course not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Collver@21:1/5 to Oregonian Haruspex on Thu Oct 6 16:59:47 2022

On 2022-10-06, Oregonian Haruspex <no_email@invalid.invalid> wrote:

You’d have to be NUTS to try to keep your precious data around in any other format. Images and videos, audio, all have common formats but is there a “forever” format for these data which rivals plain text? No. Of course not.

"Anything you write and that you want to last should be put on plain
text files."

The original article was not talking about multimedia. You don't write
images, video, nor audio, though you might write plots, scripts,
screenplays, scores, etc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Samuel Christie@21:1/5 to Ben Collver on Thu Oct 6 16:06:26 2022

Ben Collver <bencollver@tilde.pink> writes:

The original article was not talking about multimedia. You don't write images, video, nor audio, though you might write plots, scripts,
screenplays, scores, etc.

Soon we /will/ be able to store everything as text descriptions, and
just have ML models generate the images, video, and audio...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger Blake@21:1/5 to Ben Collver on Thu Oct 6 22:28:15 2022

On 2022-10-04, Ben Collver <bencollver@tilde.pink> wrote:

# Human technology: Text files

A problem is that at this point most users have no concept of what plain
text even is. If they think about it at all they think it means Microsoft Word or just "Microsoft".

If I ask someone to send me something in plain text format I usually just
get a blank stare. About the best I can usually do to get anyone to send something in an open format is pdf.

-- ------------------------------------------------------------------------------
18 Reasons I won't be vaccinated -- https://tinyurl.com/ebty2dx3
Covid vaccines: experimental biology -- https://tinyurl.com/57mncfm5
The fraud of "Climate Change" -- https://RealClimateScience.com
There is no "climate crisis" -- https://climatedepot.com
Don't talk to cops! -- https://DontTalkToCops.com ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Roger Blake on Fri Oct 7 11:53:18 2022

Roger Blake <rogblake@iname.invalid> wrote:

A problem is that at this point most users have no concept of what plain
text even is. If they think about it at all they think it means Microsoft Word
or just "Microsoft".

That doesn't surprise me. However the article doesn't really share
my own definition of plain text either. It goes on to talk about
Markdown, and using static site generators to turn it into HTML for publication.

To me plain text means that there is no standard structure. You
make a layout up that seems appropriate and makes sense as it's
displayed in the editor, therefore you don't have to worry about
any existing standards. If I'm just making notes for myself, then
I don't even have to worry about other people understanding it (and
I do have my own particular patterns for this which just happen to
suit me and possibly aren't obvious to others). That's the freedom
of plain text to me.

On the other hand I find HTML quite readable if it's formatted
sensibly, so if I want to publish something on the web then I'd
rather just write in HTML directly than complicate matters by using
something like Markdown. If I did use some intermediate format then
there's the risk that it would generate the sort of garbled mess
that most modern websites have for their HTML - full of mixed up
line breaks, and styling stuff.

But neither Markdown, nor HTML, is plain text to me anyway.
Actually I'd go further and say that as an English speaker who
doesn't need extra characters, I prefer ASCII text. UTF-8 includes
things like emoticons which, were they to become widely used in
text documents for conveying important information, would cause me
all sorts of trouble. Thankfully so far they never seem to be used
for anything remotely important.

If I ask someone to send me something in plain text format I usually just
get a blank stare. About the best I can usually do to get anyone to send something in an open format is pdf.

Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.

Well perhaps Markdown is from a reader's perspective, but not for a
writer because they need knowledge of the syntax.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From scott@alfter.diespammersdie.us@21:1/5 to Computer Nerd Kev on Fri Oct 7 17:03:30 2022

Computer Nerd Kev <not@telling.you.invalid> wrote:

Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.

If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced
from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.

Well perhaps Markdown is from a reader's perspective, but not for a
writer because they need knowledge of the syntax.

There's not much to it. Markdown seems largely to follow the sorts of conventions most people have used in text files anyway:

*this line is emphasized*

This line is a heading
======================

1. This is the first item of an ordered list.
2. This is the second line.
3. etc.

This is a quote.

* This is the first item of an unordered list.
* etc.

I suppose the elements that don't spring immediately to mind are blocks of code:

```
#include <stdio.h>

int main (void)
{
print("Hellorld!"); /* https://tinyurl.com/hellorld */
return 0;
}
```

and [links](https://alfter.us/).

Basically, it's not much of a lift from plain text to Markdown. It's definitely less obtrusive than HTML.

--
_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( https://alfter.us/ Top-posting!
\_^_/ >What's the most annoying thing on Usenet?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Spencer@21:1/5 to Computer Nerd Kev on Fri Oct 7 15:00:10 2022

Computer Nerd Kev <not@telling.you.invalid> writes:

Roger Blake <rogblake@iname.invalid> wrote:

A problem is that at this point most users have no concept of what
plain text even is. If they think about it at all they think it
means Microsoft Word or just "Microsoft".

A friend on another newsgroup, after decades as a programmer, is
struggling with the challenge of persuading/coercing his (mostly Mac)
software to send 7-bit ASCII mail and news posts. The software wants
to make everything UTF-8 (left & right double & single quotes,
ellipses and some other punctuation are each 3 bytes). It appears
that his solution will be to compose mail/posts on a Rapberry Pi
running Linux over his LAN, the retrieve the result to post via his Mac.
It remains unclear if his Mac apps will do that without "fixing" the
deficient ASCII text.

On the other hand I find HTML quite readable if it's formatted
sensibly...

Another e-acquaintance re-posts articles from the web to a mailing
list. It appears that he righteously hits the button in his browser
labeled "Email as plain text" or similar.

The result is:

* HTML is elided but

* Much of the punctuation is 3-byte UTF-8 chars

* All links/anchors in the original HTML are included in-line
inside <https://miskatonic.edu/using_brokets> brokets.

* A "line" is whatever was rendered as a paragraph in HTML

* Then his mail client (or something) does everything up as
quoted-printable

The UTF-8 puntuation is actually 9 bytes as QP (=E2=NN=NN) and urls
are frequently quite long. It's a dog's breakfast. Not totally
UNreadable but "Quite readable" wouldn't be my choice of descriptor.

But neither Markdown, nor HTML, is plain text to me anyway.
Actually I'd go further and say that as an English speaker who
doesn't need extra characters, I prefer ASCII text. UTF-8 includes
things like emoticons which, were they to become widely used in
text documents for conveying important information, would cause me
all sorts of trouble. Thankfully so far they never seem to be used
for anything remotely important.

Many years ago, I and others ridiculed Microsoft's tilt toward dumbing everything down the the acephalic lowest common denominator with
notions such as:

* Windows Iconic Droolproof Descriptive Language Extension

* Cognitive Reassembler Access Protocol for Windows Applications
with Rebus Enhancement

* Microsoft Iconic Canonical Reassembler for Ontic Cognitive
Enhancement of Proactive Heuristic Access to Linguistic
Youthfulness

only to have reality upstage satire, a decade or so ago, with iConji
(q.g.)[1]

[1] q.g.: quod google

--
Mike Spencer Nova Scotia, Canada

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sn!pe@21:1/5 to Mike Spencer on Fri Oct 7 20:00:09 2022

Mike Spencer <mds@bogus.nodomain.nowhere> wrote:

Computer Nerd Kev <not@telling.you.invalid> writes:

Roger Blake <rogblake@iname.invalid> wrote:

A problem is that at this point most users have no concept of what
plain text even is. If they think about it at all they think it
means Microsoft Word or just "Microsoft".

A friend on another newsgroup, after decades as a programmer, is
struggling with the challenge of persuading/coercing his (mostly Mac) software to send 7-bit ASCII mail and news posts. The software wants
to make everything UTF-8 (left & right double & single quotes,

Hi, Mike, PMFJI.

In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
and similarly for other substitutions that are not required.
See also Preferences / Composing / Message Format: Plain Text.

Obviously this does not necessarily hold true for third party software.

[relurk]

ellipses and some other punctuation are each 3 bytes). It appears
that his solution will be to compose mail/posts on a Rapberry Pi
running Linux over his LAN, the retrieve the result to post via his Mac.
It remains unclear if his Mac apps will do that without "fixing" the deficient ASCII text.

On the other hand I find HTML quite readable if it's formatted
sensibly...

Another e-acquaintance re-posts articles from the web to a mailing
list. It appears that he righteously hits the button in his browser
labeled "Email as plain text" or similar.

The result is:

* HTML is elided but

* Much of the punctuation is 3-byte UTF-8 chars

* All links/anchors in the original HTML are included in-line
inside <https://miskatonic.edu/using_brokets> brokets.

* A "line" is whatever was rendered as a paragraph in HTML

* Then his mail client (or something) does everything up as
quoted-printable

The UTF-8 puntuation is actually 9 bytes as QP (=E2=NN=NN) and urls
are frequently quite long. It's a dog's breakfast. Not totally
UNreadable but "Quite readable" wouldn't be my choice of descriptor.

But neither Markdown, nor HTML, is plain text to me anyway.
Actually I'd go further and say that as an English speaker who
doesn't need extra characters, I prefer ASCII text. UTF-8 includes
things like emoticons which, were they to become widely used in
text documents for conveying important information, would cause me
all sorts of trouble. Thankfully so far they never seem to be used
for anything remotely important.

Many years ago, I and others ridiculed Microsoft's tilt toward dumbing everything down the the acephalic lowest common denominator with
notions such as:

* Windows Iconic Droolproof Descriptive Language Extension

* Cognitive Reassembler Access Protocol for Windows Applications
with Rebus Enhancement

* Microsoft Iconic Canonical Reassembler for Ontic Cognitive
Enhancement of Proactive Heuristic Access to Linguistic
Youthfulness

only to have reality upstage satire, a decade or so ago, with iConji (q.g.)[1]

[1] q.g.: quod google

--
^�^ My pet rock Gordon just is.

~ Slava Ukraini ~

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Samuel Christie@21:1/5 to All on Fri Oct 7 15:07:20 2022

That brings up a point I was wondering: does usenet/email support utf-8
yet, or is everything expected to be ASCII? 7-bit?

What happens if I do insert a non-ascii unicode glyph?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Kettlewell@21:1/5 to Samuel Christie on Fri Oct 7 20:46:49 2022

Samuel Christie <shcv@sdf.org> writes:

That brings up a point I was wondering: does usenet/email support utf-8
yet, or is everything expected to be ASCII? 7-bit?

What happens if I do insert a non-ascii unicode glyph?

Many Usenet clients have supported MIME and UTF-8 for years. There’s
still few hold-outs around though.

--
https://www.greenend.org.uk/rjk/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to scott@alfter.diespammersdie.us on Sat Oct 8 08:29:51 2022

scott@alfter.diespammersdie.us wrote:

Computer Nerd Kev <not@telling.you.invalid> wrote:

Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.

If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing.

Well the thing that's handy about Postscript being text (bitmap
embedded images aside) is that in the past I've been able to do
bulk find-and-replace operations to a batch of Postscript files
without needing to use a full-blown interpreter. Unlike PDF, where
the content is compressed, Postscript is text so you just need to
understand the language and then you can do your modifications
using a text editor or Sed.

My idea of plain text format is the same, just without the
potentially difficult "understanding the language" part.

Well perhaps Markdown is from a reader's perspective, but not for a
writer because they need knowledge of the syntax.

There's not much to it. Markdown seems largely to follow the sorts of conventions most people have used in text files anyway:

*this line is emphasized*

This line is a heading
======================

1. This is the first item of an ordered list.
2. This is the second line.
3. etc.

This is a quote.

* This is the first item of an unordered list.
* etc.

Yes it's nice and obvious to a reader, but for a writer it's still
many more rules to know and follow than if they were making it up
as they went.

I mean here is a "basic syntax" guide: https://www.markdownguide.org/basic-syntax

Why would I try to remember all that just so that I can follow some
standard that allows plain-text-readable files to be converted to
HTML? Just learn the HTML if you want the styling, and if you don't
then just use my sort of unstandardised plain text. It's excess
information for an unnecessary intermediate step in my opinion. Of
course more-so for me because I learnt about HTML before Markdown.

But each to their own, it's just not my idea of "plain text".

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger Blake@21:1/5 to scott@alfter.diespammersdie.us on Fri Oct 7 23:19:40 2022

On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.

I used to be able to extract text directly from Microsoft Word documents
using "antiword" but it only works with the old binary (.doc) format and
of course the default has been the new .docx format since the 2007 version.

At least pdf is an open format. The "pdftotext" program can extract any
actual text it finds in a pdf file but sometimes those are just an image
which would require ocr to interpret.

-- ------------------------------------------------------------------------------
18 Reasons I won't be vaccinated -- https://tinyurl.com/ebty2dx3
Covid vaccines: experimental biology -- https://tinyurl.com/57mncfm5
The fraud of "Climate Change" -- https://RealClimateScience.com
There is no "climate crisis" -- https://climatedepot.com
Don't talk to cops! -- https://DontTalkToCops.com ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mike Spencer@21:1/5 to snipeco.2@gmail.com on Fri Oct 7 23:29:09 2022

snipeco.2@gmail.com (Sn!pe) writes:

Mike Spencer <mds@bogus.nodomain.nowhere> wrote:

Computer Nerd Kev <not@telling.you.invalid> writes:

Roger Blake <rogblake@iname.invalid> wrote:

A problem is that at this point most users have no concept of what
plain text even is. If they think about it at all they think it
means Microsoft Word or just "Microsoft".

A friend on another newsgroup, after decades as a programmer, is
struggling with the challenge of persuading/coercing his (mostly Mac)
software to send 7-bit ASCII mail and news posts. The software wants
to make everything UTF-8 (left & right double & single quotes,

Hi, Mike, PMFJI.

All help welcome. Most of us need all the help we can get.

In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
and similarly for other substitutions that are not required.
See also Preferences / Composing / Message Format: Plain Text.

And a Mac will interpret "Plain Text" as 7-bit ASCII? I would but
Mac-world is a black box.

Obviously this does not necessarily hold true for third party software.

[relurk]

Forwarded to Mac-user party in question.

TYVM.

ellipses and some other punctuation are each 3 bytes). It appears
that his solution will be to compose mail/posts on a Rapberry Pi
running Linux over his LAN, the retrieve the result to post via his Mac.
It remains unclear if his Mac apps will do that without "fixing" the
deficient ASCII text.

[snip]
--
Mike Spencer Nova Scotia, Canada

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dan Espen@21:1/5 to scott@alfter.diespammersdie.us on Fri Oct 7 23:33:54 2022

scott@alfter.diespammersdie.us writes:

Computer Nerd Kev <not@telling.you.invalid> wrote:

Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is. But I don't
consider either to really be plain text.

If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.

I've done the OCR the PDF thing. It worked quite well.

For non-image documents, pdftotext does the job.

--
Dan Espen

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Spiros Bousbouras@21:1/5 to Samuel Christie on Sat Oct 8 03:58:04 2022

On Fri, 07 Oct 2022 15:07:20 -0400
Samuel Christie <shcv@sdf.org> wrote:

That brings up a point I was wondering: does usenet/email support utf-8
yet, or is everything expected to be ASCII? 7-bit?

If you mean emails or usenet posts where some of the octets have values > 127 then I've never seen problems and I've sent or read many such emails or usenet posts. Obviously the header must have the correct information. For an example see this post or <87h70fmn6e.fsf@LkoBDZeT.terraraq.uk> .

Octets with value 0 are *not* ok and possibly some other values < 32 .If you want such values then the email or post needs to be appropriated encoded , namely BASE64 or quoted-printable .Again , the header must mention this.

I've seen occasions where things worked correctly even when the header did
not have the correct information , the software guessed correctly what was needed. But it's best not to risk it.

What happens if I do insert a non-ascii unicode glyph?

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

--
vlaho.ninja/prog

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Taylor@21:1/5 to Grant Taylor on Fri Oct 7 16:53:25 2022

On 10/7/22 4:51 PM, Grant Taylor wrote:

I believe so.

§

--
Grant. . . .
unix || die

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich@21:1/5 to Roger Blake on Sat Oct 8 04:49:41 2022

Roger Blake <rogblake@iname.invalid> wrote:

On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

If you're lucky, you can extract text from a PDF by selecting and
copying it. If it's just an image, though (as it might be if the
PDF was produced from a scan), you'll get back nothing. You might
be able to feed the PDF through an OCR engine and extract the text
that way, but the quality of those results depends largely on the
quality of the scan.

I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.

Docx files are just zip files containing a bunch of XML files, so with
a small bit of effort, you can extract text directly from docx files as
well.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Sn!pe@21:1/5 to Mike Spencer on Sat Oct 8 11:40:37 2022

Mike Spencer <mds@bogus.nodomain.nowhere> wrote:

[...]

Hi, Mike, PMFJI.

All help welcome. Most of us need all the help we can get.

In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
and similarly for other substitutions that are not required.
See also Preferences / Composing / Message Format: Plain Text.

And a Mac will interpret "Plain Text" as 7-bit ASCII? I would but
Mac-world is a black box.

I rather think not but I can't say definitively. I imagine it would be UTF-(something) but being only a user I'm not expert in macOS's
underpinnings. My newsreader falls back to the simplest encoding
that will support the required characters; maybe MacOS is similar.
The fellows in comp.sys.mac.* or uk.comp.sys.mac would probably
know.

Obviously this does not necessarily hold true for third party software.

[relurk]

Forwarded to Mac-user party in question.

TYVM.

YW.

--
^�^ My pet rock Gordon just is.

~ Slava Ukraini ~

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Samuel Christie@21:1/5 to All on Sat Oct 8 15:05:14 2022

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

Works just fine for me! Good to know I won't accidentally break
everything if I include unusual characters.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From The Real Bev@21:1/5 to Samuel Christie on Sat Oct 8 16:12:51 2022

On 10/8/22 12:05 PM, Samuel Christie wrote:

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

Works just fine for me! Good to know I won't accidentally break
everything if I include unusual characters.

I see them too.

Good to know that I won't have to buy a set of Typits!

BTW, here's a handy chart. Looks pretty ratty in a proportional
typeface, though.

iso8859-1 cheat sheet
(per http://www.uni-passau.de/~ramsch/iso8859-1.html)

¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿

À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß

à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

dec oct 8 7 HTML | dec oct 8 7 HTML | dec oct 8 7 HTML | ====================|=====================|=====================|
161 241 ¡ ! ¡ | 162 242 ¢ " ¢ | 163 243 £ # £ |
164 244 ¤ $ ¤| 165 245 ¥ % ¥ | 166 246 ¦ & ¦|
167 247 § ' § | 168 250 ¨ ( ¨ | 169 251 © ) © |
170 252 ª * ª | 171 253 « + « | 172 254 ¬ , ¬ |
173 255 -  | 174 256 ® . ® | 175 257 ¯ / ¯ |
176 260 ° 0 ° | 177 261 ± 1 ±| 178 262 ² 2 ² |
179 263 ³ 3 ³ | 180 264 ´ 4 ´ | 181 265 µ 5 µ |
182 266 ¶ 6 ¶ | 183 267 · 7 ·| 184 270 ¸ 8 ¸ |
185 271 ¹ 9 ¹ | 186 272 º : º | 187 273 » ; » |
188 274 ¼ < ¼| 189 275 ½ = ½| 190 276 ¾ > ¾|
191 277 ¿ ? ¿| 192 300 À @ À| 193 301 Á A Á|
194 302 Â B Â | 195 303 Ã C Ã| 196 304 Ä D Ä |
197 305 Å E Å | 198 306 Æ F Æ | 199 307 Ç G Ç|
200 310 È H È| 201 311 É I É| 202 312 Ê J Ê |
203 313 Ë K Ë | 204 314 Ì L Ì| 205 315 Í M Í|
206 316 Î N Î | 207 317 Ï O Ï | 208 320 Ð P Ð |
209 321 Ñ Q Ñ| 210 322 Ò R Ò| 211 323 Ó S Ó|
212 324 Ô T Ô | 213 325 Õ U Õ| 214 326 Ö V Ö |
215 327 × W × | 216 330 Ø X Ø| 217 331 Ù Y Ù|
218 332 Ú Z Ú| 219 333 Û [ Û | 220 334 Ü \ Ü |
221 335 Ý ] Ý| 222 336 Þ ^ Þ | 223 337 ß _ ß |
224 340 à ` à| 225 341 á a á| 226 342 â b â |
227 343 ã c ã| 228 344 ä d ä | 229 345 å e å |
230 346 æ f æ | 231 347 ç g ç| 232 350 è h è|
233 351 é i é| 234 352 ê j ê | 235 353 ë k ë |
236 354 ì l ì| 237 355 í m í| 238 356 î n î |
239 357 ï o ï | 240 360 ð p ð | 241 361 ñ q ñ|
242 362 ò r ò| 243 363 ó s ó| 244 364 ô t ô |
245 365 õ u õ| 246 366 ö v ö | 247 367 ÷ w ÷|
248 370 ø x ø| 249 371 ù y ù| 250 372 ú z ú|
251 373 û { û | 252 374 ü | ü | 253 375 ý } ý|
254 376 þ ~ þ | 255 377 ÿ ÿ |

--
Cheers, Bev
Red ship crashes into blue ship - sailors marooned.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Matthew Ernisse@21:1/5 to All on Sun Oct 9 01:59:18 2022

On Wed, 05 Oct 2022 11:33:54 +0800, 5GyYap52yQ1UGMWD wrote:

I agree, I think that we should first try to solve technological problems with the simplest solutions. One of the reasons why I've moved
my blog to gopher is that it's just easier to maintain overall. I don't
have to worry about a database, or whether my CMS is working or not. I
just fire up my text editor, write stuff and 'scp' my files to my remote server.

I'm hoping you are aware that you don't need a CMS or a database to
publish information over HTTP, but if you aren't then you can quite
happily (and just as easily) publish things to a web server to present
over HTTP using a text editor and scp. This has the benefit of still
being supported by modern browsers.

--
"The avalanche has started, it is too late for the pebbles to vote."
--Kosh

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From scott@alfter.diespammersdie.us@21:1/5 to Spiros Bousbouras on Mon Oct 10 18:40:36 2022

Spiros Bousbouras <spibou@gmail.com> wrote:

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

Received five-by-five, though the math symbols are a bit small. Pretty sure that's just down to font choice (Lucida Console, 9 pt.).

As you might see from examining the header, I'm using tin. Previously, I
had used trn, and I'm pretty sure it would've choked on non-ASCII content.

--
_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( https://alfter.us/ Top-posting!
\_^_/ >What's the most annoying thing on Usenet?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Retrograde@21:1/5 to Roger Blake on Mon Oct 10 20:37:55 2022

On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:

On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.

Pandoc does quite a nice job of converting docx to other formats.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Retrograde on Tue Oct 11 07:47:24 2022

Retrograde <fungus@amongus.com.invalid> wrote:

On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:

On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.

Pandoc does quite a nice job of converting docx to other formats.

I just discovered that myself actually. This command seems to work
well to generate a HTML file with any images embedded within it (I
prefer this a little over PDF):
pandoc -s --embed-resources --ascii -o file.htm file.docx

The other one that I would like to handle is Excel spreadsheets in
xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
seems to handle this, but as it's not designed for command-line use
it's going to take some more work to get equivalent functionality
out of it.

https://github.com/PHPOffice

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Bob Eager on Tue Oct 11 08:14:09 2022

Bob Eager <news0009@eager.cx> wrote:

On Fri, 07 Oct 2022 11:53:18 +1000, Computer Nerd Kev wrote:

Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is.

Doesn't work if the PostScript file is just a load of images.

Presuming Bitmap images, yes. Markdown apparantly allows you to
reference images as well though, so you could just as well have a
Markdown document with only scanned images of text in it.

I usually print, scan and OCR.

Surely you can OCR without the printing and scanning? Ghostscript
can generate PNG (etc.) bitmap images for each page of a PDF, at a
specified resolution.

The pdfimages program from Xpdf claims that it "extracts the images
from a PDF file", so it may be better again because there isn't
any recompression or resampling. To be honest I don't do OCR for
anything so I haven't looked into it much. Where I last found that
editing Postscript manually came in handy was actually for
correcting a formatting glitch for printing.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Spiros Bousbouras on Mon Oct 10 21:33:57 2022

On Sat, 08 Oct 2022 03:58:04 +0000, Spiros Bousbouras wrote:

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

Fine for me. Pan on FreeBSD.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Matthew Ernisse on Mon Oct 10 21:35:52 2022

On Sun, 09 Oct 2022 01:59:18 +0000, Matthew Ernisse wrote:

I'm hoping you are aware that you don't need a CMS or a database to
publish information over HTTP, but if you aren't then you can quite
happily (and just as easily) publish things to a web server to present
over HTTP using a text editor and scp. This has the benefit of still
being supported by modern browsers.

I always do it like that (although I use the curl library for REXX for an automated upload).

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Computer Nerd Kev on Mon Oct 10 21:35:03 2022

On Fri, 07 Oct 2022 11:53:18 +1000, Computer Nerd Kev wrote:

Well you can convert PDF to Postscript, and so far as I'm concened
that's "plain text" in the way that Markdown is.

Doesn't work if the PostScript file is just a load of images. I usually
print, scan and OCR.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Computer Nerd Kev on Tue Oct 11 08:21:10 2022

On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:

I usually print, scan and OCR.

Surely you can OCR without the printing and scanning? Ghostscript can generate PNG (etc.) bitmap images for each page of a PDF, at a specified resolution.

Not in this case. I have a lot of material that is on a CD, in a format
only accessible by a Windows program that won't run on anything later
than XP. It fails when printed to a file!

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ben Collver@21:1/5 to Spiros Bousbouras on Tue Oct 11 16:19:17 2022

On 2022-10-08, Spiros Bousbouras <spibou@gmail.com> wrote:

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

Reads fine for me in slrn and xfce4-terminal.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to scott@alfter.diespammersdie.us on Wed Oct 12 06:24:26 2022

scott@alfter.diespammersdie.us wrote:

Spiros Bousbouras <spibou@gmail.com> wrote:

Lets try it out :

Greek alphabet :
????????????????????????
????????????????????????

Some mathematical symbols :
? ? ? ? ? ? \ ? ? ? ? ? ? ? ? ? ? ? ?

Can you read all this ?

Received five-by-five, though the math symbols are a bit small. Pretty sure that's just down to font choice (Lucida Console, 9 pt.).

As you might see from examining the header, I'm using tin.

Tin also supports translating characters into other character sets
if it's set to prefer them, which is handy if you don't use a
unicode-capable terminal or font. But as you can see, it does tend
to go a little heavy on the "I don't know" character at times.

Compile-time options control some of that behaviour.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Retrograde@21:1/5 to Bob Eager on Wed Oct 12 00:21:24 2022

On 2022-10-10, Bob Eager <news0009@eager.cx> wrote:

On Sat, 08 Oct 2022 03:58:04 +0000, Spiros Bousbouras wrote:

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

Fine for me. Pan on FreeBSD.

It's encoded text/plain; charset=UTF-8 so any UTF-8-aware newsreader in an environment with the right font should work fine. Both claws-mail and slrn (in gnome-term) on Linux Mint show me both your Greek and your math just fine over here. On the Linux console, the Greek comes through but only half the math - I interpet that as my chosen console font having only a partial set of the math glyphs.

I'm nostalgic for lots of early technology, but I wouldn't go back to
the era before UTF-8 for anything.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Louis Krupp@21:1/5 to Bob Eager on Tue Oct 11 18:15:47 2022

On 10/11/2022 2:21 AM, Bob Eager wrote:

On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:

I usually print, scan and OCR.

Surely you can OCR without the printing and scanning? Ghostscript can
generate PNG (etc.) bitmap images for each page of a PDF, at a specified
resolution.

Not in this case. I have a lot of material that is on a CD, in a format
only accessible by a Windows program that won't run on anything later
than XP. It fails when printed to a file!

Can the program that reads the file export it as something else? Out of curiosity, what is the file format called, and is it by any chance
documented?

Louis

(My apologies if this shows up twice.)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Louis Krupp on Wed Oct 12 08:00:17 2022

On Tue, 11 Oct 2022 18:15:47 -0600, Louis Krupp wrote:

On 10/11/2022 2:21 AM, Bob Eager wrote:

On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:

I usually print, scan and OCR.

Surely you can OCR without the printing and scanning? Ghostscript can
generate PNG (etc.) bitmap images for each page of a PDF, at a
specified resolution.

Not in this case. I have a lot of material that is on a CD, in a format
only accessible by a Windows program that won't run on anything later
than XP. It fails when printed to a file!

Can the program that reads the file export it as something else? Out of curiosity, what is the file format called, and is it by any chance documented?

It's a proprietary format, and the thing that reads it is designed to
ONLY allow documents to be read on screen or printed.

It's not a problem; finally I have completed it and won't have to revisit.

Explanation: it's a CD of back issues of a journal. They want silly money
for PDFs of single articles. I knew a colleague had all the back issues
on paper, but when I asked him he had dumped them three weeks previously!
He had the CD, but it has a 16 bit installer for the reading application.
A VM with XP allowed me to use the application.

I have now thought of another possible way, but I've done it all now. The printing and OCR worked really well.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Otto J. Makela@21:1/5 to Spiros Bousbouras on Wed Oct 12 13:32:12 2022

Spiros Bousbouras <spibou@gmail.com> wrote:

MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

[...]

Lets try it out :

Greek alphabet :
ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
αβγδεζηθικλμνξοπρστυφχψω

Some mathematical symbols :
∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

Can you read all this ?

UTF-8 encoding works just fine with Gnus v5.13, to the extent that a
text terminal (I'm running this through mosh) can display characters.
--
/* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
/* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
/* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
/* * * Computers Rule 01001111 01001011 * * * * * * */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anthk@21:1/5 to Roger Blake on Thu Oct 13 00:38:56 2022

On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:

On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

If you're lucky, you can extract text from a PDF by selecting and copying
it. If it's just an image, though (as it might be if the PDF was produced >> from a scan), you'll get back nothing. You might be able to feed the PDF
through an OCR engine and extract the text that way, but the quality of
those results depends largely on the quality of the scan.

I used to be able to extract text directly from Microsoft Word documents using "antiword" but it only works with the old binary (.doc) format and
of course the default has been the new .docx format since the 2007 version.

At least pdf is an open format. The "pdftotext" program can extract any actual text it finds in a pdf file but sometimes those are just an image which would require ocr to interpret.

With MUPDF you can select the text with the right click mouse button
and it will be copied into the clipboard.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anthk@21:1/5 to Computer Nerd Kev on Thu Oct 13 00:38:57 2022

On 2022-10-10, Computer Nerd Kev <not@telling.you.invalid> wrote:

Retrograde <fungus@amongus.com.invalid> wrote:

On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:

On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

I used to be able to extract text directly from Microsoft Word
documents using "antiword" but it only works with the old binary
(.doc) format and of course the default has been the new .docx format
since the 2007 version.

Pandoc does quite a nice job of converting docx to other formats.

I just discovered that myself actually. This command seems to work
well to generate a HTML file with any images embedded within it (I
prefer this a little over PDF):
pandoc -s --embed-resources --ascii -o file.htm file.docx

The other one that I would like to handle is Excel spreadsheets in
xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
seems to handle this, but as it's not designed for command-line use
it's going to take some more work to get equivalent functionality
out of it.

https://github.com/PHPOffice

Get sc-im+gnuplot for xls and xlsx files. It's like LibreOffice Calc
but for the CLI and with vi keys.
For more operations, install visicalc and the required dependencies.
Also, to dump DOC files, you can catdoc and antiword.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Bob Eager@21:1/5 to Anthk on Thu Oct 13 07:50:11 2022

On Thu, 13 Oct 2022 00:38:56 +0000, Anthk wrote:

At least pdf is an open format. The "pdftotext" program can extract any
actual text it finds in a pdf file but sometimes those are just an
image which would require ocr to interpret.

With MUPDF you can select the text with the right click mouse button and
it will be copied into the clipboard.

Not if the pages are just scanned images.

--
Using UNIX since v6 (1975)...

Use the BIG mirror service in the UK:
http://www.mirrorservice.org

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Computer Nerd Kev@21:1/5 to Anthk on Fri Oct 14 07:44:33 2022

Anthk <anthk@disroot.org> wrote:

On 2022-10-10, Computer Nerd Kev <not@telling.you.invalid> wrote:

The other one that I would like to handle is Excel spreadsheets in
xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
seems to handle this, but as it's not designed for command-line use
it's going to take some more work to get equivalent functionality
out of it.

https://github.com/PHPOffice

Get sc-im+gnuplot for xls and xlsx files. It's like LibreOffice Calc
but for the CLI and with vi keys.

Thanks! That saved me from trying to figure out how to write a
command-line application in PHP. It still took me a while to find
the right options to get it to work as a Pandoc-style command-line
converter though. This is the magic concoction that generates a TSV
file from an XLSX spreadsheet without a lot of rubbish at the start
of the file:

sc-im --export_tab --nocurses --quit_afterload file.xlsx > file.tsv

Strangely the "--output=" option only wants to create empty files
for me (with verision 0.7.0).

The terminal-based spreadsheet program itself does look interesting
as well, though I'm pretty sure that I'd miss selecting cells and copying/pasting using the mouse.

--
__ __
#_ < |\| |< _#

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Guest
  Thu Jan 2 22:34:20 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Keyop
  Thu Jan 2 21:35:52 2025
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Thu Jan 2 21:33:29 2025
  from Wales, Uk via Telnet
- Guest
  Thu Jan 2 21:03:01 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Ginger1
  Thu Jan 2 20:36:28 2025
  from London via SSH
- Ginger1
  Thu Jan 2 20:24:14 2025
  from London via SSH
- Guest
  Thu Jan 2 18:10:40 2025
  from /bin/busybox Cat /proc/self/ex via Raw
- Guest
  Thu Jan 2 18:04:51 2025
  from /bin/busybox Cat /proc/self/ex via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	388
Nodes:	16 (2 / 14)
Uptime:	06:03:45
Calls:	8,220
Calls today:	18
Files:	13,122
Messages:	5,872,262
Posted today:	1

in praise of text files

Who's Online

Recent Visitors

System Info