• in praise of text files

    From Ben Collver@21:1/5 to All on Tue Oct 4 16:37:29 2022
    # Human technology: Text files

    It is a well-known engineering principle, that you should always use the weakest technology capable of solving your problem--the weakest
    technology is likely the cheapest, easiest to maintain, extend or
    replace and there are no sane arguments for using anything else.

    The main problem with this principle is marketing--few people would
    sell you a 10$ product that can solve your problem for ever, when they
    can sell you a 1000$ product, with 10$ per month maintenance cost, that
    will become obsolete after 10 years. If you listen to the "experts"
    you would likely end up not with the simplest, but with the most
    advanced technology.

    And with software the situation is particularly bad, because the
    simplest technologies often cost zero, and so they have zero marketing
    budget. And since nobody would be benefiting from convincing you to
    use something that does not cost anything, nobody is actively selling
    those. In this post, I will try to fill that gap by reviewing some technologies for web publishing that are based on plain text and
    putting forward their benefits. Read on to understand why and how
    you should write everything you write in plain text files and
    self-publish them on your own website.

    ## Plain text

    The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a
    use-case where considering any other technology is worth it.

    And similar is the case with simple static HTML websites--a simple
    static page is better than all publishing platforms that can ever be
    created.

    Anything you write and that you want to last should be put on plain text
    files.

    ...

    From: https://boris-marinov.github.io/text/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Ben Collver on Tue Oct 4 19:29:57 2022
    On Tue, 04 Oct 2022 16:37:29 +0000, Ben Collver wrote:

    ## Plain text

    I see what was done there!

    The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a use-case where considering any other technology is worth it.

    And similar is the case with simple static HTML websites--a simple
    static page is better than all publishing platforms that can ever be
    created.

    Anything you write and that you want to last should be put on plain text files.

    Indeed. Some years ago there was a discussion in some newsgroup (I forget which) about extracting names from several hundred web pages. They were
    the names of crews that flew from a British airfield in WWII. The problem
    was that the webpages had been created by quite a few different people,
    and it seemed that mechanical extraction (several crews per page) was difficult. Various suggestions were made, and in the end I had a go with
    a tool that is now 55 years old. After about three iteratons, it worked.

    It would have been a lot easier with plain text, or even Markdown.



    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From 5GyYap52yQ1UGMWD@21:1/5 to Ben Collver on Wed Oct 5 11:33:54 2022
    Ben Collver <bencollver@tilde.pink> writes:

    # Human technology: Text files

    It is a well-known engineering principle, that you should always use the weakest technology capable of solving your problem--the weakest
    technology is likely the cheapest, easiest to maintain, extend or
    replace and there are no sane arguments for using anything else.

    The main problem with this principle is marketing--few people would
    sell you a 10$ product that can solve your problem for ever, when they
    can sell you a 1000$ product, with 10$ per month maintenance cost, that
    will become obsolete after 10 years. If you listen to the "experts"
    you would likely end up not with the simplest, but with the most
    advanced technology.

    And with software the situation is particularly bad, because the
    simplest technologies often cost zero, and so they have zero marketing budget. And since nobody would be benefiting from convincing you to
    use something that does not cost anything, nobody is actively selling
    those. In this post, I will try to fill that gap by reviewing some technologies for web publishing that are based on plain text and
    putting forward their benefits. Read on to understand why and how
    you should write everything you write in plain text files and
    self-publish them on your own website.

    ## Plain text

    The problem of text is one of those problems where the simplest of all solutions works great--plain text files do the job. I've yet to see a use-case where considering any other technology is worth it.

    And similar is the case with simple static HTML websites--a simple
    static page is better than all publishing platforms that can ever be
    created.

    Anything you write and that you want to last should be put on plain text files.

    ...

    From: https://boris-marinov.github.io/text/

    Thanks for that good write up.

    I agree, I think that we should first try to solve technological problems with the simplest solutions. One of the reasons why I've moved
    my blog to gopher is that it's just easier to maintain overall. I don't
    have to worry about a database, or whether my CMS is working or not. I
    just fire up my text editor, write stuff and 'scp' my files to my remote server.

    --
    Pointless meanderings in a bleak and lonely world.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Oregonian Haruspex@21:1/5 to All on Thu Oct 6 06:40:58 2022
    You’d have to be NUTS to try to keep your precious data around in any other format. Images and videos, audio, all have common formats but is there a “forever” format for these data which rivals plain text? No. Of course not.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Collver@21:1/5 to Oregonian Haruspex on Thu Oct 6 16:59:47 2022
    On 2022-10-06, Oregonian Haruspex <no_email@invalid.invalid> wrote:
    You’d have to be NUTS to try to keep your precious data around in any other format. Images and videos, audio, all have common formats but is there a “forever” format for these data which rivals plain text? No. Of course not.

    "Anything you write and that you want to last should be put on plain
    text files."

    The original article was not talking about multimedia. You don't write
    images, video, nor audio, though you might write plots, scripts,
    screenplays, scores, etc.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Samuel Christie@21:1/5 to Ben Collver on Thu Oct 6 16:06:26 2022
    Ben Collver <bencollver@tilde.pink> writes:
    The original article was not talking about multimedia. You don't write images, video, nor audio, though you might write plots, scripts,
    screenplays, scores, etc.

    Soon we /will/ be able to store everything as text descriptions, and
    just have ML models generate the images, video, and audio...

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger Blake@21:1/5 to Ben Collver on Thu Oct 6 22:28:15 2022
    On 2022-10-04, Ben Collver <bencollver@tilde.pink> wrote:
    # Human technology: Text files

    A problem is that at this point most users have no concept of what plain
    text even is. If they think about it at all they think it means Microsoft Word or just "Microsoft".

    If I ask someone to send me something in plain text format I usually just
    get a blank stare. About the best I can usually do to get anyone to send something in an open format is pdf.

    -- ------------------------------------------------------------------------------
    18 Reasons I won't be vaccinated -- https://tinyurl.com/ebty2dx3
    Covid vaccines: experimental biology -- https://tinyurl.com/57mncfm5
    The fraud of "Climate Change" -- https://RealClimateScience.com
    There is no "climate crisis" -- https://climatedepot.com
    Don't talk to cops! -- https://DontTalkToCops.com ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Roger Blake on Fri Oct 7 11:53:18 2022
    Roger Blake <rogblake@iname.invalid> wrote:
    A problem is that at this point most users have no concept of what plain
    text even is. If they think about it at all they think it means Microsoft Word
    or just "Microsoft".

    That doesn't surprise me. However the article doesn't really share
    my own definition of plain text either. It goes on to talk about
    Markdown, and using static site generators to turn it into HTML for publication.

    To me plain text means that there is no standard structure. You
    make a layout up that seems appropriate and makes sense as it's
    displayed in the editor, therefore you don't have to worry about
    any existing standards. If I'm just making notes for myself, then
    I don't even have to worry about other people understanding it (and
    I do have my own particular patterns for this which just happen to
    suit me and possibly aren't obvious to others). That's the freedom
    of plain text to me.

    On the other hand I find HTML quite readable if it's formatted
    sensibly, so if I want to publish something on the web then I'd
    rather just write in HTML directly than complicate matters by using
    something like Markdown. If I did use some intermediate format then
    there's the risk that it would generate the sort of garbled mess
    that most modern websites have for their HTML - full of mixed up
    line breaks, and styling stuff.

    But neither Markdown, nor HTML, is plain text to me anyway.
    Actually I'd go further and say that as an English speaker who
    doesn't need extra characters, I prefer ASCII text. UTF-8 includes
    things like emoticons which, were they to become widely used in
    text documents for conveying important information, would cause me
    all sorts of trouble. Thankfully so far they never seem to be used
    for anything remotely important.

    If I ask someone to send me something in plain text format I usually just
    get a blank stare. About the best I can usually do to get anyone to send something in an open format is pdf.

    Well you can convert PDF to Postscript, and so far as I'm concened
    that's "plain text" in the way that Markdown is. But I don't
    consider either to really be plain text.

    Well perhaps Markdown is from a reader's perspective, but not for a
    writer because they need knowledge of the syntax.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From scott@alfter.diespammersdie.us@21:1/5 to Computer Nerd Kev on Fri Oct 7 17:03:30 2022
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    Well you can convert PDF to Postscript, and so far as I'm concened
    that's "plain text" in the way that Markdown is. But I don't
    consider either to really be plain text.

    If you're lucky, you can extract text from a PDF by selecting and copying
    it. If it's just an image, though (as it might be if the PDF was produced
    from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
    those results depends largely on the quality of the scan.

    Well perhaps Markdown is from a reader's perspective, but not for a
    writer because they need knowledge of the syntax.

    There's not much to it. Markdown seems largely to follow the sorts of conventions most people have used in text files anyway:

    *this line is emphasized*

    This line is a heading
    ======================

    1. This is the first item of an ordered list.
    2. This is the second line.
    3. etc.

    This is a quote.

    * This is the first item of an unordered list.
    * etc.

    I suppose the elements that don't spring immediately to mind are blocks of code:

    ```
    #include <stdio.h>

    int main (void)
    {
    print("Hellorld!"); /* https://tinyurl.com/hellorld */
    return 0;
    }
    ```

    and [links](https://alfter.us/).

    Basically, it's not much of a lift from plain text to Markdown. It's definitely less obtrusive than HTML.

    --
    _/_
    / v \ Scott Alfter (remove the obvious to send mail)
    (IIGS( https://alfter.us/ Top-posting!
    \_^_/ >What's the most annoying thing on Usenet?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Spencer@21:1/5 to Computer Nerd Kev on Fri Oct 7 15:00:10 2022
    Computer Nerd Kev <not@telling.you.invalid> writes:

    Roger Blake <rogblake@iname.invalid> wrote:

    A problem is that at this point most users have no concept of what
    plain text even is. If they think about it at all they think it
    means Microsoft Word or just "Microsoft".

    A friend on another newsgroup, after decades as a programmer, is
    struggling with the challenge of persuading/coercing his (mostly Mac)
    software to send 7-bit ASCII mail and news posts. The software wants
    to make everything UTF-8 (left & right double & single quotes,
    ellipses and some other punctuation are each 3 bytes). It appears
    that his solution will be to compose mail/posts on a Rapberry Pi
    running Linux over his LAN, the retrieve the result to post via his Mac.
    It remains unclear if his Mac apps will do that without "fixing" the
    deficient ASCII text.

    On the other hand I find HTML quite readable if it's formatted
    sensibly...

    Another e-acquaintance re-posts articles from the web to a mailing
    list. It appears that he righteously hits the button in his browser
    labeled "Email as plain text" or similar.

    The result is:

    * HTML is elided but

    * Much of the punctuation is 3-byte UTF-8 chars

    * All links/anchors in the original HTML are included in-line
    inside <https://miskatonic.edu/using_brokets> brokets.

    * A "line" is whatever was rendered as a paragraph in HTML

    * Then his mail client (or something) does everything up as
    quoted-printable

    The UTF-8 puntuation is actually 9 bytes as QP (=E2=NN=NN) and urls
    are frequently quite long. It's a dog's breakfast. Not totally
    UNreadable but "Quite readable" wouldn't be my choice of descriptor.


    But neither Markdown, nor HTML, is plain text to me anyway.
    Actually I'd go further and say that as an English speaker who
    doesn't need extra characters, I prefer ASCII text. UTF-8 includes
    things like emoticons which, were they to become widely used in
    text documents for conveying important information, would cause me
    all sorts of trouble. Thankfully so far they never seem to be used
    for anything remotely important.

    Many years ago, I and others ridiculed Microsoft's tilt toward dumbing everything down the the acephalic lowest common denominator with
    notions such as:

    * Windows Iconic Droolproof Descriptive Language Extension

    * Cognitive Reassembler Access Protocol for Windows Applications
    with Rebus Enhancement

    * Microsoft Iconic Canonical Reassembler for Ontic Cognitive
    Enhancement of Proactive Heuristic Access to Linguistic
    Youthfulness


    only to have reality upstage satire, a decade or so ago, with iConji
    (q.g.)[1]


    [1] q.g.: quod google


    --
    Mike Spencer Nova Scotia, Canada

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sn!pe@21:1/5 to Mike Spencer on Fri Oct 7 20:00:09 2022
    Mike Spencer <mds@bogus.nodomain.nowhere> wrote:

    Computer Nerd Kev <not@telling.you.invalid> writes:

    Roger Blake <rogblake@iname.invalid> wrote:

    A problem is that at this point most users have no concept of what
    plain text even is. If they think about it at all they think it
    means Microsoft Word or just "Microsoft".

    A friend on another newsgroup, after decades as a programmer, is
    struggling with the challenge of persuading/coercing his (mostly Mac) software to send 7-bit ASCII mail and news posts. The software wants
    to make everything UTF-8 (left & right double & single quotes,


    Hi, Mike, PMFJI.

    In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
    and similarly for other substitutions that are not required.
    See also Preferences / Composing / Message Format: Plain Text.

    Obviously this does not necessarily hold true for third party software.

    [relurk]


    ellipses and some other punctuation are each 3 bytes). It appears
    that his solution will be to compose mail/posts on a Rapberry Pi
    running Linux over his LAN, the retrieve the result to post via his Mac.
    It remains unclear if his Mac apps will do that without "fixing" the deficient ASCII text.

    On the other hand I find HTML quite readable if it's formatted
    sensibly...

    Another e-acquaintance re-posts articles from the web to a mailing
    list. It appears that he righteously hits the button in his browser
    labeled "Email as plain text" or similar.

    The result is:

    * HTML is elided but

    * Much of the punctuation is 3-byte UTF-8 chars

    * All links/anchors in the original HTML are included in-line
    inside <https://miskatonic.edu/using_brokets> brokets.

    * A "line" is whatever was rendered as a paragraph in HTML

    * Then his mail client (or something) does everything up as
    quoted-printable

    The UTF-8 puntuation is actually 9 bytes as QP (=E2=NN=NN) and urls
    are frequently quite long. It's a dog's breakfast. Not totally
    UNreadable but "Quite readable" wouldn't be my choice of descriptor.


    But neither Markdown, nor HTML, is plain text to me anyway.
    Actually I'd go further and say that as an English speaker who
    doesn't need extra characters, I prefer ASCII text. UTF-8 includes
    things like emoticons which, were they to become widely used in
    text documents for conveying important information, would cause me
    all sorts of trouble. Thankfully so far they never seem to be used
    for anything remotely important.

    Many years ago, I and others ridiculed Microsoft's tilt toward dumbing everything down the the acephalic lowest common denominator with
    notions such as:

    * Windows Iconic Droolproof Descriptive Language Extension

    * Cognitive Reassembler Access Protocol for Windows Applications
    with Rebus Enhancement

    * Microsoft Iconic Canonical Reassembler for Ontic Cognitive
    Enhancement of Proactive Heuristic Access to Linguistic
    Youthfulness


    only to have reality upstage satire, a decade or so ago, with iConji (q.g.)[1]


    [1] q.g.: quod google


    --
    ^^ My pet rock Gordon just is.

    ~ Slava Ukraini ~

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Samuel Christie@21:1/5 to All on Fri Oct 7 15:07:20 2022
    That brings up a point I was wondering: does usenet/email support utf-8
    yet, or is everything expected to be ASCII? 7-bit?

    What happens if I do insert a non-ascii unicode glyph?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Richard Kettlewell@21:1/5 to Samuel Christie on Fri Oct 7 20:46:49 2022
    Samuel Christie <shcv@sdf.org> writes:
    That brings up a point I was wondering: does usenet/email support utf-8
    yet, or is everything expected to be ASCII? 7-bit?

    What happens if I do insert a non-ascii unicode glyph?

    Many Usenet clients have supported MIME and UTF-8 for years. There’s
    still few hold-outs around though.

    --
    https://www.greenend.org.uk/rjk/

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to scott@alfter.diespammersdie.us on Sat Oct 8 08:29:51 2022
    scott@alfter.diespammersdie.us wrote:
    Computer Nerd Kev <not@telling.you.invalid> wrote:
    Well you can convert PDF to Postscript, and so far as I'm concened
    that's "plain text" in the way that Markdown is. But I don't
    consider either to really be plain text.

    If you're lucky, you can extract text from a PDF by selecting and copying
    it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing.

    Well the thing that's handy about Postscript being text (bitmap
    embedded images aside) is that in the past I've been able to do
    bulk find-and-replace operations to a batch of Postscript files
    without needing to use a full-blown interpreter. Unlike PDF, where
    the content is compressed, Postscript is text so you just need to
    understand the language and then you can do your modifications
    using a text editor or Sed.

    My idea of plain text format is the same, just without the
    potentially difficult "understanding the language" part.

    Well perhaps Markdown is from a reader's perspective, but not for a
    writer because they need knowledge of the syntax.

    There's not much to it. Markdown seems largely to follow the sorts of conventions most people have used in text files anyway:

    *this line is emphasized*

    This line is a heading
    ======================

    1. This is the first item of an ordered list.
    2. This is the second line.
    3. etc.

    This is a quote.

    * This is the first item of an unordered list.
    * etc.

    Yes it's nice and obvious to a reader, but for a writer it's still
    many more rules to know and follow than if they were making it up
    as they went.

    I mean here is a "basic syntax" guide: https://www.markdownguide.org/basic-syntax

    Why would I try to remember all that just so that I can follow some
    standard that allows plain-text-readable files to be converted to
    HTML? Just learn the HTML if you want the styling, and if you don't
    then just use my sort of unstandardised plain text. It's excess
    information for an unnecessary intermediate step in my opinion. Of
    course more-so for me because I learnt about HTML before Markdown.

    But each to their own, it's just not my idea of "plain text".

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger Blake@21:1/5 to scott@alfter.diespammersdie.us on Fri Oct 7 23:19:40 2022
    On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
    If you're lucky, you can extract text from a PDF by selecting and copying
    it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
    those results depends largely on the quality of the scan.

    I used to be able to extract text directly from Microsoft Word documents
    using "antiword" but it only works with the old binary (.doc) format and
    of course the default has been the new .docx format since the 2007 version.

    At least pdf is an open format. The "pdftotext" program can extract any
    actual text it finds in a pdf file but sometimes those are just an image
    which would require ocr to interpret.

    -- ------------------------------------------------------------------------------
    18 Reasons I won't be vaccinated -- https://tinyurl.com/ebty2dx3
    Covid vaccines: experimental biology -- https://tinyurl.com/57mncfm5
    The fraud of "Climate Change" -- https://RealClimateScience.com
    There is no "climate crisis" -- https://climatedepot.com
    Don't talk to cops! -- https://DontTalkToCops.com ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mike Spencer@21:1/5 to snipeco.2@gmail.com on Fri Oct 7 23:29:09 2022
    snipeco.2@gmail.com (Sn!pe) writes:

    Mike Spencer <mds@bogus.nodomain.nowhere> wrote:

    Computer Nerd Kev <not@telling.you.invalid> writes:

    Roger Blake <rogblake@iname.invalid> wrote:

    A problem is that at this point most users have no concept of what
    plain text even is. If they think about it at all they think it
    means Microsoft Word or just "Microsoft".

    A friend on another newsgroup, after decades as a programmer, is
    struggling with the challenge of persuading/coercing his (mostly Mac)
    software to send 7-bit ASCII mail and news posts. The software wants
    to make everything UTF-8 (left & right double & single quotes,


    Hi, Mike, PMFJI.

    All help welcome. Most of us need all the help we can get.

    In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
    and similarly for other substitutions that are not required.
    See also Preferences / Composing / Message Format: Plain Text.

    And a Mac will interpret "Plain Text" as 7-bit ASCII? I would but
    Mac-world is a black box.

    Obviously this does not necessarily hold true for third party software.

    [relurk]

    Forwarded to Mac-user party in question.

    TYVM.

    ellipses and some other punctuation are each 3 bytes). It appears
    that his solution will be to compose mail/posts on a Rapberry Pi
    running Linux over his LAN, the retrieve the result to post via his Mac.
    It remains unclear if his Mac apps will do that without "fixing" the
    deficient ASCII text.

    [snip]
    --
    Mike Spencer Nova Scotia, Canada

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Dan Espen@21:1/5 to scott@alfter.diespammersdie.us on Fri Oct 7 23:33:54 2022
    scott@alfter.diespammersdie.us writes:

    Computer Nerd Kev <not@telling.you.invalid> wrote:
    Well you can convert PDF to Postscript, and so far as I'm concened
    that's "plain text" in the way that Markdown is. But I don't
    consider either to really be plain text.

    If you're lucky, you can extract text from a PDF by selecting and copying
    it. If it's just an image, though (as it might be if the PDF was produced from a scan), you'll get back nothing. You might be able to feed the PDF through an OCR engine and extract the text that way, but the quality of
    those results depends largely on the quality of the scan.

    I've done the OCR the PDF thing. It worked quite well.

    For non-image documents, pdftotext does the job.


    --
    Dan Espen

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Samuel Christie on Sat Oct 8 03:58:04 2022
    On Fri, 07 Oct 2022 15:07:20 -0400
    Samuel Christie <shcv@sdf.org> wrote:
    That brings up a point I was wondering: does usenet/email support utf-8
    yet, or is everything expected to be ASCII? 7-bit?

    If you mean emails or usenet posts where some of the octets have values > 127 then I've never seen problems and I've sent or read many such emails or usenet posts. Obviously the header must have the correct information. For an example see this post or <87h70fmn6e.fsf@LkoBDZeT.terraraq.uk> .

    Octets with value 0 are *not* ok and possibly some other values < 32 .If you want such values then the email or post needs to be appropriated encoded , namely BASE64 or quoted-printable .Again , the header must mention this.

    I've seen occasions where things worked correctly even when the header did
    not have the correct information , the software guessed correctly what was needed. But it's best not to risk it.

    What happens if I do insert a non-ascii unicode glyph?

    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
    αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    --
    vlaho.ninja/prog

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Taylor@21:1/5 to Grant Taylor on Fri Oct 7 16:53:25 2022
    On 10/7/22 4:51 PM, Grant Taylor wrote:
    I believe so.

    §



    --
    Grant. . . .
    unix || die

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich@21:1/5 to Roger Blake on Sat Oct 8 04:49:41 2022
    Roger Blake <rogblake@iname.invalid> wrote:
    On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
    If you're lucky, you can extract text from a PDF by selecting and
    copying it. If it's just an image, though (as it might be if the
    PDF was produced from a scan), you'll get back nothing. You might
    be able to feed the PDF through an OCR engine and extract the text
    that way, but the quality of those results depends largely on the
    quality of the scan.

    I used to be able to extract text directly from Microsoft Word
    documents using "antiword" but it only works with the old binary
    (.doc) format and of course the default has been the new .docx format
    since the 2007 version.

    Docx files are just zip files containing a bunch of XML files, so with
    a small bit of effort, you can extract text directly from docx files as
    well.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sn!pe@21:1/5 to Mike Spencer on Sat Oct 8 11:40:37 2022
    Mike Spencer <mds@bogus.nodomain.nowhere> wrote:

    [...]

    Hi, Mike, PMFJI.

    All help welcome. Most of us need all the help we can get.

    In macOS Mail / Edit / Substitutions: turn off Smart Quotes;
    and similarly for other substitutions that are not required.
    See also Preferences / Composing / Message Format: Plain Text.

    And a Mac will interpret "Plain Text" as 7-bit ASCII? I would but
    Mac-world is a black box.


    I rather think not but I can't say definitively. I imagine it would be UTF-(something) but being only a user I'm not expert in macOS's
    underpinnings. My newsreader falls back to the simplest encoding
    that will support the required characters; maybe MacOS is similar.
    The fellows in comp.sys.mac.* or uk.comp.sys.mac would probably
    know.


    Obviously this does not necessarily hold true for third party software.

    [relurk]

    Forwarded to Mac-user party in question.

    TYVM.

    YW.

    --
    ^^ My pet rock Gordon just is.

    ~ Slava Ukraini ~

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Samuel Christie@21:1/5 to All on Sat Oct 8 15:05:14 2022
    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
    αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    Works just fine for me! Good to know I won't accidentally break
    everything if I include unusual characters.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From The Real Bev@21:1/5 to Samuel Christie on Sat Oct 8 16:12:51 2022
    On 10/8/22 12:05 PM, Samuel Christie wrote:
    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
    αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    Works just fine for me! Good to know I won't accidentally break
    everything if I include unusual characters.

    I see them too.

    Good to know that I won't have to buy a set of Typits!

    BTW, here's a handy chart. Looks pretty ratty in a proportional
    typeface, though.

    iso8859-1 cheat sheet
    (per http://www.uni-passau.de/~ramsch/iso8859-1.html)


    ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿

    À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß

    à á â ã ä å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ



    dec oct 8 7 HTML | dec oct 8 7 HTML | dec oct 8 7 HTML | ====================|=====================|=====================|
    161 241 ¡ ! &iexcl; | 162 242 ¢ " &cent; | 163 243 £ # &pound; |
    164 244 ¤ $ &curren;| 165 245 ¥ % &yen; | 166 246 ¦ & &brvbar;|
    167 247 § ' &sect; | 168 250 ¨ ( &uml; | 169 251 © ) &copy; |
    170 252 ª * &ordf; | 171 253 « + &laquo; | 172 254 ¬ , &not; |
    173 255 ­ - &shy; | 174 256 ® . &reg; | 175 257 ¯ / &macr; |
    176 260 ° 0 &deg; | 177 261 ± 1 &plusmn;| 178 262 ² 2 &sup2; |
    179 263 ³ 3 &sup3; | 180 264 ´ 4 &acute; | 181 265 µ 5 &micro; |
    182 266 ¶ 6 &para; | 183 267 · 7 &middot;| 184 270 ¸ 8 &cedil; |
    185 271 ¹ 9 &sup1; | 186 272 º : &ordm; | 187 273 » ; &raquo; |
    188 274 ¼ < &frac14;| 189 275 ½ = &frac12;| 190 276 ¾ > &frac34;|
    191 277 ¿ ? &iquest;| 192 300 À @ &Agrave;| 193 301 Á A &Aacute;|
    194 302 Â B &Acirc; | 195 303 Ã C &Atilde;| 196 304 Ä D &Auml; |
    197 305 Å E &Aring; | 198 306 Æ F &AElig; | 199 307 Ç G &Ccedil;|
    200 310 È H &Egrave;| 201 311 É I &Eacute;| 202 312 Ê J &Ecirc; |
    203 313 Ë K &Euml; | 204 314 Ì L &Igrave;| 205 315 Í M &Iacute;|
    206 316 Î N &Icirc; | 207 317 Ï O &Iuml; | 208 320 Ð P &ETH; |
    209 321 Ñ Q &Ntilde;| 210 322 Ò R &Ograve;| 211 323 Ó S &Oacute;|
    212 324 Ô T &Ocirc; | 213 325 Õ U &Otilde;| 214 326 Ö V &Ouml; |
    215 327 × W &times; | 216 330 Ø X &Oslash;| 217 331 Ù Y &Ugrave;|
    218 332 Ú Z &Uacute;| 219 333 Û [ &Ucirc; | 220 334 Ü \ &Uuml; |
    221 335 Ý ] &Yacute;| 222 336 Þ ^ &THORN; | 223 337 ß _ &szlig; |
    224 340 à ` &agrave;| 225 341 á a &aacute;| 226 342 â b &acirc; |
    227 343 ã c &atilde;| 228 344 ä d &auml; | 229 345 å e &aring; |
    230 346 æ f &aelig; | 231 347 ç g &ccedil;| 232 350 è h &egrave;|
    233 351 é i &eacute;| 234 352 ê j &ecirc; | 235 353 ë k &euml; |
    236 354 ì l &igrave;| 237 355 í m &iacute;| 238 356 î n &icirc; |
    239 357 ï o &iuml; | 240 360 ð p &eth; | 241 361 ñ q &ntilde;|
    242 362 ò r &ograve;| 243 363 ó s &oacute;| 244 364 ô t &ocirc; |
    245 365 õ u &otilde;| 246 366 ö v &ouml; | 247 367 ÷ w &divide;|
    248 370 ø x &oslash;| 249 371 ù y &ugrave;| 250 372 ú z &uacute;|
    251 373 û { &ucirc; | 252 374 ü | &uuml; | 253 375 ý } &yacute;|
    254 376 þ ~ &thorn; | 255 377 ÿ &yuml; |


    --
    Cheers, Bev
    Red ship crashes into blue ship - sailors marooned.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Matthew Ernisse@21:1/5 to All on Sun Oct 9 01:59:18 2022
    On Wed, 05 Oct 2022 11:33:54 +0800, 5GyYap52yQ1UGMWD wrote:
    I agree, I think that we should first try to solve technological problems with the simplest solutions. One of the reasons why I've moved
    my blog to gopher is that it's just easier to maintain overall. I don't
    have to worry about a database, or whether my CMS is working or not. I
    just fire up my text editor, write stuff and 'scp' my files to my remote server.

    I'm hoping you are aware that you don't need a CMS or a database to
    publish information over HTTP, but if you aren't then you can quite
    happily (and just as easily) publish things to a web server to present
    over HTTP using a text editor and scp. This has the benefit of still
    being supported by modern browsers.

    --
    "The avalanche has started, it is too late for the pebbles to vote."
    --Kosh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From scott@alfter.diespammersdie.us@21:1/5 to Spiros Bousbouras on Mon Oct 10 18:40:36 2022
    Spiros Bousbouras <spibou@gmail.com> wrote:
    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
    αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    Received five-by-five, though the math symbols are a bit small. Pretty sure that's just down to font choice (Lucida Console, 9 pt.).

    As you might see from examining the header, I'm using tin. Previously, I
    had used trn, and I'm pretty sure it would've choked on non-ASCII content.

    --
    _/_
    / v \ Scott Alfter (remove the obvious to send mail)
    (IIGS( https://alfter.us/ Top-posting!
    \_^_/ >What's the most annoying thing on Usenet?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Retrograde@21:1/5 to Roger Blake on Mon Oct 10 20:37:55 2022
    On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:
    On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

    I used to be able to extract text directly from Microsoft Word
    documents using "antiword" but it only works with the old binary
    (.doc) format and of course the default has been the new .docx format
    since the 2007 version.

    Pandoc does quite a nice job of converting docx to other formats.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Retrograde on Tue Oct 11 07:47:24 2022
    Retrograde <fungus@amongus.com.invalid> wrote:
    On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:
    On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

    I used to be able to extract text directly from Microsoft Word
    documents using "antiword" but it only works with the old binary
    (.doc) format and of course the default has been the new .docx format
    since the 2007 version.

    Pandoc does quite a nice job of converting docx to other formats.

    I just discovered that myself actually. This command seems to work
    well to generate a HTML file with any images embedded within it (I
    prefer this a little over PDF):
    pandoc -s --embed-resources --ascii -o file.htm file.docx

    The other one that I would like to handle is Excel spreadsheets in
    xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
    seems to handle this, but as it's not designed for command-line use
    it's going to take some more work to get equivalent functionality
    out of it.

    https://github.com/PHPOffice

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Bob Eager on Tue Oct 11 08:14:09 2022
    Bob Eager <news0009@eager.cx> wrote:
    On Fri, 07 Oct 2022 11:53:18 +1000, Computer Nerd Kev wrote:

    Well you can convert PDF to Postscript, and so far as I'm concened
    that's "plain text" in the way that Markdown is.

    Doesn't work if the PostScript file is just a load of images.

    Presuming Bitmap images, yes. Markdown apparantly allows you to
    reference images as well though, so you could just as well have a
    Markdown document with only scanned images of text in it.

    I usually print, scan and OCR.

    Surely you can OCR without the printing and scanning? Ghostscript
    can generate PNG (etc.) bitmap images for each page of a PDF, at a
    specified resolution.

    The pdfimages program from Xpdf claims that it "extracts the images
    from a PDF file", so it may be better again because there isn't
    any recompression or resampling. To be honest I don't do OCR for
    anything so I haven't looked into it much. Where I last found that
    editing Postscript manually came in handy was actually for
    correcting a formatting glitch for printing.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Spiros Bousbouras on Mon Oct 10 21:33:57 2022
    On Sat, 08 Oct 2022 03:58:04 +0000, Spiros Bousbouras wrote:

    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    Fine for me. Pan on FreeBSD.


    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Matthew Ernisse on Mon Oct 10 21:35:52 2022
    On Sun, 09 Oct 2022 01:59:18 +0000, Matthew Ernisse wrote:

    I'm hoping you are aware that you don't need a CMS or a database to
    publish information over HTTP, but if you aren't then you can quite
    happily (and just as easily) publish things to a web server to present
    over HTTP using a text editor and scp. This has the benefit of still
    being supported by modern browsers.

    I always do it like that (although I use the curl library for REXX for an automated upload).

    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Computer Nerd Kev on Mon Oct 10 21:35:03 2022
    On Fri, 07 Oct 2022 11:53:18 +1000, Computer Nerd Kev wrote:

    Well you can convert PDF to Postscript, and so far as I'm concened
    that's "plain text" in the way that Markdown is.

    Doesn't work if the PostScript file is just a load of images. I usually
    print, scan and OCR.



    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Computer Nerd Kev on Tue Oct 11 08:21:10 2022
    On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:

    I usually print, scan and OCR.

    Surely you can OCR without the printing and scanning? Ghostscript can generate PNG (etc.) bitmap images for each page of a PDF, at a specified resolution.

    Not in this case. I have a lot of material that is on a CD, in a format
    only accessible by a Windows program that won't run on anything later
    than XP. It fails when printed to a file!

    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ben Collver@21:1/5 to Spiros Bousbouras on Tue Oct 11 16:19:17 2022
    On 2022-10-08, Spiros Bousbouras <spibou@gmail.com> wrote:
    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
    αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    Reads fine for me in slrn and xfce4-terminal.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to scott@alfter.diespammersdie.us on Wed Oct 12 06:24:26 2022
    scott@alfter.diespammersdie.us wrote:
    Spiros Bousbouras <spibou@gmail.com> wrote:
    Lets try it out :

    Greek alphabet :
    ????????????????????????
    ????????????????????????

    Some mathematical symbols :
    ? ? ? ? ? ? \ ? ? ? ? ? ? ? ? ? ? ? ?

    Can you read all this ?

    Received five-by-five, though the math symbols are a bit small. Pretty sure that's just down to font choice (Lucida Console, 9 pt.).

    As you might see from examining the header, I'm using tin.

    Tin also supports translating characters into other character sets
    if it's set to prefer them, which is handy if you don't use a
    unicode-capable terminal or font. But as you can see, it does tend
    to go a little heavy on the "I don't know" character at times.

    Compile-time options control some of that behaviour.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Retrograde@21:1/5 to Bob Eager on Wed Oct 12 00:21:24 2022
    On 2022-10-10, Bob Eager <news0009@eager.cx> wrote:
    On Sat, 08 Oct 2022 03:58:04 +0000, Spiros Bousbouras wrote:

    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    Fine for me. Pan on FreeBSD.

    It's encoded text/plain; charset=UTF-8 so any UTF-8-aware newsreader in an environment with the right font should work fine. Both claws-mail and slrn (in gnome-term) on Linux Mint show me both your Greek and your math just fine over here. On the Linux console, the Greek comes through but only half the math - I interpet that as my chosen console font having only a partial set of the math glyphs.

    I'm nostalgic for lots of early technology, but I wouldn't go back to
    the era before UTF-8 for anything.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Louis Krupp@21:1/5 to Bob Eager on Tue Oct 11 18:15:47 2022
    On 10/11/2022 2:21 AM, Bob Eager wrote:
    On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:

    I usually print, scan and OCR.
    Surely you can OCR without the printing and scanning? Ghostscript can
    generate PNG (etc.) bitmap images for each page of a PDF, at a specified
    resolution.
    Not in this case. I have a lot of material that is on a CD, in a format
    only accessible by a Windows program that won't run on anything later
    than XP. It fails when printed to a file!

    Can the program that reads the file export it as something else? Out of curiosity, what is the file format called, and is it by any chance
    documented?

    Louis

    (My apologies if this shows up twice.)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Louis Krupp on Wed Oct 12 08:00:17 2022
    On Tue, 11 Oct 2022 18:15:47 -0600, Louis Krupp wrote:

    On 10/11/2022 2:21 AM, Bob Eager wrote:
    On Tue, 11 Oct 2022 08:14:09 +1000, Computer Nerd Kev wrote:

    I usually print, scan and OCR.
    Surely you can OCR without the printing and scanning? Ghostscript can
    generate PNG (etc.) bitmap images for each page of a PDF, at a
    specified resolution.
    Not in this case. I have a lot of material that is on a CD, in a format
    only accessible by a Windows program that won't run on anything later
    than XP. It fails when printed to a file!

    Can the program that reads the file export it as something else? Out of curiosity, what is the file format called, and is it by any chance documented?

    It's a proprietary format, and the thing that reads it is designed to
    ONLY allow documents to be read on screen or printed.

    It's not a problem; finally I have completed it and won't have to revisit.

    Explanation: it's a CD of back issues of a journal. They want silly money
    for PDFs of single articles. I knew a colleague had all the back issues
    on paper, but when I asked him he had dumped them three weeks previously!
    He had the CD, but it has a 16 bit installer for the reading application.
    A VM with XP allowed me to use the application.

    I have now thought of another possible way, but I've done it all now. The printing and OCR worked really well.

    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Otto J. Makela@21:1/5 to Spiros Bousbouras on Wed Oct 12 13:32:12 2022
    Spiros Bousbouras <spibou@gmail.com> wrote:

    MIME-Version: 1.0
    Content-Type: text/plain; charset=UTF-8
    Content-Transfer-Encoding: 8bit
    [...]
    Lets try it out :

    Greek alphabet :
    ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ
    αβγδεζηθικλμνξοπρστυφχψω

    Some mathematical symbols :
    ∅ ∁ ∈ ∉ ∋ ∌ ∖ ∩ ∪ ⊂ ⊃ ⊄ ⊅ ⊆ ⊇ ⊈ ⊉ ⊊ ⊋

    Can you read all this ?

    UTF-8 encoding works just fine with Gnus v5.13, to the extent that a
    text terminal (I'm running this through mosh) can display characters.
    --
    /* * * Otto J. Makela <om@iki.fi> * * * * * * * * * */
    /* Phone: +358 40 765 5772, ICBM: N 60 10' E 24 55' */
    /* Mail: Mechelininkatu 26 B 27, FI-00100 Helsinki */
    /* * * Computers Rule 01001111 01001011 * * * * * * */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anthk@21:1/5 to Roger Blake on Thu Oct 13 00:38:56 2022
    On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:
    On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:
    If you're lucky, you can extract text from a PDF by selecting and copying
    it. If it's just an image, though (as it might be if the PDF was produced >> from a scan), you'll get back nothing. You might be able to feed the PDF
    through an OCR engine and extract the text that way, but the quality of
    those results depends largely on the quality of the scan.

    I used to be able to extract text directly from Microsoft Word documents using "antiword" but it only works with the old binary (.doc) format and
    of course the default has been the new .docx format since the 2007 version.

    At least pdf is an open format. The "pdftotext" program can extract any actual text it finds in a pdf file but sometimes those are just an image which would require ocr to interpret.


    With MUPDF you can select the text with the right click mouse button
    and it will be copied into the clipboard.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anthk@21:1/5 to Computer Nerd Kev on Thu Oct 13 00:38:57 2022
    On 2022-10-10, Computer Nerd Kev <not@telling.you.invalid> wrote:
    Retrograde <fungus@amongus.com.invalid> wrote:
    On 2022-10-07, Roger Blake <rogblake@iname.invalid> wrote:
    On 2022-10-07, scott@alfter.diespammersdie.us <scott@alfter.diespammersdie.us> wrote:

    I used to be able to extract text directly from Microsoft Word
    documents using "antiword" but it only works with the old binary
    (.doc) format and of course the default has been the new .docx format
    since the 2007 version.

    Pandoc does quite a nice job of converting docx to other formats.

    I just discovered that myself actually. This command seems to work
    well to generate a HTML file with any images embedded within it (I
    prefer this a little over PDF):
    pandoc -s --embed-resources --ascii -o file.htm file.docx

    The other one that I would like to handle is Excel spreadsheets in
    xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
    seems to handle this, but as it's not designed for command-line use
    it's going to take some more work to get equivalent functionality
    out of it.

    https://github.com/PHPOffice


    Get sc-im+gnuplot for xls and xlsx files. It's like LibreOffice Calc
    but for the CLI and with vi keys.
    For more operations, install visicalc and the required dependencies.
    Also, to dump DOC files, you can catdoc and antiword.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Anthk on Thu Oct 13 07:50:11 2022
    On Thu, 13 Oct 2022 00:38:56 +0000, Anthk wrote:

    At least pdf is an open format. The "pdftotext" program can extract any
    actual text it finds in a pdf file but sometimes those are just an
    image which would require ocr to interpret.


    With MUPDF you can select the text with the right click mouse button and
    it will be copied into the clipboard.

    Not if the pages are just scanned images.

    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Anthk on Fri Oct 14 07:44:33 2022
    Anthk <anthk@disroot.org> wrote:
    On 2022-10-10, Computer Nerd Kev <not@telling.you.invalid> wrote:
    The other one that I would like to handle is Excel spreadsheets in
    xls and xlsx formats. PHPSpreadsheet from the PHPOffice project
    seems to handle this, but as it's not designed for command-line use
    it's going to take some more work to get equivalent functionality
    out of it.

    https://github.com/PHPOffice


    Get sc-im+gnuplot for xls and xlsx files. It's like LibreOffice Calc
    but for the CLI and with vi keys.

    Thanks! That saved me from trying to figure out how to write a
    command-line application in PHP. It still took me a while to find
    the right options to get it to work as a Pandoc-style command-line
    converter though. This is the magic concoction that generates a TSV
    file from an XLSX spreadsheet without a lot of rubbish at the start
    of the file:

    sc-im --export_tab --nocurses --quit_afterload file.xlsx > file.tsv

    Strangely the "--output=" option only wants to create empty files
    for me (with verision 0.7.0).

    The terminal-based spreadsheet program itself does look interesting
    as well, though I'm pretty sure that I'd miss selecting cells and copying/pasting using the mouse.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)