• Handling DOS (Windows) text files in Unix (Linux) - a nifty extensi

    From Ed Morton@21:1/5 to Mack The Knife on Sun Aug 1 10:14:35 2021
    On 8/1/2021 9:25 AM, Mack The Knife wrote:
    This is a lot of work to do what

    BEGIN { RS = "\r?\n" }

    would do. Even simpler would be to put

    tr -d '\r'

    as a stage in your pipleine before calling gawk.

    Those would both break files that use `\n` within quoted fields and
    `\r\n` record endings such as you'd get in CSVs exported from Excel when
    there are cells with newlines within the spreadsheet, e.g. (assume `\r`
    and `\n` are literal):

    Right:

    $ printf '"foo\nbar"\r\n' | awk 'BEGIN{RS="\r\n"; FS=","} {print
    NR, $1}'
    1 "foo
    bar"

    Wrong:

    $ printf '"foo\nbar"\r\n' | awk 'BEGIN{RS="\r?\n"; FS=","} {print
    NR, $1}'
    1 "foo
    2 bar"

    The `tr` would also break files that include `\r`s mid-record:

    Right:

    $ printf '"foo\rbar"\r\n' | awk 'BEGIN{RS="\r\n"; FS=","} {print
    NR, $1}' | cat -Ev
    1 "foo^Mbar"$

    Wrong:

    $ printf '"foo\rbar"\r\n' | tr -d '\r' | awk 'BEGIN{RS="\r\n";
    FS=","} {print NR, $1}' | cat -Ev
    1 "foobar"$

    You cant robustly tell by reading a file if it uses DOS line endings or
    not. For example is this:

    $ printf 'foo\nbar\r\n' | cat -Ev
    foo$
    bar^M$

    1 line using DOS line endings where no line contains `\r`:

    $ printf 'foo\nbar\r\n' | awk -v RS='\r\n' '{print NR, $0}' | cat -Ev
    1 foo$
    bar$

    or 2 lines using Unix line endings where the 2nd line ends in `\r`?

    $ printf 'foo\nbar\r\n' | awk -v RS='\n' '{print NR, $0}' | cat -Ev
    1 foo$
    2 bar^M$

    It's impossible to tell from the data, you need to KNOW the format in
    advance to be able to parse it correctly.

    So, just figure out what your record endings are expected to be (`\r\n`
    or `\n`) in advance based on some criteria that doesn't involve reading
    the file (e.g. wherever you're getting the input from) and then use the appropriate RS to parse it robustly.

    Regards,

    Ed.


    Mack

    In article <se3m1s$7qsa$1@news.xmission.com>,
    Kenny McCormack <gazelle@shell.xmission.com> wrote:
    A frequent task of mine is to write GAWK scripts that process files created >> by Windows users. Naturally, I prefer to work/develop on Linux, while they >> are creating/editing the files on Windows. I access their files via a
    Samba share.

    This all works well, except that I periodically get bit by the fact that
    DOS files have an extra character as the last character of the line (as
    seen by a program running on Linux). One ends up writing AWK code to deal >> with this, but it would be nice to not have to do so.

    To this end, I have written a GAWK extension that removes the CRs from the >> file. Source code is below.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mack The Knife@21:1/5 to Kenny McCormack on Sun Aug 1 14:25:35 2021
    This is a lot of work to do what

    BEGIN { RS = "\r?\n" }

    would do. Even simpler would be to put

    tr -d '\r'

    as a stage in your pipleine before calling gawk.

    Mack

    In article <se3m1s$7qsa$1@news.xmission.com>,
    Kenny McCormack <gazelle@shell.xmission.com> wrote:
    A frequent task of mine is to write GAWK scripts that process files created >by Windows users. Naturally, I prefer to work/develop on Linux, while they >are creating/editing the files on Windows. I access their files via a
    Samba share.

    This all works well, except that I periodically get bit by the fact that
    DOS files have an extra character as the last character of the line (as
    seen by a program running on Linux). One ends up writing AWK code to deal >with this, but it would be nice to not have to do so.

    To this end, I have written a GAWK extension that removes the CRs from the >file. Source code is below.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Mack The Knife on Sun Aug 1 18:35:19 2021
    On 01.08.2021 16:25, Mack The Knife wrote:
    This is a lot of work to do what

    BEGIN { RS = "\r?\n" }

    would do. [...]

    That's what I'd also do. It's simple and covers the DOS and the
    Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
    line termination character, and all other payload data should
    be plain text (with the exception of the TAB control character).

    The OP's 'C'-code seems to just remove all CRs from anywhere in
    the file - , so I also don't see Ed's (IMO non-text) Excel-export
    case as comparable with the original question in the thread.

    Your RS approach considers even the NL context which the C-code
    does not.

    But to be honest, I may as well just be missing the OP's point.
    In which cases it may make sense to use a separate module with a
    lot of C-code that needs to be (pre-)compiled and some specific
    mechanism to load it isn't obvious to me. (The "global approach"
    argument, beyond being personal taste, isn't very clear either.)

    And the mentioned reverse conversion is as simply done in plain
    Awk as the input conversion is.

    Janis

    N.B.: A case where I had written C-code for a similar task was
    for an environment where several people wrote and extended text
    files from different OSes; the files contained mixtures of every
    CR, LF, or CR LF combination, some had no final line ending and
    all such a mess. The program checked files and/or fixed them for
    any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
    DOS, as desired. - Maybe such a more universal function may be
    useful for the GNU Awk extension library?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Janis Papanagnou on Sun Aug 1 13:33:13 2021
    On 8/1/2021 11:35 AM, Janis Papanagnou wrote:
    On 01.08.2021 16:25, Mack The Knife wrote:
    This is a lot of work to do what

    BEGIN { RS = "\r?\n" }

    would do. [...]

    That's what I'd also do. It's simple and covers the DOS and the Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
    line termination character, and all other payload data should
    be plain text (with the exception of the TAB control character).

    Here's the POSIX definition of a text file:

    https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
    "Text File = A file that contains characters organized into zero or more
    lines. The lines do not contain NUL characters and none can exceed
    {LINE_MAX} bytes in length, including the <newline> character."

    https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
    "Line = A sequence of zero or more non- <newline> characters plus a
    terminating <newline> character."

    So a file that has CR as the line termination character is not a valid
    text file per POSIX but a file that has NL as the line termination
    character and contains CRs (be they immediately before the NLs or not)
    is a valid text file. Not sure why you mention the tab character as it's
    no different from X or # or any other character.


    The OP's 'C'-code seems to just remove all CRs from anywhere in
    the file - , so I also don't see Ed's (IMO non-text) Excel-export
    case as comparable with the original question in the thread.

    The examples I posted were just plain text per the POSIX definition
    above. I frequently have to write awk scripts to deal with CSVs exported
    from Excel with lines that end in CRNL and can contain NL in quoted
    fields. I don't know if any of those files have had CRs as I certainly
    haven't assumed they can't be present or otherwise special-cased them.

    Ed.


    Your RS approach considers even the NL context which the C-code
    does not.

    But to be honest, I may as well just be missing the OP's point.
    In which cases it may make sense to use a separate module with a
    lot of C-code that needs to be (pre-)compiled and some specific
    mechanism to load it isn't obvious to me. (The "global approach"
    argument, beyond being personal taste, isn't very clear either.)

    And the mentioned reverse conversion is as simply done in plain
    Awk as the input conversion is.

    Janis

    N.B.: A case where I had written C-code for a similar task was
    for an environment where several people wrote and extended text
    files from different OSes; the files contained mixtures of every
    CR, LF, or CR LF combination, some had no final line ending and
    all such a mess. The program checked files and/or fixed them for
    any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
    DOS, as desired. - Maybe such a more universal function may be
    useful for the GNU Awk extension library?


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From J Naman@21:1/5 to Ed Morton on Sun Aug 1 13:29:48 2021
    On Sunday, 1 August 2021 at 14:33:16 UTC-4, Ed Morton wrote:
    On 8/1/2021 11:35 AM, Janis Papanagnou wrote:
    On 01.08.2021 16:25, Mack The Knife wrote:
    This is a lot of work to do what

    BEGIN { RS = "\r?\n" }

    would do. [...]

    That's what I'd also do. It's simple and covers the DOS and the Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
    line termination character, and all other payload data should
    be plain text (with the exception of the TAB control character).
    Here's the POSIX definition of a text file:

    https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
    "Text File = A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed
    {LINE_MAX} bytes in length, including the <newline> character."

    https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
    "Line = A sequence of zero or more non- <newline> characters plus a terminating <newline> character."

    So a file that has CR as the line termination character is not a valid
    text file per POSIX but a file that has NL as the line termination
    character and contains CRs (be they immediately before the NLs or not)
    is a valid text file. Not sure why you mention the tab character as it's
    no different from X or # or any other character.

    The OP's 'C'-code seems to just remove all CRs from anywhere in
    the file - , so I also don't see Ed's (IMO non-text) Excel-export
    case as comparable with the original question in the thread.
    The examples I posted were just plain text per the POSIX definition
    above. I frequently have to write awk scripts to deal with CSVs exported
    from Excel with lines that end in CRNL and can contain NL in quoted
    fields. I don't know if any of those files have had CRs as I certainly haven't assumed they can't be present or otherwise special-cased them.

    Ed.

    Your RS approach considers even the NL context which the C-code
    does not.

    But to be honest, I may as well just be missing the OP's point.
    In which cases it may make sense to use a separate module with a
    lot of C-code that needs to be (pre-)compiled and some specific
    mechanism to load it isn't obvious to me. (The "global approach"
    argument, beyond being personal taste, isn't very clear either.)

    And the mentioned reverse conversion is as simply done in plain
    Awk as the input conversion is.

    Janis

    N.B.: A case where I had written C-code for a similar task was
    for an environment where several people wrote and extended text
    files from different OSes; the files contained mixtures of every
    CR, LF, or CR LF combination, some had no final line ending and
    all such a mess. The program checked files and/or fixed them for
    any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
    DOS, as desired. - Maybe such a more universal function may be
    useful for the GNU Awk extension library?

    I receive downloaded CSV files (from financial web sites, e.g. Fidelity.com) with terminating Ascii formfeeds, so I use:
    BEGIN { RS = @/[\r\n\f]/; } # I assume "[\n\f\r]" is portable; "\n|\f|\r" works too
    Doesn't cover Ed's Excel special cases.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Ed Morton on Mon Aug 2 14:42:00 2021
    On 01.08.2021 20:33, Ed Morton wrote:

    [...] In a text file I'd expect CR and/or NL as
    line termination character, and all other payload data should
    be plain text (with the exception of the TAB control character).

    Here's the POSIX definition of a text file:

    https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403

    "Text File = A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed
    {LINE_MAX} bytes in length, including the <newline> character."

    https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

    "Line = A sequence of zero or more non- <newline> characters plus a terminating <newline> character."

    So a file that has CR as the line termination character is not a valid
    text file per POSIX but a file that has NL as the line termination
    character and contains CRs (be they immediately before the NLs or not)
    is a valid text file. Not sure why you mention the tab character as it's
    no different from X or # or any other character.

    Last question first; I mentioned the Tab because it is besides the
    Blank a common white-space character to separate text in text files.
    I mentioned it separately because it is a control character (in many
    ways), one of it is that it is interpreted, where an 'X' or '#' is
    just displayed as a readable character as it is.

    The English Wikipedia (-> "ASCII") says about the control codes:
    "codes originally intended not to represent printable information,
    but rather to control devices".

    I had searched in the past for a general definition of a text file
    but don't recall to have found one. The German Wikipedia page shows
    a couple relevant aspects. It says that it contains representable
    (~printable) characters, that can be organized by control characters
    like those to change line and change page. A characteristic is that
    it is readable without specific tools through simple text editors.

    The Excel program using NL/CR for line structuring and substructures
    is not different from using ASCII FS, GS, RS, or, US; it's a control
    code that needs interpretation. While the NL/CR/CR-NL is at least
    uniquely defined for the same purpose on the respective platforms.

    A text file definition that considers STX, ACK, DC1, NAK, SYN, ETB,
    etc. in a file to still constitute a "text file" is not well suited
    for the purposes where I had been talking about such files during my
    long IT time. (There may be contexts where it makes sense, or maybe
    not.)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)