Forum: >>> Magnum BBS <<<

Handling DOS (Windows) text files in Unix (Linux) - a nifty extensi

From Ed Morton@21:1/5 to Mack The Knife on Sun Aug 1 10:14:35 2021

On 8/1/2021 9:25 AM, Mack The Knife wrote:

This is a lot of work to do what

BEGIN { RS = "\r?\n" }

would do. Even simpler would be to put

tr -d '\r'

as a stage in your pipleine before calling gawk.

Those would both break files that use `\n` within quoted fields and
`\r\n` record endings such as you'd get in CSVs exported from Excel when
there are cells with newlines within the spreadsheet, e.g. (assume `\r`
and `\n` are literal):

Right:

$ printf '"foo\nbar"\r\n' | awk 'BEGIN{RS="\r\n"; FS=","} {print
NR, $1}'
1 "foo
bar"

Wrong:

$ printf '"foo\nbar"\r\n' | awk 'BEGIN{RS="\r?\n"; FS=","} {print
NR, $1}'
1 "foo
2 bar"

The `tr` would also break files that include `\r`s mid-record:

Right:

$ printf '"foo\rbar"\r\n' | awk 'BEGIN{RS="\r\n"; FS=","} {print
NR, $1}' | cat -Ev
1 "foo^Mbar"$

Wrong:

$ printf '"foo\rbar"\r\n' | tr -d '\r' | awk 'BEGIN{RS="\r\n";
FS=","} {print NR, $1}' | cat -Ev
1 "foobar"$

You cant robustly tell by reading a file if it uses DOS line endings or
not. For example is this:

$ printf 'foo\nbar\r\n' | cat -Ev
foo$
bar^M$

1 line using DOS line endings where no line contains `\r`:

$ printf 'foo\nbar\r\n' | awk -v RS='\r\n' '{print NR, $0}' | cat -Ev
1 foo$
bar$

or 2 lines using Unix line endings where the 2nd line ends in `\r`?

$ printf 'foo\nbar\r\n' | awk -v RS='\n' '{print NR, $0}' | cat -Ev
1 foo$
2 bar^M$

It's impossible to tell from the data, you need to KNOW the format in
advance to be able to parse it correctly.

So, just figure out what your record endings are expected to be (`\r\n`
or `\n`) in advance based on some criteria that doesn't involve reading
the file (e.g. wherever you're getting the input from) and then use the appropriate RS to parse it robustly.

Regards,

Ed.

Mack

In article <se3m1s$7qsa$1@news.xmission.com>,
Kenny McCormack <gazelle@shell.xmission.com> wrote:

A frequent task of mine is to write GAWK scripts that process files created >> by Windows users. Naturally, I prefer to work/develop on Linux, while they >> are creating/editing the files on Windows. I access their files via a
Samba share.

This all works well, except that I periodically get bit by the fact that
DOS files have an extra character as the last character of the line (as
seen by a program running on Linux). One ends up writing AWK code to deal >> with this, but it would be nice to not have to do so.

To this end, I have written a GAWK extension that removes the CRs from the >> file. Source code is below.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Mack The Knife@21:1/5 to Kenny McCormack on Sun Aug 1 14:25:35 2021

This is a lot of work to do what

BEGIN { RS = "\r?\n" }

would do. Even simpler would be to put

tr -d '\r'

as a stage in your pipleine before calling gawk.

Mack

In article <se3m1s$7qsa$1@news.xmission.com>,
Kenny McCormack <gazelle@shell.xmission.com> wrote:

A frequent task of mine is to write GAWK scripts that process files created >by Windows users. Naturally, I prefer to work/develop on Linux, while they >are creating/editing the files on Windows. I access their files via a
Samba share.

This all works well, except that I periodically get bit by the fact that
DOS files have an extra character as the last character of the line (as
seen by a program running on Linux). One ends up writing AWK code to deal >with this, but it would be nice to not have to do so.

To this end, I have written a GAWK extension that removes the CRs from the >file. Source code is below.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Mack The Knife on Sun Aug 1 18:35:19 2021

On 01.08.2021 16:25, Mack The Knife wrote:

This is a lot of work to do what

BEGIN { RS = "\r?\n" }

would do. [...]

That's what I'd also do. It's simple and covers the DOS and the
Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).

The OP's 'C'-code seems to just remove all CRs from anywhere in
the file - , so I also don't see Ed's (IMO non-text) Excel-export
case as comparable with the original question in the thread.

Your RS approach considers even the NL context which the C-code
does not.

But to be honest, I may as well just be missing the OP's point.
In which cases it may make sense to use a separate module with a
lot of C-code that needs to be (pre-)compiled and some specific
mechanism to load it isn't obvious to me. (The "global approach"
argument, beyond being personal taste, isn't very clear either.)

And the mentioned reverse conversion is as simply done in plain
Awk as the input conversion is.

Janis

N.B.: A case where I had written C-code for a similar task was
for an environment where several people wrote and extended text
files from different OSes; the files contained mixtures of every
CR, LF, or CR LF combination, some had no final line ending and
all such a mess. The program checked files and/or fixed them for
any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
DOS, as desired. - Maybe such a more universal function may be
useful for the GNU Awk extension library?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Ed Morton@21:1/5 to Janis Papanagnou on Sun Aug 1 13:33:13 2021

On 8/1/2021 11:35 AM, Janis Papanagnou wrote:

On 01.08.2021 16:25, Mack The Knife wrote:

This is a lot of work to do what

BEGIN { RS = "\r?\n" }

would do. [...]

That's what I'd also do. It's simple and covers the DOS and the Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).

Here's the POSIX definition of a text file:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
"Text File = A file that contains characters organized into zero or more
lines. The lines do not contain NUL characters and none can exceed
{LINE_MAX} bytes in length, including the <newline> character."

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
"Line = A sequence of zero or more non- <newline> characters plus a
terminating <newline> character."

So a file that has CR as the line termination character is not a valid
text file per POSIX but a file that has NL as the line termination
character and contains CRs (be they immediately before the NLs or not)
is a valid text file. Not sure why you mention the tab character as it's
no different from X or # or any other character.

The OP's 'C'-code seems to just remove all CRs from anywhere in
the file - , so I also don't see Ed's (IMO non-text) Excel-export
case as comparable with the original question in the thread.

The examples I posted were just plain text per the POSIX definition
above. I frequently have to write awk scripts to deal with CSVs exported
from Excel with lines that end in CRNL and can contain NL in quoted
fields. I don't know if any of those files have had CRs as I certainly
haven't assumed they can't be present or otherwise special-cased them.

Ed.

Your RS approach considers even the NL context which the C-code
does not.

But to be honest, I may as well just be missing the OP's point.
In which cases it may make sense to use a separate module with a
lot of C-code that needs to be (pre-)compiled and some specific
mechanism to load it isn't obvious to me. (The "global approach"
argument, beyond being personal taste, isn't very clear either.)

And the mentioned reverse conversion is as simply done in plain
Awk as the input conversion is.

Janis

N.B.: A case where I had written C-code for a similar task was
for an environment where several people wrote and extended text
files from different OSes; the files contained mixtures of every
CR, LF, or CR LF combination, some had no final line ending and
all such a mess. The program checked files and/or fixed them for
any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
DOS, as desired. - Maybe such a more universal function may be
useful for the GNU Awk extension library?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From J Naman@21:1/5 to Ed Morton on Sun Aug 1 13:29:48 2021

On Sunday, 1 August 2021 at 14:33:16 UTC-4, Ed Morton wrote:

On 8/1/2021 11:35 AM, Janis Papanagnou wrote:

On 01.08.2021 16:25, Mack The Knife wrote:

This is a lot of work to do what

BEGIN { RS = "\r?\n" }

would do. [...]

That's what I'd also do. It's simple and covers the DOS and the Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).

Here's the POSIX definition of a text file:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
"Text File = A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed
{LINE_MAX} bytes in length, including the <newline> character."

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
"Line = A sequence of zero or more non- <newline> characters plus a terminating <newline> character."

So a file that has CR as the line termination character is not a valid
text file per POSIX but a file that has NL as the line termination
character and contains CRs (be they immediately before the NLs or not)
is a valid text file. Not sure why you mention the tab character as it's
no different from X or # or any other character.

The OP's 'C'-code seems to just remove all CRs from anywhere in
the file - , so I also don't see Ed's (IMO non-text) Excel-export
case as comparable with the original question in the thread.

The examples I posted were just plain text per the POSIX definition
above. I frequently have to write awk scripts to deal with CSVs exported
from Excel with lines that end in CRNL and can contain NL in quoted
fields. I don't know if any of those files have had CRs as I certainly haven't assumed they can't be present or otherwise special-cased them.

Ed.

Your RS approach considers even the NL context which the C-code
does not.

But to be honest, I may as well just be missing the OP's point.
In which cases it may make sense to use a separate module with a
lot of C-code that needs to be (pre-)compiled and some specific
mechanism to load it isn't obvious to me. (The "global approach"
argument, beyond being personal taste, isn't very clear either.)

And the mentioned reverse conversion is as simply done in plain
Awk as the input conversion is.

Janis

N.B.: A case where I had written C-code for a similar task was
for an environment where several people wrote and extended text
files from different OSes; the files contained mixtures of every
CR, LF, or CR LF combination, some had no final line ending and
all such a mess. The program checked files and/or fixed them for
any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
DOS, as desired. - Maybe such a more universal function may be
useful for the GNU Awk extension library?

I receive downloaded CSV files (from financial web sites, e.g. Fidelity.com) with terminating Ascii formfeeds, so I use:
BEGIN { RS = @/[\r\n\f]/; } # I assume "[\n\f\r]" is portable; "\n|\f|\r" works too
Doesn't cover Ed's Excel special cases.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Janis Papanagnou@21:1/5 to Ed Morton on Mon Aug 2 14:42:00 2021

On 01.08.2021 20:33, Ed Morton wrote:

[...] In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).

Here's the POSIX definition of a text file:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403

"Text File = A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed
{LINE_MAX} bytes in length, including the <newline> character."

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206

"Line = A sequence of zero or more non- <newline> characters plus a terminating <newline> character."

So a file that has CR as the line termination character is not a valid
text file per POSIX but a file that has NL as the line termination
character and contains CRs (be they immediately before the NLs or not)
is a valid text file. Not sure why you mention the tab character as it's
no different from X or # or any other character.

Last question first; I mentioned the Tab because it is besides the
Blank a common white-space character to separate text in text files.
I mentioned it separately because it is a control character (in many
ways), one of it is that it is interpreted, where an 'X' or '#' is
just displayed as a readable character as it is.

The English Wikipedia (-> "ASCII") says about the control codes:
"codes originally intended not to represent printable information,
but rather to control devices".

I had searched in the past for a general definition of a text file
but don't recall to have found one. The German Wikipedia page shows
a couple relevant aspects. It says that it contains representable
(~printable) characters, that can be organized by control characters
like those to change line and change page. A characteristic is that
it is readable without specific tools through simple text editors.

The Excel program using NL/CR for line structuring and substructures
is not different from using ASCII FS, GS, RS, or, US; it's a control
code that needs interpretation. While the NL/CR/CR-NL is at least
uniquely defined for the same purpose on the respective platforms.

A text file definition that considers STX, ACK, DC1, NAK, SYN, ETB,
etc. in a file to still constitute a "text file" is not well suited
for the purposes where I had been talking about such files during my
long IT time. (There may be contexts where it makes sense, or maybe
not.)

Janis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	296
Nodes:	16 (2 / 14)
Uptime:	85:44:10
Calls:	6,658
Files:	12,203
Messages:	5,333,712

Handling DOS (Windows) text files in Unix (Linux) - a nifty extensi

Who's Online

System Info