This is a lot of work to do what
BEGIN { RS = "\r?\n" }
would do. Even simpler would be to put
tr -d '\r'
as a stage in your pipleine before calling gawk.
Mack
In article <se3m1s$7qsa$1@news.xmission.com>,
Kenny McCormack <gazelle@shell.xmission.com> wrote:
A frequent task of mine is to write GAWK scripts that process files created >> by Windows users. Naturally, I prefer to work/develop on Linux, while they >> are creating/editing the files on Windows. I access their files via a
Samba share.
This all works well, except that I periodically get bit by the fact that
DOS files have an extra character as the last character of the line (as
seen by a program running on Linux). One ends up writing AWK code to deal >> with this, but it would be nice to not have to do so.
To this end, I have written a GAWK extension that removes the CRs from the >> file. Source code is below.
A frequent task of mine is to write GAWK scripts that process files created >by Windows users. Naturally, I prefer to work/develop on Linux, while they >are creating/editing the files on Windows. I access their files via a
Samba share.
This all works well, except that I periodically get bit by the fact that
DOS files have an extra character as the last character of the line (as
seen by a program running on Linux). One ends up writing AWK code to deal >with this, but it would be nice to not have to do so.
To this end, I have written a GAWK extension that removes the CRs from the >file. Source code is below.
This is a lot of work to do what
BEGIN { RS = "\r?\n" }
would do. [...]
On 01.08.2021 16:25, Mack The Knife wrote:
This is a lot of work to do what
BEGIN { RS = "\r?\n" }
would do. [...]
That's what I'd also do. It's simple and covers the DOS and the Unix/(new-)Mac case. In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).
The OP's 'C'-code seems to just remove all CRs from anywhere in
the file - , so I also don't see Ed's (IMO non-text) Excel-export
case as comparable with the original question in the thread.
Your RS approach considers even the NL context which the C-code
does not.
But to be honest, I may as well just be missing the OP's point.
In which cases it may make sense to use a separate module with a
lot of C-code that needs to be (pre-)compiled and some specific
mechanism to load it isn't obvious to me. (The "global approach"
argument, beyond being personal taste, isn't very clear either.)
And the mentioned reverse conversion is as simply done in plain
Awk as the input conversion is.
Janis
N.B.: A case where I had written C-code for a similar task was
for an environment where several people wrote and extended text
files from different OSes; the files contained mixtures of every
CR, LF, or CR LF combination, some had no final line ending and
all such a mess. The program checked files and/or fixed them for
any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
DOS, as desired. - Maybe such a more universal function may be
useful for the GNU Awk extension library?
On 8/1/2021 11:35 AM, Janis Papanagnou wrote:
On 01.08.2021 16:25, Mack The Knife wrote:
This is a lot of work to do what
BEGIN { RS = "\r?\n" }
would do. [...]
That's what I'd also do. It's simple and covers the DOS and the Unix/(new-)Mac case. In a text file I'd expect CR and/or NL asHere's the POSIX definition of a text file:
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
"Text File = A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed
{LINE_MAX} bytes in length, including the <newline> character."
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
"Line = A sequence of zero or more non- <newline> characters plus a terminating <newline> character."
So a file that has CR as the line termination character is not a valid
text file per POSIX but a file that has NL as the line termination
character and contains CRs (be they immediately before the NLs or not)
is a valid text file. Not sure why you mention the tab character as it's
no different from X or # or any other character.
The OP's 'C'-code seems to just remove all CRs from anywhere inThe examples I posted were just plain text per the POSIX definition
the file - , so I also don't see Ed's (IMO non-text) Excel-export
case as comparable with the original question in the thread.
above. I frequently have to write awk scripts to deal with CSVs exported
from Excel with lines that end in CRNL and can contain NL in quoted
fields. I don't know if any of those files have had CRs as I certainly haven't assumed they can't be present or otherwise special-cased them.
Ed.
Your RS approach considers even the NL context which the C-code
does not.
But to be honest, I may as well just be missing the OP's point.
In which cases it may make sense to use a separate module with a
lot of C-code that needs to be (pre-)compiled and some specific
mechanism to load it isn't obvious to me. (The "global approach"
argument, beyond being personal taste, isn't very clear either.)
And the mentioned reverse conversion is as simply done in plain
Awk as the input conversion is.
Janis
N.B.: A case where I had written C-code for a similar task was
for an environment where several people wrote and extended text
files from different OSes; the files contained mixtures of every
CR, LF, or CR LF combination, some had no final line ending and
all such a mess. The program checked files and/or fixed them for
any of the three common formats; (old-)Mac, Unix/(new-)Mac, and
DOS, as desired. - Maybe such a more universal function may be
useful for the GNU Awk extension library?
[...] In a text file I'd expect CR and/or NL as
line termination character, and all other payload data should
be plain text (with the exception of the TAB control character).
Here's the POSIX definition of a text file:
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_403
"Text File = A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed
{LINE_MAX} bytes in length, including the <newline> character."
https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_206
"Line = A sequence of zero or more non- <newline> characters plus a terminating <newline> character."
So a file that has CR as the line termination character is not a valid
text file per POSIX but a file that has NL as the line termination
character and contains CRs (be they immediately before the NLs or not)
is a valid text file. Not sure why you mention the tab character as it's
no different from X or # or any other character.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 296 |
Nodes: | 16 (2 / 14) |
Uptime: | 85:44:10 |
Calls: | 6,658 |
Files: | 12,203 |
Messages: | 5,333,712 |