A frequent task of mine is to write GAWK scripts that process files created
by Windows users. Naturally, I prefer to work/develop on Linux, while they
are creating/editing the files on Windows. I access their files via a
Samba share.
This all works well, except that I periodically get bit by the fact that
DOS files have an extra character as the last character of the line (as
seen by a program running on Linux). One ends up writing AWK code to deal
with this, but it would be nice to not have to do so.
To this end, I have written a GAWK extension that removes the CRs from the file. Source code is below. But first a few notes:
1) This was developed on Linux, and is written for version 1 of the GAWK API.
This is the version that the GAWK executables on most of my Linux
systems use. However, you can also compile it with Cygwin to run on a
Windows machine, using the Cygwin GAWK.EXE. Unfortunately, Cygwin
keeps GAWK up to date on their platform, so it is running GAWK 5.1 and
this uses API version 3. Luckily, this requires only one source code
change. Here is the diff between the Windows version and the Linux
version of the source code:
77c76
< { NULL, NULL, 0, 0, awk_false, NULL }
---
{ NULL, NULL, 0 }
1a) Also, if compiling on Windows, you should change the output filename
in the compile command from RemoveCRs.so to RemoveCRs.dll. Then you won't
have to specify a filename extension when loading the DLL into GAWK.EXE.
2) I think you could do this more simply just by setting RS to something
that allows for the CRs, but I prefer a more global approach. I dislike mucking up the AWK source with this sort of thing. I also dislike changing
any of the builtin "control" variables (FS, RS, etc) if I can avoid it.
3) I believe that frequent c.l.a. poster Kaz has made some modifications to
the Cygwin DLLs to do this same sort of thing. That is, of course, another option.
4) Finally, note that for total completeness, you should do the similar conversion on output - that is converting LF to CRLF. But my experience is that this isn't really needed, since most (but not all!) Windows programs nowadays handle Unix style line endings just fine. Notepad is, of course,