• Handling DOS (Windows) text files in Unix (Linux) - a nifty extension l

    From Kenny McCormack@21:1/5 to All on Sat Jul 31 14:17:32 2021
    A frequent task of mine is to write GAWK scripts that process files created
    by Windows users. Naturally, I prefer to work/develop on Linux, while they
    are creating/editing the files on Windows. I access their files via a
    Samba share.

    This all works well, except that I periodically get bit by the fact that
    DOS files have an extra character as the last character of the line (as
    seen by a program running on Linux). One ends up writing AWK code to deal
    with this, but it would be nice to not have to do so.

    To this end, I have written a GAWK extension that removes the CRs from the file. Source code is below. But first a few notes:

    1) This was developed on Linux, and is written for version 1 of the GAWK API.
    This is the version that the GAWK executables on most of my Linux
    systems use. However, you can also compile it with Cygwin to run on a
    Windows machine, using the Cygwin GAWK.EXE. Unfortunately, Cygwin
    keeps GAWK up to date on their platform, so it is running GAWK 5.1 and
    this uses API version 3. Luckily, this requires only one source code
    change. Here is the diff between the Windows version and the Linux
    version of the source code:

    77c76
    < { NULL, NULL, 0, 0, awk_false, NULL }
    ---
    { NULL, NULL, 0 }

    1a) Also, if compiling on Windows, you should change the output filename
    in the compile command from RemoveCRs.so to RemoveCRs.dll. Then you won't
    have to specify a filename extension when loading the DLL into GAWK.EXE.

    2) I think you could do this more simply just by setting RS to something
    that allows for the CRs, but I prefer a more global approach. I dislike mucking up the AWK source with this sort of thing. I also dislike changing
    any of the builtin "control" variables (FS, RS, etc) if I can avoid it.

    3) I believe that frequent c.l.a. poster Kaz has made some modifications to
    the Cygwin DLLs to do this same sort of thing. That is, of course, another option.

    4) Finally, note that for total completeness, you should do the similar conversion on output - that is converting LF to CRLF. But my experience is that this isn't really needed, since most (but not all!) Windows programs nowadays handle Unix style line endings just fine. Notepad is, of course,