• ATTN: GAWK developers. I need help with writing an input filter extensi

    From Kenny McCormack@21:1/5 to All on Wed Jul 28 13:21:06 2021
    First, note that I have already written one. It provides "readline"-like capability to GAWK. It uses a package which is similar to, but different
    from, "readline", so that you have a scrollback buffer when you are
    entering lines at the terminal in GAWK. I wrote is several years ago and
    use it extensively. So far, so good.

    Basically, what that extension does is, when called, it calls the "getline" function in the other package, then copies the line read from the buffer of
    the "getline" function into the buffer provided by GAWK. GAWK then picks
    it up and everything works as expected.

    But here's the thing. I want to write one now that will read the line
    normally and then do something to the line before returning it to GAWK.
    What I don't know how to do is to call GAWK's normal "getline" function
    from my extension library. So, what I am thinking of is something like:

    /* In my extension code; note that "fd" is passed in as a parameter */
    normal_gawk_input(fd,buff);
    /* Now examine (and possibly change) buff */
    ...
    /* And return to GAWK */
    return awk_true;

    Some notes:

    0) My target is Linux. Don't care about any other OS or any other
    "portability" or "standards" considerations.
    1) One of the sample extensions, readfile, looks like it does something
    similar to what I want. But it includes a function called
    read_file_to_buffer(), that looks more than a little above my pay grade.
    It seems like you shouldn't have to do that. I'd rather call
    whatever code GAWK already uses to read the line.
    2) I thought about using the Linux function getline(3). That would
    work, except for one little problem. The problem is that getline
    wants a FILE * object, but GAWK deals in "fd"s. You could use
    fdopen(3) to convert, but that seems messy. It seems wasteful to
    call fdopen() every time the input filter function is called, but I
    don't see any entirely safe way to avoid doing that. It would be
    nice if there was "fd" version of getline(), but I don't know of
    anything like that. (see footnote below at (*))
    3) Alternatively, if there was some way to have GAWK read the input
    line "normally" and then call my function before continuing (i.e.,
    have the extension function be able to examine the line already
    read), then that'd be good. But I don't think there is any
    capability for that in GAWK, as of the current writing.

    Finally, another question about these "input filter" functions in general.
    The discussion so far has always been in terms of lines - i.e, the usual line-oriented input model. What happens if RS is set to something other
    than the default? Is the input filter function supposed to deal with that itself or does GAWK provide some kind of handling?

    (*) Part of the problem is that it seems clear to me that fdopen(3)
    allocates memory (presumably, using malloc() or similar) under the covers
    for the FILE * object that it creates. There doesn't seem to be any clean
    way to free() that allocated memory.

    --
    The randomly chosen signature file that would have appeared here is more than 4 lines long. As such, it violates one or more Usenet RFCs. In order to remain in compliance with said RFCs, the actual sig can be found at the following URL:
    http://user.xmission.com/~gazelle/Sigs/DanaC

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Kenny McCormack on Wed Jul 28 17:43:27 2021
    On Wed, 28 Jul 2021 13:21:06 -0000 (UTC)
    gazelle@shell.xmission.com (Kenny McCormack) wrote:
    First, note that I have already written one. It provides "readline"-like capability to GAWK. It uses a package which is similar to, but different from, "readline", so that you have a scrollback buffer when you are
    entering lines at the terminal in GAWK. I wrote is several years ago and
    use it extensively. So far, so good.

    Basically, what that extension does is, when called, it calls the "getline" function in the other package, then copies the line read from the buffer of the "getline" function into the buffer provided by GAWK. GAWK then picks
    it up and everything works as expected.

    But here's the thing. I want to write one now that will read the line normally and then do something to the line before returning it to GAWK.
    What I don't know how to do is to call GAWK's normal "getline" function
    from my extension library. So, what I am thinking of is something like:

    /* In my extension code; note that "fd" is passed in as a parameter */
    normal_gawk_input(fd,buff);
    /* Now examine (and possibly change) buff */
    ...
    /* And return to GAWK */
    return awk_true;

    Some notes:

    [...]

    2) I thought about using the Linux function getline(3). That would
    work, except for one little problem. The problem is that getline
    wants a FILE * object, but GAWK deals in "fd"s. You could use
    fdopen(3) to convert, but that seems messy. It seems wasteful to
    call fdopen() every time the input filter function is called, but I
    don't see any entirely safe way to avoid doing that. It would be
    nice if there was "fd" version of getline(), but I don't know of
    anything like that. (see footnote below at (*))

    Isn't it trivial to write your own getline() with the interface you want ?

    [...]

    (*) Part of the problem is that it seems clear to me that fdopen(3)
    allocates memory (presumably, using malloc() or similar) under the covers
    for the FILE * object that it creates. There doesn't seem to be any clean way to free() that allocated memory.

    I don't know what your overall set up is and what function calls what when
    but you can specify your own buffer using setvbuf() .This way you can free
    it whenever you want.

    --
    vlaho.ninja/prog

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bruce Horrocks@21:1/5 to Kenny McCormack on Wed Jul 28 23:43:25 2021
    On 28/07/2021 14:21, Kenny McCormack wrote:
    2) I thought about using the Linux function getline(3). That would
    work, except for one little problem. The problem is that getline
    wants a FILE * object, but GAWK deals in "fd"s. You could use
    fdopen(3) to convert, but that seems messy. It seems wasteful to
    call fdopen() every time the input filter function is called, but I
    don't see any entirely safe way to avoid doing that. It would be
    nice if there was "fd" version of getline(), but I don't know of
    anything like that. (see footnote below at (*))

    You don't need to call fdopen() every time, if I understand this page correctly: <https://www.gnu.org/software/gawk/manual/html_node/Input-Parsers.html>

    I think you need only call it when your XXX_can_take_file() function is
    invoked and save the obtained FILE value in a global static.

    So that's once per file not once per record.

    --
    Bruce Horrocks
    Surrey, England

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kenny McCormack@21:1/5 to 07.013@scorecrow.com on Thu Jul 29 00:53:11 2021
    In article <0da59d24-2344-71d2-ba62-a548e64c0f7c@scorecrow.com>,
    Bruce Horrocks <07.013@scorecrow.com> wrote:
    On 28/07/2021 14:21, Kenny McCormack wrote:
    2) I thought about using the Linux function getline(3). That would
    work, except for one little problem. The problem is that getline
    wants a FILE * object, but GAWK deals in "fd"s. You could use
    fdopen(3) to convert, but that seems messy. It seems wasteful to
    call fdopen() every time the input filter function is called, but I
    don't see any entirely safe way to avoid doing that. It would be
    nice if there was "fd" version of getline(), but I don't know of
    anything like that. (see footnote below at (*))

    You don't need to call fdopen() every time, if I understand this page >correctly: ><https://www.gnu.org/software/gawk/manual/html_node/Input-Parsers.html>

    I think you need only call it when your XXX_can_take_file() function is >invoked and save the obtained FILE value in a global static.

    So that's once per file not once per record.

    Thank you. That makes a lot of sense.

    Now, as it happens, it turns out I made a boo-boo here. My underlying assumption about what was needed to be implemented was all wrong. Upon
    digging into things a bit deeper, I realized that the function you write as
    an Input Parser is not a replacement for some line-oriented function like getline(3), but is, rather, supposed to be a "drop-in" for read(2). Note
    that the default value for iobuf -> read_func is "read". This is the thing that you change to point to your new function.

    That's why the new function that you are to define is declared as:

    static ssize_t XXX_read(int fd, void *buf, size_t nbytes);

    which is the same signature as read(2).

    Once I realized this, everything became quite clear.
    It also, incidentally, answers my question about RS.

    Anyway, I was able to quickly write the new Input Parser that I had planned.
    I will be posting a summary of that new functionality soon.

    --
    Politics is show business for ugly people.

    Sports is politics for stupid people.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)