• Re: Removing extra Delimeter in a text file

    From Janis Papanagnou@21:1/5 to Sumit Modi on Tue Oct 17 12:10:56 2023
    On 17.10.2023 11:52, Sumit Modi wrote:
    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123" |mno|"vdv456"|ghu

    (I suppose that spurious blank is unintentional.)


    I tried different Sed and Awk options but could not get to the solution. Could you please help me to resolve this riddle?

    Assuming there's only one pipe symbol possible within "..." you can use

    sed 's/\("[^"]*\)|\([^"]*"\)/\1\2/g'


    Janis


    Thanks in Advance.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sumit Modi@21:1/5 to All on Tue Oct 17 02:52:39 2023
    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123" |mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution.
    Could you please help me to resolve this riddle?

    Thanks in Advance.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Sumit Modi on Tue Oct 17 10:12:20 2023
    On Tue, 17 Oct 2023 02:52:39 -0700 (PDT)
    Sumit Modi <sumitmodi1988@gmail.com> wrote:
    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123" |mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution. Could you please help me to resolve this riddle?

    sed 's/\("[^|]*\)|\([^|]*"\)/\1\2 /g'

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Sumit Modi@21:1/5 to Spiros Bousbouras on Tue Oct 17 06:26:57 2023
    On Tuesday, October 17, 2023 at 12:12:29 PM UTC+2, Spiros Bousbouras wrote:
    On Tue, 17 Oct 2023 02:52:39 -0700 (PDT)
    Sumit Modi <sumitm...@gmail.com> wrote:
    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123" |mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution. Could you please help me to resolve this riddle?
    sed 's/\("[^|]*\)|\([^|]*"\)/\1\2 /g'

    Hi Both,

    Thanks for your message.
    We can have any number of '|' appearance in ".....". So given Sed command wont work in that case.
    Any other solution which can be help this scenario?

    Input : abc|def|"ghi|123|789"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123789" |mno|"vdv456"|ghu

    thanks a lot in Advance!!

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Lew Pitcher@21:1/5 to Sumit Modi on Tue Oct 17 16:11:16 2023
    On Tue, 17 Oct 2023 02:52:39 -0700, Sumit Modi wrote:

    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix
    file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123"|mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution. Could you please help me to resolve this riddle?

    Janis and Spiros have provided great solutions that fit the subject matter
    of comp.unix.shell

    But, there are times where the standard shell utilities either cannot
    perform a given task, or will do so, but with a very complex setup.
    For those times, the "programming language" tools can help. I whipped
    up a "simple" lex(1) solution that takes care of your requirements:

    ------ debar.lex ------------

    {
    #include <stdio.h>
    #include <string.h>
    #include <stdlib.h>
    %}
    %x QSTRING QEND
    %%
    \" {
    /* a doublequote starts a quoted string.
    ** output the doublequote, and process
    ** following characters as contents of
    ** quoted string.
    */
    putchar(*yytext);
    BEGIN QSTRING;
    }

    . {
    /* any other character is output directly */
    putchar(*yytext);
    }
    <QSTRING>\" {
    /* a doublequote within a quoted string
    ** terminates the quoted string. remove the
    ** doublequote from the string, clean out any
    ** bars from the quoted string, output the
    ** cleaned-up string, then let QEND take care
    ** of the closing doublequote
    */
    yyless(yyleng-1); /* requeue doublequote */
    BEGIN QEND; /* use QEND to process it */

    char *scratch, *dst,
    *src = yytext;

    dst = scratch = calloc(yyleng+1,1);
    for (size_t len = yyleng; len > 0; --len)
    {
    if (*src != '|') *dst++ = *src;
    ++src;
    }
    fputs(scratch,stdout);
    free(scratch);
    }

    <QSTRING>. {
    /* any other character in a quoted string just
    ** extends the quoted string.
    */
    yymore();
    }

    <QEND>\" {
    /* terminate the output quoted string with
    ** it's doublequote */
    putchar(*yytext);
    BEGIN INITIAL;
    }
    %%
    int main(void) { yylex(); return 0; }
    int yywrap() { return 1; }

    To build:
    lex debar.lex
    cc -o debar lex.yy.c

    To run:
    echo 'abc|def|"ghi|123"|mno|"vdv|456"|ghu' | debar


    --
    Lew Pitcher
    "In Skills We Trust"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Lew Pitcher on Tue Oct 17 19:53:38 2023
    On 2023-10-17, Lew Pitcher <lew.pitcher@digitalfreehold.ca> wrote:
    On Tue, 17 Oct 2023 02:52:39 -0700, Sumit Modi wrote:

    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix
    file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123"|mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution.
    Could you please help me to resolve this riddle?

    Janis and Spiros have provided great solutions that fit the subject matter
    of comp.unix.shell

    But, there are times where the standard shell utilities either cannot
    perform a given task, or will do so, but with a very complex setup.
    For those times, the "programming language" tools can help. I whipped
    up a "simple" lex(1) solution that takes care of your requirements:

    I'm afraid to opine that you've perpetrated a useless use of lex.

    - all your token patterns are trivial one-character matches.

    - all the work is done by a state machine made up of Lex start conditions.

    The whole thing can be a switch statement in a getchar() loop.

    <QSTRING>\" {
    /* a doublequote within a quoted string
    ** terminates the quoted string. remove the
    ** doublequote from the string, clean out any
    ** bars from the quoted string, output the
    ** cleaned-up string, then let QEND take care
    ** of the closing doublequote
    */
    yyless(yyleng-1); /* requeue doublequote */
    BEGIN QEND; /* use QEND to process it */

    When the string ends, we should just putchar the double
    quote and go straight to the INITIAL state; there is
    no need fo the QEND to exist.

    I've been using lex for 30 years and don't remember ever
    using yyless. I forgot about its existence.


    char *scratch, *dst,
    *src = yytext;

    dst = scratch = calloc(yyleng+1,1);
    for (size_t len = yyleng; len > 0; --len)
    {
    if (*src != '|') *dst++ = *src;
    ++src;
    }
    fputs(scratch,stdout);
    free(scratch);
    }

    I don't understand this. The token here is just a single
    character: the double quote we matched; yyleng should be 1; yytext[0] is
    '"'; and yytext[1] is 0.

    We do not have the accumulated string token in any buffer;
    we just shipped it to standard output.

    I think this would just work:

    <QSTRING>| {
    /* Pipe within quoted string is deleted.
    ** Thus empty action (no state change or putchar).
    */
    }

    How about something like this:

    (Tim Rentsch, please ignore and skp to tail recursive version
    below.)

    void yylex(void)
    {
    enum { INITIAL, QSTRING } state = INITIAL;
    int ch;

    while ((ch = getchar()) != EOF) {
    switch (state) {
    case INITIAL:
    if (ch == '"')
    state = QSTRING;
    puchar(ch);
    break;
    case QSTRING:
    switch (ch) {
    case '|':
    break;
    case '"':
    state = INITIAL;
    /* fallthrough */
    default:
    putchar(ch);
    break;
    }
    }
    }
    }

    Tail recursive version:

    void yylex(void);

    int yylex_string(void)
    {
    int ch = getchar(ch);
    switch (ch) {
    case EOF:
    return EOF; /* error: unterminated string */
    case '"':
    putchar(ch);
    return yylex();
    default:
    putchar(ch);
    /* fallthrough */
    case '|':
    return yylex_string();
    }
    }

    void yylex(void)
    {
    int ch = getchar(ch);
    switch (ch) {
    case EOF:
    return 0:
    case '"':
    putchar(ch);
    return yylex_string();
    default:
    putchar(ch);
    return yylex();
    }
    }


    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca
    NOTE: If you use Google Groups, I don't see you, unless you're whitelisted.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Sumit Modi on Sat Oct 21 11:10:06 2023
    On Tue, 17 Oct 2023 06:26:57 -0700 (PDT)
    Sumit Modi <sumitmodi1988@gmail.com> wrote:
    On Tuesday, October 17, 2023 at 12:12:29 PM UTC+2, Spiros Bousbouras wrote:
    On Tue, 17 Oct 2023 02:52:39 -0700 (PDT)
    Sumit Modi <sumitm...@gmail.com> wrote:
    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123" |mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution. Could you please help me to resolve this riddle?
    sed 's/\("[^|]*\)|\([^|]*"\)/\1\2 /g'

    Thanks for your message.
    We can have any number of '|' appearance in ".....". So given Sed command wont work in that case.
    Any other solution which can be help this scenario?

    Input : abc|def|"ghi|123|789"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123789" |mno|"vdv456"|ghu

    ( IFS=\" ; i=1 ; result=
    while read a ; do
    for b in $a ; do
    if (( i & 1 )) ; then
    result="$result$b"\"
    else
    result="$result${b//|}"\"
    fi
    i=$(( $i + 1 ))
    done
    done
    echo "${result%\"}"
    )

    --
    vlaho.ninja/prog

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Spiros Bousbouras@21:1/5 to Spiros Bousbouras on Sun Oct 22 09:25:26 2023
    On Sat, 21 Oct 2023 11:10:06 -0000 (UTC)
    Spiros Bousbouras <spibou@gmail.com> wrote:
    ( IFS=\" ; i=1 ; result=
    while read a ; do
    for b in $a ; do
    if (( i & 1 )) ; then
    result="$result$b"\"
    else
    result="$result${b//|}"\"
    fi
    i=$(( $i + 1 ))
    done
    done
    echo "${result%\"}"
    )

    I had only tested the above with single line inputs but it's not correct
    for multiline inputs. Here's a correct version :

    ( IFS=\"
    while read a ; do
    result=
    i=1
    for b in $a ; do
    if (( i & 1 )) ; then
    result="$result$b"\"
    else
    result="$result${b//|}"\"
    fi
    i=$(( $i + 1 ))
    done
    echo "${result%\"}"
    done
    ) < your-file

    If an input line ends with a double quote , this gets removed. If the input
    is such that this may be a problem , the script can be modified easily enough to correct for this.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Sumit Modi on Thu Nov 2 05:45:25 2023
    On 10/17/2023 4:52 AM, Sumit Modi wrote:
    Hi All,

    I have a scenario where i need to remove the extra delimiter in a unix file.

    Input : abc|def|"ghi|123"|mno|"vdv|456"|ghu
    Output : abc|def|"ghi123" |mno|"vdv456"|ghu

    I tried different Sed and Awk options but could not get to the solution. Could you please help me to resolve this riddle?

    Thanks in Advance.

    Using any awk for any number of `|` inside the quoted fields:

    $ awk 'BEGIN{FS=OFS="\""} {for (i=2; i<=NF; i+=2) gsub(/[|]/,"",$i)} 1' file abc|def|"ghi123"|mno|"vdv456"|ghu

    That assumes you don't have newlines inside the quoted fields. If you do
    then see https://stackoverflow.com/q/45420535/1745001 for how to handle
    that using awk.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ed Morton@21:1/5 to Spiros Bousbouras on Thu Nov 2 06:01:59 2023
    On 10/22/2023 4:25 AM, Spiros Bousbouras wrote:
    <snip>
    I had only tested the above with single line inputs but it's not correct
    for multiline inputs. Here's a correct version :

    ( IFS=\"
    while read a ; do
    result=
    i=1
    for b in $a ; do
    if (( i & 1 )) ; then
    result="$result$b"\"
    else
    result="$result${b//|}"\"
    fi
    i=$(( $i + 1 ))
    done
    echo "${result%\"}"
    done
    ) < your-file

    If an input line ends with a double quote , this gets removed. If the
    input
    is such that this may be a problem , the script can be modified
    easily enough
    to correct for this.

    That would be orders of magnitude slower than an awk script, requires
    far more and more complicated code than an awk script, is less portable
    than an awk script, would strip backslashes so, for example, `\t` would
    become `t`, and exposes the input values to the shell for globbing and
    filename expansion so, for example, `*` would become the list of all
    files in the directory you run it from. Try it with this input to see
    what I mean:

    "foo\tbar"|"*"

    If you were going to do that in shell then you should read the line into
    an array rather than a scalar to loop on "${a[@]}" and add "-r" but it'd
    still be far slower, lengthier, more complicated and less portable than
    an awk script like

    awk '
    BEGIN { FS=OFS="\"" }
    { for (i=2; i<=NF; i+=2) gsub(/[|]/,"",$i) }
    1' your-file

    so there's really no point writing a shell loop to do it.

    See https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice,
    https://mywiki.wooledge.org/BashFAQ/001, and
    https://mywiki.wooledge.org/Quotes for more information.

    Ed.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)