• serial numbers as RS

    From raj@21:1/5 to All on Tue Jan 17 19:30:39 2023
    Hi
    I have file with 7 fields.
    The first field is serial number
    In some records 5th field is missing.
    Few records got truncated with the next record. In the sample file
    I have shown only two records truncation but in some cases even three to four records got truncated.
    sample file:

    1 651 643786485 107249 5190 M SMITH 1284
    2 963 212018826 103480 M746 R WADHWA 156
    3 232 215036022 105012 M743 SAMBA 337
    4 232 215036023 105012 M743 SAMBA 443
    5 054 215036704 103325 KIYA K 351 ====> 5th field is missing
    6 205 308363068 103402 5537 Mc DON 943
    7 231 343328800 105880 MANO M 6403 8 231 343329128 105880 MANO M 8324 =====> in both the records 5th field is missing
    9 309 361257222 103595 M564 C R SAM 102 10 309 361297561 103595 M564 C R SAM 332
    11 216 308659868 625402 9693 FERNAND 365

    The required output:

    1 651 643786485 107249 5190 M SMITH 1284
    2 963 212018826 103480 M746 R WADHWA 156
    3 232 215036022 105012 M743 SAMBA 337
    4 232 215036023 105012 M743 SAMBA 443
    5 054 215036704 103325 4897 KIYA K 351
    6 205 308363068 103402 5537 Mc DON 943
    7 231 343328800 105880 MANO M 6403
    8 231 343329128 105880 MANO M 8324
    9 309 361257222 103595 M564 C R SAM 102
    10 309 361297561 103595 M564 C R SAM 332

    I have tried by considering the serial number as RS but did not get the desired result

    awk 'BEGIN{RS="[0-9]+"}{
    print $0 RT
    }' file

    Actually I need first four fields(including serial number) and the last field. If the "," delimiter is given in the output that would be more helpful.

    Thank you

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to raj on Wed Jan 18 06:56:33 2023
    The contents of your post is inconsistent...

    On 18.01.2023 04:30, raj wrote:
    Hi
    I have file with 7 fields.

    No. Field numbers vary. A typical value is 8.

    The first field is serial number

    No. There's gaps, or, joined subsequent lines.

    In some records 5th field is missing.

    Also other fields in joined lines.

    Few records got truncated with the next record. In the sample file
    I have shown only two records truncation but in some cases even three to four records got truncated.
    sample file:

    1 651 643786485 107249 5190 M SMITH 1284
    2 963 212018826 103480 M746 R WADHWA 156
    3 232 215036022 105012 M743 SAMBA 337
    4 232 215036023 105012 M743 SAMBA 443
    5 054 215036704 103325 KIYA K 351 ====> 5th field is missing
    6 205 308363068 103402 5537 Mc DON 943
    7 231 343328800 105880 MANO M 6403 8 231 343329128 105880 MANO M 8324 =====> in both the records 5th field is missing
    9 309 361257222 103595 M564 C R SAM 102 10 309 361297561 103595 M564 C R SAM 332
    11 216 308659868 625402 9693 FERNAND 365

    The required output:

    1 651 643786485 107249 5190 M SMITH 1284
    2 963 212018826 103480 M746 R WADHWA 156
    3 232 215036022 105012 M743 SAMBA 337
    4 232 215036023 105012 M743 SAMBA 443
    5 054 215036704 103325 4897 KIYA K 351

    And where from should that "4897" come?

    6 205 308363068 103402 5537 Mc DON 943
    7 231 343328800 105880 MANO M 6403
    8 231 343329128 105880 MANO M 8324

    You want records with 7 and 8 fields mixed?

    9 309 361257222 103595 M564 C R SAM 102
    10 309 361297561 103595 M564 C R SAM 332

    I have tried by considering the serial number as RS but did not get the desired result

    awk 'BEGIN{RS="[0-9]+"}{
    print $0 RT
    }' file

    Actually I need first four fields(including serial number) and the last field.

    This does not match with the "required output" above.

    If the "," delimiter is given in the output that would be more helpful.

    Thank you


    ...so fix your data sample and requirements first.

    And have a closer look on the definition of lines that have a number
    of fields that may be 14, 15, 16, and how to distinguish that data.

    And speak with the one who created that data trash to fix his process.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From raj@21:1/5 to Janis Papanagnou on Wed Jan 18 06:57:33 2023
    On Wednesday, 18 January 2023 at 11:26:35 UTC+5:30, Janis Papanagnou wrote:
    The contents of your post is inconsistent...

    On 18.01.2023 04:30, raj wrote:
    Hi
    I have file with 7 fields.

    No. Field numbers vary. A typical value is 8.

    The first field is serial number

    No. There's gaps, or, joined subsequent lines.

    In some records 5th field is missing.

    Also other fields in joined lines.

    Few records got truncated with the next record. In the sample file
    I have shown only two records truncation but in some cases even three to four records got truncated.
    sample file:

    1 651 643786485 107249 5190 M SMITH 1284
    2 963 212018826 103480 M746 R WADHWA 156
    3 232 215036022 105012 M743 SAMBA 337
    4 232 215036023 105012 M743 SAMBA 443
    5 054 215036704 103325 KIYA K 351 ====> 5th field is missing
    6 205 308363068 103402 5537 Mc DON 943
    7 231 343328800 105880 MANO M 6403 8 231 343329128 105880 MANO M 8324 =====> in both the records 5th field is missing
    9 309 361257222 103595 M564 C R SAM 102 10 309 361297561 103595 M564 C R SAM 332
    11 216 308659868 625402 9693 FERNAND 365

    The required output:

    1 651 643786485 107249 5190 M SMITH 1284
    2 963 212018826 103480 M746 R WADHWA 156
    3 232 215036022 105012 M743 SAMBA 337
    4 232 215036023 105012 M743 SAMBA 443
    5 054 215036704 103325 4897 KIYA K 351

    And where from should that "4897" come?

    6 205 308363068 103402 5537 Mc DON 943
    7 231 343328800 105880 MANO M 6403
    8 231 343329128 105880 MANO M 8324

    You want records with 7 and 8 fields mixed?

    9 309 361257222 103595 M564 C R SAM 102
    10 309 361297561 103595 M564 C R SAM 332

    I have tried by considering the serial number as RS but did not get the desired result

    awk 'BEGIN{RS="[0-9]+"}{
    print $0 RT
    }' file

    Actually I need first four fields(including serial number) and the last field.

    This does not match with the "required output" above.

    If the "," delimiter is given in the output that would be more helpful.

    Thank you


    ...so fix your data sample and requirements first.

    And have a closer look on the definition of lines that have a number
    of fields that may be 14, 15, 16, and how to distinguish that data.

    And speak with the one who created that data trash to fix his process.

    Janis

    The data was copy and pasted in a text editor from a pdf file.
    The user is not having any tool/access to convert the pdf to doc or excel.

    The problem is arising when it is directly copied from the pdf file.
    That is the reason for inconsistency.

    awk 'BEGIN{RS="[0-9]+"}{
    print $0 RT
    }' file
    The result of above is breaking each field into a separate record.

    1
    651
    643786485
    107249
    5190
    M SMITH 1284

    2
    963
    212018826
    103480
    M746
    R WADHWA 156

    3
    232
    215036022
    105012
    M743
    SAMBA 337

    4
    232
    215036023
    105012
    M743
    SAMBA 443

    5
    054
    215036704
    103325
    4897
    KIYA K 351

    ....
    .....

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kees Nuyt@21:1/5 to visitnag@gmail.com on Wed Jan 18 15:45:37 2023
    On Tue, 17 Jan 2023 19:30:39 -0800 (PST), raj
    <visitnag@gmail.com> wrote:


    Actually I need first four fields(including serial number) and the last field.

    The "last field" can always be addressed with $NF

    If the "," delimiter is given in the output that would be more helpful.

    Have a look at OFS or printf. Your choice.
    --
    Kees Nuyt

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to raj on Wed Jan 18 16:26:54 2023
    On 18.01.2023 15:57, raj wrote:
    [...]

    The data was copy and pasted in a text editor from a pdf file.

    If all you have is a PDF I suggest to use a more sophisticated
    PDF tool to extract the text in a more accurate plain text form,
    or otherwise fix the worst formatting issue by hand before posting.

    The user is not having any tool/access to convert the pdf to doc or excel.

    The problem is arising when it is directly copied from the pdf file.
    That is the reason for inconsistency.

    And don't forget to answer/clarify the other issues you have been
    hinted to.

    Janis


    [snip]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)