• Musings about inspecting and processing binary data with shell

    From Janis Papanagnou@21:1/5 to All on Thu Aug 31 18:47:12 2023
    When I saw a recent post where data got extracted from a binary file
    it made me think about what would be the "right way" to do such jobs.

    Shells (as other Unix tools) have problems at least with binary '\0'.
    Kornshell supports binary data with 'typeset -b'; it stores the data
    in a MIME format. I couldn't see, though, how to _process_ the binary
    raw data within Kornshell easily. (If anyone has experiences here I'd
    certainly like to hear!)
    The 'od' tool allows displaying binary data in various formats, but
    it works on a whole data stream (not on individual fields).
    Are there any tools that support a more flexible inspection of binary
    data?

    I was thinking of some data specification and a tool to work with that specification and binary data files. My current experimental hack has
    a data specification of a form as shown in this example

    4 X magic (41424300)
    4 S version (31323334)
    2 - skip (55ee)
    2 D reserved (0004)
    0 X variable data (11223344)
    8 D start of header (0000000000000100)
    8 D end of header (00000000000002ff)
    4 D auth size (00000020)
    0 B auth data (...)
    4 X marker (ffffffff)
    4 X label (deadbeef)
    4 - skip (00000000)
    3 S EOT (454f54)
    1 Z illegal format

    basically defining the number of octets ("bytes") of a field, a type
    that indicates the desired interpretation and output format, and an
    informal text that describes the field (the numeric data in brackets
    here is only for my tests).
    Skipping of fields is possible (with type = '-'), and variable length
    data could be processed (with length = 0) depending on a previous len
    data element. Endian'ness could be supported for numeric data fields.
    (An extension might support null-terminated data fields and distantly
    located length fields.) It would create something like

    0x41424300 magic (41424300)
    '1234' version (31323334)
    4 reserved (0004)
    0x11223344 variable data (11223344)
    256 start of header (0000000000000100)
    767 end of header (00000000000002ff)
    32 auth size (00000020) <0102040811121418212224284142444881828488000000000101010180808080> auth data (...)
    0xffffffff marker (ffffffff)
    0xdeadbeef label (deadbeef)
    'EOT' EOT (454f54)
    *** Error: unsupported format 'Z'! (Use X, D, B, S, or -)

    Before I continue working on my hacked sample script I'd be interested
    to know whether such a tool with similar functionality already exists
    [in the free Linux world]; I would think this is a common task so that
    some usable tool certainly should exists but my own cursory search did
    not lead anywhere. So any hints are welcome.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Thu Aug 31 19:04:52 2023
    On 2023-08-31, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    Before I continue working on my hacked sample script I'd be interested
    to know whether such a tool with similar functionality already exists
    [in the free Linux world]; I would think this is a common task so that
    some usable tool certainly should exists but my own cursory search did
    not lead anywhere. So any hints are welcome.

    The file utility and /etc/magic and all has a langauge for inspecting
    fields in binaries and reporting. Usually this is compiled in some
    way nowadays, so you don't find the source in /etc/magic; I've
    not looked into that in depth.

    Scripting languages have pack/unpack langauges based on brief, usually single-character codes, inspired by Perl.

    FFI capabilities in languages can be used for dealing with binary
    data: instead of a pack notation in a string you declare structs
    with typed and named fields.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Janis Papanagnou on Sat Sep 2 08:53:29 2023
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The 'od' tool allows displaying binary data in various formats, but
    it works on a whole data stream (not on individual fields).
    Are there any tools that support a more flexible inspection of binary
    data?

    I was thinking of some data specification and a tool to work with that specification and binary data files. My current experimental hack has
    a data specification of a form as shown in this example

    If I'm following you, then this sounds like a description of
    something like GNU Poke:

    http://www.jemarch.net/poke

    Not something I've had a use for myself since finding out about it
    recently, but it seems like a comprehensive solution to the
    problem.

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Computer Nerd Kev on Sat Sep 2 17:26:04 2023
    On 02.09.2023 00:53, Computer Nerd Kev wrote:
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    The 'od' tool allows displaying binary data in various formats, but
    it works on a whole data stream (not on individual fields).
    Are there any tools that support a more flexible inspection of binary
    data?

    I was thinking of some data specification and a tool to work with that
    specification and binary data files. My current experimental hack has
    a data specification of a form as shown in this example

    If I'm following you, then this sounds like a description of
    something like GNU Poke:

    http://www.jemarch.net/poke

    Not something I've had a use for myself since finding out about it
    recently, but it seems like a comprehensive solution to the
    problem.

    This is really overwhelming! - Indeed it seems to cover what I was
    looking for, but yet much much more; a complete programming language
    with control constructs and exception handling, just to name one big
    part of the package. So I'm not quite decided that it's what I'd use.
    I certainly don't want to write a program[*] to extract some data,
    for my purpose the advertised declarative approach[**] would be it.
    I'll have to work through the docs to see whether some basic features
    are actually supported (e.g. I'm not sure whether simple fixed length
    strings (without \0 termination) are supported; I suppose they are,
    but some statement I read in the docs made me cautious, so I'll have
    to see). - All in all an interesting tool, so thanks for the link!

    BTW, in the poke docs I saw examples WRT endian'ness, like the spec
    little int a;
    big int b;
    int c;
    In the past I've assumed that endian'ness is a machine characteristic
    and would not change within a protocol element. The example taken from
    the poke docs suggests that there may be different elements. Of course
    we can think about different payload data in a single protocol element,
    but is that usual? - I'm coming from the ITU-T ASN.1/BER perspective,
    where the ASN.1 data spec is agnostic and endian'ness should happen
    in the encoding and decoding process for a specific source and target architecture. - The answer would lead either to a data spec (like in
    poke) to specify that property separately with every data element, or
    as a single parameter for the processing.

    Janis

    [*] An example can be found in the poke docs: http://www.jemarch.net/poke-3.3-manual/poke.html#elfextractor

    [**] http://www.jemarch.net/poke-3.3-manual/poke.html#Motivation

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Computer Nerd Kev@21:1/5 to Janis Papanagnou on Sun Sep 3 09:22:28 2023
    Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    BTW, in the poke docs I saw examples WRT endian'ness, like the spec
    little int a;
    big int b;
    int c;
    In the past I've assumed that endian'ness is a machine characteristic
    and would not change within a protocol element. The example taken from
    the poke docs suggests that there may be different elements. Of course
    we can think about different payload data in a single protocol element,
    but is that usual?

    I can't speak to what's "usual" in a general sense, but one example
    that comes to mind is working on a firmware file that's intended to
    be programmed to a device by another system. It could have
    information for the programming system stored in that system's byte
    order, while the actual data to be written will use the endianness
    of the device (or whatever reads it later).

    --
    __ __
    #_ < |\| |< _#

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Kaz Kylheku@21:1/5 to Janis Papanagnou on Sun Sep 3 06:25:02 2023
    On 2023-09-02, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    BTW, in the poke docs I saw examples WRT endian'ness, like the spec
    little int a;
    big int b;

    "big endian int" is nonsensical, which detracts from the example.

    Endian specifications only make sense on exact sized types like int16,
    uint32 or int64.

    "int" is a local concept: matching this system's principal compiler's
    "int", which is referenced in the system ABI.

    If we are dealing with external data---which we must be, if we are
    concerned with byte order---that data doesn't care what our local "int"
    is. We don't want the extraction code to break with a different "int".

    Tus, if we're commiting to a byte order, we should commit to the number
    of bytes which constitute that order.

    --
    TXR Programming Language: http://nongnu.org/txr
    Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
    Mastodon: @Kazinator@mstdn.ca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Kaz Kylheku on Sun Sep 3 13:05:39 2023
    On 03.09.2023 08:25, Kaz Kylheku wrote:
    On 2023-09-02, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
    BTW, in the poke docs I saw examples WRT endian'ness, like the spec
    little int a;
    big int b;

    "big endian int" is nonsensical, which detracts from the example.

    Endian specifications only make sense on exact sized types like int16,
    uint32 or int64.

    I cannot speak for the 'poke' package; maybe 'int' is just a shortcut
    for the 'int' type of the concrete machine where it is running.
    Similar to the "int c" declaration (from my quote upthread) that is
    assuming (as far as I recall) some default endian'ness on the machine.

    Janis

    [...]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)