• Awk: The Power and Promise of a 40-Year-Old Language

    From Ben Collver@21:1/5 to All on Mon Jan 16 17:54:25 2023
    # Awk: The Power and Promise of a 40-Year-Old Language

    By Andy Oram, 19 May, 2021

    Languages don't enjoy long lives. Very few people still code with
    the legacies of the 1970s: ML, Pascal, Scheme, Smalltalk. (The C
    language is still widely used but in significantly updated versions.)
    Bucking that trend, the 1977 Unix utility Awk can boast of a loyal
    band of users and seems poised to continue far into the future. In
    this article, I’ll explain what makes Awk special and keeps it
    relevant.

    # A Descriptive Language

    Awk runs on inputs and a script. The inputs can be files, but the
    command is often used as part of a pipeline, taking input from the
    previous command's output:

    ```
    ls | awk '/SAMPLES_[1-9][0-9]/ { ++counter }'
    ```

    The long quoted text in the above command is the script, which can be
    included on the command line or read from files. Each script
    comprises a set of conditions and actions. The condition is often a
    regular expression enclosed by slashes. The action appears as one or
    more statements between braces. If the condition matches a part of
    the input, the action is executed. Here is my trivial, one-line
    script:

    ```
    /SAMPLES_[1-9][0-9]/ { ++counter }
    ```

    The script searches for strings like SAMPLES_19 or SAMPLES_20 and
    increments a counter each time a string is found. Of course, a real
    script would use the counter in further calculations.

    This is basically how Awk operates: evaluate a condition, then take
    action when it matches. The script runs in what David Kerns, in an
    email exchange with me, called an implied loop. In his review of
    this article, Arnold Robbins, maintainer of the GNU version of Awk
    (Gawk), calls the programs data-driven.

    I see Awk as more of a declarative language than a procedural one.
    You describe what you want to happen and the conditions under which
    it happens, instead of specifying a series of sequential statements.
    Awk certainly executes statements in sequence and offers control flow statements (if, while), so it can serve quite well as a procedural
    language. Nelson H. F. Beebe, in his review of this article,
    mentioned writing a program with 23,981 lines of actions in just 12
    patterns.

    But overall, sequences of statements execute within a framework of
    declaring the conditions under which these things should happen. The
    concept of a declarative language has been around almost since the
    beginning of high-level programming languages and can be found in the
    popular notion of a promise, invented by Mark Burgess.

    http://markburgess.org/promises.html

    Awk documentation usually calls the condition a "pattern" because
    regular expressions are so often used as conditions. Janis
    Papanagnou, in his review of this article, explained that he has
    recommended the word "condition" instead. I realized that this word
    choice matches my own view of Awk at a high level as a descriptive
    language. Aleksey Cheusov, in email, said that Awk programs can be
    viewed as finite state machines, which declare how to move from one
    state to another.

    Neil Ormos, in an email exchange with me, offered an interesting
    perspective on when to use Awk:

    I'd put Awk in a special category of general-purpose programming
    languages that are especially well adapted for: (1) personal
    computing; and (2) programmer-time-efficient prototype development,
    where the prototype artifact can evolve advantageously into a production-worthy tool with a little incremental effort.

    Awk also maintains a delicate balance between being a line-oriented
    utility like grep and a full programming language. Normally, Awk
    just applies your script to each line of input, like grep, acting on
    what matches your condition.

    Furthermore, Awk is focused on lines divided into fields that are
    separated by white space or by any character or regular expression
    you choose. All behavior is subject to customizations—as Ed Morton
    suggested in his review, we should speak more generally of "records"
    instead of "lines"—but traditionally Awk is used on files where each
    line consists of a regular set of fields. It has proven very useful
    for parsing log files, for instance.

    In 1988, Kernighan put a set of bug fixes and major new features into
    a version released under the name Nawk (although he wanted it to
    replace the original Awk), and the standard version has not changed
    much since then.

    # It's Not Just About the Language

    Languages are part of a larger environment that often plays more of a
    role in the choice of language than its actual features. For
    instance, many people use Python because so many important libraries
    have been written for that language. Other people use a language for
    legacy reasons: they have an existing application to maintain or work
    in an organization that has historically depended on a language.

    Many of the people who responded to my outreach for this article
    focused their appreciation of Awk on factors other than language
    features. Besides being deeply embedded in many Unix scripts, Awk's
    presence is guaranteed on every Unix-style system, including
    GNU/Linux, BSD, and macOS. The utility's suitability for widespread
    use is bolstered by its ability to accomplish complex tasks without
    requiring the installation of outside libraries or packages. The
    language's behavior is also guaranteed in a POSIX standard, which
    turns out to be surprisingly important to a lot of users. However,
    many variants have added non-standard features. Gawk and mawk are in
    common use.

    https://pubs.opengroup.org/onlinepubs/009696799/utilities/awk.html

    Among people who use Awk on large projects, it's a critical part of
    their toolkit because it's fast. Michael May and Glaudiston Gomes da
    Silva told me that they had ported some Java data processing programs
    to Awk with more than ten-fold reductions in CPU and RAM consumption.
    One researcher clocked Awk on 25 TB of data with impressive results.
    Another advised Awk’s use for some tasks, along with other classic
    Unix tools, instead of Hadoop. And one of the most active sites in
    data science, Analytics Vidhya, published an article praising Awk.

    <https://livefreeordichotomize.com/2019/06/04/
    using_awk_and_r_to_parse_25tb/>

    <https://adamdrake.com/ command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html>

    <https://medium.com/analytics-vidhya/ use-awk-to-save-time-and-money-in-data-science-eb4ea0b7523f>

    Cheusov, in correspondence with me, provided more evidence of Awk's
    speed:

    When I worked in computational linguistics, we often parsed
    gigabytes of text. Programs written in GNU Awk and mawk were much
    faster than equivalent programs written in Ruby, Python and Perl.
    Because AWK is so simple, its interpreter can be optimized much
    more easily than for much more complex languages.

    Awk is fast because it has stayed simple and avoided features that
    are considered necessities in other languages. It concentrates on
    what it can do well. Several correspondents told me that they
    appreciated being able to do what they wanted without downloading
    large modules as they would do for other languages.

    Computer science professor Tim Menzies, in his article "Why Gawk?",
    cited the simplicity and regularity of Awk syntax, which allows it to
    be learned quickly and to ward off overly complex code. Other
    correspondents also cited the GNU Awk debugger as a boon for Awk
    development.

    https://web.archive.org/web/20150929033218/http://awk.info/?whygawk

    https://www.gnu.org/software/gawk/manual/html_node/Debugger.html

    Last but not least, we shouldn't ignore the importance of good
    documentation. Awk documentation is easy to find on the web. The
    manual for Gawk, written by the software's maintainer, Arnold D.
    Robbins, is particularly helpful. For example, the Gawk manual
    carefully distinguishes Gawk extensions from standard features, so
    that you can avoid the extensions if you want to conform to the
    standard. I have noticed that GNU tools in general have good
    manuals, perhaps because Richard M. Stallman and his collaborators
    have always assigned a high value to documentation.

    https://www.gnu.org/software/gawk/

    # Expansion Without Bloat

    The classic Awk, as created by Alfred Aho, Peter J. Weinberger, and
    Brian Kernighan (who drew on their initials to create the name of the
    utility), was informal. It didn't make users declare variables but
    simply assumed the variables' values to be zero or null the first
    time they were used. Data types were implied. This kind of casual
    scripting was common in the 1970s, and anything more formal would
    have undermined the tool's appeal.

    Every language evolves, usually by incorporating popular features
    from other languages. The trick is to avoid throwing in features of
    little value that degrade the language by making it hard to use, slow
    to compile or run, etc. In this regard, Awk has done well. It has
    resisted modernization in the form of data declarations and objects.
    Because Awk is very different from general-purpose languages, it
    doesn't have space for callbacks, polymorphism, and other fads that
    have become central to application design in many languages. But
    some variants of Awk added functionality of real value while
    maintaining Awk's sleek performance and small footprint. Gawk, like
    many GNU utilities, has upgraded aggressively.

    Many dedicated Awk users don't strive for large programs or make use
    of extended features. Some love Awk for one-liners like the one I
    showed earlier. Like most Unix and GNU/Linux users, these casual
    adherents of Awk prefer bigger languages such as Perl (yes, still!)
    and Python for large tasks. Others, however, write large Awk
    programs with the help of its newer additions.

    Here are some of the features postdating the original 1977 release
    that users tell me are most useful. I focus on the features that
    allow Awk programs to grow large and allow programmers to reuse and
    share code.

    Two features were added fairly early to standard Awk:
    multi-dimensional arrays and user-defined functions. Recent
    computing algorithms, especially in data science, depend heavily on
    matrices and higher-dimensional arrays called tensors. So the
    addition of multi-dimensional arrays to Awk prepared it for modern
    data processing. User-defined functions provided Awk with a whole
    new level of reusability. You can call complex code from different
    statements, and share your functions with colleagues.

    The other features promoting reuse and large programs are extensions in Gawk:

    ## Namespaces

    Once Awk offered user-defined functions, this Gawk extension allowed
    even more sharing and growth. As in C++ or Java, namespaces in Awk
    prevent clashes between function names or other symbols defined in
    different functions.

    ## BEGINFILE and ENDFILE

    Awk provides BEGIN and END actions to let you do initial processing
    (before all files are read) and terminal processing (after all files
    are read). Gawk extends this with BEGINFILE and ENDFILE, which let
    specify actions to take before reading or after processing each file
    in a set of multiple files.

    ## Two-way pipelines

    These streamline the operation of coprocesses, which allow you to
    delegate operations to a separate program and get results back. This
    form of multiprocessing has been around in other languages for quite
    a while, most notably in Go. The original form of Awk allowed
    coprocesses, but only through the cumbersome use of temporary files.

    ## Network programming

    This capability takes multiprocessing past the local system, using
    classic internet sockets to communicate with programs on remote
    hosts. The remote programs could be coded in any language, not just
    Awk.

    ## Arbitrary-precision arithmetic

    Like multi-dimensional arrays, this feature appeals to scientists who
    need to go beyond the limitations of conventional integers and
    floating-point numbers, constrained by microprocessor design.

    Plugins/extensions—These allow intrepid programmers to extend Gawk
    without messing around in the core code.

    # Recent Examples of Awk in Action

    An article in LWN.net discusses the continued appeal of Awk along
    with some recent large projects that use it. Other projects
    mentioned by people I corresponded with include:

    https://lwn.net/Articles/820829/

    * Validation of a sports schedule, for example, ensuring that a team
    doesn't have two games at the same time, that a coach isn't
    coaching two teams at the same time, a team isn't playing at a time
    when they aren't available, etc. This program checks about 25
    different constraints per team, on average.
    * Converting SQL data across web sites from one schema to another by
    way of exported/imported CSVs.
    * A literate programming tool. The concept of "literate programming"
    was invented by Donald Knuth, in some ways the grandfather of
    modern programming. Hints of the idea appear in modern commenting
    systems such as Javadoc.
    * An IRC client and bot.
    * Extracting bibliographies from technical journal articles.
    * runawk, a wrapper for Awk.
    * Components of the pkgsrc framework for building packages on
    Unix-like systems, including pkg_summary-utils.

    https://github.com/arnoldrobbins/texiwebjr

    http://literateprogramming.com/knuthweb.pdf

    https://github.com/cheusov/runawk

    http://www.pkgsrc.org/

    https://github.com/cheusov/pkg_summary-utils

    In conclusion, Awk can do much more than the simple line-by-line text processing that is usually considered its forte. The discussions and
    examples in this article show that the language still has a place in
    the 21st century.

    From: <https://www.fosslife.org/
    awk-power-and-promise-40-year-old-language>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Bob Eager@21:1/5 to Ben Collver on Mon Jan 16 22:12:29 2023
    On Mon, 16 Jan 2023 17:54:25 +0000, Ben Collver wrote:

    # Awk: The Power and Promise of a 40-Year-Old Language

    I might have used awk more if I hadn't previously learned a macro
    processor (no, m4 hardly counts).

    I have done some complicated things that others had attempted and failed
    - one was the extraction of names of WWII pilots from several hundred
    disparate web pages. I still use the macro processor for writing more
    user friendly firewall rules.

    The specific advantages it has had (no necessarily so true now) are
    arbitrary format input, variable delimiters (words or symbols), infinite nesting (given enough memory) and quite a lot of storage and decision
    making.

    If interested, go here:

    https://www.ml1.org.uk

    You probably want to look at the short tutorial on this page (PDF and
    HTML):

    https://www.ml1.org.uk/doc.html






    --
    Using UNIX since v6 (1975)...

    Use the BIG mirror service in the UK:
    http://www.mirrorservice.org

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)