• Parsing using a Graphics Processing Unit (GPU)?

    From Roger L Costello@21:1/5 to All on Mon Aug 31 10:35:53 2020
    Hi Folks,

    I am reading a book [1] on machine learning and the book says some pretty interesting things:

    "In the search for more speed, machine learning researchers started taking advantage of special hardware found in some computers, originally designed to improve graphics performance. You may have heard these called graphics cards. ... Those graphics cards contain a GPU, or graphics processing unit. Unlike a general purpose CPU, a GPU is designed to perform specific tasks, and do them well. One of those tasks is to carry out arithmetic, including matrix multiplication, in a highly parallel way. ... GPUs have many more [than CPUs] arithmetic cores, thousands are fairly common today. This means a huge
    workload can be split amongst all those cores and the job can be done
    quickly."

    Neat!

    Has the parsing community found a way to take advantage of GPUs?

    From the above excerpt, it appears that GPUs are especially good at
    arithmetic. When I think of parsing, I don't think of lots of arithmetic. Perhaps someone has devised a way to recast the parsing problem into an arithmetic problem?

    Any thoughts you might have on:

    (a) parsing-using-GPUs, and
    (b) recasting-the-parsing-problem-into-an-arithmetic-problem

    would be appreciated. /Roger

    [1] "Make Your First GAN with Pytorch" by Tariq Rashid

    [Parsing is not usually an important factor in compiler performance.
    The slow parts are the lexer, because it has to look at every
    character of the input, and some optimizations that have to analyze
    the entire intermediate form of the program. The first step in lexing
    is to identify what class each character is, e.g., identifier, white
    space, or operator. Perhaps a GPU could do vector lookups to speed
    that up. For optimizations, I can sort of imagine how some analyses
    like reachability might be expressible as matrices. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christian Gollwitzer@21:1/5 to All on Tue Sep 1 09:22:11 2020
    Am 31.08.20 um 12:35 schrieb Roger L Costello:
    Has the parsing community found a way to take advantage of GPUs?

    I don't think that you can speed up parsing a lot using GPUs. GPUs are generally good at parallel processing. A single thread is usually slower
    than a CPU thread, and there is overhead, because they are not
    self-contained - i.e. you can usually speed up only some part of a
    program, and it needs to be uploaded to the GPU and downloaded back.
    GPUs also have faster memory, *if* you access it either in big blocks or
    as a serial stream, in which case the compiler can transform it to block access. For random accesses, the memory is slower.

    I have done some basic GPU programming, and I think that parsing is not
    a parallel task in that sense. The parser reads the input as a stream of tokens; you can't split the C file at some arbitrary point in half and
    parse both parts independently.

    Christian

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From A. K.@21:1/5 to All on Tue Sep 1 01:25:54 2020
    Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
    Has the parsing community found a way to take advantage of GPUs?

    The old IT classic still holds: Early optimization is the root of all evil.

    IOW: first identify your bottlenecks by profiling your parsing routines.
    Repeat it.

    Most probably bottlenecks will have to do with I/O speed and selected algorithms. Memory access speed perhaps, CPU/GPU performance very behind.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to All on Tue Sep 1 19:02:21 2020
    Am 01.09.2020 um 09:22 schrieb Christian Gollwitzer:
    Am 31.08.20 um 12:35 schrieb Roger L Costello:
    Has the parsing community found a way to take advantage of GPUs?


    I have done some basic GPU programming, and I think that parsing is not
    a parallel task in that sense. The parser reads the input as a stream of tokens;

    and acts *differently* depending on that input, while a GPU performs essentially the *same* operation to all elements of a stream (merge,
    scale...).

    you can't split the C file at some arbitrary point in half and
    parse both parts independently.

    Multiple files can be parsed in parallel just with a multi-core CPU.
    More parallel modules require more symbol tables etc., what limits the
    amount of concurrent threads by available (cached!) memory.

    DoDi

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Wed Sep 2 01:14:06 2020
    Our esteemed moderator wrote:
    [Parsing is not usually an important factor in compiler performance.
    The slow parts are the lexer, because it has to look at every
    character of the input, and some optimizations that have to analyze
    the entire intermediate form of the program. The first step in lexing
    is to identify what class each character is, e.g., identifier, white
    space, or operator. Perhaps a GPU could do vector lookups to speed
    that up. For optimizations, I can sort of imagine how some analyses
    like reachability might be expressible as matrices. -John]

    As to parallelizing lexers, we did some experiments on speeding up
    regular expressions (mostly for Snort) while I was at Intel. Charlie
    Lasswell did experiments on starting parallel scans on different parts
    of a stream or file that showed some promise. The hyperscan team (aka
    Sensory Networks, Geoff Langdale &co) had some interesting results on
    when you could skip ahead and guess the correct state when using
    NFAs--I believe they called it "and then a miracle happens" (following
    the famous cartoon about a mathematical proof with some missing steps
    on a whiteboard). Michela Becchi worked on when you could predict
    states involved in xFA loops would terminate. Geoff had me do a mode
    they called Maclellan (the HS folks had some interesting naming
    conventions, one of them was the use of US Civil War generals for
    major variations/modes) which was turning NFAs into parallelized DFAs
    for Hyperscan. The regular expression engine we put into the Cave
    Creek chip, did a different form of parallel DFAs, spawning them when
    certain conditions were met, and for things like the Aho-Corasick
    algorithm you can show that there tends to be a "bushy" part near the
    head of the automata, with "skinny" parts being forked off after a
    small number of bytes--that was the key insight that led to our chip.

    So, there are definite possibilities for doing SIMD regular expression
    matching that would be suitable for GPU implementation. Geoff and I
    discussed how to approach it on numerous occasions. It just never
    quite got past the to-consider list.

    Some of this "could" also be applied to parsing, particularly in a GLR
    style parser, where you have forests. But our moderator is correct in
    saying the big boost would be in speeding up lexing.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Elijah Stone@21:1/5 to Roger L Costello on Tue Sep 1 20:13:27 2020
    On Mon, 31 Aug 2020, Roger L Costello wrote:

    Any thoughts you might have on:

    (a) parsing-using-GPUs, and
    (b) recasting-the-parsing-problem-into-an-arithmetic-problem

    Look up Aaron Hsu's Ph.D thesis, _A data parallel compiler hosted on the
    GPU_ (https://scholarworks.iu.edu/dspace/handle/2022/24749). As John (and others) mention, I don't think the GPU is an especially interesting target
    to speed up parsing specifically, but it may be a fruitful line of
    inquiry. If so then that thesis is, as far as I can tell, the only
    research that has been done so far; maybe you can make your own
    experiments based on it.

    --
    time flies like an arrow;
    fruit flies like a banana

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Jan Ziak@21:1/5 to Christian Gollwitzer on Wed Sep 2 02:13:04 2020
    On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:
    The parser reads the input as a stream of
    tokens; you can't split the C file at some arbitrary point in half and
    parse both parts independently.

    Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary
    points: when the state of the lexical analyzer collapses to a single
    state starting from a random file position with an arbitrary starting
    state.

    -atom

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Roger L Costello@21:1/5 to All on Wed Sep 2 11:45:20 2020
    Look up Aaron Hsu's Ph.D thesis,

    A data parallel compiler hosted on the GPU

    (https://scholarworks.iu.edu/dspace/handle/2022/24749)

    Abstract:
    This work describes a general, scalable method for building data-parallel by construction tree transformations that exhibit simplicity, directness of expression, and high-performance on both CPU and GPU architectures when executed on either interpreted or compiled platforms across a wide range of data sizes, as exemplified and expounded by the exposition of a complete compiler for a lexically scoped, functionally oriented programming commercial language. The entire source code to the compiler written in this method requires only 17 lines of simple code compared to roughly 1000 lines of equivalent code in the domain-specific compiler construction framework, Nanopass, and requires no domain specific techniques, libraries, or infrastructure support. It requires no sophisticated abstraction barriers to retain its concision and simplicity of form. The execution performance of the compiler scales along multiple dimensions: it consistently outperforms the equivalent traditional compiler by orders of magnitude in memory usage and run time at all data sizes and achieves this performance on both interpreted and compiled platforms across CPU and GPU hardware using a single source code for both architectures and no hardware-specific annotations or code. It does not use any novel domain-specific inventions of technique or process, nor does it use any sophisticated language or platform support. Indeed, the source does
    not utilize branching, conditionals, if statements, pattern matching, ADTs, recursions, explicit looping, or other non-trivial control or dispatch, nor
    any specialized data models. ---------------------------------------------------
    Wow!

    Thanks for the reference Elijah!

    /Roger

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aharon Robbins@21:1/5 to A. K. on Wed Sep 2 05:43:16 2020
    In article <20-09-003@comp.compilers>, A. K. <minforth@arcor.de> wrote:
    Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
    The old IT classic still holds: Early optimization is the root of all evil.

    Jus to give credit where credit is due, it was Donald Knuth who said
    "Premature optimization is the root of all evil."

    IIRC it was in his famous "Structured Programming with GOTO Statements"
    paper.

    Arnold

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Wed Sep 2 17:57:37 2020
    To follow up on the part of the question about using GPUs to do
    arithmetic to implement DFA transitions.

    It can "easily" be done. The base ideas can be found in the Wu Mamber
    string search algorithm. You treat states like (usually one-hot) bits
    in a fixed length (usually one word) bit vector. Your transition
    function then simply maps bit vectors to bit vectors. You can do that
    as a SIMD instruction, thus it is suitable for a GPU to execute.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Hans-Peter Diettrich@21:1/5 to All on Wed Sep 2 22:34:18 2020
    Am 02.09.2020 um 11:13 schrieb Jan Ziak:
    On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:
    The parser reads the input as a stream of
    tokens; you can't split the C file at some arbitrary point in half and
    parse both parts independently.

    Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary points:

    Certainly not for C/C++. The preprocessor and included files modify the original source. That's one of the reasons for the slow C/C++ compilation.

    when the state of the lexical analyzer collapses to a single
    state starting from a random file position with an arbitrary starting
    state.

    If you mean the lexer by "lexical analysis", that's only a front end for
    the parser. A compiler needs more state and context information like
    symbol tables, which can not be built from arbitrary parts of a source file.

    DoDi
    [One can certainly pipeline the C preprocessor and later phases but I'm guessing that's not what you're talking about here. -John]

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Christopher F Clark@21:1/5 to All on Thu Sep 3 17:52:22 2020
    I hate getting overly involved in this, but not only is GPU lexing
    possible, it probably isn't that complicated.

    To make this practical, let's assume we are lexing a language like C
    and we have split our file into arbitrary chunks (say 1000 bytes) and
    are trying to process them in parallel.

    First, note that while the preprocessor can do some pretty complicated
    things to the output token stream, it doesn't have that impact at the
    lexical level, the level of turning things into "raw" tokens. You
    cannot write a macro that introduces a quote character or a comment
    delimiter. And those are your main issues in terms of lexing. Yes,
    you have to do symbol table lookup (and macro expansion) as a
    post-step, but you aren't doing that on a character-by-character
    basis.

    In fact, you basically have only a few different "states" to worry about.

    1) you are lexing normally, not in one of the special cases that follows.
    2) you are in a string, i.e. you have seen a quote character and until
    you see the matching one, you are not tokenizing at all. There are
    two quote characters so call them states 2a and 2b.
    3) you have seen a C comment start /* and are looking for */. Again,
    you are not tokenizing.
    4) you have seen a C++ comment start // and are looking for a newline.
    Not tokenizing.
    5) you have seen a # (preceded only with whitespace) and are looking
    for a newline. You are tokenizing, but the line is a preprocessor
    directive.
    6) you are in code that is in a #if and will be skipped for that reason.
    7) you are at the start of your buffer and the previous character might be a \

    So, given those states, assume that your chunk starts in state 1 and
    lex like that, but saving the contents as a string in case of state 2.
    On a quote, keep going but reverse the text contexts. Similar idea
    for comment marks and newlines and hash characters. You only have to
    deal with state 7 for the first character and if you consider the
    first character part of the previous chunk, just that you get to
    inspect, you can probably finesse that issue.

    For each chunk, you have roughly two interesting sequences of possible
    tokens, whether you were in a string or not. And you can easily patch
    those together because your left context always tells you which state
    you enter a chunk in.

    Now, while I did this hand-waving analysis for C. Most languages are
    not significantly more complex at the lexical level. Basically, you
    are tokenizing or you are collecting a string or the text will be
    entirely thrown away, and you mostly care about the first two cases
    (tokenizing or building a string and even building a string is easy
    (although can use memory) so you really only care about the tokenizing
    case.

    So build a SIMD lexer using techniques based upon Wu Manber (and
    Glushkov NFAs) and break your text into chunks and let the GPU have a
    field day.

    -- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
    23 Bailey Rd voice: (508) 435-5016
    Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gah4@21:1/5 to Roger L Costello on Wed Sep 9 14:09:20 2020
    On Monday, August 31, 2020 at 9:44:53 PM UT
    C-7, Roger L Costello wrote:
    Hi Folks,

    I am reading a book [1] on machine learning and the book says some pretty interesting things:

    "In the search for more speed, machine learning researchers started taking advantage of special hardware found in some computers, originally designed to improve graphics performance. You may have heard these called graphics cards.

    The usual way to use a GPU is for single precision floating point. They might also
    be able to do fixed point of a reasonable size.

    As above, parsing is usually fast enough.

    GPUs are often enough used for dynamic programming, which is sometimes
    used for optimization and code generation in compilers. It might be that those could use a speed boost. This might be more true for unusual architectures where
    optimal code generation is more important.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Tue Sep 22 18:18:15 2020
    I hate getting overly involved in this, but not only is GPU lexing
    possible, it probably isn't that complicated.

    Indeed. Extra bonus points if you do it generically by having the `lex`
    tool do that parallelization for you. I think in the general case it
    could be a potentially costly parallelization, but in practice it should
    be pretty efficient for most programming languages.


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)