Forum: >>> Magnum BBS <<<

Parsing using a Graphics Processing Unit (GPU)?

From Roger L Costello@21:1/5 to All on Mon Aug 31 10:35:53 2020

Hi Folks,

I am reading a book [1] on machine learning and the book says some pretty interesting things:

"In the search for more speed, machine learning researchers started taking advantage of special hardware found in some computers, originally designed to improve graphics performance. You may have heard these called graphics cards. ... Those graphics cards contain a GPU, or graphics processing unit. Unlike a general purpose CPU, a GPU is designed to perform specific tasks, and do them well. One of those tasks is to carry out arithmetic, including matrix multiplication, in a highly parallel way. ... GPUs have many more [than CPUs] arithmetic cores, thousands are fairly common today. This means a huge
workload can be split amongst all those cores and the job can be done
quickly."

Neat!

Has the parsing community found a way to take advantage of GPUs?

From the above excerpt, it appears that GPUs are especially good at
arithmetic. When I think of parsing, I don't think of lots of arithmetic. Perhaps someone has devised a way to recast the parsing problem into an arithmetic problem?

Any thoughts you might have on:

(a) parsing-using-GPUs, and
(b) recasting-the-parsing-problem-into-an-arithmetic-problem

would be appreciated. /Roger

[1] "Make Your First GAN with Pytorch" by Tariq Rashid

[Parsing is not usually an important factor in compiler performance.
The slow parts are the lexer, because it has to look at every
character of the input, and some optimizations that have to analyze
the entire intermediate form of the program. The first step in lexing
is to identify what class each character is, e.g., identifier, white
space, or operator. Perhaps a GPU could do vector lookups to speed
that up. For optimizations, I can sort of imagine how some analyses
like reachability might be expressible as matrices. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christian Gollwitzer@21:1/5 to All on Tue Sep 1 09:22:11 2020

Am 31.08.20 um 12:35 schrieb Roger L Costello:

Has the parsing community found a way to take advantage of GPUs?

I don't think that you can speed up parsing a lot using GPUs. GPUs are generally good at parallel processing. A single thread is usually slower
than a CPU thread, and there is overhead, because they are not
self-contained - i.e. you can usually speed up only some part of a
program, and it needs to be uploaded to the GPU and downloaded back.
GPUs also have faster memory, *if* you access it either in big blocks or
as a serial stream, in which case the compiler can transform it to block access. For random accesses, the memory is slower.

I have done some basic GPU programming, and I think that parsing is not
a parallel task in that sense. The parser reads the input as a stream of tokens; you can't split the C file at some arbitrary point in half and
parse both parts independently.

Christian

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From A. K.@21:1/5 to All on Tue Sep 1 01:25:54 2020

Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:

Has the parsing community found a way to take advantage of GPUs?

The old IT classic still holds: Early optimization is the root of all evil.

IOW: first identify your bottlenecks by profiling your parsing routines.
Repeat it.

Most probably bottlenecks will have to do with I/O speed and selected algorithms. Memory access speed perhaps, CPU/GPU performance very behind.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to All on Tue Sep 1 19:02:21 2020

Am 01.09.2020 um 09:22 schrieb Christian Gollwitzer:

Am 31.08.20 um 12:35 schrieb Roger L Costello:

Has the parsing community found a way to take advantage of GPUs?

I have done some basic GPU programming, and I think that parsing is not
a parallel task in that sense. The parser reads the input as a stream of tokens;

and acts *differently* depending on that input, while a GPU performs essentially the *same* operation to all elements of a stream (merge,
scale...).

you can't split the C file at some arbitrary point in half and
parse both parts independently.

Multiple files can be parsed in parallel just with a multi-core CPU.
More parallel modules require more symbol tables etc., what limits the
amount of concurrent threads by available (cached!) memory.

DoDi

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Wed Sep 2 01:14:06 2020

Our esteemed moderator wrote:
[Parsing is not usually an important factor in compiler performance.
The slow parts are the lexer, because it has to look at every
character of the input, and some optimizations that have to analyze
the entire intermediate form of the program. The first step in lexing
is to identify what class each character is, e.g., identifier, white
space, or operator. Perhaps a GPU could do vector lookups to speed
that up. For optimizations, I can sort of imagine how some analyses
like reachability might be expressible as matrices. -John]

As to parallelizing lexers, we did some experiments on speeding up
regular expressions (mostly for Snort) while I was at Intel. Charlie
Lasswell did experiments on starting parallel scans on different parts
of a stream or file that showed some promise. The hyperscan team (aka
Sensory Networks, Geoff Langdale &co) had some interesting results on
when you could skip ahead and guess the correct state when using
NFAs--I believe they called it "and then a miracle happens" (following
the famous cartoon about a mathematical proof with some missing steps
on a whiteboard). Michela Becchi worked on when you could predict
states involved in xFA loops would terminate. Geoff had me do a mode
they called Maclellan (the HS folks had some interesting naming
conventions, one of them was the use of US Civil War generals for
major variations/modes) which was turning NFAs into parallelized DFAs
for Hyperscan. The regular expression engine we put into the Cave
Creek chip, did a different form of parallel DFAs, spawning them when
certain conditions were met, and for things like the Aho-Corasick
algorithm you can show that there tends to be a "bushy" part near the
head of the automata, with "skinny" parts being forked off after a
small number of bytes--that was the key insight that led to our chip.

So, there are definite possibilities for doing SIMD regular expression
matching that would be suitable for GPU implementation. Geoff and I
discussed how to approach it on numerous occasions. It just never
quite got past the to-consider list.

Some of this "could" also be applied to parsing, particularly in a GLR
style parser, where you have forests. But our moderator is correct in
saying the big boost would be in speeding up lexing.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Elijah Stone@21:1/5 to Roger L Costello on Tue Sep 1 20:13:27 2020

On Mon, 31 Aug 2020, Roger L Costello wrote:

Any thoughts you might have on:

(a) parsing-using-GPUs, and
(b) recasting-the-parsing-problem-into-an-arithmetic-problem

Look up Aaron Hsu's Ph.D thesis, _A data parallel compiler hosted on the
GPU_ (https://scholarworks.iu.edu/dspace/handle/2022/24749). As John (and others) mention, I don't think the GPU is an especially interesting target
to speed up parsing specifically, but it may be a fruitful line of
inquiry. If so then that thesis is, as far as I can tell, the only
research that has been done so far; maybe you can make your own
experiments based on it.

--
time flies like an arrow;
fruit flies like a banana

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Jan Ziak@21:1/5 to Christian Gollwitzer on Wed Sep 2 02:13:04 2020

On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:

The parser reads the input as a stream of
tokens; you can't split the C file at some arbitrary point in half and
parse both parts independently.

Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary
points: when the state of the lexical analyzer collapses to a single
state starting from a random file position with an arbitrary starting
state.

-atom

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Roger L Costello@21:1/5 to All on Wed Sep 2 11:45:20 2020

Look up Aaron Hsu's Ph.D thesis,

A data parallel compiler hosted on the GPU

(https://scholarworks.iu.edu/dspace/handle/2022/24749)

Abstract:
This work describes a general, scalable method for building data-parallel by construction tree transformations that exhibit simplicity, directness of expression, and high-performance on both CPU and GPU architectures when executed on either interpreted or compiled platforms across a wide range of data sizes, as exemplified and expounded by the exposition of a complete compiler for a lexically scoped, functionally oriented programming commercial language. The entire source code to the compiler written in this method requires only 17 lines of simple code compared to roughly 1000 lines of equivalent code in the domain-specific compiler construction framework, Nanopass, and requires no domain specific techniques, libraries, or infrastructure support. It requires no sophisticated abstraction barriers to retain its concision and simplicity of form. The execution performance of the compiler scales along multiple dimensions: it consistently outperforms the equivalent traditional compiler by orders of magnitude in memory usage and run time at all data sizes and achieves this performance on both interpreted and compiled platforms across CPU and GPU hardware using a single source code for both architectures and no hardware-specific annotations or code. It does not use any novel domain-specific inventions of technique or process, nor does it use any sophisticated language or platform support. Indeed, the source does
not utilize branching, conditionals, if statements, pattern matching, ADTs, recursions, explicit looping, or other non-trivial control or dispatch, nor
any specialized data models. ---------------------------------------------------
Wow!

Thanks for the reference Elijah!

/Roger

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Aharon Robbins@21:1/5 to A. K. on Wed Sep 2 05:43:16 2020

In article <20-09-003@comp.compilers>, A. K. <minforth@arcor.de> wrote:

Am Dienstag, 1. September 2020 06:44:53 UTC+2 schrieb Roger L Costello:
The old IT classic still holds: Early optimization is the root of all evil.

Jus to give credit where credit is due, it was Donald Knuth who said
"Premature optimization is the root of all evil."

IIRC it was in his famous "Structured Programming with GOTO Statements"
paper.

Arnold

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Wed Sep 2 17:57:37 2020

To follow up on the part of the question about using GPUs to do
arithmetic to implement DFA transitions.

It can "easily" be done. The base ideas can be found in the Wu Mamber
string search algorithm. You treat states like (usually one-hot) bits
in a fixed length (usually one word) bit vector. Your transition
function then simply maps bit vectors to bit vectors. You can do that
as a SIMD instruction, thus it is suitable for a GPU to execute.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Hans-Peter Diettrich@21:1/5 to All on Wed Sep 2 22:34:18 2020

Am 02.09.2020 um 11:13 schrieb Jan Ziak:

On Tuesday, September 1, 2020 at 6:03:27 PM UTC+2, Christian Gollwitzer wrote:

The parser reads the input as a stream of
tokens; you can't split the C file at some arbitrary point in half and
parse both parts independently.

Of course you can split asm/C/C++/Go/Python/Rust/etc file at arbitrary points:

Certainly not for C/C++. The preprocessor and included files modify the original source. That's one of the reasons for the slow C/C++ compilation.

when the state of the lexical analyzer collapses to a single
state starting from a random file position with an arbitrary starting
state.

If you mean the lexer by "lexical analysis", that's only a front end for
the parser. A compiler needs more state and context information like
symbol tables, which can not be built from arbitrary parts of a source file.

DoDi
[One can certainly pipeline the C preprocessor and later phases but I'm guessing that's not what you're talking about here. -John]

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Christopher F Clark@21:1/5 to All on Thu Sep 3 17:52:22 2020

I hate getting overly involved in this, but not only is GPU lexing
possible, it probably isn't that complicated.

To make this practical, let's assume we are lexing a language like C
and we have split our file into arbitrary chunks (say 1000 bytes) and
are trying to process them in parallel.

First, note that while the preprocessor can do some pretty complicated
things to the output token stream, it doesn't have that impact at the
lexical level, the level of turning things into "raw" tokens. You
cannot write a macro that introduces a quote character or a comment
delimiter. And those are your main issues in terms of lexing. Yes,
you have to do symbol table lookup (and macro expansion) as a
post-step, but you aren't doing that on a character-by-character
basis.

In fact, you basically have only a few different "states" to worry about.

1) you are lexing normally, not in one of the special cases that follows.
2) you are in a string, i.e. you have seen a quote character and until
you see the matching one, you are not tokenizing at all. There are
two quote characters so call them states 2a and 2b.
3) you have seen a C comment start /* and are looking for */. Again,
you are not tokenizing.
4) you have seen a C++ comment start // and are looking for a newline.
Not tokenizing.
5) you have seen a # (preceded only with whitespace) and are looking
for a newline. You are tokenizing, but the line is a preprocessor
directive.
6) you are in code that is in a #if and will be skipped for that reason.
7) you are at the start of your buffer and the previous character might be a \

So, given those states, assume that your chunk starts in state 1 and
lex like that, but saving the contents as a string in case of state 2.
On a quote, keep going but reverse the text contexts. Similar idea
for comment marks and newlines and hash characters. You only have to
deal with state 7 for the first character and if you consider the
first character part of the previous chunk, just that you get to
inspect, you can probably finesse that issue.

For each chunk, you have roughly two interesting sequences of possible
tokens, whether you were in a string or not. And you can easily patch
those together because your left context always tells you which state
you enter a chunk in.

Now, while I did this hand-waving analysis for C. Most languages are
not significantly more complex at the lexical level. Basically, you
are tokenizing or you are collecting a string or the text will be
entirely thrown away, and you mostly care about the first two cases
(tokenizing or building a string and even building a string is easy
(although can use memory) so you really only care about the tokenizing
case.

So build a SIMD lexer using techniques based upon Wu Manber (and
Glushkov NFAs) and break your text into chunks and let the GPU have a
field day.

-- ****************************************************************************** Chris Clark email: christopher.f.clark@compiler-resources.com Compiler Resources, Inc. Web Site: http://world.std.com/~compres
23 Bailey Rd voice: (508) 435-5016
Berlin, MA 01503 USA twitter: @intel_chris ------------------------------------------------------------------------------

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gah4@21:1/5 to Roger L Costello on Wed Sep 9 14:09:20 2020

On Monday, August 31, 2020 at 9:44:53 PM UT
C-7, Roger L Costello wrote:

Hi Folks,

I am reading a book [1] on machine learning and the book says some pretty interesting things:

"In the search for more speed, machine learning researchers started taking advantage of special hardware found in some computers, originally designed to improve graphics performance. You may have heard these called graphics cards.

The usual way to use a GPU is for single precision floating point. They might also
be able to do fixed point of a reasonable size.

As above, parsing is usually fast enough.

GPUs are often enough used for dynamic programming, which is sometimes
used for optimization and code generation in compilers. It might be that those could use a speed boost. This might be more true for unusual architectures where
optimal code generation is more important.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Tue Sep 22 18:18:15 2020

I hate getting overly involved in this, but not only is GPU lexing
possible, it probably isn't that complicated.

Indeed. Extra bonus points if you do it generically by having the `lex`
tool do that parallelization for you. I think in the general case it
could be a potentially costly parallelization, but in practice it should
be pretty efficient for most programming languages.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Craig Colby
  Wed May 1 21:21:02 2024
  from Lewiston, Maine via Telnet
- Bob Worm
  Wed May 1 20:18:53 2024
  from Wales, Uk via Raw
- Bob Worm
  Wed May 1 19:46:10 2024
  from Wales, Uk via Telnet
- Bob Worm
  Wed May 1 13:46:16 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	299
Nodes:	16 (2 / 14)
Uptime:	27:36:52
Calls:	6,681
Calls today:	4
Files:	12,222
Messages:	5,342,397
Posted today:	2

Parsing using a Graphics Processing Unit (GPU)?

Who's Online

Recent Visitors

System Info