Summary

We plan to implement a parallel regular expression matcher (a parallel version of GNU grep) in CUDA for GPUs.

Background

Regular expression matching has many important uses in different fields. Many times people are faced with the challenge of finding a small chunk in a huge amount of data -- finding the needle within a haystack. If this data is text, regular expressions make searching for this data much less of a hassle. For example: with a few special characters you could formulate a search string that a web crawler might use to identify and scrape links from a web page.

Regular expression matching has inherently parallel points. For example if represented as an NFA, multiple state transitions from a particular state can be processed by different tasks to speed it up. Also, the same regular expression could run on multiple data sets parallely. There is SIMD to be exploited (in Cuda - threads within a block) if you are willing to work hard enough for it. For example, on state transitions all we care about is the current state, whether or not it matches and the next transition if there is a match. When different threads are looking at different states, they still perform the same above mentioned function.

Basic high-level algorithm:

Load dataset and regular expression
Compute NFA (in parallel?) from regular expression
LineLoop: for all lines in dataset:
    for all characters in line:
        Run NFA (in parallel on splits) on character (stop at newline)
            if (match):
                print line
                break LineLoop

Challenge

In terms of programming, we have 4 primary challenges

Parallelizing the generation of the NFA (we’re not sure how fast the sequential version will be or if parallelizing it is even feasible or needs to be done)
Parallelizing over NFA state transitions
SIMD parallelization over text in a single line
SPMD parallelization over multiple lines

Potential Problems:

A potential problem (of which there are possibly many that we haven’t thought of yet) we might face is with regard to cycles in an NFA. Our initial thought was to naively parallelize over transitions. For example, in the figure above, at transition q3, two threads would be created one of which would move on to q4 and the other to q1 and continue. Here we notice that both threads might once again reach q3 if the data set keeps matching and multiply again. Hence the number of threads that we would require for a regular expression like this when matching a specifically tailored input set might grow exponentially.

Also, our workload is a body of text. Since we aim to emulate GNU grep, we do not care about matches that span newlines. Therefore, every line of our dataset is independent of the other lines. There will be high locality per line (SIMD per line, SPMD across lines). We do need to generate the NFA before we can start searching through any of the text. There is also divergent execution since some thread may fail to match early on in an expression, while another almost makes it to the end of an expression. In addition, there is high communication to computation ratio, since the computation is really around O(n) in the size of the input -- but if our input fits in the RAM of the GPUs, we can benefit from the large bandwidth of the DDR5 bus.

Resources

We plan to use Ken Thompson’s NFA paper(1) as a guide and an article on the topic by Russ Cox(2) as a reference for this project. We will probably use Russ Cox’s NFA C implementation(3) as starter code.

Goals/Deliverables

Plan to Achieve

Implement a basic sequential regex engine with these characters *+?()| from the starter code and the articles mentioned in the resources section
Parallelize the regex engine and show close to linear speedup of the regex matching.
Implement more regex primitives: character classes, positive and negative assertions, etc.

Hope to Achieve

An application of the parallel regular expression matcher
Near-complete implementation of perl’s regex syntax (excluding backreferences which are not regular and cannot be represented efficiently with an NFA)

Demo

Running our parallel regular expression matcher to the class on a large set of texts. If there’s time to implement an application of the regular expression matcher, show this as well.

Platform

Nvdia GPUs (CUDA) in Gates. It makes sense to use this parallel system for implementing grep because we believe there to be both SPMD and SIMD parallelism in this problem. Additionally, we think this would be a good solution for those who need to high performance regular-expression pattern matching in real world applications.

Proposed Schedule

Week 1:

What we plan to do: Look at existing code by Russ Cox1 for regular expression matching using NFA’s. Learn how it works (understand all the code) and get the serial version up and running. Also, look at other possible approaches to parallelizing regular expressions. Maybe branching on NFA’s is not the best parallel solution possible?
What we actually did: 4/3: nfa.c compiles and runs, begin understanding code

Week 2:

What we plan to do: We should have a serial version up and running. Work on parallelizing this code using CUDA initially over the datasets. Different thread blocks process different lines of text. Also look at parallelizing NFA generation. Is it worth doing?
What we actually did:

Week 3:

What we plan to do: We have parallelized over the input data. Now we should parallelize across transitions in the NFA. Here we might issues such as work scheduling. In the case that we have an exponential number of transitions for complex regular expressions, how do we distribute work?
What we actually did:

Week 4:

What we plan to do: Continue with this approach and/or find more ways to parallelize the code.
What we actually did:

Week 5:

What we plan to do: Try to find an application that is curtailed due to slow string matching (maybe some sort of data mining using Twitter) Possibly work on trying to use our new and faster version of regular expression matching to try and speed it up just to prove that its awesome.
What we actually did:

Week 6:

What we plan to do: Buffer week. Just in case something takes longer than expected. Also, prepare for presentation of the final project.
What we actually did:

In case we have more time - Maybe package this into a real library that people could use in applications that might require high performance regular expression matching. In any case, we will probably do this eventually (even if it’s after the end of the semester).

CUDA grep

Manish and Brandon's 15418 final project