Fork Me on GitHub

CUDA grep

Manish and Brandon's 15418 final project

Summary

We plan to implement a parallel regular expression matcher (a parallel version of GNU grep) in CUDA for GPUs.

Background

Regular expression matching has many important uses in different fields. Many times people are faced with the challenge of finding a small chunk in a huge amount of data -- finding the needle within a haystack. If this data is text, regular expressions make searching for this data much less of a hassle. For example: with a few special characters you could formulate a search string that a web crawler might use to identify and scrape links from a web page.

Regular expression matching has inherently parallel points. For example if represented as an NFA, multiple state transitions from a particular state can be processed by different tasks to speed it up. Also, the same regular expression could run on multiple data sets parallely. There is SIMD to be exploited (in Cuda - threads within a block) if you are willing to work hard enough for it. For example, on state transitions all we care about is the current state, whether or not it matches and the next transition if there is a match. When different threads are looking at different states, they still perform the same above mentioned function.

Basic high-level algorithm:

Load dataset and regular expression
Compute NFA (in parallel?) from regular expression
LineLoop: for all lines in dataset:
    for all characters in line:
        Run NFA (in parallel on splits) on character (stop at newline)
            if (match):
                print line
                break LineLoop 

Challenge

In terms of programming, we have 4 primary challenges

Potential Problems:

A potential problem (of which there are possibly many that we haven’t thought of yet) we might face is with regard to cycles in an NFA. Our initial thought was to naively parallelize over transitions. For example, in the figure above, at transition q3, two threads would be created one of which would move on to q4 and the other to q1 and continue. Here we notice that both threads might once again reach q3 if the data set keeps matching and multiply again. Hence the number of threads that we would require for a regular expression like this when matching a specifically tailored input set might grow exponentially.

Also, our workload is a body of text. Since we aim to emulate GNU grep, we do not care about matches that span newlines. Therefore, every line of our dataset is independent of the other lines. There will be high locality per line (SIMD per line, SPMD across lines). We do need to generate the NFA before we can start searching through any of the text. There is also divergent execution since some thread may fail to match early on in an expression, while another almost makes it to the end of an expression. In addition, there is high communication to computation ratio, since the computation is really around O(n) in the size of the input -- but if our input fits in the RAM of the GPUs, we can benefit from the large bandwidth of the DDR5 bus.

Resources

We plan to use Ken Thompson’s NFA paper(1) as a guide and an article on the topic by Russ Cox(2) as a reference for this project. We will probably use Russ Cox’s NFA C implementation(3) as starter code.

Goals/Deliverables

Plan to Achieve

Hope to Achieve

Demo

Platform

Nvdia GPUs (CUDA) in Gates. It makes sense to use this parallel system for implementing grep because we believe there to be both SPMD and SIMD parallelism in this problem. Additionally, we think this would be a good solution for those who need to high performance regular-expression pattern matching in real world applications.

Proposed Schedule

Week 1:

Week 2:

Week 3:

Week 4:

Week 5:

Week 6:

In case we have more time - Maybe package this into a real library that people could use in applications that might require high performance regular expression matching. In any case, we will probably do this eventually (even if it’s after the end of the semester).