Fork Me on GitHub

CUDA grep

Manish and Brandon's 15418 final project

Download this project as a .zip file Download this project as a tar.gz file

Updated Time Table (what we plan to do)

HalfWeek1:

HalfWeek2:

HalfWeek3:

HalfWeek4:

HalfWeek5:

Work Completed so far

We started off by understanding Russ Cox’s code - he had a sequential regular expression matcher for CPUs. We fixed a few bugs with his code and then created a bunch of helper modules like an NFA visualizer & a test bench to help us with debugging issues later on. After this we implemented a few more features such as support for escape characters, character classes (ranges) and wildcard characters to ensure that we supported most of egrep’s features.

The next step was porting this code over to CUDA. As of right now, we have ported most of the code over. We set up a CUDA kernel that works similarly to our sequential implementation and uses a single thread.

Update as of April 23

Over the weekend we managed to get parallel matching across lines working. The results were a little surprising. When we launched a kernel with 1 block and 1 thread the time taken for matching was 0.11s. Launching it with 256 threads (SIMD) in 1 block brought that time down to 0.026s. This was surprising considering that we thought that SIMD might slow things down because of high divergence. We also observed that 0.026s was a barrier that couldn’t be breached even with a larger number of thread blocks. A little more research showed that this was the amount of time it took for the CUDA kernel to get set up as launching an empty kernel performed similarly as well. Therefore the time it takes for matching a simple NFA like the one we experimented with is negligible and is bounded by the time it takes to set up a CUDA kernel.

As a result our goal moving forward might change just a little bit. Maybe a more complicated regular expression won’t be bounded by the CUDA kernel set up time. If it is still bounded, then we might need to launch a single kernel and match multiple regular expressions within that kernel to effectively hide the overhead involved in launching a CUDA kernel.

Updates on Goals

Old Plan to Achieve

Hope to Achieve

Goals for May 10th (not including those completed)

Plan to show

Preliminary Results

(times taken with ./nfa -t -f romeojuliet.txt 'ROMEO')

CUDA Parallel

Parallel CopyNFAToDevice 0.0531 seconds
Parallel CopyStringsToDevice 0.8060 seconds
Parallel pMatch 0.1426 seconds
Parallel Total 1.0038 seconds

CUDA Sequential (1 thread, 1 threadblock)

Sequential CopyNFAToDevice = 0.1689 seconds
Sequential CopyStringsToDevice = 0.8648 seconds
Sequential pMatch = 0.1472 seconds
Sequential Total = 1.1831 seconds

Sequential CPU

Sequential Total = 0.0097 seconds

Issues and Unknowns