Saturday, June 26, 2010

"Blackhole" interpreter

Hi all,

Here are a few words about the JIT's "great speedup in compiling time" advertized on the PyPy 1.3 release (see the previous blog post). The exact meaning behind these words needs a fair bit of explanation, so here it is in case you are interested.

If you download a version of PyPy 1.3 that includes a JIT compiler, you get an executable that could be qualified as rather fat: it actually contains three interpreters. You have on the one hand the regular Python interpreter. It is here because it's not possible to JIT-compile every single piece of Python code you try to run; only the most executed loops are JIT-compiled. They are JIT-compiled with a tracing interpreter that operates one level down. This is the second interpreter. This tracing step is quite slow, but it's all right because it's only invoked on the most executed loops (on the order of 100 to 1000 times in total in a run of a Python script that takes anyway seconds or minutes to run).

So apart from the JIT compilation itself, we have two worlds in which the execution proceeds: either by regular interpretation, or by the execution of assembler code generated by the JIT compiler. And of course, we need to be able to switch from one world to the other quickly: during regular interpretation we have to detect if we already have generated assembler for this piece of code and if so, jump to it; and during execution of the assembler, when a "guard" fails, i.e. when we meet a path of execution for which we did not produce assembler, then we need to switch back to regular interpretation (or occasionally invoke the JIT compiler again).

Let us consider the cost of switching from one world to another. During regular interpretation, if we detect that we already have assembler corresponding to this Python loop, then we just jump to it instead of interpreting the Python loop. This is fairly cheap, as it involves just one fast extra check per Python loop. The reverse is harder because "guard" failures can occur at any point in time: it is possible that the bit of assembler that we already executed so far corresponds to running the first 4 Python opcodes of the loop and a half. The guard that failed just now is somewhere in the middle of interpreting that opcode -- say, multiplying these two Python objects.

It's almost impossible to just "jump" at the right place in the code of the regular interpreter -- how do you jump inside a regular function compiled in C, itself in a call chain, resuming execution of the function from somewhere in the middle?

So here is the important new bit in PyPy 1.3. Previously, what we would do is invoke the JIT compiler again in order to follow what needs to happen between the guard failure and the real end of the Python opcode. We would then throw away the trace generated, as the only purpose was to finish running the current opcode. We call this "blackhole interpretation". After the end of the Python opcode, we can jump to the regular interpreter easily.

Doing so was straightforward, but slow, in case it needs to be done very often (as in the case in some examples, but not all). In PyPy 1.3, this blackhole interpretation step has been redesigned as a time-critical component, and that's where the third interpreter comes from. It is an interpreter that works like the JIT compiler, but without the overhead of tracing (e.g. it does not need to box all values). It was designed from the ground up for the sole purpose of finishing the execution of the current Python opcode. The bytecode format that it interprets is also new, designed for that purpose, and the JIT compiler itself (the second interpreter) was adapted to it. The old bytecode format in PyPy 1.2 is gone (it was more suited for the JIT compiler, but less for blackhole interpretation).

In summary, it was a lot of changes in the most front-end-ish parts of the JIT compiler, even though it was mostly hidden changes. I hope that this longish blog post helped bring it a bit more to the light :-)

PyPy 1.3 released

Hello.

We're please to announce the release of PyPy 1.3. This release has two major improvements. First of all, we stabilized the JIT compiler since 1.2 release, answered user issues, fixed bugs, and generally improved speed.

We're also pleased to announce alpha support for loading CPython extension modules written in C. While the main purpose of this release is increased stability, this feature is in alpha stage and it is not yet suited for production environments.

Highlights of this release

  • We introduced support for CPython extension modules written in C. As of now, this support is in alpha, and it's very unlikely unaltered C extensions will work out of the box, due to missing functions or refcounting details. The support is disabled by default, so you have to do:

    import cpyext
    

    before trying to import any .so file. Also, libraries are source-compatible and not binary-compatible. That means you need to recompile binaries, using for example:

    pypy setup.py build
    

    Details may vary, depending on your build system. Make sure you include the above line at the beginning of setup.py or put it in your PYTHONSTARTUP.

    This is alpha feature. It'll likely segfault. You have been warned!

  • JIT bugfixes. A lot of bugs reported for the JIT have been fixed, and its stability greatly improved since 1.2 release.

  • Various small improvements have been added to the JIT code, as well as a great speedup of compiling time.

Cheers,
Maciej Fijalkowski, Armin Rigo, Alex Gaynor, Amaury Forgeot d'Arc and the PyPy team

Update:The correct command to build extension is "pypy setup.py build", not "python setup.py build" as it was stated before.

Tuesday, June 8, 2010

A JIT for Regular Expression Matching

This is part 2 of a series, see Part 1 for an introduction. In this post I want to describe how the JIT generator of the PyPy project can be used to turn the elegant but not particularly fast regular expression matcher from the first part into a rather fast implementation. In addition, I will show some speed measurements against various regular expression implementations.

Again, note the disclaimer: This technology could not easily be used to implement Python's re-module.

Example Expression and First Numbers

The regular expression I will use as an example in the rest of this paper is the expression (a|b)*a(a|b){20}a(a|b)*. It matches all strings that have two a with exactly 20 characters between them. This regular expression has the property that the corresponding DFA needs 2**(n+1) different states. As an input string, we use a random string (of varying lengths) that does not match the regular expression. I will give all results as number of chars matched per second. While this is not a particularly typical regular expression, it should still be possible to get some ballpark numbers for the speeds of various implementations – as we will see, the differences between implementations are huge anyway.

All the benchmarks were performed on my laptop, which has an Intel Core2 Duo P8400 processor with 2.26 GHz and 3072 KB of cache on a machine with 3GB RAM running Ubuntu Linux 10.04.

To get a feeling for the orders of magnitude involved, the CPython re module (which is implemented in C and quite optimized) can match 2'500'000 chars/s. Google's new re2 implementation still matches 550'000 chars/s. Google's implementation is slower, but their algorithm gives complexity and space guarantees similar to our implementation in the last blog post.

On the other end of the performance scale is the pure-Python code from the last blog post running on CPython. It can match only 12'200 chars/s and is thus 200 times slower than the re module.

Translating the Matcher

The code described in the last blog post is not only normal Python code, but also perfectly valid RPython code. Nothing particularly dynamic is going on in the code, thus it can be translated with PyPy's translation toolchain to C code. The resulting binary is considerably faster and can match 720'000 chars/s, 60 times faster than the untranslated version.

Another approach is to write equivalent versions of the algorithms in lower level languages. This has been done for C++ by Sebastian Fischer and for Java by Baltasar Trancón y Widemann. The algorithm is object-oriented enough to be mapped very closely to the respective languages. The C++ version is a little bit faster than the RPython version translated to C, at 750'000 chars/s. That's not very surprising, given their similarity. The Java version is more than twice as fast, with 1'920'000 chars/s. Apparently the Java JIT compiler is a lot better at optimizing the method calls in the algorithm or does some other optimizations. One reason for this could be that the Java JIT can assume that the classes it sees are all there are (and it will invalidate the generated machine code if more classes are loaded), whereas the C++ compiler needs to generate code that works even in the presence of more regular expression classes.

Generating a JIT

To get even more performance out of the RPython code, it is possible to generate a JIT for it with the help of the PyPy translation toolchain. To do this, the matching code needs to be extended somewhat by some hints that tell PyPy's JIT generator how this is to be done. The JIT generator can automatically produce a JIT compiler from an RPython interpreter of the source language. In our case, we view the regular expression matcher as an interpreter for regular expressions. Then the match function corresponds to the dispatch loop of a traditional interpreter.

Our regular expression matcher is a very peculiar interpreter. The matcher works by running exactly one loop (the one in match) as many times as the input string is long, irrespective of the "program", i.e. the particular regular expressions. In addition, within the loop there are no conditions (e.g. if statements) at all, it is just linear code. This makes it almost perfectly suited to the JIT generator, which produces a tracing JIT. A tracing JIT compiles the hot loops of a program (i.e. regular expression) and has to do extra work if there are conditions in the loop. In our case, there is exactly one loop per regular expression, without any condition.

JIT Hints

The hints that are needed for the match function of the last blog post can be seen here (the function is slightly rewritten, e.g. the JIT does only properly support a while loop as the main dispatch loop):

jitdriver = jit.JitDriver(reds=["i", "result", "s"], greens=["re"])

def match(re, s):
    if not s:
        return re.empty
    # shift a mark in from the left
    result = re.shift(s[0], 1)
    i = 1
    while i < len(s):
        jitdriver.can_enter_jit(i=i, result=result, s=s, re=re)
        jitdriver.jit_merge_point(i=i, result=result, s=s, re=re)
        # shift the internal marks around
        result = re.shift(s[i], 0)
        i += 1
    re.reset()
    return result

The jitdriver is an instance describing the data of the interpreter we are dealing with. The arguments to the constructor need to list all local variables of the dispatch loop. The local variables are classified into two classes, red ones and green ones. The green ones hold the objects that make up the program that the interpreter currently runs and which position in the program is currently being executed. In a typical bytecode interpreter, the bytecode object and the program counter would be green. In our case, the regular expression is the program, so it is green. The rest of the variables are red.

The green variables are treated specially by the JIT generator. At runtime, for a given value of the green variables, one piece of machine code will be generated. This piece of machine code can therefore assume that the value of the green variable is constant.

There are two additional hints, which are method calls on the jitdriver instance. The jit_merge_point method marks the beginning of the main interpreter loop. The can_enter_jit function marks the point where a loop in the user program can be closed, which in our case is trivial, it's just at the end of the interpreter loop (for technical reasons it is put at the beginning, because nothing must happen between the can_enter_jit and jit_merge_point invocations).

Those are the hints that the JIT generator needs to function at all. We added some additional hints, that give the JIT generator more information to work with. Those hints are immutability information, which means that certain instance fields can not be changed after the object has been constructed. Apart from the marked field, none of the fields of any of the Regex subclasses can change. For example for the Char class this is expressed in the following way:

class Char(Regex):
    _immutable_fields_ = ["c"]
    def __init__(self, c):
        ...

These hints allow the generated JIT to constant-fold reads out of the immutable fields in some situations.

Adaptions to the Original Code

In the introduction above I wrote that the code within the loop in match uses no conditions. It is indeed true that none of the _shift methods have an if statement or similar. However, there are some hidden conditions due to the fact that the and and or boolean operators are used, which are short-circuiting. Therefore the JIT-version of the code needs to be adapted to use the non-short-circuiting operators & and |.

JIT Example

To get an impression of how the generated machine code looks like, consider the regular expression (a|b)*. As regular expression objects this would be Repetition(Alternative(Char('a'), Char('b'))). The machine code in its intermediate, machine-independent form looks as follows (I have slightly cleaned it up and added comments for clarity):

# arguments of the loop
# i0 is i in the match function
# result0 is result in the match function
# s0 is s in the match function
[i0, result0, s0] # those are the arguments to the machine code
char = s0[i0] # read the character
# read the current mark:
i5 = ConstPtr(ptr_repetition).marked
i7 = char == 'a' # is the character equal to 'a'
i8 = i5 & i7
i10 = char == 'b' # is the character equal to 'b'
i11 = i5 & i10
# write new mark
ConstPtr(ptr_chara).marked = i8
i13 = i8 | i11
# write new mark
ConstPtr(ptr_charb).marked = i11
# write new mark
ConstPtr(ptr_alternative).marked = i13
# increment the index
i17 = i0 + 1
i18 = len(s0)
# write new mark
ConstPtr(ptr_repetition).marked = i13
# check that index is smaller than the length of the string
i19 = i17 < i18
if not i19:
    go back to normally running match
jump(i17, i13, s0) # start from the top again

The various ConstPtr(ptr_*) denote constant addresses of parts of the regular expression tree:

  • ptr_repetition is the Repetition
  • ptr_chara is Char('a')
  • ptr_charb is Char('b')
  • ptr_alternative is the Alternative

Essentially the machine code reads the next char out of the string, the current mark out of the Repetition and then performs some boolean operations on those, writing back the new marks. Note in particular how the generated machine code does not need to do any method calls to shift and _shift and that most field reads out of the regular expression classes have been optimized away, because the fields are immutable. Therefore the machine code does not need to deconstruct the tree of regular expression objects at all, it just knows where in memory the various parts of it are, and encodes that directly into the code.

Performance Results With JIT

With the regular expression matcher translated to C and with a generated JIT, the regular expression performance increases significantly. Our running example can match 16'500'000 chars/s, which is more than six times faster than the re module. This is not an entirely fair comparison, because the re module can give more information than just "matches" or "doesn't match", but it's still interesting to see. A more relevant comparison is that between the program with and without a JIT: Generating a JIT speeds the matcher up by more than 20 times.

Conclusion

So, what have we actually won? We translated the relatively simple and very slow regular expression matching algorithm from the last post to C and were thus able to speed it up significantly. The real win is gained by also generating a JIT for the matcher, which can be regarded as a simple interpreter. The resulting matcher is rather fast.

The lesson from these posts is not that you can or should write a practical and general regular expression module in this way – indeed, enhancing the algorithm to support more features of the re module would be a lot of work and it is also unclear what the speed results for more realistic regular expressions would be. However, it makes for a great case study of the JIT generator. It was relatively straightforward to generate a JIT for the regex matcher, and the speed results were great (Admittedly I know rather a lot about PyPy's JIT though). This approach is generalizable to many programs that are sufficiently "interpreter-like" (whatever that exactly means).

All the results that appeared at various points in this blog post can be seen here:

Implementation chars/s speedup over pure Python
Pure Python code 12'200 1
Python re module 2'500'000 205
Google's re2 implementation 550'000 45
RPython implementation translated to C 720'000 59
C++ implementation 750'000 61
Java implementation 1'920'000 157
RPython implementation with JIT 16'500'000 1352

Sources

All the source code can be found in my Subversion user directory on Codespeak.

Edit:

Armin is right (see first comment). I fixed the problem.