Chris Vest

Examining the performance of XorShift

2014-05-03 { java performance jmh }

George Marsaglia's XorShift algorithm has been my favourite pseudo-random number generator for a while now: The paper is a short and easy read, the generated entropy is good enough for my uses, it's easy to implement, and it's fast.

Here is an example implementation:

public class XorShift128 {
  // XorShift PRNG with a 2^128-1 period.
  private int x = System.identityHashCode(this);
  private int y = -938745813;
  private int z = 452465366;
  private int w = 1343246171;

  public final int nextInt() {
      int t = x^(x<<15);
      x = y; y = z; z = w;
      return w = (w^(w>>>21))^(t^(t>>>4));
  }
}

Pseudo-random number generators are like giant rings of numbers; go around the ring, and the sequence starts over. The implementation above has 2^128-1 numbers in its ring. The "2^128" is because the internal state is held in four 32 bit integers, and the "-1" is because the internal state can never be exactly zero - if it was, there would be no bits for the algorithm to shift around and XOR. Such is the nature of this particular algorithm.

Measurement

So how fast is this thing? The most obvious thing to do, is to compare it to java.util.Random. It is only logical that those in pursuit of absolute performance would confine the state of any PRNG to the thread that is going to use it, and then have a generator per thread. However, j.u.Random is implemented in a thread-safe fashion, by way of a compareAndSet() operation on the AtomicLong that holds the seed. Since Java 7, we have also had access to the ThreadLocalRandom class, which in its implementation can make the assumption that no synchronization is necessary.

This gives us a list of at least four things that we should benchmark:

Code that care a lot about the performance of their entropy sources, tend to do so because it needs a lot of numbers, and often more than one at a time. Therefore, it is also interesting to have a corresponding set of benchmarks that measure the effect of getting random numbers in batches.

The code for the benchmark is here: https://github.com/chrisvest/entropy-benchmark

I also added a variant of j.u.Random where biased locking has been disabled. I suspect it would have made a difference on Java 5, but in the end I did not bother to run the benchmark on a Java 5 JVM. I also added the j.s.SecureRandom for comparison, and a 32-bit variant of XorShift.

Benchmark results

Here are the results on Java 6:

shipilev:entropy-benchmark[master]$ java -jar target/microbenchmarks.jar -f 10 -wi 10 -i 10
...
Benchmark                                  Mode   Samples         Mean   Mean error    Units
CounterBenchmark.batch                    thrpt       100      104.332        0.321   ops/us
CounterBenchmark.single                   thrpt       100      104.335        0.241   ops/us
JavaUtilRandomBenchmark.batch             thrpt       100       92.618        0.261   ops/us
JavaUtilRandomBenchmark.single            thrpt       100       92.370        0.258   ops/us
JavaUtilRandomUnbiasedBenchmark.batch     thrpt       100       92.441        0.276   ops/us
JavaUtilRandomUnbiasedBenchmark.single    thrpt       100       92.129        0.372   ops/us
SecureRandomBenchmark.batch               thrpt       100        1.880        0.007   ops/us
SecureRandomBenchmark.single              thrpt       100        1.869        0.008   ops/us
XorShift128Benchmark.batch                thrpt       100      887.078        2.728   ops/us
XorShift128Benchmark.single               thrpt       100      362.377        0.915   ops/us
XorShift32Benchmark.batch                 thrpt       100      753.979        1.795   ops/us
XorShift32Benchmark.single                thrpt       100      370.324        0.984   ops/us
shipilev:entropy-benchmark[master]$

Java 6 does not have j.u.c.ThreadLocalRandom, so we don't get numbers for that implementation. This is one of the perks of the abstract benchmark pattern, as I like to call it: even though some benchmarks cannot even be initialised, say because they rely on a class that is not there, the other benchmarks can still run.

Here are the results on Java 7:

shipilev:entropy-benchmark[master]$ java -jar target/microbenchmarks.jar -f 10 -wi 10 -i 10
...
Benchmark                                  Mode   Samples         Mean   Mean error    Units
CounterBenchmark.batch                    thrpt       100      104.203        0.313   ops/us
CounterBenchmark.single                   thrpt       100      104.424        0.375   ops/us
JavaUtilRandomBenchmark.batch             thrpt       100       92.117        0.273   ops/us
JavaUtilRandomBenchmark.single            thrpt       100       89.992        0.446   ops/us
JavaUtilRandomUnbiasedBenchmark.batch     thrpt       100       92.254        0.243   ops/us
JavaUtilRandomUnbiasedBenchmark.single    thrpt       100       92.603        0.233   ops/us
SecureRandomBenchmark.batch               thrpt       100        1.887        0.007   ops/us
SecureRandomBenchmark.single              thrpt       100        1.883        0.007   ops/us
ThreadLocalRandomBenchmark.batch          thrpt       100      636.366        1.569   ops/us
ThreadLocalRandomBenchmark.single         thrpt       100      333.633        1.784   ops/us
XorShift128Benchmark.batch                thrpt       100      890.128        2.295   ops/us
XorShift128Benchmark.single               thrpt       100      355.688        1.273   ops/us
XorShift32Benchmark.batch                 thrpt       100      755.840        1.886   ops/us
XorShift32Benchmark.single                thrpt       100      369.797        1.155   ops/us
shipilev:entropy-benchmark[master]$

And finally Java 8:

shipilev:entropy-benchmark[master]$ java -jar target/microbenchmarks.jar -f 10 -wi 10 -i 10
...
Benchmark                                  Mode   Samples         Mean   Mean error    Units
CounterBenchmark.batch                    thrpt       100      165.406        0.527   ops/us
CounterBenchmark.single                   thrpt       100      175.406        0.603   ops/us
JavaUtilRandomBenchmark.batch             thrpt       100       92.492        0.276   ops/us
JavaUtilRandomBenchmark.single            thrpt       100       92.539        0.226   ops/us
JavaUtilRandomUnbiasedBenchmark.batch     thrpt       100       92.604        0.262   ops/us
JavaUtilRandomUnbiasedBenchmark.single    thrpt       100       92.289        0.282   ops/us
SecureRandomBenchmark.batch               thrpt       100        1.856        0.019   ops/us
SecureRandomBenchmark.single              thrpt       100        1.853        0.009   ops/us
ThreadLocalRandomBenchmark.batch          thrpt       100      536.396        2.066   ops/us
ThreadLocalRandomBenchmark.single         thrpt       100      442.882        1.485   ops/us
XorShift128Benchmark.batch                thrpt       100      889.241        5.456   ops/us
XorShift128Benchmark.single               thrpt       100      360.662        2.551   ops/us
XorShift32Benchmark.batch                 thrpt       100      753.075        4.669   ops/us
XorShift32Benchmark.single                thrpt       100      367.155        2.992   ops/us
shipilev:entropy-benchmark[master]$

So the ThreadLocalRandom in the non-batch case is only marginally slower than XorShift on Java 7, and beats it by a good margin on Java 8, while the (arguably less important) batch measurements of TLR remain within 60 to 70% of the performance of XorShift.

Based on this data, I can conclude that my habit of writing up a quick XorShift implementation whenever I need a fast source of entropy, has been outdated since the introduction of ThreadLocalRandom. Especially so if I target Java 8.

I guess I could stop here, but that would be boring. The astute reader will notice that the conclusion is fine, there are still curiosities in the data and the benchmark that have yet to be explained. For instance, why is the counter baseline faster in Java 8 while the performance of the j.u.Random stayed the same? Why, exactly, is the batch performance higher for some of the implementations? Why do I have these additions in the measurement loop? Let's take a look at each of these questions in turn.

Why is the counter so much faster than j.u.Random in Java 8?

Answering this first question is easy. The counter benchmark calls AtomicInteger.incrementAndGet() while j.u.Random is implemented with compareAndSet(). It used to be, since Java 5 through Java 7, that incrementAndGet() used compareAndSet() in underneath, but in Java 8 it was intrinsified so incrementAndGet() now compiles to a LOCK:XADD instruction on x86, which is faster than LOCK:CMPXCHG.

Why is batching faster?

Because the individual nextInt() calls all inline into the measurement loop and gets optimised together, basically. The XorShift128 implementation, for instance, does 4 field reads and 4 field writes in every nextInt() call. When the method is inlined, those field accesses only happen once for the entire batch, which then keeps them in registers. The XorShift128 implementation also does a rearranging of the field values. HotSpot is clever enough to completely optimise that away, and instead duplicates the code for the following nextInt() call in the batch such that operates on the registers as if they had been rearranged.

Why is the XorShift128 batching faster than the XorShift32 batching?

The rearranging of the code, rather than the registers, in the machine code HotSpot produces for the XorShift128 implementation, ends up enabling even more cleverness. The XorShift128 iteration starts by initialising a temporary variable: int t = x^(x<<15);, then does the field rearranging that is optimised away, and then ends with a computation and a field write: w = (w^(w>>>21))^(t^(t>>>4)); – the important thing to notice here, is that this final computation does not touch the x field! Because the nextInt() calls are inlined into the same scope and optimised together, and because the rearranging makes a nextInt() call operate on different registers than the previous call, we end up having no data dependency between the final computation of one iteration, and the initialisation of the temporary variable in the next iteration. Therefore the CPU is free to flex its Instruction Level Parallelism muscles, and begin executing the next call to nextInt() before the first call has finished. The XorShift32 implementation does allow this because there is only a single field holding the complete state of the PRNG, and therefore a call to nextInt() must complete in its entirety before the next call can begin.

On the other hand, the XorShift32 implementation only has one field to read and write, and that is why it is marginally faster on the single-invocation benchmark.

Why those additions in the benchmark?

Because Java 8 is too clever. Take a look at what happens if I remove them:

Benchmark                                        Mode   Samples         Mean   Mean error    Units
ThreadLocalRandomBenchmark.batch_wrong          thrpt        10     9504.551      113.507   ops/us
ThreadLocalRandomBenchmark.single_unadjusted    thrpt        10      474.207        9.360   ops/us

The performance of the single-invocation benchmark for j.u.c.ThreadLocalRandom improves by about 7%, but the batch performance "improves" by a factor 18! If we were to believe those numbers, they would imply that we are able to generate almost 10 random numbers every nanosecond! It is worth noting that the batch benchmark uses a batch size of 20, and an examination of the machine code (using JitWatch) that HotSpot produces for the measurement loop, reveals that it is able to deduce that the other 19 calls have no effect, and can be eliminated as dead code. Clever girl.

raptor.jpg

What about hardware based entropy sources?

Bonus question!

Intel x86_64 CPUs have since Ivy Bridge had an RDRAND instruction that provides unhindered (no kernel mode needed, for instance) access to true (as opposed to pseudo) random numbers, generated in hardware. It is very useful for when you need cryptographically secure random numbers, such as what the SecureRandom class provides. Java does not currently make use of this instruction, but it appears to be on the radar. RDRAND, however, does not make sense when you don't need the cryptographic properties it provides, and the reason is quite simple: RDRAND takes about 150 processor clock cycles to execute, and the software based PRNGs benchmarked above (other than SecureRandom) are all faster than that.