by Annanay Agarwal
My project was to incorporate polyhedral compilation into TensorFlow, the deep learning framework by Google.
Integrate Polly (the polyhedral loop optimizer in LLVM) into the LLVM framework used in TensorFlow through their JIT compiler - XLA that uses LLVM as a backend.
TF_FLAGS="xla_cpu_llvm_cl_opts"
along the Python command.
TensorFlow relies on an external library - Eigen - for fast implementations of Convolution and other operators. These Eigen implementations had to be disabled in order to get the TensorFlow compiler to pass these to the JIT compiler.0 < i < 64
, the code-generator would generate code that checked (unsigned)i < 64
. Polly did not have the mechanism in place to handle unsigned comparisons. Support for unsigned comparisons was thus added in Polly and the relevant PR is here.jit_scope = tf.contrib.compiler.jit.experimental_jit_scope
with jit_scope():
# Code that should be JIT'ted
Polly has successfully been integrated into TensorFlow, and the official upstreaming is underway. The PR can be found here. Work is underway to make Polly an optional pass in the TensorFlow pipeline as seen in the discussions. I am also planning to work on another Convolution optimization - Convolution can essentially reduced to matrix multiplication, the difference being that in conventional Matmul both matrices are of comparable dimensions, whereas here one matrix (the filter) is usually much smaller than the other (the image). Hence, the filter matrix can always be stored in main memory as opposed to conventional matmul.
Tests were run on a machine with the following stats:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 44
Stepping: 2
CPU MHz: 1600.000
BogoMIPS: 6131.56
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
The convolution operator was applied on a number of images as in this pseudo code.
# Do the process for N images
for i in range(N)
# pick a random image from the set of images
img = mnist.train.next_batch(batch_size)
# apply the convolution operation with a 5x5 weight - 32 such filters
conv = tf.conv2d(img, filter)
The three numbers show Non-JIT version (Eigen implementations), JIT without Polly and JIT with Polly.
Due to some last minute benchmarking issues, this could not be benchmarked.
With an exact dependence analysis and a good scheduling algorithm, Polly is better capable of optimizing the loops in the program over regular LLVM optimization passes. This is the result of polyhedral analysis passes which model the iteration space as points on polyhedra in N dimension space (where N is the depth of the loop). Dependency analysis and transformations are then performed with the help of a thread safe Integer Set Library (ISL) and an optimal schedule for running the program is thus derived.
Polly also has a specific optimization for the matrix multiplication operation. This enables special tiling and other data locality optimizations to improve the performance of matrix multiplication manifold.
Polly as an optimization pass in TensorFlow is a work in progress. When an complete pass of Convolution + Bias Addition + Softmax is passed through the JIT compiler, Polly is unable to fuse some operations into one loop (which is theoretically possible). This requires refactoring both on the Polly side as well as the TensorFlow side. On the TensorFlow side, XLA (and hence Polly) should have a broader spectrum and should be allowed to handle entire graphs instead of subgraphs that are passed to it to be optimized. Polly’s scheduling algorithm would work better if given a large piece of code to optimize rather than to run the scheduling algorithm (costly) multiple times on some chunks of the program. On the Polly side, there are some scalability issues wherein Polly bails out while handling of large SCoPs. This would typically be the scenario when a complete pass or with a complex neural network like RNN is passed through Polly’s SCoP detection. Compilation time significantly goes up as we add Polly’s passes in the pipeline, and this may be a concern for some applications written in Python for which compilation time might be important. However, improvements in execution time overcast the increase in compilation time.
Polly’s passes for SCoP detection are also expensive in terms of memory usage at the time of compilation since loading information of a large SCoP like that of an RNN. Below is an image of the memory usage on my local machine. It is quite clear that these numbers could blow up if a complete pass was added to the neural network pipeline.
All this aside, the potential of Polly in bringing speedups to training programs is now obvious. Benchmarks of neural networks training run up to days and Polly’s transformations can have a huge impact on such programs, reducing the execution time to a fraction.
The TensorFlow runtime relies on a cost model to decide which parts of the program should be JIT compiled. This cost modeling should be left to Polly which is capable of generating GPGPU code. It would be really useful to have Polly as a Backend in TensorFlow. This would take away a lot of responsibility of scheduling from the JIT compiler (XLA) since the mechanism and framework is already implemented in Polly.