by Sahil Yerawar
My project was to analyze the cause of failure of Polly loop optimizer to analyze Chapel’s loops. Furthermore, if successful, also extend generation of code to GPU architechture in addition to CPU codegen.
Analyze the cause of failure of Polly in Chapel framework using some simple examples. Once Polly recognizes Chapel’s loops, check for proper code generation in both the architectures.
We now move on toward the 2-D array initialization and matrix multiplication problems. Here, Polly failed to recognize the SCoP’s of Chapel Loops. However with the help of -polly-invariant-load-hoisting
, SCoP’s were getting generated properly. In addition to this, -polly-codegen
also generated polly-specific code. However these SCoP’s were discovered only due to -polly-process-unprofitable
which were suboptimal. In the absence of this, no SCoP’s were detected.
-polly-detect
was giving error remarks related to aliasing and non affine access. We had triedto analyze the cause of failure for this and stumbled across a rather interesting bug in Polly. Suppose there is a set of instructions which are chained to each other in the form of alternating GEP’s and load instructions. If the starting instruction is loop invariant then all the instructions should be collectively invariant. For eg:; <label>:51: ; preds = %50, %51
%.02 = phi i64 [ 1, %50 ], [ %84, %51 ], !dbg !14
%52 = getelementptr inbounds `%_array_DefaultRectangularArr_2_int64_t_F__real64_int64_t, %_array_DefaultRectangularArr_2_int64_t_F__real64_int64_t*` %2, i64 0, i32 1, !dbg !15
%53 = load `%chpl_DefaultRectangularArr_2_int64_t_F__real64_int64_t_object*, %chpl_DefaultRectangularArr_2_int64_t_F__real64_int64_t_object**` %52, align 8, !dbg !15, !tbaa !16
%54 = getelementptr inbounds `%chpl_DefaultRectangularArr_2_int64_t_F__real64_int64_t_object, %chpl_DefaultRectangularArr_2_int64_t_F__real64_int64_t_object*` %53, i64 0, i32 3, i64 0, !dbg !15
%55 = load i64, i64* %54, align 8, !dbg !15, !tbaa !28
%56 = mul nsw i64 %55, %.0, !dbg !15
%57 = add nsw i64 %56, %.01, !dbg !15
%58 = getelementptr inbounds `%chpl_DefaultRectangularArr_2_int64_t_F__real64_int64_t_object, %chpl_DefaultRectangularArr_2_int64_t_F__real64_int64_t_object*` %53, i64 0, i32 8, !dbg !15
%59 = load double*, double** %58, align 8, !dbg !15, !tbaa !29
%60 = getelementptr inbounds double, double* %59, i64 %57, !dbg !15
%61 = load double, double* %60, align 8, !dbg !15, !tbaa !42
In this loop block, %52
is considered to be loop-invariant but %53
,%54
,%55
,%58
,%59
should also be loop-invariant as they are relying on a previous instruction which is inductively loop-invariant. This type of invariant detection was not present in polly initially. This phabricator review (link) deals with introducing detection of collective invariant loads.
Arr[1..4][2..6]
). It also didn’t handle the cases where the iteration domains were not 0-based. Thus alternative methods were discussed, which could handle the generalizations effectively.Initially there was a consensus to pass on information like array dimension sizes and index subscripts thorugh metadata to Polly. But the idea was a bit unstable in the fact that metadata is sometimes prone to changes for certain optimization passes in LLVM. Hence we have resorted to develop an intrinsic-like function which is responsible for computing the final index to be used for array access. The main purpose of this indexing function is that Polly can effectively pick array dimension sizes and index subscripts from the function’s arguments and it can use this information for generating code for both CPU and GPGPU’s. This idea was supported because there was already an existing similar work, albeit it was for Fortran
language.
Initially some backporting issues had to be fixed, which were causing irregular assertion failures in both Polly as well as Isl code. After fixing those nitpicks, some changes related to getDataIndex()
in DefaultRectangular.chpl
had to be made in order to simplify the generated SCoP. At this point, Polly was properly detecting Chapel’s loops in SCoP format.
Moving on to the codegen phase, we have aimed for both -polly-codegen
and -polly-codegen-ppcg
(using NVPTX backend). I had to update the normal polly-codegen
from the phabricator review which resolved normal codegeneration. For PPCGCodegen, we first had to teach the fact that polly_array_index
is one of the functions which can be exempted from disallowing whenever present in the patch.
realignParams()
as a temporary measure, the root cause was that another function getFortranArrayIds()
which was populating the parameters was concerned with only the outermost dimension and all the dimensions. With that change, PPCG code was getting generated.polly_array_index()
from Polly so that the kernel can have access to it while in GPU. There were some array allocation issues in terms of size. This was because of we were passing pre-computed coefficients of indices in the index expressions rather than individual array sizes, which Polly misinterpreted. So as a temporary measure, I also passed individual dimension sizes to Polly through the arguments of indexing function. Then we tried to eliminate the block values(the pre-computed values of indices) from arguments of polly_array_index()
to minimize the number of parameters passed, and we make Polly to compute these block values from the obtained dimension sizes. On the Chapel front, necessary changes were implemented to polly_array_index()
definition. This also involved introduction sizesPerDim
field in DefaultRectangularArr
class (PR link for polly_array_index is Here).-searlyShiftData
config parameter).So the general strategy was to discard the usage of factoredOffs
and instead subtract offset values from its corresponding indices.However, due to this method, we stumbled upon an unknown anomaly in Chapel that integer tuples as arguments were behaving as call-by reference while they should behave as call-by-value. Relevant PR to resolve this is Here. Ultimately, Arrays with custom domains are also detected by Polly and proper CPU and GPU polly optimized code is getting generated.
The Phabricator Review which handles the indexing function method can be found Here.Till now, the CPU and GPU code was getting generated under the assumption that the RTC’s were always evaluating to be true. The problem associated with the RTC checks lied in improper evaluation of Invalid context of the SCoP. After further debugging, it was found out that the constants used in Invalid context for comparision required 64 bits. Polly currently considers all integer constant having bit width strictly greater than 63 bits to be LargeInts
, the commit which brought this change didn’t insist on 64 bits to be considered as too large. A PR (link is Here ) has been created to include integers with 64-bit width as within the limits.
Integrating the Pipeline
polly_array_index()
is enabled to append the number of dimensions of the array accessed. PR can be found Here.Right now, we have assumed that the loop to be optimized would be present in a function called test_polly()
which Polly would selectively optimize and generate Polly optimized code.
On the Chapel front, the indexing function has been successfully implemented into Chapel compiler. The official upstreaming is in progress. Whereas for Polly, the existing feature of analyzing the polly_array_index()
call has been successfully working in local version of Polly. However there can be further optimizations on this feature and hence there needs further work on this approach. The integration of Chapel and Polly is also in progress but with the help of a set of commands, we can essentially compile Chapel code using it’s LLVM+Polly copy. It will get fixed soon.
As of now, Polly can detect Chapel Loops which can have custom domains for every dimension (for eg: Arr[1..5][6..8][25..89]
).
Tests were run on the machine with the following stats
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 56
On-line CPU(s) list: 0-55
Thread(s) per core: 2
Core(s) per socket: 14
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping: 1
CPU MHz: 1200.000
CPU max MHz: 2600.0000
CPU min MHz: 1200.0000
BogoMIPS: 5195.92
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 35840K
Model: Tesla P100-PCIE-12GB
IRQ: 154
GPU UUID: GPU-cca49a44-59ee-4412-1284-f4741454c98b
Video BIOS: 86.00.3a.00.02
Bus Type: PCIe
DMA Size: 47 bits
DMA Mask: 0x7fffffffffff
Bus Location: 0000:0b:00.0
Device Minor: 0
The preliminary results for standardized matrix multiplication (all matrices of size 1000 x 1000) using the collective-invariant-loads
patch were conducted using perf stat -r 10
.
Binary | Time (In seconds) |
---|---|
Plain Chapel Code | 7.54 |
Polly optimized code | 1.18 |
This huge performance gain encourages us to further pursue PPCGCodegen for Chapel.
We have checked the indexing call approach using two examples: 2D-array initialization and matrix multiplication. We have compared it using three methods:
The results are obtained as follows:
Binary | Time (In Seconds) |
---|---|
Without Polly | 2.625 |
Polly(CPU) | 2.612 |
Polly(GPU) | 3.324 |
Size: 1000
Binary | Time (In Seconds) |
---|---|
Without Polly | 27.5273 |
Polly(CPU) | 7.13 |
Polly(GPU) | 0.87 |
Size: 5000
Binary | Time (In Seconds) |
---|---|
Without Polly | 8102.7239 |
Polly(CPU) | 4334.082 |
Polly(GPU) | 17.07 |
Polly tries to model the SCoP detected into a polyhedra over a N-dimensional space (where N is the depth of the loop). Polly has exact dependence analysis and an efficient scheduling algorithm which is possible with the help of Integer Set Library (ISL). Furthermore, Polly has a well-defined strategy to optimize whenever a matrix-multiplication type kernel is detected within a program. All of these features ensure Polly to be better capable of optimizing loops rather than the regular LLVM loop optimization passes.
Integration of Polly within Chapel provides a lot of scope in terms of improvement. Currently, the SCoP detected by Polly is sub-optimal in many ways.One of the main reasons being there are a lot of parameters involved within the context which can significantly affect the compile-time due to Polly.In Chapel, the actual integration in terms of adding the required Polly passes and adding libGPURuntime.so
to the list of link files is also yet to be done. The Polly Codegen can still be further simplified to provide better performance in both the architechtures.
In terms of the nature of the loops, the work can also be extended to detect strided Chapel Arrays.
The current results are based on only two simple yet common examples. There can be two ways to further extend the testing phase:
We are working on understanding the exact impact of this indexing approach by performing improvements related to code-generation, scop detection and optimal transfer of parameters involved for indexing. In parallel, we are also testing whether the one-off polly_array_index
method could be coherently be introduced within LLVM. We are working on an RFC (Link) to document this purpose.
It is now more clear that introducing Polly and its set of optimizations to Chapel benefits the compiler both in terms of performance as well as providing an extra option to different architecture (GPGPU’s), thereby reaping the benefits of both. This can bring upon a huge impact on Chapel Programs, attaining significant speedups.
Right now, only simple Chapel for
loops are modelled using Polly. It would be exciting if this support can be extended further to the data-parallel forall
loops which are more commonly used in distributed scenario.
I would like to thank my mentors, Siddharth Bhat and Phillip Pfaffe for always being available and patient enough to resolve my doubts in a precise manner, Michael Ferguson for supporting me on the Chapel front and helping me whenever I had faced any troubles providing appropriate resources to alleviate the problems. I would like to extend my special thanks to Michael Kruse for always being available and providing suggestions and inputs which helped in solving the issues.