We had an outstanding social event on September 13 themed Compilers meet Theorem Provers, with more than 45 attendees and three outstanding speakers.
The GSoC 2018 coding phase officially ended yesterday. We congratulate all students who particpiated!
Google requires all students to preparte a web page summarizing the work they did for their project. Here are the reports for the projects supported by Polly Labs:
Pankaj Kukreja: Collect Polyhedral Benchmarks [Project proposal]
Mentored by: Brian Homerding, Michael Kruse
Sahil Yerawar: Compiling Chapel with Polly/LLVM [Project proposal]
Mentors: Michael Ferguson, Siddharth Bhat, Philip Pfaffe
Andrei Lascu: Metamorphic Testing for Math Libraries [Project proposal]
Mentors: Tobias Grosser, Sven Verdoolaege
We thank all students for their contributions and all mentors for their support. We hope that we could get you interested in developing for LLVM, Clang, Polly and/or isl and maybe that you continue contributing to these projects.
[GSoC Logo source: https://developers.google.com/open-source/gsoc/resources/media]
PPCG 0.08 was released containing minor fixes.
barvinok 0.41 was released containing only a single minor change:
Remove support applying codegen
to a set in iscc
The codegen operation prints an AST generated from a schedule, given either as a schedule tree or a (union) map. As a convenience, applying codegen to a set would result in codegen being applied to the identity mapping on the set. This seems to be causing confusion about the nature of the codegen operation. Drop the magic such that users are aware that a schedule is needed to perform AST generation. They can still explicitly call the “identity” operation on a set if they really want the old behavior.
pet 0.11 was released containing a couple of changes:
Return statements in summary functions
Return statements are now allowed at the end of a summary function. The accesses that appear in the argument of the return statement are treated like any other read. Return statements in other positions are currently not allowed as that would require special treatment for turning perceived must-writes that may get short-cut by the return into may-writes.
For example, the following input is now accepted.
int select(int n, int a[n], int i)
{
return a[i];
}
void foo(int n, int a[n])
{
#pragma scop
for (int i = 0; i + 1 < n; ++i) {
a[i] = select(n, a, i) + select(n, a, i + 1);
}
#pragma endscop
}
Return statements in inlined functions
Return statements are now allowed at the end of an inlined function. This enables support for calls to inlined functions in nested expressions. The return statement in the inlined function is simply replaced by a write to a freshly allocated variable.
For example, the following input is now accepted.
inline int select(int n, int a[n], int i)
{
return a[i];
}
void foo(int n, int a[n])
{
#pragma scop
for (int i = 0; i + 1 < n; ++i) {
a[i] = select(n, a, i) + select(n, a, i + 1);
}
#pragma endscop
}
isl 0.19 was released containing various improvements:
Improve coalescing
In particular, the following set descriptions are now also coalesced
{ [a, b, c] : (b = -1 + a and 0 < a <= 3 and
9*floor((-4a + 2c)/9) <= -3 - 4a + 2c) or
(a = 4 and b = 3 and 9*floor((-16 + 2c)/9) <= -19 + 2c) };
{ [x, y] : 2y = -x and x <= 0 or x <= -1 and 2y <= -x - 1 and 2y >= x - 1 };
{ [a] : (a <= 0 and 3*floor((a)/3) = a) or (a < 0 and 3*floor((a)/3) < a) };
{ [a, b] : a <= 1024 and b >= 0 and
((-31 - a + b <= 32*floor((-1 - a)/32) <= -33 + b and
32*floor((-1 - a)/32) <= -16 + b + 16*floor((-1 - a)/16))
or (2 <= a <= 15 and b < a)) };
{ [a] : a > 0 and ((16*floor((a)/16) < a and
32*floor((a)/32) < a) or a <= 15) };
{ [a, b, c, d] : (-a + d) mod 64 = 0 and a <= 8 and b <= 1 and
10 - a <= c <= 3 and d >= 5 and 9 - 64b <= d <= 70;
[a, b = 1, c, d] : (-a + d) mod 64 = 0 and a <= 8 and c >= 4 and
10 - a <= c <= 5 and 5 <= d <= 73 - c };
[n, m] -> { S_0[i] : (-n + i) mod 3 = 0 and m >= 3 + n and
i >= n and 3*floor((2 + n + 2m)/3) <= n + 3m - i;
S_0[n] : n <= m <= 2 + n };
{ [a, b] : exists (e0: 0 <= a <= 1 and b >= 0 and
2e0 >= -5 + a + 2b and 2e0 >= -1 + a + b and
2e0 <= a + b);
[a, b] : exists (e0: 0 <= a <= 1 and 2e0 >= -5 + a + 2b and
2e0 >= -1 - a + b and 2e0 <= -a + b and
2e0 < -a + 2b) };
{ [i, j, i - 8j] : 8 <= i <= 63 and -7 + i <= 8j <= i;
[i, 0, i] : 0 <= i <= 7 };
{ [a, b] : a >= 0 and 0 <= b <= 1 - a; [1, 1] };
{ [a, b] : a, b >= 0 and a + 2b <= 2; [1, 1] };
Improve parametric integer programming
If a tableau used for parametric integer programming contains a purely parametric row with an indeterminate sign, then this means that a solution only exists in the part of the context where this constraint holds. Such constraints are now transferred to the context first such that they can be used to more accurately determine the possible signs of other rows and to possible avoid further splits.
In particular, calling isl_set_dim_max
on the set
[P, N] -> { [a] : a < N and a >= 0 and N > P and a <= P and
N > 0 and P >= 0 }
now results in an affine expression defined over a single part of the domain.
Try harder to avoid large coefficients in scheduler
For the Pluto-like scheduler, the main cause of large coefficients was loop coalescing and this was already resolved in isl 0.17. For the Feautrier scheduler, which is used as a backup for the Pluto-like scheduler, loop coalescing was also handled, but for this scheduler, loop coalescing is not the only main source of large coefficients. In particular, if the solution to the LP problem constructed by the Feautrier scheduler is not integral, then the numerators with respect to a common denominator are used as schedule coefficients. These numerators can be arbitrarily large even if the rational values themselves are small. The solution is to continue the search until an integral solution is found. Second, the Feautrier scheduler tries to carry as many dependences as possible (as instructed), which tends to result in schedules with coefficients proportional to the number of statements. This second problem is largely resolved by first trying to only carry self-dependences. This second change also removes some spurious shifts.
Handle coscheduled must-sources in dependence analysis
The behavior in case of coscheduled must-sources was undefined. Define the behavior to kill all earlier sources, but not any coscheduled may-sources or must-sources. Also, a dependence derived from a must-source that is coscheduled with any other source is never considered a must-dependence.
Support explicit kills in dependence analysis
In fact, make kills the primary interface, with must-sources considered to be both may-sources and kills, since the effect of kills is actually easier to explain than that of must-sources.
Drop deprecated isl_int
, band forests and additional functions
We are happy to announce that Google selected Polly Labs as one of the mentoring organizations for Summer of Code 2018. Our page at Google as participating organization is here. Like 2017 we will assist students to enter open source development for polyhedral optimization and compilation.
If you are currently enrolled at a university or college and you are interested in contributing in the field of polyhedral compilation, we invite you to apply with a project proposal at Google. The first step is to contact us on one of our projects’ mailing list. See the 2018 ideas and information page for more detailed instructions and ideas for projects. Hints for writing a good proposal are found on our How-to-apply page.
Also see the post for GSoC 2017.
[GSoC Logo source: https://developers.google.com/open-source/gsoc/resources/media]
Sven Verdoolaege presented the Extending Pluto-style Polyhedral Scheduling with Consecutivity paper at IMPACT 2018.
Abstract:
The Pluto scheduler is a successful polyhedral scheduler that is used in one form or another in several research and production compilers. The core scheduler is focused on parallelism and temporal locality and does not directly target spatial locality. Such spatial locality is known to bring performance benefits and has been considered in various forms outside and inside polyhedral compilation. For example, the Pluto compiler has some support for spatial locality, but it is limited to a post-processing step involving only loop interchange. Consecutivity is a special case of spatial locality that aims for stride-1 accesses, which can be useful for constructing burst accesses and for vectorization. Stride-1 accesses have been targeted by an approach based on one-shot scheduling, but it is fairly approximative and not directly transferable to a Pluto-style scheduler. This paper describes an approach for consecutivity that is integrated in a Pluto-style polyhedral scheduler, as implemented in isl. Both intra-statement and inter-statement consecutivity is considered, taking into account multiple references per statement and the integration into a component based incremental scheduler.
The report Consecutivity in the isl Polyhedral Scheduler, describing an extension of the isl scheduler for targeting consecutive accesses, has been made available.
The GSoC 2017 coding phase officially ended today. We congratulate all students who participated!
Students prepared a page for their final report summarizing the work they did for their project. These are the reports for the students that were supported by Polly Labs:
Annanay Agarwal: Enable Polyhedral Optimizations in XLA through LLVM/Polly [Project proposal]
Mentored by: Brian Retford, Michael Kruse, Chandan Reddy
Nicolas Bonfante: Maximal static expansion for efficient loop parallelization on GPU [Project proposal]
Mentors: Andreas Simbürger, Michael Kruse
Malhar Thakkar: ISL Memory Management Using Clang Static Analyzer [Project proposal]
Mentors: Devin Coughlin, Sven Verdoolaege, Alexandre Isoard
Sanjay Srivallabh Singapuram Enabling Julia to target GPGPUs using Polly [Project proposal]
Mentors: Tobias Grosser
We thank all students for their contributions and all mentors for their support. We hope that we could get you interested in developing for LLVM, Clang and/or Polly and maybe that you continue contributing to these projects.
[GSoC Logo source: https://developers.google.com/open-source/gsoc/resources/media]
The report Scheduling for PPCG, describing the isl scheduler as used by PPCG, has been made available.
NEWS: Google selected Polly
Labs as mentoring organization for
Google Summer of Code 2017! You are an undergraduate student or doing your
doctorate? Join us to work on polyhedral compilation this summer! Here the link
directly to our Polly Labs GSoC Ideas and Information
Page.
Polyhedral compilation is so much more than just the LLVM Polly project and with Polly Labs as a Google Summer of Code organization we now have a new tool that allows us to host Summer of Code students to work on polyhedral modeling projects that go even beyond the scope of LLVM and gcc. LLVM and gcc have been and are in the future great places to host core compiler projects that use polyhedral compilation, but our community is interested in a lot more. Work on the PPCG polyhedral GPU code generator, the isl math library, as well the PRL run-time library, but also many other research projects (Apollo, Polly-JIT, …) benefits the greater polyhedral community, but is certainly outside of the scope of our traditional host organizations. The Polly Labs GSoC effort provides a home for such kind of projects and complements the traditional GSoC work we do in the context of gcc, LLVM, or Julia.
Google Summer of Code has a long history in supporting polyhedral compilation. Polly itself started as a GSoC project and already Polly’s older sibling, GCC/Graphite, took part in the very first GSoC years, with Tobias Grosser (today Polly Labs and ETH Zurich) being the first student in 2008 (mentored by Sebastian Pop). Only one year later, Hongbin Zheng (today lead of Xilinx High-Level Synthesis Team) worked as GSoC student on the first implementation of the Polly frontend and put together with Tobias Grosser the first end-to-end version of Polly. Over the years, many GSoC students contributed directly or indirectly to advances in polyhedral compilation or infrastructure we build on today. LLVM’s original PTX backend was developed as part of Google Summer of Code 2011 by Justin Holewinski (today NVIDIA), and today drives the GPU generation of Polly-ACC, for which again foundations were laid during GSoC 2012 and 2014 by Yabin Hu. Over the years, many students worked on Google Summer of Code projects and contributed significant pieces to polyhedral compilation in general and Polly specifically and we hope that you might be the next one. Riyadh Baghdadi (now MIT, before ENS Paris), another former GSoC student working on GCC/graphite, wrote down some great advices on “how to write a great GSoC proposal”. Check them out to follow in the footsteps of many successful GSoC students.
We already have an interesting set of polyhedral experts, who kindly agreed to be mentors in projects of their expertise and we are hopeful to convince more. As a student, we invite you to apply with your own project ideas or to expand one of the ideas we suggested on our Google Summer of Code page. This pages also contains information about possible mentors, how to apply as a student, and generally how to get started with the Google Summer of Code process. We are looking forward to work with you on exciting Google Summer of Code projects!
[GSoC Logo source: https://developers.google.com/open-source/gsoc/resources/media]
barvinok 0.40 was released mainly fixing some minor bugs. It will also serve as a basis for more invasive changes in the next release.
PPCG 0.07 was released containing one main change:
Support hybrid tiling
Basic hybrid tiling
can now be enabled using the --hybrid
option.
It currently only applies to stencils with a single statement.
For example, hybrid tiling can be applied to the input (say heat-2d.c
)
#define N 3072
#define T 512
void heat(float A[2][N][N])
{
#pragma scop
for (long t = 0; t < T; t++) {
for (long i = 1; i < N - 1; i++)
for (long j = 1; j < N - 1; j++)
A[(t+1)%2][i][j] = (1/9.0f) *
(A[t%2][i][j] +
A[t%2][i-1][j-1] + A[t%2][i-1][j] + A[t%2][i-1][j+1] +
A[t%2][i][j-1] + A[t%2][i][j+1] +
A[t%2][i+1][j-1] + A[t%2][i+1][j] + A[t%2][i+1][j+1]);
}
#pragma endscop
}
by invoking ppcg
as follows
ppcg --hybrid --sizes="{kernel[i]->tile[2,2,128];kernel[i]->block[1,128]}" \
--isolate-full-tiles --unroll-copy-shared --unroll-gpu-tile heat-2d.c
Polly gained a new ARM64 Buildbot sponsored by Qualcomm Innovation Center, which adds ARM64 platform coverage to Polly’s continious integration testing:
With this buildbot Polly increases its buildbot farm to nine builders. Polly runs the following builders:
Seven LLVM LNT builds:
Besides the standard -O3 -mllvm -polly configuration, Polly regularly tests its ability to generate parallel code, and also runs in an “extended coverage mode” called -mllvm -polly-unprofitable which forces Polly to transform each trivial loop it encounteres to provide higher test coverage on Polly’s transformation engine by transforming even code where we do not expect any performance gains. Finally, with our Polly Performance bot, we regularly report performance results to the LLVM performance tracking infrastructure hosted at http://llvm.org/perf/.
Besides the above set of buildbots, we also host a set of “Before-Vectorizer” build bots, which keep track of our efforts to move Polly into the LLVM pass pipeline with the goal to remove the need for additional canonicalization passes when running Polly. Such passes always pre-transform the IR that is fed to the LLVM standard pass pipeline and can consequently result in random difficult to avoid performance changes. After moving Polly into the Pass pipeline, we expect Polly to only affect performance in cases where our performance model predicts a clear benefit. You can find more information about this topic in the Polly Documentation.
Polly is also regularly tested externally. Publicly accessible are the FFmpeg FATE tests. More tests are run in-house by different users.
pet 0.10 was released containing a couple of changes:
Detect and report unbalanced pairs of scop/endscop pragmas
For example, in the following code fragment,
the #pragma scop
appears inside the for
-loop,
while the #pragma endscop
appears outside of the loop.
Previous versions would silently ignore such regions.
Now, a warning is produced.
void foo(int a[10])
{
for (int i = 0; i < 10; ++i)
#pragma scop
a[i] = i;
#pragma endscop
}
Take into account assumptions during extraction
Consider the program fragment below.
int f(int k)
{
int a;
#pragma scop
__builtin_assume(k >= 0);
a = k % 16;
#pragma endscop
return a;
}
Without the __builtin_assume
, pet would have to assume
that k
may be either positive or negative, resulting
in a piecewise expression for a
.
The extra assumption allows the expression to be simplified.
Before, pet would only perform such simplifications at the very end,
now they are also performed during the extraction process,
potentially speeding up this extraction process in the presence
of multiple modulo or integer division expressions.
isl 0.18 was released containing various improvements:
Improve elimination of redundant existentially quantified variables
In particular, finally implement the
Omega test in isl.
A more powerful (and more expensive) version of the Omega test
was supposed to have been implemented already, but it missed some cases.
Those cases are now also handled.
For example, the set description
{ [m, w] : exists a : w - 2m - 5 <= 3a <= m - 2w }
is now simplified to { [m, w] : w <= 1 + m }
.
Improve coalescing
In particular, the following set descriptions are now also coalesced
{ [a, b] : 0 <= a <= 2 and b >= 0 and
((0 < b <= 13) or (2*floor((a + b)/2) >= -5 + a + 2b)) }
{ [a] : (2 <= a <= 5) or (a mod 2 = 1 and 1 <= a <= 5) }
{ [y, x] : (x - y) mod 3 = 2 and 2 <= y <= 200 and 0 <= x <= 2; [1, 0] }
{ [x, y] : (x - y) mod 3 = 2 and 2 <= y <= 200 and 0 <= x <= 2; [0, 1] }
{ [1, y] : -1 <= y <= 1; [x, -x] : 0 <= x <= 1 }
{ [1, y] : 0 <= y <= 1; [x, -x] : 0 <= x <= 1 }
{ [x, y] : 0 <= x <= 10 and x - 4*floor(x/4) <= 1 and y <= 0;
[x, y] : 0 <= x <= 10 and x - 4*floor(x/4) > 1 and y <= 0;
[x, y] : 0 <= x <= 10 and x - 5*floor(x/5) <= 1 and 0 < y;
[x, y] : 0 <= x <= 10 and x - 5*floor(x/5) > 1 and 0 < y }
{ [x, 0] : 0 <= x <= 10 and x mod 2 = 0;
[x, 0] : 0 <= x <= 10 and x mod 2 = 1;
[x, y] : 0 <= x <= 10 and 1 <= y <= 10 }
Improve parametric integer programming
Based on analysis of a test case reported by Wenlei Bao, a major cause of the long run-time turned out to be the symmetry detection performed internally. The detected symmetry is exploited by replacing several upper bounds with a single additional parameter. This helps to speed up lexicographic optimization of some inputs with lots of symmetry, but causes a significant slow-down on this particular input because the parameters that appear in the replaced constraints may also appear in other constraints and through the replacement, the connection is lost. The symmetry detection is now only applied in cases where this problem does not occur.
A further analysis showed that quite some time was spent in exploring parts of the domain where the optimum is undefined. In a pure lexicographic optimization operation, the user is not interested in this part of the domain. The procedure was changed to start from the actual domain of the input rather than the corresponding universe such that the complement no longer needs to be explored at all. A possible down-side is that this actual domain may involve existentially quantified variables, which increases the dimension of the context and which may also appear in the final result. Special care therefore needs to be taking for the case where lexicographic optimization is being used for its side-effect of eliminating existentially quantified variables.
The upshot is that for the input
{ S_2[i, k, j] -> [o0, o1, o2, o3] : exists (e0 = floor((o1)/8):
8e0 = o1 and k <= 255 and o0 <= 255 and o2 <= 255 and
o3 <= 255 - j and 7o3 >= 255 - j and 7o3 <= 7 - i and
240o3 <= 239 + o0 and 247o3 <= 247 + k - j and
247o3 <= 247 + k - o1 and 247o3 <= 247 + i and
248o3 >= 248 - o1 and 248o3 <= o2 and 254o3 >= i - o0 + o1
and 254o3 >= -o0 + o1 and 255o3 >= -i + o0 - o1 and
1792o3 >= -63736 + 257o1) }
the computation time went down from
real 6m36.786s
user 6m36.032s
sys 0m0.128s
to
real 0m0.040s
user 0m0.032s
sys 0m0.000s
preserve isolate
option in isl_schedule_node_band_split
This is needed to simplify the implementation of full/partial tile separation for hybrid tiling in PPCG.
Print AST nodes in YAML format
Minor improvements to Python bindings
IMPACT 2017, the 7th International Workshop on Polyhedral Compilation, will be taking place January 23 - 25, 2017 in Stockholm colocated with the HiPEAC conference. IMPACT is probably the venue to meet the full polyhedral compilation community and a great place to discuss your latest ideas on polyhedral compilation.
With Tobias as co-chair, Polly Labs contributes to the organization of IMPACT 2017.
The following is the official call for papers:
This is a CALL FOR PAPERS for IMPACT 2017 7th International Workshop on Polyhedral Compilation Techniques http://impact.gforge.inria.fr/impact2017/ held in conjunction with HiPEAC 2017 (Jan 23-25, 2017) Stockholm, Sweden. ====================================== IMPORTANT DATES: Abstract submission: October 21, 2016 (AoE) Paper submission: October 28, 2016 (AoE) Author notification: December 02, 2016 Final version due: December 16, 2016 Workshop date: TBD ====================================== OVERVIEW: Polyhedral techniques have gained attention due to the rise of multi-core processors and other architectures that require parallelism and more complex schedules for data locality. At the heart of the polyhedral model is an abstraction of programs that enables powerful analysis and scheduling of applications. Recent developments in the polyhedral model research area includes automatic parallelization targeting various platforms, program verification, and hardware synthesis. IMPACT is a unique workshop aimed to bring together researchers and practitioners interested in polyhedral techniques to exchange ideas. This year's IMPACT will be held in conjunction with HiPEAC as a one-day workshop including technical paper presentations, panel discussions, and possibly a keynote. We welcome both theoretical and experimental papers on all aspects of polyhedral compilation and optimization. We also welcome submissions describing preliminary results, crazy new ideas, position papers, experience reports, and available tools, with an aim to stimulate discussions, collaborations, and advances in the field. The following illustrate potential IMPACT papers: - Discussion of a preliminary idea with an attempt to place it in context but no experimental results. - Experimental results comparing two or more existing ideas. - Presentation of an existing idea in a different way including illustrations of how the idea applies in current codes. Attribution should be done as well as possible. Topics of interest include, but are not limited to: - program optimization (automatic parallelization, tiling, etc.) - code generation - data/communication management on GPUs, accelerators and distributed systems - hardware/high-level synthesis - static analysis - program verification - model checking - theoretical foundations of the polyhedral model - extensions of the polyhedral model - scalability and robustness of polyhedral compilation techniques - autotuning - application case studies - tool demonstration SUBMISSION: Submissions should not exceed 8 pages (recommended 6 pages), excluding references, formatted as per ACM proceedings format. Please use the following template when preparing your manuscript: http://www.acm.org/sigs/publications/proceedings-templates Submissions should be in PDF format and printable on US Letter or A4 sized paper. Please submit through EasyChair at: https://easychair.org/conferences/?conf=impact17 Proceedings will be posted online. If the final version of an accepted paper does not sufficiently address the comments of the reviewers, then it may be accompanied by a note from the program committee. Publication at IMPACT will not prevent later publication in conferences or journals of the presented work. However, simultaneous submission to IMPACT and other workshop, conference, or journal is often prohibited by the policy of other venues. For instance, a paper with significant overlap with IMPACT submission cannot be sent to PLDI 2017 or any other overlapping SIGPLAN event. We will also continue the poster teasers we started last year. Authors of the rejected papers that still plan to attend HiPEAC will have an opportunity to present their submission in the HiPEAC poster session. We encourage poster presentations by providing a short (3 min.) slot in the workshop to advertise the posters. If possible, posters related to IMPACT will be gathered at the poster session close to each other. COMMITTEES: Organizers and Program Chairs: Mary Hall, University of Utah, USA Tobias Grosser, ETH Zurich, Switzerland Program Committee: Uday Bondhugula (IISc Bangalore, India) Cedric Bastoul (University of Strasbourg, France) Samuel Bayliss (Xilinx, USA) Albert Cohen (INRIA, France) Philippe Clauss (University of Strasbourg, France) Alain Darte (CNRS, France) Paul Feautrier (ENS Lyon / INRIA, France) Armin Groesslinger (University of Passau, Germany) Alexandra Jimborean (Uppsala University, Sweden) Paul Kelly (Imperial College London, UK) Andreas Kloeckner (University of Illinois at Urbana-Champaign, USA) Martin Kong (Rice University, USA) Sanyam Mehta (Cray Inc., USA) J. Ramanujam (Louisiana State University, USA) P. Sadayappan (Ohio State University, USA) Jun Shirako (Rice University, USA) Michelle Strout (University of Arizona, USA) Ramakrishna Upadrasta (IIT Hyderabad, USA) Sven Verdoolaege (Polly Labs, Belgium) David Wonnacott (Haverford College, USA) Tomofumi Yuki (INRIA, France) Hongbin Zheng (Xilinx Inc., USA) Reply to AllReply to SenderForward
Tobias Grosser presented Polyhedral AST generation is more than scanning polyhedra at PLDI 2016, which was held this year on June 13-17 on the beautiful shores of Santa Barbara.
In the ACM TOPLAS paper presented, we provide detailed insights into state-of-the-art polyhedral AST generation techniques used at the heart of Polly, GCC/Graphite, PPCG, PolyMage, as well as others, and discuss concepts – outside of classical AST generation – used to enable GPU tiling techniques such as hybrid-hexagonal tiling.
Abstract mathematical representations such as integer polyhedra have shown to be useful to precisely analyze computational kernels and to express complex loop transformations. Such transformations rely on Abstract Syntax Tree (AST) generators to convert the mathematical representation back to an imperative program. Such generic AST generators avoid the need to resort to transformation-specific code generators, which may be very costly or technically difficult to develop as transformations become more complex. Existing AST generators have proven their effectiveness, but they hit limitations in more complex scenarios. Specifically, (1) they do not support or may fail to generate control flow for complex transformations using piecewise schedules or mappings involving modulo arithmetic; (2) they offer limited support for the specialization of the generated code exposing compact, straightline, vectorizable kernels with high arithmetic intensity necessary to exploit the peak performance of modern hardware; (3) they offer no support for memory layout transformations; (4) they provide insufficient control over the AST generation strategy, preventing their application to complex domain-specific optimizations.
We present a new AST generation approach that extends classical polyhedral scanning to the full generality of Presburger arithmetic, including existentially quantified variables and piecewise schedules, and introduce new optimizations for the detection of components and shifted strides. Not limiting ourselves to control flow generation, we expose functionality to generate AST expressions from arbitrary piecewise quasi-affine expressions which enables the use of our AST generator for data-layout transformations. We complement this with support for specialization by polyhedral unrolling, user-directed versioning, and specialization of AST expressions according to the location they are generated at, and complete this work with fine-grained user control over the AST generation strategies used. Using this generalized idea of AST generation, we present how to implement complex domain-specific transformations without the need to write specialized code generators, but instead relying on a generic AST generator parametrized to a specific problem domain.
Try it out in the Polly Labs playground.
Sven Verdoolaege presented the lecture PPCG and Pencil compiler design at the Polyhedral Code Optimizations and Numerical Simulation spring school.
Abstract:
This course presents an overview of the inner workings of the polyhedral parallelizing compiler PPCG, which takes PENCIL code as input and produces either CUDA or OpenCL code. PENCIL is a subset of C99 with some extra builtins and pragmas that allow the user to express additional information that can be used by PPCG to produce better code. Some of the most prominent PENCIL features are described, along with their effect on PPCG. All the major steps of PPCG are covered, with a special focus on the choice of the core scheduling algorithm and its potential pitfalls.
Michael Kruse presented Using PPCG in Practice by Means of the pencilcc Wrapper
Abstract:
pencilcc is a compiler for PENCIL source code (similar to nvcc or mpicc) that uses PPCG under the hood, but that simplifies the user experience. PENCIL is a subset of C99 designed to facilitate automatic parallelization. The pencilcc compiler also comes with a pencil.h header and a runtime library called PRL. In this session we will learn how to compile PENCIL sources using pencilcc to OpenCL/CUDA and how to profile our code. We will also see the influence of PENCIL-specific annotations on the generated code as well as the positive effect on performance of using the PRL library.
PPCG 0.06 was released containing various changes:
Use PPCG specific macro names in generated code
The names for macros inside isl AST expression now have a ppcg_
prefix to avoid conflicts with other symbols.
Complete transition to schedule trees
In particular, the last remaining use of nested AST generation has been removed. AST expressions for affine expressions are now also built within the context where they will appear, allowing them to exploit this context for simplification purposes.
Maximize coincidence by default
In particular, turn on --isl-schedule-maximize-coincidence
by default.
Map arrays with constant index expressions to private memory
This allows some code to be mapped to an accelerator that could not be mapped before because of excessive scheduling constraints.
Optionally group chains of statements
Statements that are executed in sequence in the input code and where the second completely depends on the first can now be grouped together. In particular, pairs of consecutive statements where the first writes to a scalar and the second reads from that scalar are grouped together. This reduces the number of statements, reducing the computation time of the scheduler, while not giving up too much scheduling freedom.
pet 0.09 was released containing various changes:
Push affine conditions into index expressions
If the code contains an expression of the form
c ? A[f] : A[g]
with c an affine condition, then replace it by
A[c ? f : g]
This reduces the number of accesses and, more importantly, includes the information of when the access is performed in the index expression. This allows dependence analysis to take into account this information, reducing the risk of introducing spurious dependences.
Properly support undeclared loop iterators
When the loop iterator of a for
loop is not declared inside
the for
loop itself, the value of this loop iterator persists
after the execution of the loop and could in principle be used
by subsequent statements. Introduce assignments to the corresponding
variable that ensure the variable contains the correct value
after the execution of the loop.
isl 0.17 was released containing various improvements:
Optionally combine SCCs incrementally in the scheduler
Within each component of the dependence graph, the isl scheduler
tries to schedule all statements together by default.
By turning off the schedule-whole-component
option,
the isl scheduler will first consider each strongly connected
component separately and then incrementally combine them.
This should reduce the scheduling time and allows for more
finegrained control over which SCCs should be combined.
In particular, the coincidence maximization below depends
on the incremental scheduler.
Optionally maximize coincidence in the scheduler
If the schedule-whole-component
option is turned off,
then turning on the schedule-maximize-coincidence
option
will make isl refuse to combine SCCs if this reduces the number of
coincident band members in the schedules for any of the SCCs.
Optionally avoid loop coalescing in the scheduler
By default, isl will now try and prevent or recover from schedules that represent loop coalescing. Loop coalescing is undesirable because it results in a skewed view of the schedule dimension, leading to single instance band members, and because the large coefficients can cause difficulties in later step, including complicated generated code.
For example, a schedule of the form
domain: "{ C[i0, i1] : 2 <= i0 <= 3999 and 0 <= i1 < i0 }"
child:
schedule: "[{ C[i0, i1] -> [(3998i0 + i1)] }]"
child:
schedule: "[{ C[i0, i1] -> [(i0)] }]"
permutable: 1
coincident: [ 1 ]
will be rejected in favor of the schedule
domain: "{ C[i0, i1] : 2 <= i0 <= 3999 and 0 <= i1 < i0 }"
child:
schedule: "[{ C[i0, i1] -> [(i0)] }]"
child:
schedule: "[{ C[i0, i1] -> [(i1)] }]"
Fix handling of nested integer divisions
Nested integer divisions were not handled correctly in some operations, leading to incorrectly generated code in corner cases.
Optionally detect min/max expressions during AST generation
An isl_pw_aff
with several pieces often represents a
minimum of maximum expression. Detect some of these cases inside
isl_ast_build_expr_from_pw_aff
and construct an explicit minimum
or maximum expression on success.
For example, construct max(0, M)
instead of M <= 0 ? 0 : M
.
Minor AST generator improvements
In particular, reduce the number of pieces during the combination of pairs of piecewise affine expressions and move the most complicated piece last such that it does not need to be AST generated. When combining a pair of expressions of m and n pieces, the original implementation could produce a result with 2 * (m + 1) * (n + 1) pieces. The new implementation creates a result with at most m + n pieces.
Simplify stride constraints
isl_map_gist
would only simplify away stride constraints
if they could be removed completely and if they happened
to be formulated in exactly the same way in the context.
Stride constraints are now simplified more generally.
For example,
[n] -> { [x] : x = n && x mod 32 = 0 } % [n] -> { [x] : x mod 32 = 0 }
now produces [n] -> { [x = n] }
, and
{ [x] : x mod 6 = 0 } % { [x] : x mod 3 = 0 }
now produces { [x] : x mod 2 = 0 }
.
Improve support for expansions in schedule trees
These changes are needed to support statement grouping in PPCG.
At this years Euro LLVM and CC conference in Barcelona, a variety of Polly Labs related presentations have been given.
The Polly Loop Optimizer is a framework for the analysis and optimization of (possibly imperfectly) nested loops. It provides various transformations such as loop fusion, loop distribution, loop tiling as well as outer loop vectorization. In this tutorial we introduce the audience to the Polly loop optimizer and show how Polly can be used to analyze and improve the performance of their code. We start off with basic questions such as "Did Polly understand my loops?", "What information did Polly gather?", "How does the optimized loop nest look like?", "Can I provide more information to enable better optimizations?", and "How can I utilize Polly's analysis for other purposes?". Starting from these foundations we continue with a deeper look in more advanced uses of Polly: This includes the analysis and optimization of some larger benchmarks, the programming interfaces to Polly as well as the connection between Polly and other LLVM-IR passes. At the end of this tutorial we expect the audience to not only be able to optimize their codes with Polly, but also to have a first understanding of how to use it as a framework to implement their own loop transformations.
Presenters: Tobias Grosser and Johannes Doerfert
Feedback:
Tobias Grosser (the lead developer of Polly) and Johannes Doerfert gave a fantastic tutorial on Polly. ‘Polyhedral loop modelling’ is an intimidating term when you first encounter it, but their explanation was clear and effective. They also showed how to derive simple loop optimisations like loop interchange in this system and the more sophisticated optimisations they’ve developed on top.
http://www.wilfred.me.uk/blog/2016/03/22/llvm-developer-meeting-2016/
The Polly Loop Optimization infrastructure has seen active development throughout 2015 with contributions from a larger group of developers located at various places around the globe. With three successful Polly sessions at the US developers meeting and larger interest at the recent HiPEAC conference in Prag, we expect various Polly developers to be able to attend EuroLLVM. To facilitate in-persona collaboration between the core developers and to reach out to the wider loop optimization community, we propose a BoF session on Polly and the LLVM loop optimization infrastructure. Current hot topics are the usability of Polly in an '-O3' compiler pass sequence, profile driven optimizations as well as the definition of future development milestones. The Polly developers community will present ideas on these topics, but very much invites input from interested attendees.
Presenters: Tobias Grosser, Johannes Doerfert, Zino Benaissa
Motivated by modern day physics which in addition to experiments also tries to verify and deduce laws of nature by simulating the state-of-the-art physical models using large computers, we explore means of accelerating such simulations by improving the simulation programs they run. The primary focus is Lattice Quantum Chromodynamics (QCD), a branch of quantum field theory, running on IBM newest supercomputer, the Blue Gene/Q. Molly is an LLVM compiler extension, complementary to Polly, which optimizes the distribution of data and work between the nodes of a cluster machine such as Blue Gene/Q. Molly represents arrays using integer polyhedra and uses another already existing compiler extension Polly which represents statements and loops using polyhedra. When Molly knows how data is distributed among the nodes and where statements are executed, it adds code that manages the data flow between the nodes. Molly can also permute the order of data in memory. Molly's main task is to cluster data into sets that are sent to the same target into the same buffer because single transfers involve a massive overhead. We present an algorithm that minimizes the number of transfers for unparametrized loops using anti-chains of data flows. In addition, we implement a heuristic that takes into account how the programmer wrote the code. Asynchronous communication primitives are inserted right after the data is available respectively just before it is used. A runtime library implements these primitives using MPI. Molly manages to distribute any code that is representable in the polyhedral model, but does so best for stencils codes such as Lattice QCD. Compiled using Molly, the Lattice QCD stencil reaches 2.5% of the theoretical peak performance. The performance gap is mostly because all the other optimizations are missing, such as vectorization. Future versions of Molly may also effectively handle non-stencil codes and use make use of all the optimizations that make the manually optimized Lattice QCD stencil fast.
Presenter: Michael Kruse
Polly, as a polyhedral loop optimizer for LLVM, is not only a sophisticated tool for data locality optimizations, but also has precise information about loop behavior that can be used to automatically generate accelerator code. In this presentation we present a set of new Polly features that have been introduced throughout the last two years (and as part of two GSoC projects) that enable the use of Polly in the context of compilation for heterogeneous systems. As part of this presentation we discuss how we use Polly to derive the precise memory footprints of compute regions for both flat arrays as well as multi-dimensional arrays of parametric size. We then present a new, high-level interface that allows for the automatic remapping of memory access functions to new locations or data-layouts and show how this functionality can be used to target software managed caches. Finally, we present our latest results in terms of automatic PTX/CUDA code generation using Polly as a core component.
Presenter: Tobias Grosser
The performance of OpenCL programs suffers from memory and control flow divergence. Therefore, OpenCL compilers employ static analyses to identify non-divergent control flow and memory accesses in order to produce faster code. However, divergence is often input-dependent, hence can be observed for some, but not all inputs. In these cases, vectorizing compilers have to generate slow code because divergence can occur at run time. In this paper, we use a polyhedral abstraction to partition the input space of an OpenCL kernel. For each partition, divergence analysis produces more precise results i.e., it can classify more code parts as non-divergent. Consequently, specializing the kernel for the input space partitions allows for generating better SIMD code because of less divergence. We implemented our technique in an OpenCL driver for the AVX instruction set and evaluate it on a range of OpenCL benchmarks. We observe speed ups of up to 9x for irregular kernels over a state-of-the-art vectorizing OpenCL driver.
Presenter: Simon Moll
Authors: Simon Moll, Johannes Doerfert, Sebastian Hack
A spring school on Polyhedral Code Optimizations and Numerical Simulation was announced. Sven Verdoolaege from Polly Labs will be one of the speakers at this event. More details are available in this blog post .
Sven Verdoolaege and Tobias Grosser gave today the Applied Polyhedral Compilation tutorial at HiPEAC 2016.
Content:
Polyhedral compilation is widely used in high-level synthesis tools and in production compilers such as gcc, LLVM and IBM/XL and is still being actively developed by several research groups across the globe, resulting in highly attended IMPACT workshops and a recent polyhedral school. It is based on the polyhedral model, a powerful abstraction for analyzing and transforming (parts of) programs that are "sufficiently regular". The key feature of this model is that it is instance based, allowing for a representation and treatment of individual dynamic executions of a statement inside a loop nest and/or individual array elements. In this tutorial, we explain how to apply the basic concepts of polyhedral compilation in different setting. The tutorial consists of two parts, one dealing with the use of polyhedral compilation inside LLVM/Polly, giving a detailed view into the core of Polly and enabling the attendees to write their own Polly based analysis and optimizations, and the other dealing with applications of set cardinality, focusing on problem modeling. In particular, we will address the following topics: - Recapitulation of the basic polyhedral concepts - Overview of LLVM's classical loop optimizers and vectorizers - Identification/analysis of compute kernels, memory access patterns, and data dependences in Polly - Code transformations in Polly - Overview of the basic counting operations - How to model problems involving counting and related operations This tutorial continues from the Polyhedral Compilation without Polyhedra tutorial presented at HiPEAC 2015. Some of the material will be drawn from the first half of a lecture presented at the polyhedral school.
Sven Verdoolaege presented the Live-Range Reordering paper at IMPACT 2016.
Abstract:
False dependences are caused by the reuse of memory to store different data. These false dependences severely constrain the schedule of statement instances, effectively serializing successive accesses to the same memory location. Several array expansion techniques have been proposed to eliminate some or all of these false dependences, enabling more reorderings of statement instances during loop nest transformations. However, array expansion is only relevant when complemented with a storage mapping optimization step, typically taking advantage of the fixed schedule set in earlier phases of the compilation, folding successive values into a compact set of contracted arrays. Furthermore, array expansion can result in memory footprint and locality damages that may not be recoverable through storage mapping optimization when intermediate transformation steps have abused the freedom offered by the removal of false dependences. Array expansion and storage mapping optimization are also complex procedures not found in most compilers, and the latter is moreover performed using suboptimal heuristics (particularly in the multi-array case). Finally, array expansion may not remove all false dependences when considering data-dependent control and access patterns. For all these reasons, it is desirable to explore alternatives to array expansion as a means to avoid the spurious serialization effect of false dependences. This serialization is unnecessary in general, as semantics preservation in presence of memory reuse only requires the absence of interference among live-ranges, an unordered constraint compatible with the their commutation. We present a technique to deal with memory reuse without serializing successive uses of memory, but also without increasing memory requirements or preventing important loop transformations such as loop distribution. The technique is generic, fine-grained (instancewise) and extends two recently proposed, more restrictive approaches. It has been systematically tested in PPCG and shown to be essential to the parallelizing compilation of a variety of loop nests, including large pencil programs with many scalar variables.
PPCG 0.05 was released containing various changes:
Fix live-out computation
Due to a bug, the live-out computation would consider all elements written by a write to have been killed as soon as one of those elements had effectively been killed. This could in rare cases result in some statement instances mistakenly getting removed during dead code elimination.
Optionally compute schedule for C target
The original support for the C target only added the basic infrastructure and did not even allow the input code to be rescheduled. Now, this backend will also (optionally) reschedule the code, but support for this target is still fairly basic.
Create single kernel for non-permutable subtree
If some parts of the code that is mapped to the device do not have any parallelism, then a kernel would be launched for each instance of that part of the code, even if that would result in loop on the host that only launches such kernels. Now, a single kernel is invoked for the entire non-parallel part.
barvinok 0.39 was released containing various changes:
Support more recent versions of NTL
In particular, assume NTL has been compiled in ISO mode, which has been the default ever since NTL 5.4, and handle a change in internal representation of NTL.
Support more operations in iscc
These additional operations are needed for some of the examples in the Presburger Formulas and Polyhedral Compilation tutorial.
An initial version of the Presburger Formulas and Polyhedral Compilation tutorial has been made available. The current version focuses mainly on Presburger formulas and on operations that can be performed on sets and binary relations defined by such formulas, but also includes some information on polyhedral compilation, especially on dependence analysis.
pet 0.08 was released containing various changes:
Initialize compiler builtins
Michael Kruse contributed a patch to initialize builtins.
Without this initialization, builtins are treated as implicitly
defined functions and are therefore assumed to have return type
int
, resulting in compiler warnings or even errors when
those builtins are used.
Rename pet_scop
accessors
The original names where too focused on how the accessors were implemented.
Support variable renaming
This allows the input to have several locally declared variables with the same name.
Support inlining of outermost call expressions
Add preliminary Python bindings
The python bindings are needed for some of the examples in the Presburger Formulas and Polyhedral Compilation tutorial.
isl 0.16 was released containing various improvements:
Add 32 bit integer optimization for IMath
Until isl 0.15 the IMath backend, which serves as MIT licensed alternative to the default libgmp backend, was not very well optimized and its result could result in isl being 2-4x slower. Michael Kruse contributed a new small-integer-optimization to isl which improved the performance of the IMath backend to be now around 30% faster than libgmp.
Improved AST generation quality
Various smaller changes result in generally improved AST generation quality.
Add isl_union_flow_get_full_may_dependence
and
isl_union_flow_get_full_must_dependence
These functions include the accessed data elements that are responsible for the dependence in their results.
Improvements to Python bindings
The isl provided python bindings have been extended to support the full set of polyhedral compilation examples that is discussed in our new Presburger Formulas and Polyhedral Compilation tutorial.
Improved printing of isl sets and maps
Until isl 0.15 the constraints in an isl set and map have been printed in the (rather random) order they have been added to the data structure. isl now makes an effort to print constraints in a human readable way. This results in both shorter constraints, but also more predictable output in case implementation details change.
Before:
{ S[i0, i1] : i0 <= 97 and i0 >= 0 and i1 <= 99 and i1 >= 1 }
After:
{ S[i0, i1] : 0 <= i0 <= 97 and 0 < i1 <= 99 }
Together with isl, we also released PPCG 0.05, pet 0.08, and barvinok 0.39.
We are proud to announce that Polly Labs won the HiPEAC Tech Transfer Award 2015.
The HiPEAC Tech Transfer Award has been awarded to Polly Labs for the continous development and deployment of the Polly Loop Optimizer and the isl Integer Set Library. Both software artifacts are not only widely used in research, but are today part of the Snapdragon LLVM Compiler for Android shipped by Qualcomm Inc. and are investigated by other Polly Labs industry partners.
This achievement was only possible due to the regular contributions of the growing Polly open source community. We want to thank especially Johannes Doerfert, who was a leading community contributor in 2015 as well as all the other contributers we got over time: Alexandre Isoard, Armin Groesslinger, Chris Jenneisch, Christian Bielert, Dmitry N. Mikushin, Jack Howarth, Karthik Senthil, Ajith Pandel, Lawrence Hu, Matthew Simpson, Peter Conn, Pratik Bhatu, Richard Membarth, Roal Jordans, Sam Novak, Star Tan, Yabin Hu, Adrian Prantl, Andreas Simbuerger, Andy Gibbs, Anton Korobeynikov, Benjamin Kramer, Bill Wendling, Chandler Carruth, Craig Topper, Daniel Dunbar, Daniel Jasper, Zino Benaissa, David Blaikie, David Peixotto, Duncan P. N. Exon Smith, Hongbin Zheng, Logan Chien, Matt Arsenault, Micah Villmow, NAKAMURA Takumi, Patrik Hägglund, Rafael Espindola, Raghesh Aloor, Roman Gareev, Saleem Abdulrasool, Sameer Sahasrabuddhe, Sebastian Pop, Sumanth Gundapaneni, Sunil Srivastava, Sylvestre Ledru, Tanya Lattner, Wei Mi and various others we would have very much deserved to be listed here.
As part of Polly Labs, Albert Cohen, Sven Verdoolage and Michael Kruse from the INRIA Parkas team as well as Torsten Hoefler and Tobias Grosser from the Scalable Parallel Computing Lab at ETH Zurich we are very happy to have received this award and would like to thank again ARM for their great support.