Current state of schedule tree feature
This is a quick overview of the current state of the schedule tree feature. The feature requires two things
- transpiling OIR to schedule trees
- transpiling schedule trees to SDFGs
The first one is a GT4Py issue, the other a missing piece in DaCe.
Workflow
Working branches
- GT4Py
romanc/oir-to-stree
on Roman's fork - DaCe
romanc/stree-to-sdfg
on Roman's fork
The DaCe branch branches off from v1/maintenance
and includes Tal's work from the branch stree-to-sdfg
. For a quick overview of the changes, look at
https://github.com/spcl/dace/compare/v1/maintenance...romanc:romanc/stree-to-sdfg
OIR to schedule tree
OIR to schedule tree goes via a "Tree IR". The tree IR is just here to facilitate building the schedule tree. For now, we don't do any transformation on the tree IR.
flowchart LR
oir["
OIR
(GT4Py)
"]
treeir["Tree IR"]
stree["Schedule tree"]
oir --> treeir --> stree
OIR to tree IR conversion has two visitors in separate files:
oir_to_treeir
transpiles control flowoir_to_tasklet
transpiles computations (i.e. bodies of control flow elements) into tasklets
While this incurs a bit of code duplications (e.g. for resolving indices), allows for separation of concerns. Everything that is related to the schedule is handled in oir_to_treeir
. Note, for example, that we keep the distinction between horizontal mask and general if
statements. This distinction is kept because horizontal regions might influence scheduling decisions.
The conversion from tree IR to schedule tree is then a straight forward lowering.
Schedule tree to SDFG
In big terms, schedule tree to SDFG conversion has the following steps:
- Setup a new SDFG and initialize it's descriptor repository from the schedule tree.
- Insert (artificial) state boundary nodes in the schedule tree:
- Visitor on the schedule tree, translating every node into the new SDFG, see class
StreeToSDFG
. - Memlet propagation through the newly crated SDFG.
- Run
simplify()
on the newly created SDFG (optional).
Hacks and shortcuts
StreeToSDFG
has many visitors raising aNonImplementedError
. I've implemented these visitors on an as-needed basis.- I've added additional state boundaries around nested SDFGs (needed for state changes, e.g.
IfScope
, insideMapNodes
) to force correct execution order. - I've added additional state boundaries after inter-state assigns to ensure the symbols are defined before they are accessed. As far as I understand, that shouldn't be necessary. However, I've had SDFGs (todo: which ones?) with unused assigns at the end of the main visitor.
- I've written tests for some things as a way of developing the main visitor. For simple schedule trees, I've already added checks on the resulting SDFG, but pretty fast I ended up validating by "looking at the resulting SDFG".
Optimization laundry list
Things we want to do for optimization (and things we have to re-build from the old bridge).
- Different loops per target hardware: like previously, but less confusing
- Tiling: like previously, but hardware dependent. More details in this file.
- CPU: temps are allocated & de-allocated on the fly
- Axis-split merge
- Over-computation merge
- Local caching
- Inline thread-local transients: like previously.
- Optimize OpenMP pragmas: Check which of the previous optimizations still make sense.
- We ran a special version of
TrivialMapElimination
with more condition to when it applies. - Special cases for stencils without effect? They were treated separately in the previous bridge.
- In the previous bridge, we'd merge a horizontal region with the loop bounds in case the horizontal region was the only thing inside that loop.
- In the previous bridge, we'd split horizontal execution regions. This was also used for orchestration in NDSL. To be re-evaluated.