<span id="page-0-0"></span>

# **SSA transformation for GHC's native code generator**

### **BACHELORARBEIT**

zur Erlangung des akademischen Grades

### **Bachelor of Science**

im Rahmen des Studiums

#### **Software & Information Engineering**

eingereicht von

#### **Benjamin Maurer**

Matrikelnummer 00826765

an der Fakultät für Informatik

der Technischen Universität Wien

Betreuung: Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Andreas Krall

Wien, 1. März 2021

Benjamin Maurer **Andreas Krall** 



# **SSA transformation for GHC's native code generator**

### BACHELOR'S THESIS

submitted in partial fulfillment of the requirements for the degree of

### **Bachelor of Science**

in

#### **Software & Information Engineering**

by

### **Benjamin Maurer**

Registration Number 00826765

to the Faculty of Informatics

at the TU Wien

Advisor: Ao.Univ.Prof. Dipl.-Ing. Dr.techn. Andreas Krall

Vienna, 1st March, 2021

Benjamin Maurer **Andreas Krall** 

## **Erklärung zur Verfassung der Arbeit**

Benjamin Maurer

Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwendeten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe.

Wien, 1. März 2021

Benjamin Maurer

## **Abstract**

<span id="page-6-1"></span><span id="page-6-0"></span>[Static Single Assignment](#page-34-0) [\(SSA\)](#page-34-0) form is a widely-used and effective intermediate representation in optimizing compilers, which explicitly encodes def-use chains and enables a wide range of optimizations. However, transforming native machine instructions to [SSA](#page-34-0) poses additional challenges, due to register constraints, instructions modifying input operands and other Instruction Set Architecture peculiarities. The aim of this practical work is to implement SSA construction and destruction in the [Native Code](#page-34-1) [Generator backend](#page-34-1) [\(NCG\)](#page-34-1) of the [Glasgow Haskell Compiler](#page-34-2) [\(GHC\)](#page-34-2). Beyond enabling future optimizations, this will discover and rename webs for improved register allocation. This leads to a reduction in inserted spill instructions of up to 27% for the graph coloring register allocator.

## **Contents**

<span id="page-8-0"></span>

#### **[Bibliography](#page-34-4) 25**

# **CHAPTER**

## **Introduction**

#### <span id="page-10-4"></span><span id="page-10-1"></span><span id="page-10-0"></span>**1.1 Haskell and GHC**

 $H$ askell<sup>[1](#page-10-2)</sup> is a purely functional programming language with lazy evaluation semantics. To achieve pure functions, Haskell has pioneered the use of monads to model sideeffects[\[JW93\]](#page-36-0). It features a rich, static type system, which can be used to model program constraints at compile time.

The main compiler for the language - the [Glasgow Haskell Compiler](#page-34-2) [\(GHC\)](#page-34-2) - was first released in 1992 (see [\[HHJW07\]](#page-36-1)) and differs in its internal architecture from more mainstream compilers for imperative languages.

<span id="page-10-3"></span>

Figure 1: [GHC](#page-34-2) compilation pipeline

<span id="page-10-2"></span><sup>1</sup>https://www.haskell.org/

<span id="page-11-6"></span>Figure [1](#page-10-3) gives a high-level overview of used intermediate representations in [GHC.](#page-34-2) After parsing, [GHC](#page-34-2) translates the program into an [intermediate representation](#page-34-5) [\(IR\)](#page-34-5) called "Core", a simple, typed functional language based on *SystemF<sup>C</sup>* [\[SCJD07\]](#page-37-0). Many optimizations are performed in a "Core to Core" pass. Next, Core is lowered to STG, a minimalist functional language for an abstract machine called the "Spineless Tagless G-machine" [\[Jon92\]](#page-36-2), which aims to facilitate mapping of a non-strict functional language to actual hardware. STG is then lowered to Cmm, GHC's implementation of the C- language [\[JRR99\]](#page-36-3), a low level imperative language with an explicit stack.

[GHC](#page-34-2) currently offers three backends. The C-backend<sup>[2](#page-11-1)</sup> is the oldest one and considered deprecated except for bootstrapping on new platforms. A backend utilizing the  $LLVM<sup>3</sup>$  $LLVM<sup>3</sup>$  $LLVM<sup>3</sup>$ compiler infrastructure ([\[TC10\]](#page-37-1)) is the most recent of the three, but suffers from prohibitively long compile times. [GHC'](#page-34-2)s default backend is the [Native Code Generator](#page-34-1) [backend](#page-34-1) [\(NCG\)](#page-34-1), which strives to balance compile times and speed of generated code. [NCG](#page-34-1) is relatively simple and performs instruction selection, [CFG-](#page-34-6)based basic block layout optimization, liveness analysis with dead code elimination and register allocation. The vast majority of optimizations is performed in earlier phases.

At this stage, no intermediate representation is used, but a stream of machine instructions with either physical registers or virtual registers (before register allocation) and some metainstructions (e.g., SPILL, RELOAD), which are substituted in the course of compilation. Two register allocators are available, the default linear scan allocator and a Chaitin-Briggs-style graph coloring allocator.

#### <span id="page-11-0"></span>**1.2 Motivation**

While Haskell is known to be at the cutting edge in terms of type system features, like adding generalized algebraic datatypes (GADTs), type families, kind polymorphism and recently linear types  $[BBN<sup>+</sup>18]$  $[BBN<sup>+</sup>18]$ , code generation has received less attention.

The graph coloring register allocator has been removed from the standard optimization flags  $(-O2)$  in  $2013<sup>4</sup>$  $2013<sup>4</sup>$  $2013<sup>4</sup>$ , as it suffered a regression in performance of generated code and the situation has not been remedied since.

Analyzing the short routine in Listing [1](#page-12-1) submitted to [GHC'](#page-34-2)s issue tracker<sup>[5](#page-11-4)</sup>, which exhibits excessive spilling, reveals one fundamental problem.

The routine simply adds three records with nine Int fields each together in some arbitrary permutation. Note the use of the BangPatterns language extension<sup>[6](#page-11-5)</sup>, which forces

<span id="page-11-1"></span><sup>2</sup>[https://downloads.haskell.org/~ghc/9.0.1/docs/html/users\\_guide/codegens.](https://downloads.haskell.org/~ghc/9.0.1/docs/html/users_guide/codegens.html#c-code-generator-fvia-c) [html#c-code-generator-fvia-c](https://downloads.haskell.org/~ghc/9.0.1/docs/html/users_guide/codegens.html#c-code-generator-fvia-c) – Accessed 5.3.2021

<span id="page-11-2"></span> $3$ <https://www.llvm.org/> – Accessed  $5.3.2021$ 

<span id="page-11-3"></span><sup>4</sup><https://gitlab.haskell.org/ghc/ghc/-/issues/7679> – Accessed 5.3.2021

<span id="page-11-5"></span><span id="page-11-4"></span> $5$ <https://gitlab.haskell.org/ghc/ghc/-/issues/8048> - Accessed 5.3.2021

 $^6$ [https://downloads.haskell.org/~ghc/9.0.1/docs/html/users\\_guide/exts/](https://downloads.haskell.org/~ghc/9.0.1/docs/html/users_guide/exts/strict.html)

[strict.html](https://downloads.haskell.org/~ghc/9.0.1/docs/html/users_guide/exts/strict.html) – Accessed 5.3.2021

```
{-#} LANGUAGE BangPatterns _{+}-}
module Spill where
import GHC.Exts
data S = S !Int !Int !Int !Int !Int !Int !Int !Int !Int
         deriving Show
spill :: S -> S -> S -> S
spill (S !a !b !c !d !e !f !g !h !i)
      (S !j !k !l !m !n !o !p !q !r)
      (S !s !t !u !v !w !x !y !z _)
 = S (a + j + s) (b + c) (k + r)
     (a + b + c + d + e + f + q + h + i)(j + k + 1 + m + n + o + p + q + r)(s + t + u + v + w + x + y + z)(a + b + c) (j + k + 1) (s + t + u)
```
Listing 1: Example program triggering spill code insertion

strict evaluation of matched patterns. When compiled using the graph coloring register allocator[7](#page-0-0) , it will insert 8 store and 9 load instructions for spilling.

Inspecting one spilled [virtual register](#page-34-7) [\(vreg\)](#page-34-7),  $\forall v \in R$  nKI, we find that it is live in two basic blocks, as can bee seen in Figure [2.](#page-13-0)

Looking at the assembly in Listing [2,](#page-14-1) we can see that [vreg](#page-34-7)  $\forall v \perp nK$  gets redefined several times, explicitly stored on the stack between the two basic blocks and even dies and gets redefined within the same block. Using the same [virtual register](#page-34-7) name constrains the graph coloring allocator to assign these disjoint live ranges to the same physical register, increasing register pressure. These longer live ranges also may have more conflicts with other live ranges, increasing likelihood of spilling.

#### <span id="page-12-0"></span>**1.3 Renumbering**

Generally, a Chaitin-Briggs-style allocator would have a *renumbering* phase, which discovers intersecting def-use chains (webs) and assigns them unique [virtual register](#page-34-7) names [\[BCT94\]](#page-35-1).

This is both important before register allocation can begin, in order to identify life ranges, as well as during register allocation to identify newly created live ranges (e.g., via

 $7$ Using GHC 8.10.3, with ghc  $-02$  -fregs-graph -ddump-asm-stats

<span id="page-13-1"></span><span id="page-13-0"></span>

Figure 2: Control-flow graph of Listing [1,](#page-12-1) nodes with virtual register occurrence highlighted



Figure 3: Phases of a Chaitin-Briggs-style allocator

spilling or live range splitting). No such phase exists in [NCG.](#page-34-1) [vreg'](#page-34-7)s are created during instruction selection in the Cmm-to-ASM pass. Further optimizations may disjoint live ranges without renaming them. [NCG'](#page-34-1)s graph coloring allocator only renames live ranges during spilling, where it can locally generate a new name as it emits SPILL and RELOAD meta-instructions around the original instruction.

One approach to *renumbering* uses classical iterative data-flow analysis. Def-use chains are discovered by performing *reaching definitions* analysis, then pair-wise intersection tests are performed to find common uses and union these intersecting def-use chains together into a web. Representing def-use chains as lists, not only requires *m* ∗ *n* space, for *m* definitions and *n* uses, but is also computationally expensive.

Briggs' seminal paper on *optimistic coloring* [\[BCT94\]](#page-35-1) mentions that Chaitin's original graph coloring register allocator  $[CAC^+81]$  $[CAC^+81]$  used this approach, whereas they utilized [Static Single Assignment](#page-34-0) [\(SSA\)](#page-34-0) form.

```
cHO
    movq $block_cHT_info,-64(%rbp)
    movq 7(%rbx),%vI_nKI # Definition of nKI
    \# \lceil \ldots \rceilmovq %vI_nKI,80(%rbp) # nKI dies
    addq $-64, $rbp
    testb $7,%bl
    jne _blk_cHT
    jmp _blk_cHU
cJF
    # \lceil ... \rceilmovq 144(%rbp),%vI_nKI # Definition of nKI
    movq %vI_nKI,%vI_nJY # nKI dies
    # \lceil ... \rceilmovq %vI_nKC,%vI_nKi # Definition of nKI
    \# [\ldots]
```
Listing 2: Excerpt of generated assembly of spilling example

Transforming the program into [SSA](#page-34-0) form will create a new name for each definition of a value and webs can then be simply found by applying the union-find algorithm. The big advantage though of [SSA](#page-34-0) form is, that it enables a wide range of further optimizations.

#### <span id="page-14-0"></span>**1.4 Static Single Assignment Form**

[SSA](#page-34-0) form is widely used in modern compilers, as it facilitates many optimizations and analysis in an efficient manner [\[Sin18\]](#page-37-2). Both data-flow and control-flow is explicit in [SSA.](#page-34-0)

When transforming a program into [SSA,](#page-34-0) each (re-)definition of a variable introduces a new name. Generally, an index is added to to the original variable name, incremented with each definition. For example, the following side-by-side comparison before and after renaming:



As such, each name is defined exactly once and each definition dominates all of its uses.

At points where control-flow merges with conflicting definitions of a variable, *φ*-functions (also known as  $\phi$ -nodes) are inserted. A  $\phi$ -function is a function of the form  $x_n :=$  $\phi(x_0, x_1, \ldots, x_{n-1})$ , where  $x_n$  is a new name for the value and  $\phi$  has as many arguments

<span id="page-15-0"></span>as the basic block has predecessors. Conceptually, a  $\phi$ -function will "select" the right version of x, depending on the actual path taken.  $\phi$ -functions represent parallel copies, such that all  $\phi$ -functions at the beginning of a block are "executed" in parallel and their "results" are in *LiveIn* for the respective basic block [\[SJGS99\]](#page-37-3).

Several different flavors of [SSA](#page-34-0) exist, namely:

- **Minimal SSA**, described in [\[CFR](#page-35-3)<sup>+</sup>91], inserts the minimal number of  $\phi$ -functions for each join-point of the [CFG](#page-34-6) where more than one [SSA-](#page-34-0)name per original name merges.
- **Pruned SSA**, which performs liveness checks to only insert *φ*-functions for variables that are live in a given block. This can drastically reduce the number of  $\phi$ -functions, but comes at the cost of having to perform global data-flow analysis for liveness, plus the liveness checks on insertion [\[CCF91\]](#page-35-4).
- **Semi-pruned SSA** will insert fewer (dead) *φ*-functions, by performing a linear scan over the code to weed out all local live ranges, that do not cross basic block boundaries [\[BCHS98\]](#page-35-5).

#### **1.4.1 SSA in the code generator**

Some issues arise when choosing a [SSA](#page-34-0) form code representation in a code generator. Machine instructions of a real-world [Instruction Set Architecture](#page-34-8) [\(ISA\)](#page-34-8) will often not conform to [SSA](#page-34-0) constraints. For example, such an instruction may modify its operands directly, making it impossible to assign a new name to the new value, e.g.: inc  $*\text{max}$  which is equivalent to  $x_1 := x_0 + 1$ , or add \$4,  $\epsilon$ rax which is  $x_1 := 4 + x_0$ . Instructions may also have implicit operands, like status registers, or some [ISAs](#page-34-8) feature conditional instructions, like this example of ARM assembly: cmp r0, #3; addlt r0, r1, r2 or if  $( r0 < 3 )$  then  $r0 := r1 + r2$  in pseudo-code.

Different approaches to handle these challenges will be addressed In chapters [2](#page-16-0) and [3.](#page-17-0)

# **CHAPTER**

## **Related Work**

<span id="page-16-0"></span>First we are looking at some older papers for approaches to live range discovery without [SSA.](#page-34-0) Chow and Hennessy [\[CH90\]](#page-35-6) describe using data-flow analysis [\[Hec77\]](#page-36-4) to solve separately for the live and reaching definition attributes. Disjoint live ranges are not renamed though, as they rely on their later splitting phase. Chaitin et al.  $[CAC<sup>+</sup>81]$  $[CAC<sup>+</sup>81]$ call this step "getting the right number of names" and use global data-flow analysis to achieve this. In his thesis, Briggs [\[Bri14\]](#page-35-7) describes in more detail how they use union-find on pruned [SSA](#page-34-0) form to this end, noting the efficiency gained having to only union the definitions reaching a *φ*-node, instead of having to union all definitions reaching each use.

 $[CFR+91]$  $[CFR+91]$ , the seminal paper on [SSA,](#page-34-0) describes properties and an algorithm to compute minimal [SSA.](#page-34-0) [\[CCF91\]](#page-35-4) introduces pruned [SSA,](#page-34-0) which only inserts live *φ*-functions and Briggs et al. [\[BCHS98\]](#page-35-5) describe some previously overlooked problems in [SSA](#page-34-0) destruction, as well as semi-pruned [SSA,](#page-34-0) which is cheaper to compute, yet almost as small. Braun et al. [\[BBH](#page-34-9)+13] present a simple and efficient lazy backwards algorithm to construction [SSA.](#page-34-0) Sreedhar et al. [\[SJGS99\]](#page-37-3) describe an approach for [SSA](#page-34-0) destruction,  $[BDR^+09]$  $[BDR^+09]$ suggest a more efficient algorithm and notes on correctness issues for [SSA](#page-34-0) destruction.

Leung et al. [\[LG99\]](#page-36-5) discuss the problems regarding naming constraints for [SSA](#page-34-0) form on machine instructions and represent a scheme to convert between [SSA](#page-34-0) and native code. They propose a three phase approach, which first collects all naming constraints, then marks nodes where these constraints are violated and finally a reconstruct phase to repair the code, which is based on Briggs's et al. [SSA](#page-34-0) destruction in [\[BCHS98\]](#page-35-5).

Dinechin [\[dD14\]](#page-35-9) looks at the challenges [SSA](#page-34-0) form in the code generator poses, like ISA and ABI naming constraints, non-kill target operands and conditional execution in general. They discuss different representations of instruction semantics and which compilation phases benefit from [SSA,](#page-34-0) i.e., phases before pre-pass scheduling, remarking that there is still debate on the benefits for register allocation. The paper also features

#### 2. Related Work

an extensive review on techniques for if-conversion, an optimization technique to replace control flow with straight-line code of predicated or conditional instructions.

In [\[HGG06\]](#page-36-6), Hack et al. present a graph coloring register allocator on [SSA-](#page-34-0)form. They show that the interference graphs of programs in [SSA](#page-34-0) form are *chordal* and that an optimal coloring for such graphs can be found in polynomial time. Furthermore, spilling, coloring and coalescing can be decoupled in this framework. However, coalescing becomes  $\mathcal{NP}$ -complete.

Linear Scan register allocation has also been extended to work on [SSA-](#page-34-0)form by Wimmer et al [\[WF10\]](#page-37-4). Their work uses properties of [SSA-](#page-34-0)form and a block order, in which earlier blocks always dominate later blocks, to significantly simplify the construction of lifetime intervals. Most of the expensive lifetime interval interferences checks can be omitted with the guarantees provided by [SSA-](#page-34-0)form, however the authors note that already split life intervals are no longer in [SSA-](#page-34-0)form and require these checks. The final *resolution* phase also performs [SSA](#page-34-0) destruction. As the core register allocation algorithm stays the same, the quality of generated code remained the same. The authors reported a simplification of the compiler code and measured reduction both in compile times and memory use.

Many code generation and optimization steps, like [SSA-](#page-34-0)destruction, may introduce many register-to-register copies. The task of register coalescing is to eliminate as many unproductive copies as possible and is typically performed as part of register allocation. While it is generally beneficial to remove extraneous copies, it may have a big impact on colorability of the graph, as two coalesced live ranges may have more conflicts. Park and Moon [\[PM04\]](#page-37-5) provide an overview of coalescing algorithms, as they present their *optimistic coalescing*. *Aggressive coalescing*, as used in Chaitin's [\[CAC](#page-35-2)+81] allocator, will coalesce any non-interfering, copy-related nodes in the interference graph. While this removes many unproductive copies, it may also worsen colorability. Park and Moon note however, that it may also *positively* impact colorability, if both coalesced nodes *a* and *b* had an interference with node *n* prior to coalescing, as this lowers the degree of *n* by one. *Conservative coalescing*, by Briggs et al. [\[BCT94\]](#page-35-1), only coalesces two nodes if this is guaranteed not to worsen colorability. Since *conservative coalescing* potentially misses many opportunities for safe coalescing, George and Appel's *iterative coalescing* [\[GA96\]](#page-36-7) interleaves conservative coalescing with the simplification phase of coloring, therefore creating repeated opportunities for coalescing. *Optimistic coalescing* performs *aggressive coalescing*, but splits live ranges again, when they have to be spilled.

<span id="page-17-0"></span>Many improved techniques for [SSA](#page-34-0) destruction also seek to coalesce, or avoid inserting, copies. Dinechin [\[dD14\]](#page-35-9) gives an overview of [SSA](#page-34-0) destruction algorithms in section 4, discussing their different coalescing approaches.

## **Methodology**

<span id="page-18-2"></span>In the first implementation phase, [SSA](#page-34-0) construction and destruction will be performed back to back, in order to discover webs for register allocation. Because of this, we compute pruned[-SSA,](#page-34-0) as we are only interested in live variables. The necessary liveness information is gathered in the existing liveness-pass, via fixed-point data-flow analysis, which also eliminates dead code.

#### <span id="page-18-0"></span>**3.1 Machine Instruction Constraints**

The [SSA](#page-34-0) construction routine receives a linked list of basic blocks containing instructions annotated with register usage and liveness information as input. The instructions are pre-colored and contain references to physical registers for special purpose registers and [ABI](#page-34-10) constraints. These are not promoted to [SSA](#page-34-0) variables. While this simple approach enables live range discovery, it won't be enough for other analysis and optimizations, e.g., moving liveness analysis after [SSA](#page-34-0) construction. One way to remedy this, would be to promote references to physical registers to [SSA](#page-34-0) variables, for which a mapping to the original register is kept. On [SSA](#page-34-0) destruction, these variables are renamed to their original register name, resolving any interferences introduced by optimizations. This scheme was, however, not implemented.

<span id="page-18-1"></span>Modifying instructions, e.g., single argument increment/decrement, 2-Address instructions, and conditional instructions are treated as non-kill definitions that won't introduce a new [SSA](#page-34-0) name.

```
movq $1, %vI_x_0
inc %vI_x_1
addq vI_y0, vI_x2
```
Listing 3: Wrong renaming of modified operands

Listings [3](#page-18-1) and [4](#page-19-1) show what would happen, if non-kill definitions were to be renamed naively. In the case of source-destination operands (modified operands), they are constraint to have the same name. If one were to treat them as normal [SSA](#page-34-0) variable, this constraint

```
movq $1, %vI_x_0 # definition
addq vI_x_0, vI_y # use
cmov $0, %vI_x_1 # Conditional defintion
addq \text{{\tt svI_x1, svI_z}} # Error
```
Listing 4: Wrong renaming of conditional definition

<span id="page-19-2"></span>would have to be modelled, e.g., by a three address instruction with explicit naming constraint. On [SSA](#page-34-0) destruction, an additional copy to the newly named variable would be necessary (Listing [5\)](#page-19-2).

```
movq $1, %vI_x_0
# \lceil ... \rceilmovq vI_x 0, vI_x 1 # Copy to new name
addq %vI_y_0, %vI_x_1
```
Listing 5: Renaming with additional copy

Similarly, in Listing [4,](#page-19-1) the conditional move may not be executed, so using a new name afterwards would not be correct. Since there are possibly two versions of  $\sqrt{\gamma v} I_x$  available, something similar to a *φ*-function would be necessary. To this end, [\[SdF01\]](#page-37-6) introduces *ψ*-functions to model predicated execution.

By not renaming non-kill definitions, the destruction scheme described in section [3.3](#page-20-0) produces correct code and achieves the goal of web renaming. Any future optimization which moves code, must maintain the relative position of these not renamed instructions.

#### <span id="page-19-0"></span>**3.2 SSA construction**

[SSA](#page-34-0) constructions follows the algorithm by Cytron et al.  $[CFR<sup>+</sup>91]$  $[CFR<sup>+</sup>91]$  with added liveness checks.

A *φ*-function is needed at every control flow join point, where multiple definitions of a variable are live. Placing one for each variable in the program at each join point would obviously create huge numbers of useless  $\phi$ -functions. To find out where to place  $\phi$ -nodes, we need to look at the dominance property.

In a directed graph, a node *n* is said to dominate a node *m*, or  $m \in Dom(n)$ , if all paths from the start node to *m* lead through *n*. *n* is said to *strictly dominate m* if  $m \in Dom(n)$ and  $n \neq m$ .

A variable definition in *n* clearly does not require a  $\phi$ -function in *m*, as by the dominance property, every path to *m* must lead through *n*. Instead, *φ*-functions are required at join points just outside the region dominated by *n*. This *dominance frontier* of *n*,  $DF(n)$ , <span id="page-20-2"></span>is defined as  $m \in DF(n), \forall m, s.t. \ q \in Pred(m), \ q \in Dom(n)$  and *n* does not strictly dominate *m*.

We can then place a  $\phi$ -function for each definition in *n* at the beginning of every block in  $DF(n)$ . Each  $\phi$ -function is itself a definition and may induce further  $\phi$ -functions.

The next step is *variable renaming*, in which we introduce new names for each definition. This new name consists of a *base name*, which corresponds to the original variable name and a subscripted index. Compiler generated temporary names need to be assigned a new base name as well.

For the renaming step, the *Dominator Tree* of the [CFG](#page-34-6) has to be computed. The node *m* which *strictly dominates* and is *closest* to *n*, is called the *immediate dominator* of *n* or *IDom*(*n*). The start node does not have an immediate dominator. Each node in the [CFG](#page-34-6) appears in the dominator tree and an edge connects *m* with *n* if *m* is in  $IDom(n)$ .

The renaming algorithm walks the dominator tree in preorder. First, the *φ*-function definitions are renamed, then it iterates over each instruction in the block, replacing uses with the current name on the name stack and creates new names for definitions, pushing them on the name stack. Next, the *φ*-function arguments are filled in for the *successors* of the current block (in the [CFG\)](#page-34-6) with the current names. Then the procedure recurs into the block's children in the dominator tree. After the recursive step, the name stacks are restored to their initial state while entering this block.

#### <span id="page-20-0"></span>**3.3 SSA destruction**

[SSA](#page-34-0) destruction is the process of transforming from [SSA](#page-34-0) form into legal machine instructions. This includes the correct elimination of  $\phi$ -functions. Many improvements and corrections to the original destruction procedure have been proposed ([\[BCHS98\]](#page-35-5), [\[BDR](#page-35-8)+09]). Transforming out of [SSA](#page-34-0) after optimizations have been performed requires much care, as copies have to be placed carefully for correctness. Critical edges<sup>[8](#page-20-1)</sup> in the [CFG](#page-34-6) have to be split for some of these approaches to work and algorithms have been carefully designed to avoid this costly process.

Sreedhar et al. [\[SJGS99\]](#page-37-3) describe the difference between conventional SSA (CSSA) and transformed SSA (TSSA). Directly after [SSA](#page-34-0) construction, before applying optimizations like copy propagation or code motion, the code is in CSSA. Therefore, no interferences have been introduced between [SSA](#page-34-0) variables. That paper also presents an algorithm to transform TSSA back into CSSA.

Since [SSA](#page-34-0) destruction is performed immediately after [SSA](#page-34-0) construction in this implementation and no optimizations have been performed on the [SSA](#page-34-0) form, it is still in CSSA. This allows for a straight forward and simple destruction algorithm. No copies need to be inserted and complicated control flow, such as critical edges, can be ignored.

<span id="page-20-1"></span><sup>&</sup>lt;sup>8</sup>A critical node is an edge from a node with multiple successors to a node with multiple predecessors

#### 4. Implementation

<span id="page-21-3"></span><span id="page-21-0"></span>To convert out of [SSA](#page-34-0) in this simple case, [vregs](#page-34-7) have to be renamed in a way, such that they form coherent webs. By performing a union-find over all *φ*-functions in the code, these webs can be built. All the sets associated with the  $\phi$  argument names are unioned with the defined name. Purely local [vregs](#page-34-7) are not included and keep their name. Each disjoint set produced this way, represents a web and all members of the set are renamed to a common name.

# **CHAPTER**

## **Implementation**

#### <span id="page-21-1"></span>**4.1 Unique Names**

[GHC](#page-34-2) uses so called Uniques as identifiers internally, which allow for easy creation of unique values by passing a unique source around. They are designed for fast comparisons and are represented by an integer. Since virtual registers are identified by Uniques, using the subscript based naming scheme common in [SSA](#page-34-0) is not easily applicable. This implementation simply generates completely new Uniques, with the drawback that debugging becomes more difficult, as the connection between different Uniques is not obvious.

#### <span id="page-21-2"></span>**4.2 Code Representation**

#### **4.2.1 Liveness**

Each instruction is either a machine instruction or a meta-instruction (SPILL, RELOAD). Register usage consists of a list of registers read and written per instruction. These contain both virtual and physical registers. The liveness information contains three sets of registers "born" (written to for the first time), dying because they were read for the last time and because they were written to for the last time. *LiveIn* sets are also provided for each block.

#### <span id="page-22-2"></span>**4.2.2 Phi-functions**

For the [SSA-](#page-34-0)form, LiveBasicBlocks are wrapped with SsaBasicBlocks, which also contain a list of *φ*-functions. Keeping them separate, instead of as pseudo-instructions in the instruction list, makes it easier to identify and process them. This also avoids introducing yet another intermediate representation, as adding *φ*-functions as metainstructions to the current instruction type would mean, that they would have to be handled even after [SSA](#page-34-0) destruction.

```
type LiveCmmDecl statics instr
    = GenCmmDecl
            statics
            LiveInfo
            [SCC (LiveBasicBlock instr)]
type LiveBasicBlock instr
    = GenBasicBlock (LiveInstr instr)
```
Listing 6: GHC's Live Code Representation

```
data SSABasicBlock instr
    = SSABB [PhiFun] (LiveBasicBlock instr)
data SccBits a
    = AcyclicBit a | CyclicBit Int a
type BlockLookupTable instr
    = UniqDFM BlockId (SccBits (SSABasicBlock instr))
```
Listing 7: SSA Code Representation

#### **4.2.3 Flattening SCCs**

The [SSA](#page-34-0) transformation for [GHC](#page-34-2) was designed to be minimally invasive and integrate into the existing backend passes. Therefore, it has to deal with the current representation for liveness annotated procedures. These don't simply contain lists of basic blocks, but instead lists of [Strongly Connected Components](#page-34-11) [\(SCCs\)](#page-34-11), where an acyclic [SCC](#page-34-12) only contains one basic block and a cyclic [SCC](#page-34-12) (representing a top-level loop) contains a list of basic blocks. Listing [6](#page-22-0) shows a simplified version of the data structures, where LiveInfo contains the *LiveIn* sets, among other things and GenBasicBlock contains a BlockId and a list of live instructions.

<span id="page-23-2"></span>The [SCCs](#page-34-11) are flattened into a list and each cyclic [SCC](#page-34-12) is numbered, so that consecutive SccBits can later be mapped to their corresponding acyclic [SCC.](#page-34-12) This flattened list is then stored in a map of *basic block IDs* to SccBits wrapped basic blocks, to enable random access of blocks (Listing [7\)](#page-22-1). The employed UniqDFM type is a deterministic finite map of Uniques to elements and guarantees maintaining insertion order for deterministic iteration, which is important for restoring the original structure. [GHC'](#page-34-2)s Unique Map and Set types are very efficient, as they are wrappers around *Data.IntMap* which is an implementation of big-endian patricia trees [\[OG98\]](#page-37-7). Most operations have a worst-case time complexity of  $O(min(n, W))$ , where *W* is the word size in bits.

#### <span id="page-23-0"></span>**4.3 Pruned SSA**

Since the main focus lies on improving register allocation, it is important to remove any unproductive  $\phi$ -function. Otherwise bogus interferences may be added to the conflict graph.

Before inserting  $x_n = \phi(x_0, \ldots, x_{n-1})$  in block *b*, a check is performed whether  $x \in$  $Live_{In}(b)$ , i.e., this variable was live in the original program at that point. This condition is what yields *pruned-SSA* form.

An optimization was added, described in [\[CT11\]](#page-35-10): Each block should only be visited once per variable. To this end, a visited set is set to the empty set for a new variable and each block is added upon visit.

#### <span id="page-23-1"></span>**4.4 Efficient Renaming**

One of the routines measured to be most resource intensive in the [SSA](#page-34-0) transformation, is rename variables. As the rename routine iterates over the instructions in a block, it renames encountered uses and definitions place new names on a stack. This global name stack is needed, as the algorithm traverses the [CFG](#page-34-6) depth-first, where "deeper lying" blocks may redefine a basename which is used in a direct successor of the current block.

This solution keeps one stack for each name and a list of change-sets for the current block nesting depth. As a new name for a [vreg](#page-34-7) is generated, it's original name is added to the local rename set. If it was already an element of that set, the current top of the name stack is replaced, otherwise the new name pushed onto the stack. This makes sure, that the name stack size does not grow with multiple redefinitions within blocks, but is bounded by the depth of the dominator tree. After visiting all successors, the change-set for the current block contains all the redefined basenames, whose top stack elements must be removed, as the recursive function returns.

#### <span id="page-24-4"></span><span id="page-24-0"></span>**4.5 Updating Liveness Information**

To avoid having to recompute liveness information after [SSA](#page-34-0) destruction, it is kept up-to-date during construction and destruction.

As [vregs](#page-34-7) are being renamed, so are their occurrences in the liveness information of the current instruction and the *LiveIn* sets. To get correct *LiveIn* sets, it is necessary to first process a blocks  $\phi$ -functions, as their definitions are considered to be in  $Live_{In}$ . Then all names within the current block's *Live<sub>In</sub>* set are replaced with their new names from the top of the rename stacks.

<span id="page-24-1"></span>During [SSA](#page-34-0) destruction, *LiveIn* sets are updated while renaming webs. Each entry of the set is replaced with its corresponding set-name in the union-find data-structure.

# CHAPTER 5

## **Results**

The implementation was evaluated using Haskell's Nofib benchmark suite<sup>[9](#page-24-3)</sup>, which contains 154 programs organized in 7 groups. All benchmarks were compiled with the same GHC 9.1 development version containing the [SSA](#page-34-0) transformation patches, activatable by a command-line switch and general optimizations  $(-02)$ . The hardware used is a x86\_64 machine with 16GB of RAM and an AMD Ryzen 7 4700U CPU with 8 cores. Benchmarks were run five times each and average results for CPU cycles are shown.

The tables printed here focus on the *real* group, which contains larger benchmarks, aimed at being more realistic.

#### <span id="page-24-2"></span>**5.1 vreg names**

The data confirms, that most programs contain disjoint live ranges with the same [vreg](#page-34-7) name, as can be seen in Table [1.](#page-25-1) All but eight programs showed an increase in unique [vreg](#page-34-7) after [SSA](#page-34-0) destruction. The median increase for the *real* group is 4.89% and 5.44% overall, with the maximum increase being 21.04%.

<span id="page-24-3"></span> $^{9}$ https://gitlab.haskell.org/ghc/nofib/ – Accessed 23.3.2021

<span id="page-25-2"></span><span id="page-25-1"></span>

|                                               | Before SSA        | After SSA         |
|-----------------------------------------------|-------------------|-------------------|
| real/anna                                     | 19409             | 20403             |
| $\overline{\text{real}/\text{ben}}$ -raytrace | 6431              | 6937              |
| $\overline{\text{real}}/\text{bspt}$          | 4019              | 4287              |
| real/cacheprof                                | 6325              | 6633              |
| real/compress                                 | 798               | 839               |
| real/compress2                                | $\overline{2514}$ | 2876              |
| $\overline{\text{real/eff/CS}}$               | 87                | 87                |
| real/eff/CSD                                  | $\overline{43}$   | $\overline{43}$   |
| real/eff/FS                                   | $\overline{74}$   | $\overline{75}$   |
| real/eff/S                                    | 10                | 10                |
| $\overline{\text{real}/\text{eff}/\text{VS}}$ | $\overline{93}$   | 96                |
| real/eff/VSD                                  | $\overline{32}$   | $\overline{32}$   |
| real/eff/VSM                                  | $\overline{220}$  | $\overline{220}$  |
| real/fem                                      | 3375              | 3665              |
| real/fluid                                    | 8740              | 9437              |
| real/fulsom                                   | 4512              | 4812              |
| real/gamteb                                   | 1571              | 1771              |
| $\overline{\text{real/gg}}$                   | 3719              | 3858              |
| $\overline{\text{real}}/\text{prep}$          | 1387              | 1469              |
| real/hidden                                   | 2441              | 2572              |
| real/hpg                                      | 3297              | 3390              |
| real/infer                                    | 2383              | $\overline{2418}$ |
| real/lift                                     | 1883              | 1933              |
| real/linear                                   | 3018              | $\overline{3172}$ |
| real/maillist                                 | 187               | 198               |
| $real/mk$ hprog                               | 564               | 593               |
| $\overline{\text{real}/\text{parser}}$        | $48\overline{75}$ | 5281              |
| real/pic                                      | 1416              | 1515              |
| real/prolog                                   | 1208              | 1243              |
| real/reptile                                  | 3486              | 3631              |
| real/rsa                                      | 157               | 168               |
| real/sec                                      | 4342              | 4810              |
| real/smallpt                                  | 1871              | $\overline{2090}$ |
| real/symalg                                   | 2143              | $\overline{2320}$ |
| real/vertas                                   | 18766             | 19667             |

Table 1: Unique vregs before SSA construction and after SSA destruction

#### <span id="page-25-0"></span>**5.2 Spill Code**

[GHC'](#page-34-2)s graph coloring register allocator uses a very simplistic spill heuristic, namely it will always spill the longest live range. This worked unexpectedly well so far, possibly because the disjoint live ranges will be longer on average. Therefore, a modified version was tested, using the classic cost function by Chaitin,  $cost(n)/deg(n)$ , where  $deg(n)$  is <span id="page-26-1"></span>the degree of node *n* in the conflict graph and

$$
cost(n) = \sum_{\substack{I \in Instruments\\c \text{ number of defs and uses of n in I}}} c * freq(I)
$$

with *freq*(*I*) being the expected execution frequency of instruction *I*.

Table [2](#page-27-0) shows spill numbers for the graph coloring register allocator. *graph* is the baseline, *graph-ssa* with [SSA](#page-34-0) transformation and *chaitin-ssa* with [SSA](#page-34-0) transformation and the Chaitin spill heuristic. *Spills* includes STORE and LOAD instructions inserted, *reg-reg* includes all remaining register-register moves. The median reduction of spills with [SSA](#page-34-0) transformation for *real* is -7.09%, of reg-reg moves -4.82%. Using the Chaitin spill heuristic, median reduction of spills is zero and of reg-reg moves -5.42\%. This seems to be biased by the *real/eff* benchmarks, which didn't spill any registers to begin with. Excluding all programs in *real*, which never spilled across all test configurations, the median change of spills for *graph-ssa* becomes -21.67% and -27.50% for *chaitin-ssa*.

The median of changes in spills - excluding never spilling programs - for the whole benchmark suite is -20% for *graph-ssa* and -24.54% for *chaitin-ssa*. reg-reg moves over all programs changed by -5.4% for *graph-ssa* and -5.67% for *chaitin-ssa*.

Changes for the linear scan register allocator, which has been tuned and optimized more, are less pronounced and can be seen in Table [3.](#page-28-0) The mean change in spills is actually zero and for reg-reg moves a negligible increase of 0.34%. This does not change when excluding never spilling programs.

For the entire benchmark suite, median change in both spills and reg-reg moves is zero. When excluding never spilling programs, the median change in spills becomes -1.09%.

#### <span id="page-26-0"></span>**5.3 Runtime**

To evaluate runtime performance, CPU cycles as measured by the hardware performance counters were collected. The programs in the *real/eff* subgroup were excluded, as they did not see any changes in unique names or spills. Due to their small size (e.g., real/eff/VSD has less than 30 lines) and short runtime (e.g., 12ms), any changes in runtime are most likely just noise.

Runtime performance with the graph coloring allocator shows good improvements for some benchmarks, like *real/ben-raytrace* (-8.06%), *real/hpg* (-3.35%) and *real/linear* (-3.63%). Of these programs, only *real/ben-raytrace* also exhibits a big reduction in spills, whereas *real/linear* even sees a slight *increase* in spills (but reduction in reg-reg moves of  $-6.62\%$ ).

Several benchmarks see increases in runtime, e.g., *real/grep* (+3.09%), *real/infer* (+6.58%) and *real/reptile* (+5.44%). A possible explanation for *real/grep* is the observed increase of 3.66% in cache misses. This benchmark even shows an improved runtime by -6.02% when using the Chaitin spill heuristic (-1.22% cache misses). *real/infer* shows performance

#### 5. Results

<span id="page-27-0"></span>

Table 2: Inserted spills and remaining reg-reg moves for graph coloring register allocator

regressions for both heuristics, for *graph-ssa* possibly because of an increase in cache misses by 1.04%, whereas *chaitin-ssa* inserted two additional spills.

The average change in runtime is small. To minimize the impact of changes to very short running programs (less than a second), versus very long running programs (e.g., 58 seconds), we weight each value by the fraction of total runtime of *real* it contributes. The weighted average change in runtime for *graph-ssa* is -1.90% and -2.25% for *chaitin-ssa*.

<span id="page-28-0"></span>

|                                                | linear            | linear-ssa        |                   |                   |
|------------------------------------------------|-------------------|-------------------|-------------------|-------------------|
|                                                | $s$ pills         | reg-reg           | spills            | reg-reg           |
| real/anna                                      | $\overline{113}$  | 9106              | $\overline{115}$  | 9146              |
| $real/ben-ray trace$                           | $\frac{558}{558}$ | 2495              | $\frac{587}{587}$ | 2444              |
| $\overline{\text{real}/\text{b} \text{spt}}$   | $\overline{52}$   | 1803              | $\overline{78}$   | 1812              |
| real/cacheprof                                 | $\overline{72}$   | 3683              | $\overline{69}$   | 3697              |
| $\overline{\text{real}/\text{compress}}$       | $\overline{24}$   | $\overline{342}$  | $\overline{24}$   | $\overline{343}$  |
| $\overline{\text{real}/\text{compress2}}$      | $\overline{254}$  | $\overline{818}$  | $\overline{290}$  | 790               |
| $\overline{\text{real/eff/CS}}$                | $\overline{0}$    | $\overline{83}$   | $\overline{0}$    | $\overline{83}$   |
| $\overline{\text{real/eff/CSD}}$               | $\overline{0}$    | $\overline{30}$   | $\overline{0}$    | $\overline{30}$   |
| $\overline{\text{real}/\text{eff}/\text{FS}}$  | $\overline{0}$    | $\overline{33}$   | $\overline{0}$    | $\overline{33}$   |
| $\overline{\text{real/eff/S}}$                 | $\overline{0}$    | $\overline{7}$    | $\overline{0}$    | 7                 |
| real/eff/VS                                    | $\overline{0}$    | $\overline{54}$   | $\theta$          | $\overline{54}$   |
| $\overline{\text{real}/\text{eff}/\text{VSD}}$ | $\overline{0}$    | 14                | $\theta$          | 14                |
| $\overline{\text{real}/\text{eff}/\text{VSM}}$ | $\overline{0}$    | $\overline{64}$   | $\overline{0}$    | $\overline{64}$   |
| $real/$ fem                                    | $\overline{92}$   | 1332              | $\overline{92}$   | 1329              |
| real/fluid                                     | $\overline{353}$  | 2960              | 398               | 2939              |
| real/fulsom                                    | $\overline{0}$    | 1245              | $\theta$          | 1250              |
| real/gamteb                                    | 151               | $\overline{583}$  | 147               | 585               |
| real/gg                                        | $\overline{18}$   | 1799              | $\overline{18}$   | 1797              |
| real/grep                                      | $\overline{12}$   | $\overline{715}$  | 12                | 718               |
| real/hidden                                    | $\overline{40}$   | 977               | $\overline{40}$   | 980               |
| real/hpg                                       | $\overline{4}$    | 1238              | $\overline{4}$    | 1247              |
| real/inter                                     | 12                | 949               | 12                | 952               |
| real/lift                                      | $\overline{0}$    | 629               | $\overline{0}$    | 634               |
| real/linear                                    | $\overline{40}$   | 1337              | $\overline{27}$   | 1317              |
| real/maillist                                  | $\overline{12}$   | $\overline{125}$  | $\overline{12}$   | 130               |
| $\overline{\text{real/mkhprog}}$               | $\overline{0}$    | $\overline{305}$  | $\overline{0}$    | $\overline{305}$  |
| $\overline{\text{real}/\text{parser}}$         | $\overline{72}$   | $\overline{2435}$ | $\overline{72}$   | 2449              |
| real/pic                                       | $\overline{428}$  | 594               | 396               | $\overline{619}$  |
| $\overline{\text{real}/\text{prolog}}$         | $\overline{14}$   | 580               | $\overline{17}$   | $\overline{582}$  |
| $\overline{\text{real/reptil}}$                | $\overline{32}$   | $\overline{2051}$ | $\overline{36}$   | $\overline{2050}$ |
| real/rsa                                       | $\overline{12}$   | $\overline{110}$  | $\overline{12}$   | $\overline{115}$  |
| $\overline{\text{real}/\text{scs}}$            | 1264              | $\overline{2261}$ | 1117              | $\overline{2243}$ |
| $\overline{\text{real}/\text{smallpt}}$        | $\overline{110}$  | 754               | 108               | $\overline{820}$  |
| $\overline{\text{real/symalg}}$                | $\overline{9}$    | 1498              | $\overline{11}$   | 1515              |
| $real/$ veritas                                | $\overline{76}$   | $\frac{1}{8490}$  | $\overline{76}$   | 8464              |

Table 3: Inserted spills and remaining reg-reg moves for linear scan register allocator

Chaitin's heuristic performs generally better than the simplistic live range length based one. A notable outlier is *real/reptile*. While we see a performance degradation for both heuristics, Chaitin's actually leads to a a reduction of 60% in spills. Spill placement may be at fault though, with a 3% increase in cache misses. *real/veritas*'s increased runtime can be explained by the 8.5% increase in spills.

Table [5](#page-31-0) shows the runtime benchmarks using the linear scan allocator. There are some



Table 4: CPU cycles for *real* with graph coloring register allocator

<span id="page-30-1"></span>improvements to *real/hidden*, *real/infer* and also to *real/hpg*, despite only small changes to number of spills. Of these, only *real/hpg* saw a drop of -4% in cache misses. The result of *real/infer* appears to be spurious though, given the high standard error for baseline runtime and the fact, that earlier runs did not show such an increase for this program.

*real/cacheprof*, *real/linear* and *real/mkhprog* showed significant increases in runtime. Of these, *real/cacheprof* did not see big changes in spill numbers, yet this result was reproducible. Spill-code placement may be to blame, as an increase of 1.53% in cache misses can be seen. The case is similar for *real/linear*, despite the reduction in spills. *real/mkhprog*'s result may be bogus, as this program did not spill, saw no changes to reg-reg moves, an actual *decrease* in cache misses and an elevated standard error for the *linear-ssa* measurement.

The average runtime has not changed significantly, both for *real* and the whole benchmark suite.

#### <span id="page-30-0"></span>**5.4 Compile Time**

To measure the impact on compile times, the library version of Haskell's package manager *Cabal* was compiled with maximum optimizations  $(-02)$  and again with the addition of [SSA](#page-34-0) transformation. Containing around 100,000 lines of Haskell code, it takes sufficiently long to compile to notice any changes in compile times. The observed increase in compile times is quite considerable, at almost 30%. It's important to note though, that the implementation has not been particularly optimized.

<span id="page-31-0"></span>

|                             | linear  | std. err.             | linear-ssa (rel) | std. err. |
|-----------------------------|---------|-----------------------|------------------|-----------|
| real/anna                   | 5.01E9  | $0.20\%$              | $-0.66\%$        | $0.30\%$  |
| real/ben-raytrace           | 5.71E11 | $2.0e-2%$             | $-1.34\%$        | $1.9e-2%$ |
| real/bspt                   | 3.52E11 | $0.20\%$              | 0.21%            | $0.20\%$  |
| real/cacheprof              | 3.08E12 | $0.30\%$              | 3.85%            | $0.40\%$  |
| real/compress               | 2.41E13 | $0.20\%$              | $0.50\%$         | $0.20\%$  |
| real/compress2              | 2.86E14 | $5.5e-2%$             | $-2.09\%$        | $0.20\%$  |
| $real/\overline{fem}$       | 3.29E15 | $7.9e-2%$             | $-1.78%$         | $0.40\%$  |
| real/fluid                  | 2.74E16 | $0.30\%$              | $-1.19\%$        | $0.10\%$  |
| real/fulsom                 | 1.77E17 | $6.\overline{1e-2\%}$ | 2.03%            | 0.10%     |
| real/gamteb                 | 4.76E18 | $0.30\%$              | 1.96%            | $6.7e-2%$ |
| $\overline{\text{real/gg}}$ | 3.38E19 | $0.10\%$              | $-0.03%$         | $0.30\%$  |
| $real/$ grep                | 3.27E20 | $0.10\%$              | $1.38\%$         | $0.20\%$  |
| real/hidden                 | 4.53E21 | $4.1e-2%$             | $-4.48\%$        | $0.20\%$  |
| real/hpg                    | 3.10E22 | $0.40\%$              | $-2.58\%$        | $6.3e-2%$ |
| real/infer                  | 3.93E23 | $3.00\%$              | $-3.21\%$        | $0.20\%$  |
| real/lift                   | 3.05E24 | $8.8e-2%$             | $1.44\%$         | $0.10\%$  |
| real/linear                 | 5.50E25 | $5.5e-2%$             | $4.10\%$         | $0.10\%$  |
| real/maillist               | 2.82E26 | $0.50\%$              | 1.11%            | $0.40\%$  |
| real/mkhprog                | 1.64E25 | $0.70\%$              | $3.29\%$         | $1.30\%$  |
| real/ensure                 | 2.76E28 | $0.20\%$              | $0.60\%$         | $5.8e-2%$ |
| real/pic                    | 1.94E29 | $0.50\%$              | $-1.09\%$        | $0.40\%$  |
| real/prog                   | 3.56E30 | $0.20\%$              | $-2.23\%$        | $0.20\%$  |
| real/reptile                | 2.59E31 | $0.30\%$              | $0.12\%$         | $0.10\%$  |
| real/rsa                    | 2.77E32 | $7.7e-2%$             | $0.62\%$         | $6.4e-2%$ |
| real/sec                    | 2.74E33 | $0.10\%$              | $-0.34\%$        | $0.40\%$  |
| real/mallpt                 | 1.81E36 | $0.10\%$              | $-0.57%$         | $0.20\%$  |
| real/symalg                 | 3.16E35 | $7.0e-2%$             | $0.08\%$         | $1.7e-2%$ |
| $real/$ veritas             | 2.69E36 | $0.10\%$              | 1.97%            | $0.30\%$  |
| Geo. Mean                   |         |                       | $0.04\%$         |           |
| Median                      |         |                       | $0.10\%$         |           |
| Weighted avg.               |         |                       | $-0.56\%$        |           |

Table 5: CPU cycles for *real* with linear scan register allocator

# CHAPTER  $6$

## <span id="page-32-1"></span><span id="page-32-0"></span>**Conclusions and Future Work**

[Static Single Assignment](#page-34-0) form intermediate representations are a fundamental technique in modern compiler construction, providing a sparse representation of data-flow information and enabling many optimizations in an efficient manner. Even at a late state in the compilation pipeline, when dealing with native code, [SSA](#page-34-0) can be very beneficial.

Adding [SSA](#page-34-0) transformation to [GHC'](#page-34-2)s native code generator effectively performs renumbering of disjoint live ranges, facilitating register allocation and leading to reduced spill code insertion. This is especially pronounced for the graph coloring register allocator. Furthermore it enables future optimizations, such as sparse conditional constant propagation or live range splitting in the register allocator.

However, more work needs to be done to increase efficiency of the implementation. Using the same base algorithm by Cytron et al., semi-pruned [SSA](#page-34-0) form could be computed more cheaply, followed by liveness analysis on [SSA](#page-34-0) form. The approach by Braun et al. [\[BBH](#page-34-9)+13] also promises to be more efficient, not requiring prior liveness analysis or computation of dominance frontiers. Another advantage is the fact, that it produces pruned [SSA](#page-34-0) directly.

To enable further optimizations, it may also become necessary to model machine instruction constraints.

## **Acronyms**

<span id="page-34-10"></span><span id="page-34-3"></span>**ABI** Application Binary Interface. [9](#page-18-2)

<span id="page-34-6"></span>**CFG** Control-flow graph. [2,](#page-11-6) [6,](#page-15-0) [11,](#page-20-2) [14](#page-23-2)

<span id="page-34-2"></span>**GHC** Glasgow Haskell Compiler. [vii,](#page-6-1) [1,](#page-10-4) [2,](#page-11-6) [12–](#page-21-3)[14,](#page-23-2) [16,](#page-25-2) [23](#page-32-1)

<span id="page-34-5"></span>**IR** intermediate representation. [2](#page-11-6)

<span id="page-34-8"></span>**ISA** Instruction Set Architecture. [6](#page-15-0)

<span id="page-34-1"></span>**NCG** Native Code Generator backend. [vii,](#page-6-1) [2,](#page-11-6) [4](#page-13-1)

<span id="page-34-12"></span>**SCC** Strongly Connected Component. [13,](#page-22-2) [14](#page-23-2)

<span id="page-34-11"></span>**SCCs** Strongly Connected Components. [13,](#page-22-2) [14](#page-23-2)

<span id="page-34-0"></span>**SSA** Static Single Assignment. [vii,](#page-6-1) [4](#page-13-1)[–15,](#page-24-4) [17,](#page-26-1) [21,](#page-30-1) [23](#page-32-1)

<span id="page-34-7"></span><span id="page-34-4"></span>**vreg** virtual register. [3,](#page-12-2) [4,](#page-13-1) [12,](#page-21-3) [14,](#page-23-2) [15](#page-24-4)

## **Bibliography**

<span id="page-34-9"></span>[BBH+13] Matthias Braun, Sebastian Buchwald, Sebastian Hack, Roland Leißa, Christoph Mallon, and Andreas Zwinkau. Simple and efficient construction of static single assignment form. In Ranjit Jhala and Koen De Bosschere, editors, *Compiler Construction*, pages 102–122, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.

- <span id="page-35-0"></span>[BBN+18] Jean-Philippe Bernardy, Mathieu Boespflug, Ryan R. Newton, Simon Peyton Jones, and Arnaud Spiwack. Linear Haskell: practical linearity in a higherorder polymorphic language. *Proc. ACM Program. Lang.*, 2(POPL):5:1–5:29, 2018.
- <span id="page-35-5"></span>[BCHS98] Preston Briggs, Keith D. Cooper, Timothy J. Harvey, and L. Taylor Simpson. Practical improvements to the construction and destruction of static single assignment form. *Softw. Pract. Exp.*, 28(8):859–881, 1998.
- <span id="page-35-1"></span>[BCT94] Preston Briggs, Keith D. Cooper, and Linda Torczon. Improvements to graph coloring register allocation. *ACM Trans. Program. Lang. Syst.*, 16(3):428–455, 1994.
- <span id="page-35-8"></span>[BDR+09] Benoit Boissinot, Alain Darte, Fabrice Rastello, Benoît Dupont de Dinechin, and Christophe Guillon. Revisiting Out-of-SSA translation for correctness, code quality and efficiency. In *Proceedings of the CGO 2009, The Seventh International Symposium on Code Generation and Optimization, Seattle, Washington, USA, March 22-25, 2009*, pages 114–125. IEEE Computer Society, 2009.
- <span id="page-35-7"></span>[Bri14] Preston Briggs. *Register Allocation via Graph Coloring*. PhD thesis, Rice University, 2014.
- <span id="page-35-2"></span>[CAC+81] Gregory J. Chaitin, Marc A. Auslander, Ashok K. Chandra, John Cocke, Martin E. Hopkins, and Peter W. Markstein. Register allocation via coloring. *Comput. Lang.*, 6(1):47–57, 1981.
- <span id="page-35-4"></span>[CCF91] Jong-Deok Choi, Ron Cytron, and Jeanne Ferrante. Automatic construction of sparse data flow evaluation graphs. In David S. Wise, editor, *Conference Record of the Eighteenth Annual ACM Symposium on Principles of Programming Languages, Orlando, Florida, USA, January 21-23, 1991*, pages 55–66. ACM Press, 1991.
- <span id="page-35-3"></span>[CFR+91] Ron Cytron, Jeanne Ferrante, Barry K. Rosen, Mark N. Wegman, and F. Kenneth Zadeck. Efficiently computing static single assignment form and the control dependence graph. *ACM Trans. Program. Lang. Syst.*, 13(4):451– 490, 1991.
- <span id="page-35-6"></span>[CH90] Fred C. Chow and John L. Hennessy. The priority-based coloring approach to register allocation. *ACM Trans. Program. Lang. Syst.*, 12(4):501–536, 1990.
- <span id="page-35-10"></span>[CT11] Keith D. Cooper and Linda Torczon. *Engineering a Compiler (Second Edition)*. Morgan Kaufmann, 2011.
- <span id="page-35-9"></span>[dD14] Benoît Dupont de Dinechin. Using the SSA-Form in a code generator. In Albert Cohen, editor, *Compiler Construction - 23rd International Conference, CC 2014, Held as Part of the European Joint Conferences on Theory and*

*Practice of Software, ETAPS 2014, Grenoble, France, April 5-13, 2014. Proceedings*, volume 8409 of *Lecture Notes in Computer Science*, pages 1–17. Springer, 2014.

- <span id="page-36-7"></span>[GA96] Lal George and Andrew W. Appel. Iterated register coalescing. *ACM Trans. Program. Lang. Syst.*, 18(3):300–324, 1996.
- <span id="page-36-4"></span>[Hec77] Matthew S Hecht. *Flow analysis of computer programs*. Elsevier Science Inc., 1977.
- <span id="page-36-6"></span>[HGG06] Sebastian Hack, Daniel Grund, and Gerhard Goos. Register allocation for programs in SSA-form. In Alan Mycroft and Andreas Zeller, editors, *Compiler Construction, 15th International Conference, CC 2006, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2006, Vienna, Austria, March 30-31, 2006, Proceedings*, volume 3923 of *Lecture Notes in Computer Science*, pages 247–262. Springer, 2006.
- <span id="page-36-1"></span>[HHJW07] Paul Hudak, John Hughes, Simon L. Peyton Jones, and Philip Wadler. A history of Haskell: being lazy with class. In Barbara G. Ryder and Brent Hailpern, editors, *Proceedings of the Third ACM SIGPLAN History of Programming Languages Conference (HOPL-III), San Diego, California, USA, 9-10 June 2007*, pages 1–55. ACM, 2007.
- <span id="page-36-2"></span>[Jon92] Simon L. Peyton Jones. Implementing lazy functional languages on stock hardware: The Spineless Tagless G-Machine. *J. Funct. Program.*, 2(2):127– 202, 1992.
- <span id="page-36-3"></span>[JRR99] Simon L. Peyton Jones, Norman Ramsey, and Fermin Reig. C–: A portable assembly language that supports garbage collection. In Gopalan Nadathur, editor, *Principles and Practice of Declarative Programming, International Conference PPDP'99, Paris, France, September 29 - October 1, 1999, Proceedings*, volume 1702 of *Lecture Notes in Computer Science*, pages 1–28. Springer, 1999.
- <span id="page-36-0"></span>[JW93] Simon L. Peyton Jones and Philip Wadler. Imperative functional programming. In Mary S. Van Deusen and Bernard Lang, editors, *Conference Record of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, Charleston, South Carolina, USA, January 1993*, pages 71–84. ACM Press, 1993.
- <span id="page-36-5"></span>[LG99] Allen Leung and Lal George. Static single assignment form for machine code. In Barbara G. Ryder and Benjamin G. Zorn, editors, *Proceedings of the 1999 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Atlanta, Georgia, USA, May 1-4, 1999*, pages 204–214. ACM, 1999.
- <span id="page-37-7"></span>[OG98] Chris Okasaki and Andrew Gill. Fast mergeable integer maps. In *In Workshop on ML*, pages 77–86, 1998.
- <span id="page-37-5"></span>[PM04] Jinpyo Park and Soo-Mook Moon. Optimistic register coalescing. *ACM Trans. Program. Lang. Syst.*, 26(4):735–765, 2004.
- <span id="page-37-0"></span>[SCJD07] Martin Sulzmann, Manuel M. T. Chakravarty, Simon L. Peyton Jones, and Kevin Donnelly. System F with type equality coercions. In François Pottier and George C. Necula, editors, *Proceedings of TLDI'07: 2007 ACM SIGPLAN International Workshop on Types in Languages Design and Implementation, Nice, France, January 16, 2007*, pages 53–66. ACM, 2007.
- <span id="page-37-6"></span>[SdF01] Artour Stoutchinin and François de Ferrière. Efficient static single assignment form for predication. In Yale N. Patt, Josh Fisher, Paolo Faraboschi, and Kevin Skadron, editors, *Proceedings of the 34th Annual International Symposium on Microarchitecture, Austin, Texas, USA, December 1-5, 2001*, pages 172–181. ACM/IEEE Computer Society, 2001.
- <span id="page-37-2"></span>[Sin18] Jeremy Singer. Introduction. In *Single Static Assignment Book*, chapter 1. 2018. INRIA.
- <span id="page-37-3"></span>[SJGS99] Vugranam C. Sreedhar, Roy Dz-Ching Ju, David M. Gillies, and Vatsa Santhanam. Translating out of static single assignment form. In Agostino Cortesi and Gilberto Filé, editors, *Static Analysis, 6th International Symposium, SAS '99, Venice, Italy, September 22-24, 1999, Proceedings*, volume 1694 of *Lecture Notes in Computer Science*, pages 194–210. Springer, 1999.
- <span id="page-37-1"></span>[TC10] David A. Terei and Manuel M. T. Chakravarty. An LLVM backend for GHC. In Jeremy Gibbons, editor, *Proceedings of the 3rd ACM SIGPLAN Symposium on Haskell, Haskell 2010, Baltimore, MD, USA, 30 September 2010*, pages 109–120. ACM, 2010.
- <span id="page-37-4"></span>[WF10] Christian Wimmer and Michael Franz. Linear scan register allocation on SSA form. In Andreas Moshovos, J. Gregory Steffan, Kim M. Hazelwood, and David R. Kaeli, editors, *Proceedings of the CGO 2010, The 8th International Symposium on Code Generation and Optimization, Toronto, Ontario, Canada, April 24-28, 2010*, pages 170–179. ACM, 2010.