HKUST COMP4901U Computer Language Processing and Compiler Design

28 minute read

introduction

a language can be natural language, computer language, language of mathematics
processing general-purpose programming languages:
- interpreter: execute instructions while traversing the program (Python)
- compiler: traverse program, generate executable code to run later (Rust, C)
general compiler organization
- source code: for programmers, higher level for abstraction and productiveness
- target code: efficiently run on hardware, low level
front end & back end in compiler

front-end —————————————————————————–> back-end

lexical analyzer -> parser -> name analyzer -> type checker -> intermediate code generator -> JIT compiler or platform-specific back end
program first to a tree structure, Abstract Syntax Tree (AST)
- node represents arithmetic operations, statements, blocks
- leaves represent constants, variables, methods

theory of formal languages

word

define of word

Let A be an alphabet {a, b, c, …}

define words of length n, as A^n^, as follow:

A^0^ = ${\epsilon}$, only one word of length 0
For n > 0, $A^n = {aw w\in A^{n-1}}$

set of all words: $A^* = \cup_{n \ge 0}A^n$

equivalent word

Let $u,v \in A^*$, then u=v if and only if

$u=\epsilon \text{ and } v = \epsilon$; or
$u=au^\prime ~ and ~ v = av^\prime ~ where ~ u^\prime = v^\prime$

Theorem (Structural induction of words)

$P(\epsilon)$ , P(u) implies P(au), then for any word a, P(a) is correct

word concatenation

$ u \cdot v = \left{ \begin{array}{ c l } v & \text{if } u=\epsilon
a(u^\prime \cdot v) & \text{if } u = au^\prime \end{array} \right. $

for easy notation, uv means word concatenation

free monoid of words

an algebraic structure, word monoid satisfies two additional properties:

left cancellation law, if wu = wv, then u = v
right cancellation law

reversal

prefix, suffix, and slice

easy as name extends

language

define of language

a language over alphabet A is a set $L \subseteq A^\prime$, for example A = {0, 1}

a finite language like L={0, 1, 01, 010}, or the empty language $\empty$
infinite but very difficult to describe
infinite but having some nice structure, a pattern for description precisely

for example, $L_2=\set{01, 0101, 010101, …} = \set{(01)^n

n\ge1}$

language is a set, we have union, intersection, and other set operations

language operations

monoid of language

two conditions for monoid

a neutral element, $L\set{\epsilon}=L,\set{\epsilon}L=L, \text{ so } \set{\epsilon}$ is one
associative law: we need $L_1 \cdot (L_2 \cdot L_3)=(L_1 \cdot L_2) \cdot L_3$

note: no cancellation law! $L_1\empty = \empty = L_2 \empty$ but not necessarily L1=L2

represent programming language

in general, some formal languages are not recursively enumerable sets

a language $L \subseteq A^$ is given by its characteristic function $f_L : A^\to \set{0,1}$, defined by f~L~(w) = 1 for w in L, and 0 for w not in L

for example, for $L_2=\set{01, 0101, 010101, …} = \set{(01)^n

n\ge1}$

def f(w: List[Int]): Boolean = w match
    case Cons(0, Cons(1, Nil())) => true
    case Cons(0, Cons(1, wRest)) => f(wRest)
    case _ => false

val L2 = Lang(f)
L2.contains(0::1::0::1::Nil()) // true

Kleene Star, repetition of a language

regular expression

one way to denote (often infinite) languages

definition

a regular expression e is built from:

$\empty$, corresponds to the empty language
$\epsilon$, corresponding to $\set{\epsilon}$
a, b, etc. corresponding to $\set{a}, \set{b},…$
e1 e2 corresponding to $L_{e1} \cup L_{e2}$
e1e2, corresponding to $L_{e1} \cdot L_{e2}$
e* corresponding to $L_{e}^{*}$

regular expression operators

[a…z] = a b … z
e^?^ = e $\epsilon$
e^+^ = e e^*^
e^k..^ = e^k^ e^^ and e^p..q^ = e^p^ (e^?^)^q-p^
!e, complementary, no obvious translation from base oprators
e1 & e2 = ! (!e1 !e2), denoting $L_{e1} \cap L_{e2}$

properties of regular expression

emptiness, inclusion, disjointness

lexical Analysis

input: character streams. res = 14 + arg * 3

output: token streams. “res”, “=”, “14”, “+”, “arg”, “*”, “3”

key ideas:

small memory usage
not difficult to construct manually
use longest match rule

lexer is implemented by:

conversion to finite-state automata
usage of regular expression derivation

example of a simple lexer:

num = 13; 
while (num > 1) { 
  println("num = ", num); 
  if (num % 2 == 0) { 
    num = num / 2; 
  } else { 
    num = 3 * num + 1; 
  }
}

tokens:

Ident ::=
	letter (letter | digit)*
integerConst ::=
	digit digit*
keywords
	if  else  while  println
special symbols
	(  )   &&  <   ==  +  -  *  /  %  !  - {  }  ;  ,  
letter ::= a | b | c | … | z | A | B | C | … | Z
digit ::= 0 | 1 | … | 8 | 9

a small hand-written lexer:

enum Token:
    case ID(content: String) // id3
    case IntConst(value: Int) // 10
    case object AssignEQ
    case CompareEQ
    case MUL // *
    case PLUS // +
    case LEQ // <=
    case OPAREN
    case CPAREN
    case IF
    case WHILE
    case EOF // End Of File

class CharStream(fileName: String):
    val file = new BufferedReader(
    new FileReader(fileName))
    var current: Char = '\0x0'
    var eof: Boolean = false
    def next =
    	if (eof) 
			throw EndOfInput("reading" + file)
    	val c = file.read()
    	eof = (c == -1)
    	current = c.toChar
    next // initialize first char

class Lexer(ch: CharStream):
    var current: Token
    def next: Unit =
		// lexer code goes here
		if (isLetter) {
  			b = new StringBuffer
            while (isLetter || isDigit) {
                b.append(ch.current)
                ch.next
            }
 			keywords.lookup(b.toString) { 
  				case None=> token=ID(b.toString)
  				case Some(kw) => token=kw
 			}
		}
		if (isDigit) {
  			k = 0
  			while (isDigit) {
    			k = 10*k + toDigit(ch.current)
    			ch.next
  			}
  			token = IntConst(k)
		}

deciding which token is coming

problem: how do we know we are analyzing a string or integer sequence?

first

use first(e) - symbols with which e can start

$first(L) = \set{a \in A \mid \exist v \in A^*. ~ av\in L }$

example: L = {a, ab}, first(L) = {a}

first($\empty$) = $\empty$

first($\epsilon$) = $\empty$

first(a) = {a}

first(e1

e2) = first(e1) $\cup$ first(e2)

first(e^*^) = first(e)

first(e1e2) = $\text{if }(nullable(e1))\text{ then }first(e1) \cup first(e2) \text{ else } first(e1)$

nullable

nullable($\empty$) = false

nullable($\epsilon$) = true

nullable(a) = false

nullable(e1

e2) = nullable(e1) $\or$ nullable(e2)

nullable(e^*^) = true

nullable(e1e2) = nullable(e1) $\and$ nullable(e2)

ch.current match {
    case '(' => {current = OPAREN; ch.next; return}
    case ')' => {current = CPAREN; ch.next; return}
    case '+' => {current = PLUS; ch.next; return}
    case '/' => {current = DIV; ch.next; return}
    case '*' => {current = MUL; ch.next; return}
    case '=' => { // more tricky because there can be =, �==
    	ch.next
    	if (ch.current �== '=') {ch.next; current = CompareEQ; return}
    	else {current = AssignEQ; return}
    }
    case '<' => { // more tricky because there can be <, <=
    ch.next
    if (ch.current == '=') {ch.next; current = LEQ; return}
    	else {current = LESS; return}
    }
}

longest match rule

token priority

for a word is both identifier and keyword, which should we assign? Set a priority so match the keyword first

general approach to automatic lexing

traditional approach

convert to nondeterministic finite-state automaton
perform determinization (can be expensive)
run the resulting automaton on input (linear in the input size)

Brzozowskis derivatives

accepts: (e, w) -> {true, false} e: regular expression, w: input word

accepts(e, epsilon) = nullable(e)

accepts(e, cu) = 
	if e = null, false
	if e = epsilon, false
	if e = c, if(u = epsilon) true else false
	if e = d, false
	if e = e1 | e2, accepts(e1, w) || accepts(e2, w)
	if e = e1e2, ??? // need to try all splits of w?
	if e = e1*, ???

after consume a given letter, what is the rest of the regex?

definition:

The derivative of a regex e with respect to letter c, written as $\delta^c(e)$, is defined as:

$L(\delta^c(e)) = \set{w \mid cw \in L(e)}$

the derivative of a regex is still a regex

some examples:

$\delta^a(ab

da)=b

$\delta^a((ab)^)=b(ab)^$

$\delta^a((ab

c)^*ad)=b(ab

c)^*ad \mid d$

matching regex by derivation

accepts: (e, w) -> {true, false} e: regular expression, w: input word

accepts(e, epsilon) = nullable(e)

accepts(e, cw) = accepts($\delta^c(e)$, w)

Important: need to cache each intermediate result to avoid duplicate calculation

regular expression in Scala

enum RegExp:
    // empty language∅
    case Failure
    // empty wordϵ
    case EmptyStr
    // character a such that predicate(a)
    case CharWhere(predicate: Character => Boolean)
    // union left|right
    case Union(left: RegExp, right: RegExp)
    // concatenation first|second
    case Concat(first: RegExp, second: RegExp)
    // Kleene star underlying∗
    case Star(underlying: RegExp)

    // is this regexp nullable?
    def acceptsEmpty: Boolean = ...

    // can this regexp possibly accept some words?
    def isProductive: Boolean = this match
        case Failure => false
        case EmptyStr | CharWhere(_) | Star(_) => true // approx.
        case Union(l, r) => l.isProductive || r.isProductive
        case Concat(l, r) => l.isProductive && r.isProductive

extension (expr: RegExpr):
    def ~ (that: RegExpr): RegExpr = Concat(expr, that)
    def | (that: RegExpr): RegExpr = Union(expr, that)
    def * : RegExpr = Star(expr)
    def ? : RegExpr = expr | EmptyStr
    def + : RegExpr = expr ~ expr.*
    def times(n: Int): RegExpr =
    	if (n <= 0) EmptyStr else expr ~ expr.times(n - 1) }

// example
// e1 ~ e2 ~ e3 | e4
// e1.* | e2.+

def elem(pred: Char => Boolean): RegExpr = CharWhere(pred)
def elem(char: Char): RegExpr = CharWhere(_ == char)
def elem(chars: Iterable[Char]): RegExpr =
	chars.map(elem).foldLeft[RegExpr](Failure)(_ | _)
def word(chars: Iterable[Char]): RegExpr =
	chars.map(elem).foldLeft[RegExpr](EmptyStr)(_ ~ _)
def inRange(low: Char, high: Char): RegExpr =
	elem(c => c >= low && c <= high)

// Example:
elem(_.isLetter) ~ (elem(_.isLetter) | elem(_.isDigit)).*


def derive(char: Character): RegExp =
    def work(expr: RegExp): RegExp = expr match
    	case Failure | EmptyStr => Failure
    	case CharWhere(pred) =>
    		if(pred(char)) EmptyStr else Failure
    	case Union(left, right) => work(left) | work(right)
    	case Concat(left, right) =>
    		val w = work(left) ~ right
    		if(left.acceptsEmpty) w | work(right) else w
    	case Star(inner) => work(inner) ~ expr
    work(this)

naïve derivation is inefficient

solution: on-the-fly normalization

associate all concatenation to the right: (e1e2)e3 => e1(e2e3)
avoid repetitions in unions: (e1 e2) (e2 e3) => e1 e2 e3

after removal,

code:

def deriveNorm(char: Character): RegExp =
	val disjuncted = collection.mutable.SortedSet[RegExp]()
	def work(expr: RegExp, rest: RegExp): Unit = expr match
		case CharWhere(pred) => if(pred(char)) disjuncted += rest
		case Union(left, right) => work(left, rest); work(right, rest)
		case Concat(left, right) =>
			work(left, right ~ rest)
			if(left.acceptsEmpty) work(right, rest)
		case Star(inner) => work(inner, expr ~ rest)
		case Failure | EmptyStr => ()
	work(this, EmptyStr) // register unions into `disjuncted`
	disjuncted.foldLeft[RegExp](Failure)(_ | _) // rebuild regexp

the pumping lemma

regular expression: though infinite, they contain a simple repeating pattern

theory of normalizing derivation

this part is where I cannot understand.

Thus, number of distinct regexs generated by normalizing derivation of any w is bounded!

more specifically, for any e, we have:

\set{\underline\delta^w(e) \mid w \in A^*}

\le 1+2^{

max(e, \epsilon)

this allows regex matching in constant space and linear time w.r.t. size of words

algorithm for regex matching

start with empty mapping M := $\empty$ and with regex e
For each i^th^ character in c~i~ in w
- if (e, c~i~) not in domain(M), set M(e, c~i~) := $\underline\delta^{c_i}(e)$
- set e := M(e, c~i~)
Test whether nullable(e)

expressiveness limitation of regex

grammars

regular grammar

an equivalent way of defining regular languages

start -> letter(letter

digit)*

letter -> [a..z]

digit -> 0

…

regularity requirement: no recursion!

definitions form a directed acyclic graph (DAG)

context free grammar

$S \rightarrow \epsilon \mid a~S~b$

semantics given by rewriting derivations

S -> aSb -> aaSbb -> aaaSbbb -> aaa(epsilon)bbb = aaabbb

definition of a Context-Free Grammars (CFG)

a tuple G = (A, N, S, R)

A - terminals (usually tokens, endpoint)
N - non-terminals (symbols with recursive definitions)
R - grammar rules as pair n -> v, where n is a non-terminal, $v \in (A \cup N)^*$
S - starting symbol S
G - the derivation starts from S

parse trees

definition

a tree t is a parse tree of G = (A, N, S, R) iff t is a node-labelled tree with ordered children that satisfies:

root is labeled by S
leaves are labelled by elements of A
each non-leaf node is labelled by an element of N
for each non-leaf node labelled by n,

whose children left to right are p~1~, … p~k~

there is a rule (n -> p~1~ … p~k~) belongs to R
the yield of parse tree t: a word obtained by the leaves of t
the language of grammar G: defined as L(G) = {yield(t) t is a parse tree of G}

example

syntax tree

Difference between parse trees and abstract syntax trees

Node children in parse trees correspond precisely to RHS of grammar rules

Definition of parse trees is fixed given the grammar. Often, compilers never actually build parse trees in memory
uniquely specify how an input was recognized by the grammar
contains all information needed to reconstruct the input

Nodes in abstract syntax tree (AST) contain only useful information

We can choose our own syntax trees, to facilitate both construction and processing in later stages of compiler

Compilers often directly builds ASTs

ambiguous grammar

how to parse “x * 42 + y” ?

some token sequences have multiple parse trees => ambiguous

solution: change the grammar to layered form

expr -> expr + expr

multi

multi -> intLiteral

ident

multi * multi

’(‘ expr ‘)’

but how to parse “x + 42 + y” ?

we need to define the associative rule. For ‘+’, we want it to be left-associative, x+42+y = (x+42)+y

solution:

expr -> expr + multi

multi

multi -> multi * factor

factor

factor -> intLiteral

indent

’(‘ expr ‘)’

such a grammar is left recursive, since expr -> expr + …

generalities on grammar

Chomskys Classification of Grammars

type 0, unrestricted: arbitrary string rewrite rules
- equivalent to Turing machines!
- eXb -> eXeX -> Y
type 1, context sensitive: RHS always larger
- O(n)-space Turing machines
- aXb -> acXb
type 2, context free: one LHS nonterminal
- X -> acXb
type 3, regular: no recursion, just Kleene star
- X -> acY*b

parsing

recursive descent LL(1) parsing

Can be easily implemented manually based on the grammar
Efficient – linear in the size of the token sequence
Direct correspondence between grammar and code

a simple example:

statmt ::= 
    println ( stringConst , ident )
    | ident = expr
    | if ( expr ) statmt (else statmt)?
    | while ( expr ) statmt
    | { statmt* }

a simple parser could be:

def skip(t : Token) = if (lexer.token == t) lexer.next 
   else error(“Expected”+ t)

def statmt = {
   if (lexer.token == Println) { lexer.next;
      skip(openParen); skip(stringConst); skip(comma);
      skip(identifier); skip(closedParen)
   } else if (lexer.token == Ident) { lexer.next;
      skip(equality); expr
   } else if (lexer.token == ifKeyword) { lexer.next;
      skip(openParen); expr; skip(closedParen); statmt;
      if (lexer.token == elseKeyword) { lexer.next; statmt }
   } else if (lexer.token == whileKeyword) { lexer.next;
      skip(openParen); expr; skip(closedParen); statmt
   } else if (lexer.token == openBrace) { lexer.next;
      while (isFirstOfStatmt) { statmt }
      skip(closedBrace)
 } else { error(“Unknown statement, found token ” + lexer.token)  }

because we have terminals at the beginning of each alternative, which rule to parse is quite easy to get.

now look at another example:

%% the priority is similar to java
expr ::= expr ( +|-|*|/ ) expr
		| name
		| '(' expr ')'
name ::= ident

we can transform it to the abstract syntax tree:

expr ::= term termList
termList ::= + term termList
			| - term termList
			| epsilon
term ::= factor factorList
factorList ::= * factor factorList
			| / factor factorList
            | epsilon
factor ::= name | ( expr )
name ::= ident

corresponding code:

def expr = { term; termList }
def termList =
  if (token==PLUS) {
     skip(PLUS); term; termList
  } else if (token==MINUS)
     skip(MINUS); term; termList
  }
def term = { factor; factorList }
...
def factor =
  if (token==IDENT) name
  else if (token==OPAR) {
    skip(OPAR); expr; skip(CPAR)
  } else error("expected ident or )")

we need to rewrite the AST so that:

algorithm for first and null

$first(A) = first(B_1…B_p) \cup first(C_1…C_p) \cup first(D_1…D_p)$

$first(B_1…B_p) = \left{ \begin{array}{ c l } first(B_1) & \text{if not } nullable(B_1)
first(B_1) \cup …\cup first(B_k) & \text{if } nullable(B_1),…, nullable(B_{k-1}) \text{ and not } nullable(B_k) \end{array} \right.$

thus, given any grammar, we can:

for each non-terminal X, whether nullable(X)
using this, get the set of first(X) for each non-terminal

pseudo code for getting the constraints:

nullable = {}
changed = true
while (changed) {
  changed = false
  for each non-terminal X
    if (X is not nullable) and (grammar contains rule X ::= ε | ...     ) 
    	or (grammar contains rule     X ::= Y1 ... Yn | ... where {Y1,...,Yn} in nullable)
    then {
        nullable = nullable U {X}
        changed = true
    }
}

for each nonterminal X: first(X) = {}
for each terminal t:  first(t) = {t}
loop
  for each grammar rule X ::= Y(1) ... Y(k)
  for i = 1 to k
      if i=1 or {Y(1),...,Y(i-1)}  in nullable 
      then
        first(X) = first(X) U first(Y(i))
until none of first(…) changed in last itera#on

problem with nullable non-terminal

there are still some cases cannot be determined only using first set:

stmtList ::= ε | stmt  stmtList 
stmt ::= assign | block 
assign ::= ID  =  ID
block ::= beginof ID stmtList ID ends

def stmtList = 
  if (???) 			// what should the condition be?
  else { stmt; stmtList }
def stmt =
  if (lex.token == ID) assign
  else if (lex.token == beginof) block
  else error(“Syntax error: expected ID or beginonf”)
…
def block =
  { skip(beginof); skip(ID); stmtList; skip(ID); skip(ends) }

we cannot determine if stmtList is null or continue parsing:

For nullable non-terminals, we must also compute what follows them

parsing the block, beginof ID stmtList ID ends, after we consume beginof, ID, next we see is also an ID, what is this ID for? a new assignment? or the ID end? In LL(1) grammar, by just looking at next token, this is not determined!

LL(1) grammar

Grammar is LL(1) if for each nonterminal X
- first sets of different alternatives of X are dijoint
- if nullable(X), first(X) must be disjoint from follow(X) and only one alternative of X maybe nullable
For each LL(1) grammar we can build recursive-descent parser
Each LL(1) grammar is unambiguous
If a grammar is not LL(1), we can sometimes transform it into equivalent LL(1) grammar

So for the above stmtList example, it is not LL(1) because:

nullable(stmtList)
first(stmtList) = {ID, beginof}
follow(stmtList) = {ID}
$first(stmt) \cap follow(stmtList) = {ID}$

algorithm for follow

LL(1) parse table

concrete parser implementation

enum Token:
    case Ident(name: String)
    case OpenParen
    case CloseParen
    case Plus
    case Times
// "A + B * C" => Ident("A"),Plus,Ident("B"),Times,Ident("C")

enum Expr:
    case Var(name: String)
    case Add(lhs: Expr, rhs: Expr)
    case Mult(lhs: Expr, rhs: Expr)
// "A + B * C" => Add( Var("A") , Mult(Var("B"), Var("C")) )

class Parser(ite: Iterator[Token]):
	// Parser state manipulation:
	var cur: Option[Token] = ite.nextOption
	def consume: Unit =
		cur = ite.nextOption
	// define parser here:
	def expr = ...

object Parser:
	def parse(ts: Iterable[Token]): Expr =
		val p = Parser(ts.iterator)
		val res = p.expr // entry point
		if (p.cur.nonEmpty)
			fail("input not fully consumed")
		res

// Helper method:
def skip(tk: Token): Unit =
	if (cur != Some(tk))
		fail("expected " + tk + ", found " + cur)
	consume

// Unambiguous "atomic" expressions:
def atom: Expr = 
	cur match
		case Some(Ident(nme)) => 
			consume
			Var(nme)
		case OpenParen =>
			consume
			val e = expr
			skip(CloseParen)
			e
		case _ => fail("expected atomic expression, found " + cur)

def expr: Expr =
	val p = product
	val ps = addedProducts
	ps.foldLeft(p)((l, r) => Add(l, r))

def addedProducts: List[Expr] = \
	cur match
		case Some(Plus) =>
			consume
			product :: addedProducts
		case _ => Nil

def product: Expr =
	val a = atom
	val as = multipliedAtoms
	as.foldLeft(a)((l, r) => Mult(l, r))

def multipliedAtoms: List[Expr] = cur match
	case Some(Times) =>
		consume
		product :: addedProducts
	case _ => Nil

Pratt Parsing

how to avoid manually transforming grammars?

how to support user-defined operators and parse them correctly?

===> We need separately specify operator precedence / associativity

idea

Simplest way of describing precedence and associativity:

operators have distinct left and right precedences

‘+’ has (3,4) and ‘*’ has (5,6)

implementation

def opPrec(opStr: String): (Int, Int) = opStr match
    case "*" => (50, 51)
    case "+" => (30, 31)
    case "=>" => (21, 20)
    case ...

enum Token:
	case OpenParen
	case CloseParen
	case Ident(name: String)
	case Oper(name: String)

enum Expr:
	case Var(name: String)
	case Infix(lhs: Expr, op: String, rhs: Expr)

def expr(prec: Int): Expr = 
	cur match
		case Some(Ident(nme)) =>
			consume; exprCont(Var(nme), prec)
		case Some(OpenParen) =>
			consume; val res = expr(0); skip(CloseParen)
			exprCont(res, prec)
		case _ => fail(rest)

// Having parsed acc, what to do next at this precedence?
def exprCont(acc: Expr, prec: Int): Expr = cur match
	case Some(Oper(opStr)) if opPrec(opStr)._1 > prec =>
		consume
		val rhs = expr(opPrec(opStr)._2)
		exprCont(Infix(acc, opStr, rhs), prec)
	case _ => acc

user defined operators

character precedence tables

parsing-expression grammars (PEG)

more recent alternative to context-free grammars (CFG) Parsing-Expression Grammars remove ambiguities through biased choice: Instead of X|Y, use X/Y which tries to parse Y only if parsing X fails!

name analysis

An identifier is used but not declared: def p(amount: Int) { total = total + ammount }
Multiple method arguments have the same name def p(x:Int, y:Int, x:Int) { FF. }
Multiple functions with the same name object Program { def m(x: Int) = { x + 1 } def m(x: Int) = { x + 3 } }
ill-formed type definition (e.g. circular) class List extends Expr class Cons extends List class Expr extends Cons

symbol table

maintain a map from identifiers to declaration information (symbol) at each point in the tree: symbol table

symbol tables can be computed every time, cached, or integrated partly or fully into trees as symbol reference

it provides efficient access to information of identifiers

declaration of a value or variable, its type and initial value
variable inside a pattern matching
a function and its signatures and its body
an algebraic data type (case class) its alternatives and fields

it is a map data structure

scope and scoping rules

static (lexical) scoping

local variables are only visible inside the function or block where they are introduced

type system

why type? prevent errors, ensure memory safety, document the program, refactor, compilation optimization

An unsound (broken) type system

background: inductively defined relations and sets

proof that r = {(x, y)

x <= y}:

if there is a derivation, then x<=y
- induction on derivation, go through each rule
if x<=y, there exists a derivation
- given x, y find derivation tree
  
  start from (0, 0) then derive (0, y-x) in y-x steps of increase right
  
  if x > 0, increase both, x < 0, decrease both x times

context-free grammar as inductively defined relations

We define a rule as r of the form $\frac{t_1(\bar x) \in r, ... , t_n(\bar x) \in r}{t(\bar x) \in r}$ where t~i~(x) in r is the assumptions, and t(x) in r is the conclusion

when n=0 (no assumptions), the rule is called an axiom

example: Amyli language

Amyli is a tiny language only works on integers and booleans

(initial) program is a pair (e~top~, t~top~) where

e~top~ is the top-level environment, function names -> function definition
t~top~ is the top-level term (expression) that starts execution

Function definition for a given function name is a tuple of:

parameter list $\bar x$
parameter type $\bar \tau$
expression representing function body t
result type $\tau_0$

expressions are formed by primitive functions (+, -, *, /), defined function calls or if expression

no local val definition nor match, e will remain fixed

thus, we can write t as: $t := true \mid false \mid c \mid f(t_1, t_2, ..., t_n) \mid if (t) ~t_1 ~else ~t_2 \\ \text{c is an interger, f denotes user-defined function or primitive operators}$ so we can write a simple factor function program as:

operational semantics

so how can we process a program to infer its type? We need to define a set of rules so that we can define the relations inductively:

having these rules, we can do the induction:

now consider another example, when we encounter a expression: if (5) 3 else 7

5 cannot further evaluate and it is a integer constant, but if statement only accepts true, false, thus it get stuck

stuck terms indicate errors

Type checking is a way to prevent stuck terms statically, without trying to evaluate the program to see if it gets stuck

typing rules

typing context: given inital program (e, t) defined $\Gamma_0 = \set{(f, \tau_1 \times...\times\tau_n\to \tau_0)\mid (f, xs, (\tau_1,...,\tau_n), t_f, \tau_0)\in e}$

now we can set the type rules for the Amyli language:

soundness theorem

if program type checks, its evaluation does not get stuck

proof using two lemmas (common approach):

progress

if a program type checks, it is not stuck.

if $\Gamma \vdash t:\tau$, then either t is a constant or there exists a t’ that $t \rightsquigarrow t’$
preservation

if a program type checks and makes one “$ \rightsquigarrow$” step, then the result again type checks.

e.g. if $\Gamma \vdash t:\tau$ and $t\rightsquigarrow t’$ then $\Gamma \vdash t’:\tau$

example: how to show that if is progress and preservation

don’t ask me why, I cannot understand this fucking tedious proof

an example of derivation tree:

type inference

motivation:

writing type annotations is boring, need a way to get the type automatically

then how to do a type inference? think about one small example:

strategy for type inference

use type variable (e.g. $\alpha_{verbose}, \alpha_{s}$) to denote unknown types
use type checking rules to derive constraints among type variables (e.g., arguments have expected types)
use a unification algorithm to solve the constraints

demo using a simple language

types

primitive types: Int, Bool, String, Unit
type constructors: Pair[A, B] or (A, B), Function[A,B] or A=>B

Abstract syntax of types: $T ::= Int \mid Bool \mid String \mid Unit \mid (T_1, T_2) \mid (T_1, \Rightarrow T_2)$ Terms include pairs and anonymous functions: (x denotes variables, c literals) $t ::= x \mid c \mid f(t_1,...,t_n) \mid if (t) ~ t_1 ~ else ~ t_2 \mid (t1,t2) \mid (x \Rightarrow t)$ if t= (x,y) then t._1=x and t._2=y

type rules

now let’s see an example:

def translatorFactory(dx, dy) = {
	p⇒(p._1 + dx, p._2 + dy)// returns anonymous function
}
def upTranslator = translatorFactory(0, 100)
def test = upTranslator((3, 5)) // computes (3, 105)

// the actual type should be:
def translatorFactory(dx: Int, dy: Int): (Int,Int) -> (Int,Int) = {
	p⇒(p._1 + dx, p._2 + dy)
}
def upTranslator : (Int,Int) -> (Int,Int) = translatorFactory(0, 100)
def test: (Int,Int) = upTranslator((3, 5))

So, when we don’t have the type, how can we inference the type, say dx?

def translatorFactory(dx, dy) = { p⇒(p._1 + dx, p._2 + dy) }

so, we can generate constraints from the expressions given:

then, we need a unification algorithm to solve the constraints:

Apply following rules as long as current set of equations changes:
- Orient: Replace T=αwithα=T when T not a type variable
- Delete useless: Remove T=T (both sides syntactically equal)
- Eliminate: Givenα=T whereαdoes not occur in T,
- replaceαwith T in all remaining equations
- Occurs check: Givenα=T whereαoccurs in T, report error (no solutions)
- Decompose pairs: Replace(T1,T2) = (T′1,T′2)with T1=T′1and T2=T′2
- Decompose functions: Replace(T1⇒T2) = (T′1⇒T′2)with T1=T′1and T2=T′2.
- Decomposition clash (remaining cases): Given T1=T2 where T1and T2have different constructors, report clash (no solution)
  
  Examples:(T1,T2) = (T′1⇒T′2); Int= (T1,T2); Int=Bool;(T1⇒T2) =String
property of unification algorithm
- always terminated
- If error reported, equations have no solution
- otherwise, it has one or more solutions

code generation

the specification for Web Assembly is at this link here

a stack machine simulator could be:

var code : Array[Instruction]
var pc : Int // program counter
var local : Array[Int]  // for local variables
var operand : Array[Int]  // operand stack
var top : Int
while (true) 
	step
    def step = code(pc) match {
        case Iadd() =>
            operand(top - 1) = operand(top - 1) + operand(top)
            top = top ­ 1  // two consumed, one produced
        case Imul() => 
            operand(top - 1) = operand(top - 1) * operand(top)
            top = top - 1 // two consumed, one produced
        case iconst(c) =>
            operand(top + 1) = c // put given constant 'c' onto stack
            top = top + 1
        case Igetlocal(n) => 
            operand(top + 1) = local(n) // from memory onto stack
            top = top + 1 
        case Isetlocal(n) => 
            local(n) = operand(top) // from stack into memory
            top = top ­ 1  // consumed
        }
	if (notJump(code(n)))
		pc = pc + 1   // by default go to next instructions

prefix, postfix, infix notation

let f be a binary operation, e~1~, e~2~ two expressions

in prefix, f e~1~ e~2~, Polish notation
in infix, e~1~ f e~2~
in postfix, e~1~ e~2~ f Reverse Polish notation (RPN)

for nested expressions, like a * b + c, infix needs parentheses, prefix and postfix no bracket

arg.list	+(x,y)	+(*(x,y),z)	+(x,*(y,z))	*(x,+(y,z))
prefix	+ x y	+ * x y z	+ x * y z	* x + y z
infix	x+y	x*y + z	x + y*z	x*(y+z)
postfix	x y +	x y * z +	x y z * +	x y z + **

prefix	+ + + x y z u	+ x + y + z u
infix	((x + y) + z) + u	x + (y + (z + u))
postfix	x y + z + u +	x y z u + + +

first draw the AST, and then traverse according to the order

below is an example to transform a tree into a prefix, postfix and infix notation:

enum Expr:
    case Var(varID: String)
    case Plus(lhs: Expr, rhs: Expr)
    case Times(lhs: Expr, rhs: Expr)

enum Token:
    case ID(str : String)
    case Add
    case Mul
    case O // Opening paren ’(’
    case C // Closing paren ’)’

def prefix(e: Expr): List[Token] = e match
    case Var(id) => List(ID(id))
    case Plus(e1,e2) => List(Add()) ::: prefix(e1) ::: prefix(e2)
    case Times(e1,e2) => List(Mul()) ::: prefix(e1) ::: prefix(e2)

def infix(e: Expr): List[Token] = e match // needs to emit parantheses
    case Var(id) => List(ID(id))
    case Plus(e1,e2) =>
    	List(O()) ::: infix(e1) ::: List(Add()) ::: infix(e2) :::List(C())
    case Times(e1,e2) =>
    List(O()) ::: infix(e1) ::: List(Mul()) ::: infix(e2) :::List(C())

def postfix(e: Expr): List[Token] = e match
    case Var(id) => List(ID(id))
    case Plus(e1,e2) => postfix(e1) ::: postfix(e2) ::: List(Add())
    case Times(e1,e2) => postfix(e1) ::: postfix(e2) ::: List(Mul())

Lisp, an example of prefix notation

(defun factorial (n)
    (if (<= n 1)
		1
		(* n (factorial (- n 1)))))

Postscript, an example of post fix language

/inch {72 mul} def
/wedge
    { newpath
        0 0 moveto
        1 0 translate
        15 rotate
        0 15 sin translate
        0 0 15 sin - 90 90 arc
        closepath
    } def
gsave
    3.75 inch 7.25 inch translate
    1 inch 1 inch scale
    wedge 0.02 setlinewidth stroke
grestore
gsave

compiling expression

why postfix? can evaluate it using stack

Evaluating postfix expressions is like running a stack based virtual machine on compiled code
Compiling expressions for stack machine is like translating expressions into postfix form

to evaluate e1 * e2 interpret as:

evaluate e1
evaluate e2
combines the result using *

def compile(e : Expr) : List[Bytecode] = e match { // ~ postfix printer
    case Var(id) => List(Igetlocal(slotFor(id)))
    case Plus(e1,e2)  => compile(e1) ::: compile(e2) ::: List(Iadd())
    case Times(e1,e2) => compile(e1) ::: compile(e2) ::: List(Imul())   
}

to include local variables:

Assigning indices (called slots) to local variables using function slotOf : VarSymbol -> {0,1,2,3,…}
How to compute the indices?
assign them in the order in which they appear in the tree

def compile(e : Expr) : List[Bytecode] = e match {
	case Var(id) => List(Igetlocal(slotFor(id)))
	...	
}
def compileStmt(s : Statmt) : List[Bytecode] = s match {
	// id=e
	case Assign(id,e) => compile(e) ::: List(Iset_local(slotFor(id)))
    ...
}

compiling control flow:

in Web Assembly, All comparison operators yield 32bit integer results with 1 representing true and 0 representing false.

block: the beginning of a block construct, a sequence of instructions with a label at the end
loop: a block with a label at the beginning which may be used to form loops
br: branch to a given label in an enclosing construct
br_if: conditionally branch to a given label in an enclosing construct
return: return zero or more values from this function
end: an instruction that marks the end of a block, loop, if, or function

block $label1 block $label0
(negated condition code)
br_if $label0 // to else branch
(true case code)
br $label1// done with if
end $label0// else branch
(false case code)
end $label1// end of if

or, in web assembly, it provides a shortcut as:

special tricks for [e1] && [e2]:

only need to evluate e1 when e1 == False, similarly for [e1]

[e2] when e1 == True

[if (econd) ethenelse eelse] :=
	block nAfter
    	block nElse
    		block nThen
    			branch(econd, nThen, nElse)
    		end //nThen:
    		[ethen]
    		br nAfter
    	end //nElse:
    	[eelse]
    end //nAfter:
    [erest]

branch(!e,nThen,nElse) :=
	branch(e,nElse,nThen)

branch(e1 && e2,nThen,nElse) :=
	block nLong
		branch(e1,nLong,nElse)
	end //nLong:
	branch(e2,nThen,nElse)

branch(e1 || e2,nThen,nElse) :=
	block nLong
		branch(e1,nThen,nLong)
	end //nLong:
	branch(e2,nThen,nElse)
	
branch(true,nThen,nElse) :=
	br nThen

branch(false,nThen,nElse) :=
	br nElse

branch(b,nThen,nElse) :=	(where b is a local var)
	get_local #b
	br_if nThen
	br nElse

we can make a new command as :

then switch statement could write as:

[sswitch] nAfter nBreak :=
block nDefault
	block nCasen
		...
		block nCase1
			block nTest
				[e_scrutinee] nTest nBreak
			end //nTest:
			tee_local #s(where s is some fresh local of type i32)
			i32.const c1
			i32.eq
			br_if nCase1
			get_local #s
			i32.const c2
			i32.eq
			br_if nCase2
			...
			br nDefault
		end //nCase1:
		[e1] nCase2 nAfter
		...
	end //nCasen:
	[en] nDefault nAfter
end //nDefault:
[edefault] nAfter nAfter

register optimization

Basic Instructions of Register Machines Ri← Mem[Rj] load Mem[Rj] ←Ri store Ri← Rj * Rk compute: for an operation * Efficient register machine code uses as few loads and stores as possible.

wab assembly: imul.32

Wilson ZOU

introduction

theory of formal languages

word

define of word

equivalent word

Theorem (Structural induction of words)

word concatenation

free monoid of words

reversal

prefix, suffix, and slice

language

define of language

language operations

monoid of language

represent programming language

Kleene Star, repetition of a language

regular expression

definition

regular expression operators

properties of regular expression

lexical Analysis

example of a simple lexer:

deciding which token is coming

first

nullable

longest match rule

token priority

general approach to automatic lexing

traditional approach

Brzozowskis derivatives

matching regex by derivation

regular expression in Scala

naïve derivation is inefficient

solution: on-the-fly normalization

the pumping lemma

theory of normalizing derivation

algorithm for regex matching

expressiveness limitation of regex

grammars

regular grammar

context free grammar

definition of a Context-Free Grammars (CFG)

parse trees

definition

example

syntax tree

ambiguous grammar

generalities on grammar

parsing

recursive descent LL(1) parsing

algorithm for first and null

problem with nullable non-terminal

LL(1) grammar

algorithm for follow

LL(1) parse table

concrete parser implementation

Pratt Parsing

idea

implementation

user defined operators

character precedence tables

parsing-expression grammars (PEG)

name analysis

symbol table

scope and scoping rules

static (lexical) scoping

type system

background: inductively defined relations and sets

context-free grammar as inductively defined relations

example: Amyli language

operational semantics

typing rules

soundness theorem

type inference

strategy for type inference

demo using a simple language

types

type rules

code generation