A.2 Cassiopea Overview
This section covers the abstract syntax for Cassiopea.
Cassiopea is a register-transfer-list style language: it models instructions as non-Turing-complete procedures that update a machine state. Its executable semantics are covered in Section
A.4.
We model the machine from the assembly-language programmer’s perspective. In particular, we do not treat memory as a huge block of address space but instead treat it in small chunks passed in from somewhere else. We model both control registers and general-purpose registers as well as other machine state such as whether interrupts are enabled.
Furthermore, we must allow assembler labels, which have addresses, but those addresses are not resolved until after programs are compiled and linked and must be treated as abstract.
Notation. We use the following metavariables:
(Other constructs are referred to with longer names.)
A number of constructs are listed (with a null case and a cons case)—these are written out in longhand so that typing and semantic judgments can be applied explicitly to each case, in order to, for example, propagate environment updates correctly.
Identifiers and Variables. Identifiers are divided syntactically into seven categories:
—
\(\mathit {x_{mem}}\) are identifiers bound to memory regions, which are second-class.
—
\(\mathit {x_{func}}\) are identifiers bound to functions, which are second-class.
—
\(\mathit {x_{proc}}\) are identifiers bound to procedures, which are second-class.
—
\(\mathit {x_{op}}\) are identifiers bound to instructions (“operations”), which are akin to procedures but distinguished from them.
—
\(\mathit {x_{\tau }}\) are identifiers for type aliases, which are bound to base types in declarations.
—
\(\mathit {{x}_{module}}\) are the names of “modules”, which are used to select among many possible groups of lowering elements.
—
Other identifiers \(\mathit {x}\) are used for other things, and should be assumed to not range over the above elements.
Note that all identifiers live in the same namespace, and rebinding or shadowing them is not allowed. All these identifiers can be thought of as variables, in the sense that they are names that stand for other things. All of them are immutable once defined, including the ordinary variables \(\mathit {x}\) that contain plain values.
Types. Types are divided syntactically into base types (integers, booleans, strings, bitvectors, etc.) and others (memory regions and functions). Functions may handle only base types. Furthermore, memory regions and functions are second-class for reasons discussed below and are excluded in various places in the syntax and the typing rules.
We use index typing to capture the bit width of values.
Registers. Registers are represented in the specification with the metavariable
r, which stands for the underlying abstract identity of a register. Declaring a register, e.g., with
\(\mathtt {let} \mathtt {state}\ \mathit {x}\ \mathit {:}\ \mathit {C}\ \mathtt {reg}\) as shown in Figure
A.2, allocates a fresh register
r and binds the variable
\(\mathit {x}\) to it. A subsequent declaration of the form
\(\mathtt {let}\ \mathit {y}\ \mathit {:}\ \mathit {C}\ \mathtt {reg} = \mathit {x}\) creates another variable
\(\mathit {y}\) that refers to the same underlying register. One might think of registers as numbered internally. We use the form
\(\mathtt {let} \mathtt {state}\ \mathtt {control}\ \mathit {x}\ \mathit {:}\ \mathit {C}\ \mathtt {reg}\) to declare specific
control registers, which are treated differently by the framing rules. The additional keyword
dontgate inhibits state gating for the register; this should be used for flags registers and anything similar that is implicitly used by all ordinary code.
Some registers have associated with them a text form, which is declared separately and is the form an assembler expects to parse. The synthesis engine uses this to extract an assembly program from Cassiopea’s internal representation. It is referred to by attaching the suffix \(\mathit {.} \mathtt {txt}\) to the/a register variable. As some registers are not directly addressable by the assembler (e.g., they might be subfields of some larger addressable unit or non-addressable internal state), not all registers have a text form. This is not readily checked statically without additional information, so invalid \(\mathit {.} \mathtt {txt}\) references fail at assembly code extraction time.
The type of a register is \(\mathit {C}\ \mathtt {reg}\) , which is a register that holds a \(\mathit {C}\) -bit bitvector. The bitvector value in question can be updated by assigning a new value; this is a statement ( \(\mathit {e} _1\thinspace \mathit {:} = \mathit {e} _2\) ) and can only happen in places where statements are allowed. The construct \(\:^*\mathit {e}\) reads a register.
The reader will note that the semantics rules for machines and declarations do not provide initial values for registers. Instead, executions are defined in terms of some initial register state (and also some memory state), which is required to have the right registers to match the machine definition. This allows reasoning about the execution of programs and program fragments in terms of many or all possible initial states. These issues are discussed further below.
Memory. A memory region has the type
\(\mathit {C} _1\ \mathtt {bit}\ \mathit {C} _2\ \mathtt {len}\ \mathit {C} _3\ \texttt {ref}\) , shown in Figure
A.1. This refers to a memory region that has
\(\mathit {C} _2\) cells, each of which stores a bitvector of width
\(\mathit {C} _1\) . This memory region is addressed with pointers of width
\(\mathit {C} _3\) . Note that we assume byte-addressed machines, and for the purposes of both this specification and our implementation, we assume bytes are 8 bits wide. (This restriction could be relaxed if we wanted to model various historic machines.) Thus a memory region of type
\(32\ \mathtt {bit}\ 4\ \mathtt {len}\ 32\ \texttt {ref}\) has 4 32-bit values in it, which can be addressed at byte offsets 0, 4, 8, and 12. Memory regions and registers are mutable states.
Memory regions are named with identifiers. These names, and memory regions themselves, are not first class; variables are not allowed to range over them. Also, note that memory regions are a property of programs (and thus are declared in specifications) and not a property of the machine as a whole.
Pointers. A pointer literal has the form
\(\mathit {(}\mathit {x_{mem}} \mathit {,}\ \mathit {C} \mathit {)}\) , in which
\(\mathit {x_{mem}}\) is the name of a memory region and
\(\mathit {C}\) is the offset, shown in Figure
A.1. Because memory regions are second-class,
\(\mathit {x_{mem}}\) must be one of the available declared memory regions. Pointer literals exist in the abstract syntax, but are not allowed in the concrete syntax except in specifications. The only way to get a pointer value is to look up a label (discussed immediately below) or have it provided in a register as part of the initial machine state.
A pointer literal is treated as a bitvector of the width appropriate for pointers to the memory region, so one can appear in a register or in memory. However, we enforce a restriction (not captured in the semantic judgments so far) that no value in the initial machine state, whether in a register or in memory, is a pointer unless required to be so by the precondition part of the specification. All other values are restricted to plain bitvector values.
Addition and subtraction are allowed on pointers and they change the offset, but other bitvector operations (e.g., multiply) are disallowed and fail. Similarly, attempting to fetch from or store to a plain bitvector that is not a pointer fails. Note however that we do not statically distinguish pointers and plain bitvectors. (We could have used flow-sensitive typing to reason about when registers and memory cells contain pointers and when they do not; but this adds substantial complexity and for our problem domain does not provide significant value.) Instead, we step to failure at runtime. This can be seen in the semantic judgments.
Fetching from a pointer takes the form \(\mathtt {fetch}\) ( \(\mathit {e}\) , \(\mathit {C}\) ). Storing to a pointer takes the form \(\mathtt {store} (\mathit {e} _1, C) \leftarrow \mathit {e} _2\) . The extra constant \(\mathit {C}\) specifies the width of the cell pointed to. (This is not an offset.) Because we do not check pointers statically, we do not know the memory region being pointed to and cannot look up its cell size; thus we need the width explicitly for typing. It is checked at runtime.
Labels. As mentioned above, the term “label” means an assembler label or linker symbol. These are constant addresses that are not known at assembly time, so we must model them abstractly.
When one declares a memory region, one may attach a label to it, which is an additional identifier. This identifier is created as a variable of type
\(\mathit {C}\ \mathtt {label}\) in Figure
A.1. The value is a pointer to the first entry in the region, and a single type subsumption rule allows this value to be accessed and placed in an ordinary register or variable of suitable bitvector type. The intended mechanism is that for each machine the preferred instruction on that machine for loading assembler symbols into a register can be defined to take an operand of type
\(\mathit {C}\ \mathtt {label}\) , and then its value can be assigned to the destination register. This type of restriction on the operand is sufficient to synthesize programs that use labels correctly.
Code Positions and Branching. Cassiopea code blocks may contain branches, either forward within the block (branching backwards is forbidden) or to a single external assembler label outside the block. This model is sufficient for our block-oriented synthesis approach; a more complex control flow can be handled with multiple blocks.
Consequently, a branch may either go to the single external assembler label (which terminates execution of the current block) or skip zero or more instructions. We model branch targets as an 8-bit skip count. In Figure
A.2, the statement
\(\texttt {BRANCH}(\mathtt {0b}\mathit {C} {})\) skips
\(\mathtt {0b}\mathit {C}\) instructions;
\(\texttt {BRANCH}(\texttt {0xff})\) jumps to the external assembler label. This statement may be used to define both conditional and unconditional branch instructions. Such instructions should be defined to take an operand of type
\(8\ \mathtt {bit}\) to choose the branch destination. This magic number should then be printed to the output assembly text using the built-in function
textlabel, which replaces it with a valid assembler label, either the selected external label string or a scratch label attached to the proper destination instruction.
Specifications do not need to be directly concerned with internal branches, which occur or not as needed. However, external branches are part of a block’s specification; typically the purpose of a block with an external branch is to test some condition and then either branch away or fall through to the next block. It is thus necessary to be able both to name the external label to use and to specify the conditions when it should be reached. For this purpose, a predicate branchto is provided. It may appear in the postcondition and governs the exit path from the block: if forced to true, the block branches to the external assembler label, and if false, the block falls through from its last instruction. The concrete syntax is branchto(dest) which also sets the assembler label used to dest. It is not valid to name more than one such assembler label.
Note that the assembler labels used in branching are, though also assembler labels, a separate mechanism unrelated to the labels attached to memory regions; they are code labels rather than data labels and inherently work differently.
Register Sets. Register sets are second-class elements intended to exist only as literals and only as the result of lowering machine-independent specifications that cannot directly talk about specific registers. Currently, they do not exist in the implementation. Register sets are not allowed to be operands to instructions to avoid state explosions when synthesizing. For simplicity, this restriction is not shown in the abstract syntax or typing rules.
Functions and Procedures. Functions, defined with
\(\mathtt {def}\) in Figure
A.2, are pure functions whose bodies are expressions. They produce values. They can read registers and memory, and can fail, but cannot update anything. Procedures, defined with
\(\mathtt {proc}\) , are impure and their bodies are statements. They do not produce values, but they may update the machine state. They are otherwise similar, and are intended to be used to abstract out common chunks of functionality shared among multiple instructions in machine descriptions. Functions can also be used for state hiding in specifications.
Functions and procedures are second-class; they may be called only by their own name and may not be bound to variables or passed around. Furthermore, they are only allowed to handle base types: higher-order functions are explicitly not supported.
Operations. Operations (defined with
\(\mathtt {defop}\) in Figure
A.2) are machine instructions or short sequences of machine instructions. An operation takes zero or more operands and transforms the machine state as defined by one or more statements. Operands are currently defined as expressions, but are restricted as follows:
—
They may be values, but not string values, and not \(\mathtt {fail}\) .
—
They may be variables of register type.
—
They may be variables of label type.
This restriction affects what the synthesizer tries to generate; a broader set of expressions may be accepted for verification or concrete execution and simply evaluated in place.
In general, we refer to “instructions” and “operations” interchangeably. However, there is an important distinction between them: operations do not necessarily need to be single machine instructions. The text output to the assembler is arbitrary and if desired can be computed on the fly based on the operand values. On some platforms, the assembler defines “synthetic instructions” that potentially assemble to multiple machine instructions, or to different sequences based on the constants or registers used. This facility takes that a step further by allowing the writer of the machine description to define their own synthetics. Operations are the units in which Cassiopea reasons about machine operations and the units in which Cassiopea generates programs and code fragments.
Other Constructs. \(\mathit {e} \mathit {[}\mathit {C} \mathit {]}\) and
\(\mathit {e} \mathit {[}\mathit {C} _1\mathit {,}\ \mathit {C} _2\mathit {]}\) in Figure
A.1 extract a single bit and a slice, respectively, from a bitvector. The offsets are constants; if variable offsets are needed, the value can be shifted first. The width of the slice must be constant for static typing.
Machines, Lowering, Specifications, and Programs. A
\(\mathit {machine}\) is the full description of a machine architecture; it includes declarations (including types, constants, registers, functions, and procedures) and also instructions, as shown in Figure
A.3. This is typically a standalone file but maybe a set of files referenced via
\(\mathtt {include}\) .
A (single) lowering is a collection of declarations used to instantiate elements in Alewife translations, shown in Figure
A.3. These are placed into a
\(\mathit {module}\) , with multiple modules per file, so that the lowerings associated with multiple related Alewife specifications can be kept together. The
\(\mathtt {import}\) \(\mathit {{x}_{module}}\) directive enables sharing common elements in one module
\(\mathit {{x}_{module}}\) across multiple specifications. The modules used to lower a specification are selected using the
\(\mathtt {lower\text{-}with}\) declaration in Alewife.
A
\(\mathit {spec}\) (specification) is a precondition and postcondition, which are boolean expressions, along with optional permission to modify additional registers (the
\(\mathit {frame}\) ), shown in Figure
A.3. Cassiopea specifications are produced by compiling Alewife specifications. Note that a module can also contain frame declarations; they are added to any frame conditions provided in the Alewife specification. A code block is permitted to modify any register that is either explicitly listed in the frame declarations or mentioned in the postcondition, while it may read any register mentioned in the precondition and any control register. This restriction is currently not adequately captured in the semantics rules.
A
\(\mathit {program}\) is a sequence of instruction invocations, shown in Figure
A.3.
Built-in Functions. Here is a partial list of the built-in functions in Cassiopea along with their types.
—
empty : \(\mathtt {int} \rightarrow \mathit {C}\ \mathtt {reg}\ \mathtt {set}\) produces an empty register set of bit size \(\mathit {C}\) , where \(\mathit {C}\) is the value of the first argument. Note that this built-in function is dependently typed and treated specially during type checking. The first argument must be a constant.
—
hex : \(\mathit {C}\ \mathtt {bit} \rightarrow \mathtt {string}\) prints numbers in hexadecimal.
—
bin : \(\mathit {C}\ \mathtt {bit} \rightarrow \mathtt {string}\) prints numbers in binary.
—
dec : \(\mathit {C}\ \mathtt {bit} \rightarrow \mathtt {string}\) prints numbers in decimal.
—
lbl : \(\mathit {C}\ \mathtt {label} \rightarrow \mathtt {string}\) prints labels (it returns the label identifier as a string). This is for data labels attached to memory locations.
—
textlabel : \(8\ \mathtt {bit} \rightarrow \mathtt {string}\) prints branch offsets as assembler labels. This is for code labels, as described above.
—
format : \(\mathtt {string} \rightarrow \mathtt {string}\\)\) ... \(\rightarrow \mathtt {string}\) formats strings. The first argument is a format string; the remainder of the arguments are substituted into the format string where a dollar sign appears followed by the argument number (1-based). (A literal dollar sign can be inserted by using $$.) The number of additional arguments expected is deduced from the contents of the format string.
—
bv_to_len : \(\mathtt {int} \rightarrow \mathit {C} _2\ \mathtt {bit} \rightarrow \mathit {C} _1\ \mathtt {bit}\) returns a new bitvector of size \(\mathit {C} _1\) (where \(\mathit {C} _1\) is the value of the first argument) with the same value as the second argument, up to the ability of the new size to represent that value. Note that this built-in function is dependently typed and treated specially during type checking. The first argument must be a constant.
—
bv_to_uint : \(\mathit {C} _1\ \mathtt {bit} \rightarrow \mathtt {int}\) converts a bitvector to unsigned int.
—
uint_to_bv_l : \(\mathtt {int}\ \rightarrow \mathtt {int} \rightarrow \mathit {C} _1\ \mathtt {bit}\) converts an unsigned int (second argument) into a bitvector of size \(\mathit {C} _1\) , where \(\mathit {C} _1\) is the value of the first argument. Note that this built-in function is dependently typed and treated specially during type checking. The first argument must be a constant.
—
isptr : \(\mathit {C}\ \mathtt {bit} \rightarrow \mathtt {bool}\) tests at runtime if a bitvector-typed value is a pointer.
Note that some of these functions have their own typing rules, some of which are polymorphic in bitvector size. We have not complicated the typing rules presented by including all of these as special cases.
Concrete Syntax. We do not describe the concrete syntax in detail here; however, it does not stray very far from the abstract syntax. The operator precedence and most of the operator spellings are taken from C but most of the rest of the concrete syntax is ML-style. There are also a few things desugared in the parser and not shown in the abstract syntax. As already mentioned, bitvector constants whose size is a multiple of 4 can also be written in the form \(\mathtt {0x}\mathit {C}\) . Syntax of the form \(e\mathit {.} \mathtt {hex}\) , \(e\mathit {.} \mathtt {bin}\) , and \(e\mathit {.} \mathtt {dec,}\) is converted to the built-in functions \(\mathtt {hex}\) , \(\mathtt {bin}\) , \(\mathtt {dec,}\) respectively. These print either integers or bitvectors as strings in hexadecimal, binary, or decimal respectively. The syntax \(\mathit {x} \mathit {.} \mathtt {lbl}\) is similarly converted to the built-in function \(\mathtt {lbl}\) . This produces the label (that is, the identifier naming the label) as a string. Furthermore, the concrete syntax supports include files via an \(\mathtt {include}\) directive, which is useful for sharing common elements between related descriptions.
A.3 Cassiopea Static Typing
This section describes the Cassiopea type system.
Environments. The type system uses two environments: \(\Delta\) maps type alias names to their expansions, and \(\Gamma\) maps variable names to types. Recall from the syntax that type alias names may only expand to base types; thus type alias names can be treated as base types.
Well-Formedness. Since types syntactically include type alias names, we need to check the validity of those names. We also insist that the widths of bitvectors be greater than 0. The judgment for this has the form
\(\Delta \vdash _{\texttt {wf}}\tau\) , shown in Figure
A.4. There is an intended invariant that only well-formed types may be entered into the variable typing environment
\(\Gamma\) , so that types taken out of it do not need to be checked for well-formedness again.
In a typing environment comprised of \(\Delta\) mapping user-defined type names (type aliases) to types and \(\Gamma\) mapping variables to types, we say that a type is well formed when all type names exist and refer to well-formed types, and all indices are of type \(\mathtt {int}\) and positive.
Expressions. Expressions produce values that have types. Because types appear explicitly in some expressions (e.g.,
let), we need both environments, so the form of an expression typing judgment is
\(\Delta , \Gamma \vdash \mathit {e}\ \mathit {:}\ \tau\) , shown in Figure
A.5. This means that we conclude
\(\mathit {e}\) has type
\(\tau\) . Note that the
\(\mathit {.} \mathtt {txt}\) form is restricted to registers; it is specifically for extracting the assembly text form of a register. We have not written out a separate rule for each unary and binary operator. The types of operators are as shown in Figure
A.6. Note that the bitvector operators are polymorphic in bit size.
Arguably the right hand argument of the shift operators should be allowed to be a different width. There is one rule for pointer literals that covers both the expression and the value form. There is no rule (either in the typing or in the semantics) that allows taking a subrange of a memory region as a new smaller region. We have not needed this for our use cases, so for simplicity, we do not support it.
Statements. Statements do not produce values. We still need both environments, though, so the form of a typing judgment for a statement is
\(\Delta , \Gamma \vdash {\mathit {S}}\) , shown in Figure
A.7. This means that
\(\mathit {S}\) is well-typed.
Declarations. Declarations update the environment. The form of a typing judgment for a declaration is
\(\Delta , \Gamma \vdash \mathit {decl} \rhd \Delta ^{\prime }, \Gamma ^{\prime }\) , and a judgment for a list of declarations has the same form, shown in Figure
A.8. This means that the declaration (or list) is well typed and produces the new environment on the right.
We impose an additional syntactic restriction on declarations found in a machine description (as opposed to the additional declarations that may appear in a specification): they may not use the expression forms that refer to machine state (registers or memory), because when defining the machine, there is no specific machine state to refer to; any references would need to be quantified. (That in turn is not allowed; while many SMT solvers now support quantified expressions, they generally do not perform well.)
Machines. A machine is some declarations followed by some defops, so the typing rule is just sequencing, shown in Figure
A.9, but there is a wrinkle: the initial environment for the machine is not an input.
\(\Delta _\mathit {builtin}\) is the (fixed) environment describing the built-in type aliases. (Currently, there are none.)
\(\Gamma _\mathit {builtin}\) is the environment describing the types of built-in variables. This notionally includes the built-in functions. (But as mentioned earlier some of them actually have their own typing rules.) The form of a typing judgment for a machine is
\(\vdash \mathit {machine} \rhd \Delta , \Gamma\) . This means that the machine description is well-typed and provides the environment on the right for use of other constructs that depend on the machine. (Specs and programs are only valid relative to a given machine.)
Specifications. For specifications we need two helper rules, shown in Figure
A.9: one that applies an additional list of declarations to a machine, which has the same form as the judgment on a machine; and one that says that a frame (modifies list) is well-typed, which has the form
\(\Delta , \Gamma \vdash \mathit {frames}\) . This lets us write the real rule, which has the form
\(\mathit {machine} \vdash \mathit {spec}\) and means that the specification is well-typed under the machine description.
Programs. A program is a sequence of calls to instructions. We need judgments of the form
\(\Delta , \Gamma \vdash \mathit {inst}\) for a single instruction and also
\(\Delta , \Gamma \vdash \mathit {insts}\) for the sequence, shown in Figure
A.9. There are two cases for a single instruction because of a minor glitch in formulation: because the overbar notation means “one or more”, there are two cases in the syntax for instructions, one for zero operands and one for one or more operands; we need typing rules for both cases. Meanwhile the type entered into
\(\Gamma\) for a zero-operand instruction is unit to unit, not
\(\epsilon\) to unit, to avoid needing an additional form for types just for this case. (Notice that a one-operand instruction may not have type unit to unit because unit is not allowed as an instruction operand, so the type is not ambiguous.) These rules let us write a judgment for a program, which has the form
\(\mathit {machine} \vdash \mathit {program}\) and means that the program is well-typed relative to the machine.
Soundness. Our type system is sound: we include the necessary checks and failure states in the semantics so that evaluation does not get stuck, even though some properties are not statically checked.
A.4 Cassiopea Semantics
This section defines the semantics of Cassiopea.
Environment. The execution environment \(\Lambda\) maps Cassiopea variables \(\mathit {x}\) to values \(\mathit {v}\) . For labels on memory regions, each label maps to a pointer that points to the base (offset 0) of the region associated with the label. However, we take advantage of the polymorphism and dynamic typing of article rules to also store the following in the same environment:
—
\(\mathit {x_{func}}\) (function names) map to pairs \(\lbrace \overline{\mathit {x} _i}, {\mathit {e}}\rbrace\) , which give the list of argument names and the body for functions.
—
\(\mathit {x_{proc}}\) (procedure names) map to pairs \(\lbrace \overline{\mathit {x} _i}, {\mathit {S}}\rbrace\) , which give the list of argument names and the body for procedures.
—
\(\mathit {x_{op}}\) (operation/instruction names) map to triples \(\lbrace \overline{\mathit {x} _i}, \mathit {e}, {\mathit {S}}\rbrace\) , which give the list of argument names, the expression for the text form, and the body for operations.
—
\(\mathit {r} \mathit {.} \mathtt {txt}\) (the form for the text version of a register) maps to a value.
—
The word EXTBRANCH maps to a branch state, which must be either ext or \(\cdot\) . This reports whether, after executing a program, it branched to the external label or not.
Since identifiers are not allowed to overlap in well-typed programs, and register identities are not strings at all, and EXTBRANCH is reserved, this usage creates no conflicts.
Note that \(\mathit {x_{mem}}\) , \(\mathit {x_{\tau }}\) , and \(\mathit {{x}_{module}}\) do not appear in \(\Lambda\) as these require no translation/lookup at runtime.
Machine State. In addition to the execution environment, we also need a representation of the machine state. We define two stores, one for registers and one for memory. The register store \(\rho\) maps registers \(\mathit {r}\) to values \(\mathit {v}\) . The memory store \(\sigma\) is more complicated: it maps pairs \(\mathit {(}\mathit {x_{mem}} \mathit {,}\ \mathit {C} \mathit {)}\) (that is, pointer literals) to pairs \((\mathit {v}, \mathit {C} _l)\) , where \(\mathit {v}\) is the value stored at that location and \(\mathit {C} _l\) is the bit width. The bit widths of memory regions are invariant, both across the region when they are declared and also over time. They are used to check the access widths appearing in fetch and store operations. Also note that new entries cannot be created in either the register store or the memory store, as real hardware does not permit such actions. The values stored in registers and memory regions are restricted by the typing rules to bitvectors (whether constants or pointers) of the appropriate width.
Notice that stepping through the declarations does not initialize the machine state. We want to reason about executions over ranges of possible starting machine states; so instead we provide a judgment that uses the typing environments to restrict the stores to forms consistent with the declarations. This is discussed further below.
Expressions. We describe expressions with a large-step operational semantics, shown in Figure
A.10. The form of an expression semantic judgment is:
\(\Lambda \vdash (\rho , \sigma ,\mathit {e})\Downarrow \mathit {v} {}\) , which means that given the environment
\(\Lambda\) and the machine state
\(\rho , \sigma\) , the expression
\(\mathit {e}\) evaluates to the value
\(\mathit {v}\) . Expressions may read the machine state, but not modify it. Expressions can fail; in addition to the explicit failure cases seen, some of the operators and built-in functions can fail. For example, as mentioned earlier, attempting bitvector arithmetic other than addition and subtraction on pointers will fail. Furthermore, division by 0 fails.
Note that we currently do not statically check (in the typing rules) that the \(\mathit {.} \mathtt {txt}\) form is present for every register or that it is defined for registers on which it is used. Thus we have an explicit failure rule for when no matching declaration has been seen. We also have failure rules for bad fetch operations: if the length annotation is wrong, if the pointer is not in the machine state (this covers both unaligned accesses and out-of-bounds accesses), or if the value used is not a pointer. Similarly, we have failure rules for when bit indexing/slicing a pointer. We do not, conversely, need explicit failure checks or rules for the bit indexes in the bit extraction/slicing constructs as they are statically checked.
Also, note that we include in the semantics the obvious failure propagation rules for when subexpressions fail. We do not show these explicitly as they are not particularly interesting or informative; however, note that the \(\&\&\) and \(||\) logical operators short-cut left to right.
Statements. Unlike expressions, statements can change machine state. Thus, the form of a machine state semantics judgment (also large step) is
\(\Lambda \vdash (\rho , \sigma ,\mathit {S})\Downarrow (\rho ^{\prime }, \sigma ^{\prime },\mathit {S}^{\prime },\xi)\) , shown in Figure
A.11. This means that the statement
\(\mathit {S}\) evaluates to the irreducible statement
\(\mathit {S}\) ’ (which must be either
\(\mathtt {skip}\) or
\(\mathtt {crash}\) ) and in the course of doing so changes the machine state from
\(\rho , \sigma\) to
\(\rho ^{\prime }, \sigma ^{\prime }\) , and produces a branching state
\(\xi\) , which can be either an 8-bit bitvector, the reserved value
ext, or a dot (
\(\cdot\) ). As with expressions, statements can fail. Explicit failure rules are shown for bad stores (corresponding to the cases for bad fetches) and for failed assertions. We also similarly include, but do not show, the obvious failure propagation rules for cases where sub-statements, or expressions within statements, fail.
Declarations. The semantics for declarations have judgments of the form
\(\Lambda , (\rho , \sigma) \vdash \mathit {decl} \rhd \Lambda ^{\prime }\) , shown in Figure
A.12. This means that the given declaration updates
\(\Lambda\) as shown. As stated above, we do not
initialize the machine state while handling declarations; this instead allows us to work with arbitrary (or universally quantified) machine states afterward. However, because the let-binding declaration evaluates an expression, it potentially needs
access to a machine state. Consequently, we write the rules so they accept a machine state as input, but do not update it. In the case of machine descriptions, where there is no machine state, we pass empty environments; let-binding declarations in machine descriptions are not allowed to reference machine state. In the case of the additional declarations that accompany a specification, we pass in the initial machine state; this allows values from the initial machine state to be globally bound so they can be referred to in the postcondition.
We give first the rules for a list of declarations, then the rules for the various declarations, then the rules for a list of operation definitions, and a rule for a single operation definition. Note that several of the declarations do not update \(\Lambda\) , and nothing is placed in \(\Lambda\) for memory regions. For registers, only the mapping of the identifier to its underlying register \(\mathit {r}\) is entered; nothing for \(\mathit {r}\) is inserted.
Machines. As with the typing rules, the semantics rule for a whole machine description integrates the initial environment and gives a judgment of the form
\(\vdash \mathit {machine} \rhd \Lambda ^{\prime }\) , shown in Figure
A.13. We also include a comparable form that includes additional declarations, as it will be used below.
Programs. Instructions (or more precisely, Cassiopea operations) update the machine state, and we chose to represent programs as lists of instructions rather than having dummy instruction forms for skip and sequence. Consequently, the form of the judgments is slightly different, and there are several of them, as shown in Figure
A.14.
First, we can execute an individual instruction using the form \(\Lambda \vdash (\rho , \sigma ,\mathit {inst})\rightarrow (\rho ^{\prime }, \sigma ^{\prime }, \xi)\) , meaning that the instruction executes and updates the machine state \(\rho , \sigma\) to \(\rho ^{\prime }, \sigma ^{\prime }\) , producing the branching state \(\xi\) . Then, a list of instructions executes using the form \(\Lambda \vdash (\rho , \sigma ,\mathit {insts}, \xi)\rightarrow (\rho ^{\prime }, \sigma ^{\prime }, \mathit {insts}^{\prime }, \xi ^{\prime })\) , which means that the list steps to a new list and updates both the machine state and the branching state. When the instruction list runs out, these reduce to a shorter form via a judgment of the form \(\Lambda \vdash (\rho , \sigma ,\mathit {insts}, \xi)\rightarrow (\rho ^{\prime }, \sigma ^{\prime })\) , which discards the instructions and branching state and produces an output machine state. That in turn allows us to draw conclusions of the form \(\mathit {machine} \vdash (\rho , \sigma ,\mathit {program})\rightarrow (\rho ^{\prime }, \sigma ^{\prime })\) , which means that a machine with the initial state \(\rho , \sigma\) executes the program to produce the new machine state \((\rho ^{\prime }, \sigma ^{\prime })\) .
Note that there are two versions of the judgment for instructions, one specialized for no arguments/operands. Instructions with no operands are declared as taking unit, but invoked with empty operands (not with unit) to correspond to the way assembly languages normally work.
We include a final judgment of the form \(\mathit {machine} \vdash (\rho , \sigma ,\mathit {program})\rightarrow (\rho ^{\prime }, \sigma ^{\prime })\) that puts the machine on the left-hand side of the turnstile. It means that under a given machine the program maps \(\rho , \sigma\) to \(\rho ^{\prime }, \sigma ^{\prime }\) . There is a limitation in the way we have formulated programs and the rules for programs, which is that there is no easy way to represent failure. (Failure in this might represent triggering an exception and stopping execution, which we do not model, or invoking “unpredictable” or “undefined” behavior in the processor and transitioning to an arbitrary unknown machine state.)
The intended behavior is that a program that fails during execution (that is, the body of one of its instructions steps to \(\mathtt {crash}\) ) enters a state where no postcondition can evaluate to \(\mathtt {true}\) . We have decided for the moment that working this explicitly into the formalism would result in a lot of complication and obscuration without providing any useful information.
Specifications. For specifications, we need three judgments, shown in Figure
A.15: the first two state what the
\(\mathtt {reg\text{-}modify}\) and
\(\mathtt {mem\text{-}modify}\) clauses mean, respectively (they are properties on initial and final register and memory states), and the last one says what it means for a program to satisfy a specification. Note that the
\(\mathtt {reg\text{-}modify}\) and
\(\mathtt {mem\text{-}modify}\) rules as shown are slightly misleading because the register and pointer list provided in the input specification is implicitly augmented with all registers and pointers mentioned in the postcondition before it gets to this point.
Machine State Validity. As discussed above, we do not initialize the machine state while processing declarations. Instead, we treat the starting machine state as an input (e.g., in the final judgment about programs) or quantify it universally as in the specification judgment. This requires that we have a predicate to reject machine states that do not match the machine description. The validity judgment has the form
\(\Delta , \Gamma , \Lambda \vdash \rho\) , shown in Figure
A.16, and correspondingly for
\(\sigma\) (except without
\(\Lambda\) ) and then for
\(\rho , \sigma\) (both stores at once). This means that the given stores match the given environments.
We use this with the typing environments that come from both the machine description and the additional declarations arising from a specification. In the case of registers, we need access to \(\Lambda\) to handle the names of registers. We do not use the \(\Lambda\) generated from the additional declarations in a specification; this avoids circularity. This is acceptable because specifications are not allowed to define new registers. For memory regions, we need to enumerate the valid offsets for the region (note the literal 8 that hardwires 8-bit bytes) and check the cell width.
Branching. To handle branching, we allow statements to produce a branching state, which indicates the number of instructions to skip over. Normally this is \(\cdot\) , which means none and has no effect; however, when an instruction body produces something else we use it to branch. A nonzero bitvector results in skipping the indicated number of instructions; the out-of-band additional value ext causes a branch to the external label. The magic number used to select the external label appears only in one of the statement rules; beyond that point we use ext explicitly.
The program rule in Figure
A.15 inserts the branch state produced by the program into
\(\Lambda\) , where it provides a value for the
branchto predicate found in the postcondition. This allows specifications to enforce the external branching behavior.
A.5 Alewife Overview
This section describes Alewife, our specification language for writing abstracted machine-independent specifications of low-level code.
Alewife specifications are abstractions of machine-level Cassiopea specifications; we say that Cassiopea constructs are lifted into Alewife and Alewife constructs are lowered into Cassiopea. Alewife is only for specifications, so there are no statements, no updates, and no notion of instructions or programs. We refer to the single synthesis problem in one Alewife specification as a block.
Notation. We use the following metavariables:
(Other constructions are referred to with longer names.)
As noted previously, Alewife types and expressions should be considered distinct from Cassiopea ones (even where they correspond directly). We use the same letters in the hopes that this will cause less confusion (even in the definition of the translation) than picking an entirely different set of letters for Alewife.
Identifiers and Variables. In Alewife there are five syntactic categories of identifiers: As in Cassiopea, \(\mathit {x_{mem}}\) name memory regions. \(\mathit {x_{func}}\) name functions, and \(\mathit {x_{\tau }}\) are type aliases. \(\mathit {{x}_{module}}\) name Cassiopea lowering modules, which are used to instantiate the abstract elements and conduct Alewife –Cassiopea translation for synthesis. Other \(\mathit {x}\) are ordinary variables that range over other things, and may be presumed to not range over the above reserved categories. All variables are immutable, in the sense that their values do not change once bound.
Symbolic Constants. In Alewife symbolic constants \(\mathit {N}\) are permitted to occur in some places where only integer constants are allowed in the corresponding Cassiopea constructions. In particular, the bit sizes associated with types (and the lengths of memory regions, which are functionally similar) may be given as symbolic values \(\mathit {x}\) instead of integer constants. These must be bound to integer constants either directly in the Alewife spec, in the Cassiopea lowering, or by the Cassiopea machine description. This allows the concrete sizes of bitvectors to vary depending on the machine architecture.
Types. As in Cassiopea, Alewife types are divided syntactically into base types and others, shown in Figure
A.17. The chief difference from Cassiopea is that bit widths (and the lengths of memory regions) can be symbolic constants. However, an additional difference is that pointers (
\(\mathtt {ptr}\) ) are distinguished from plain bitvectors (
\(\mathtt {vec}\) ). This is reasonably possible in Alewife, because it need not reason about the progression of values through machine registers, only pre- and post-block machine states. Strings and unit are also absent, as they are not needed for specifications.
Values and Expressions. The values in Alewife correspond directly to the values in Cassiopea as do operators and most expressions, shown in Figure
A.17. Note that the width argument of fetch can be a symbolic size.
Declarations and Frames. Alewife declarations come in two forms:
\(\mathtt {require}\) and
\(\mathtt {provide}\) , shown in Figure
A.18. The second form declares elements in an ordinary way, while the first form declares an element that must be provided by the Cassiopea lowerings or the Cassiopea machine description. In this case, the type is given, but not the value. This functions as a form of import, and allows an Alewife file to be checked on its own separately from any particular machine description or Cassiopea lowerings. However, we do not currently define or implement such a check. Note that it is possible to
\(\mathtt {require}\) functions that implicitly depend on machine state or that depend on machine state on some machines and not others. Such functions can also depend on constants or other elements that are not visible in the Alewife specification at all. The
\(\mathtt {lower\text{-}with}\) declarations specify all lowering modules that are used to compile this Alewife specification into a Cassiopea specification. The module name
\(\mathit {{x}_{module}}\) is used to look up the Cassiopea lowering module to apply. The
\(\mathtt {region}\) declarations declare memory regions, like the memory-typed
\(\mathtt {letstate}\) declarations in Cassiopea. (These are implicitly always
\(\mathtt {provide}\) , because, for memory regions, the corresponding
\(\mathtt {require}\) declaration would be entirely equivalent, requiring duplication in the Cassiopea lowering.) Note that the parameters of the region can be symbolic constants if abstraction is needed.
Frame declarations in Alewife, annotated with \(\mathtt {reg\text{-}modify}\) and \(\mathtt {mem\text{-}modify}\) , are exactly the same as in Cassiopea. Because Alewife files are machine-independent, the registers mentioned must be abstract and concretized via the Cassiopea lowerings.
Block-lets. While Alewife expressions include let-bindings, the scope of those let-bindings is conventional: it lasts until the end of the expression. To refer to values taken from the initial state (that is, the machine state of which the precondition must be true), we need a way to bind these values so their scope extends to the postcondition. The block-lets serve this purpose in Alewife, shown in Figure
A.18, much like the additional declarations seen in Cassiopea specs can. These are found within a block (because a block corresponds to a synthesis problem, it is meaningful to associate pre- and post-block machine states with it), and the scope is the entire block.
Specifications. A full specification starts with a preamble of declarations, shown in Figure
A.18. It also includes block-lets and the pre- and postconditions for the block. Common declarations can be shared with
\(\mathtt {include}\) .
A.6 Alewife Typing and Semantics
We do not provide (or implement) a full typechecking pass for Alewife. Instead, when we lower to Cassiopea, we allow the Cassiopea typechecker to reject invalid results, which might be caused by invalid Alewife input or by bad/mismatched Cassiopea lowering definitions. The rules provided here are for doing scans over the declarations sufficient to make the translation to Cassiopea work and no more.
Environments. We retain the Cassiopea typing environments, \(\Delta , \Gamma\) . We add an additional environment \(\Sigma\) , which maps identifiers to integer constants. This is a projection of the Cassiopea execution environment \(\Lambda\) : it holds mappings only for variables defined as integer constants and excludes everything else. We include a separate set of rules for extracting these integer constants without doing a full Cassiopea execution. (Among other things, this avoids involving machine state or the machine state stores.)
Translation. The translation (lowering) from Alewife to Cassiopea, defined in the next section, appears cross-recursively in the rules in this section. Because \(\Delta , \Gamma\) are Cassiopea environments, they map identifiers to Cassiopea types, not Alewife ones. This means Alewife types must be lowered on the fly to update them correctly.
Integer Constant Extraction. The integer constant extraction rules do a simple pass over Cassiopea declarations to extract the variables defined as integer constants, shown in Figure
A.19. These populate a substitution environment
\(\Sigma\) that we use for lowering Alewife types containing symbolic constants. These rules are judgments of the form
\(\Sigma \vdash \mathit {decl} \rhd \Sigma ^{\prime }\) or
\(\Sigma \vdash \mathit {decls} \rhd \Sigma ^{\prime }\) , plus one of the form
\(\vdash \mathit {machine} \rhd \Sigma\) for a whole machine description.
Typing. The declaration typing rules are intended to accumulate types for all the declarations in an Alewife specification. They are applied concurrently with the Cassiopea declaration rules to the Alewife specification, the Cassiopea machine description, and the Cassiopea lowering. The declaration typing rules have judgments of the form
\(\Delta , \Gamma , \Sigma \vdash \mathit {decl} \rhd \Delta ^{\prime }, \Gamma ^{\prime }, \Sigma ^{\prime }\) and
\(\Delta , \Gamma , \Sigma \vdash \mathit {decls} \rhd \Delta ^{\prime }, \Gamma ^{\prime }, \Sigma ^{\prime }\) , shown in Figure
A.20. These mean that the declaration or declarations update the type environment (and integer constant environment) as shown. Note that there is a special-case rule for
\(\mathtt {provide}\) \(\mathtt {value}\) for when the value is an integer constant; this enters the constant into
\(\Sigma\) . The integer constants are in turn used when lowering the types of memory regions, which can be seen in the last two rules.
Block-lets. The rules for block-lets are effectively the same as the rules for declarations, shown in Figure
A.21. The ways in which block-lets are special mostly do not apply here. Note however that even though we pass through
\(\Sigma\) (for consistency of the form of the rules) there is no rule for loading integer constants into
\(\Sigma\) from block-lets. Integer constants used in types and defined in the Alewife specification should be defined with
\(\mathtt {provide}\) \(\mathtt {value}\) ; block-lets are intended to provide access to machine state.
Specifications. The rule for the semantics of an entire specification is large and complex. The conclusion is that a given machine, lowering module, and Alewife specification produce a final translation output \(\Omega\) . The rules work by non-deterministically taking fixpoints over all the material included. \(\mathit {decls}\) is the combination of all declarations found both in the initial specification and all the included lowering modules, and \(\mathit {frames}\) is the combination of all frame information (part of the declarations in Alewife; separated in Cassiopea). Similarly, the final set of environments \(\Gamma , \Delta , \Sigma\) represent fixpoints produced by processing all the declarations.
In Figure
A.22, the first premise expands the Alewife specification as we will need to work with the components. The next two premises generate initial environments: the Cassiopea typing environments induced by the machine description, and its integer constants, and then we require that these are included in the final environments. In the fifth and sixth premises, we then require that the result of processing the declarations from the specification appears in the final environments, and that the translation of these to Cassiopea is included in the final declarations and frame rules. Then for every lowering module requested by the specification, we require that it be provided in the input modules list and that its components appear in the final declarations and frame rules. This is followed by two more rules to ensure that these results are represented in the final environments. Later, we include the block-let material in the final environments, include its lowered form in the final declaration list (block-lets lower to declarations), bind the lowerings of the pre- and postconditions, and define the output.
The fixpoint-based evaluation strategy for declarations is required, because the Alewife declarations rely on the Cassiopea lowering file (most notably for resolving symbolic constants), but the Cassiopea lowering file is in turn also specifically allowed to refer to objects declared by the Alewife specification, such as memory regions. In the implementation, this circularity is resolved by lifting both the Cassiopea and Alewife declarations (and block-lets) into a common representation and topologically sorting them based on identifier references. (Genuinely circular references among identifiers are prohibited.) From this point, they can be handled in order in a more conventional way.
Complete Output. Note that the output includes the declarations from the Cassiopea lowering modules (each \(\mathit {decls} _{\mathit {lower}}\) ). Apart from symbolic constants, we do not substitute the definitions of the lowering elements, as that would greatly complicate things, especially with functions; instead, we include the definitions and let the translation refer to them. In fact, because of the declaration ordering issues, in the implementation the lowering declarations and translated Alewife declarations can be arbitrarily interleaved in the output.
Note furthermore that it would not be sufficient to include only the lowering declarations explicitly imported with \(\mathtt {require}\) declarations, as those may refer freely to other things declared in the lowering module that the Alewife specification itself may have no cognizance of whatsoever.