Object Format

This article describes the layout in memory of all ML data objects in MLWorks.

Boxed values

Every boxed value is 32 bits. The C type mlval is a boxed value that might refer to a heap object, depending on its tag.

Every object is 8-byte aligned, giving us 3 "primary tag" bits at the bottom of every pointer. Boxed integers are 30 bits, with 2 zero tag bits (accounting for primary tag values 0 and 4). Here are the tag values (see rts\src\values.h):

Bits	Tag name	Description
0 0 0	INTEGER0	Even integer.
0 0 1	PAIRPTR	Pair.
0 1 0	HEADER	Header word.
0 1 1	REFPTR	Reference pointer.
1 0 0	INTEGER1	Odd integer.
1 0 1	POINTER	Any other pointer.
1 1 0	PRIMARY6	Unused (but even so not a pointer).
1 1 1	PRIMARY7	Unused (but odd, so a pointer).

Pairs (2-tuples) do not have a header word. Every other object has a header word. Note that POINTER = PAIRPTR+4, so loading from tuples is the same for header words as for other values. The ML type system means that the compiler always knows what the tag bits are when generating code to dereference a pointer, so can apply a fixed offset (which is free on most architectures).

There are a lot of useful macros for runtime C code to produce and manipulate these boxed objects:

Macro	Meaning
`PRIMARY(x)`	the primary tag value of `x`.
`MLINT(cint)`	the boxed ML integer with the value `cint`.
`CINT(mlint)`	the value of the ML integer `mlint`.
`MLPTR(t,p)`	the pointer `p` with primary tag `t`.
`OBJECT(x)`	a C pointer to the header of the boxed object `x`.
`MLVALISPTR(x)`	true iff `x` is a pointer.
`ISORDPTR(x)`	true iff `x` is a POINTER or PAIRPTR.
`ISREFPTR(x)`	true iff `x` is a REFPTR (or a PRIMARY7 value).

Header words

The bottom 6 bits of every header word is a "secondary tag", which identifies the runtime type of the object, and the rest of the header word is the length field. Here are the secondary tag values:

Tag bits	Tag name	Description
0 0 0 0 1 0	RECORD	Tuple with `length` slots: may contain pointers; simple equality.
0 0 1 0 1 0	STRING	String with `length` bytes: contains no pointers; simple equality.
0 1 0 0 1 0	ARRAY	Updatable tuple with `length` slots;, pointers; ref equality.
0 1 1 0 1 0	BYTEARRAY	Updatable tuple with `length` bytes; no pointers, ref equality.
1 0 0 0 1 0	BACKPTR	"Back pointer" (see below).
1 0 1 0 1 0	CODE	Code object of `length` words; no pointers, ref equality.
1 1 0 0 1 0	HEADER50	Unused.
1 1 1 0 1 0	WEAKARRAY	Weak array with `length` slots: special pointers, ref equality.

Note that particular secondary tag bits are used to convey some information. Bit 3 indicates to the garbage collector that the object contains no fixable pointers. Bit 4 indicates that polymorphic equality on the object is determined by its address (rather than its contents). Bit 5 indicates that the object is "unusual".

The only boxed pointers permitted into the interior of an object are "shared closures" and "back pointers". A "shared closure" object is a tuple with code pointers in even-numbered slots (0, 2, 4, ...) and zero words in odd-numbered slots (1, 3, 5, ...). A pointer may point to any even-numbered slot, with primary tag POINTER. Thus it will have a zero word where it would expect a header word.

A "back pointer" has primary tag POINTER and a header word with secondary tag BACKPTR. The length field of the header word is the offset (in bytes) of the back pointer header word from the object's header word. Back pointers are used, specifically, for pointers to code objects which form part of a "code vector" (of which more below).

Again the runtime has a number of useful macros to produce and manipulate secondary tags:

Macro	Meaning
`SECONDARY(x)`	the secondary tag of a header word `x`.
`LENGTH(x)`	the length field of a header wqord `x`.
`EQUALITY(x)`	true iff the object with header `x` has simple equality.
`POINTERS(x)`	true iff the object with header `x` contains pointers.
`FIXABLE(x)`	true if `x` is a header word and contains pointers.
`ISNORMAL(x)`	true iff the object with header `x` is ordinary.
`MAKEHEAD(t,l)`	a header word with secondary tag `t` and length `l`.
`DEREF(x)`	the value of the first word in an array.
`FOLLOWBACK(x)`	a boxed pointer to the object into which `x` is an internal pointer.
`TRUE_HEAD(x)`	a pointer to the first word of the object to which `x` points.
`OBJECT_SIZE(s,l)`	the size in bytes of an object with secondary tag `s` and length `l`.
`DATA_SIZE(s,l)`	the size in bytes of user data in such an object.

Floating-point Numbers

The only floating-point numbers supported by MLWorks, of the SML type real, are 64-bit (double-precision) IEEE-754 floats. A single such value is implemented as a POINTER to a BYTEARRAY containing 4 bytes of padding followed by the 8 byte float value itself (making 16 bytes including the header). The padding is required to align the float on an 8-byte boundary, which is necessary on some processor architectures.

MLWorks also supports arrays of real (as used, for instance, in the SML Basis Library structure RealArray). These are implemented as larger BYTEARRAY values, with 4 bytes of padding and a total size of 8N+4+4 bytes.

Code Objects

Every ML function is represented by a closure, which is a tuple. The zeroth slot in that tuple is a code item pointer, which actually points to the machine code generated for the function. When a function is called, its boxed closure is passed in a register, and the function code accesses all values referred to in the body of the function (aside from arguments) via slots in the closure.

For shared closures (see above), there may be other code item pointers in subsequent even-numbered slots (separated by zeroes in the odd-numbered slots). Thus a pointer into the interior of this same tuple (to one of the code item pointer slots) can be used as the boxed closure for each function.

All the code generated by the compiler is allocated into "code vectors" (secondary tag CODE). A code vector has a single boxed slot at the beginning (immediately after the header word) pointing to the "ancillary record", and the a number of "code items". In this way, the code for several distinct functions which share other closure values can be generated into a single code vector, with a shared closure.

Each code item has a BACKPTR header, and a single word of metadata (the "ancillary word", which is a terrible name because it's easy to confuse with the "ancillary record"), followed by the machine code itself. A pointer to code (for example, in a closure) points to one of these code items, so the first instruction of the machine code is at an offset of 3 from the code pointer (-5 for the POINTER tag, +4 for the header word, +4 for the ancillary word).

The ancillary word has several fields. From the least-significant bits up, this table shows field sizes on each architecture:

Name	SPARC	MIPS	x86	Description
number	10	10	10	The index of this item in the code vector.
intercept	5	6	8	An offset to code to replace to intercept this function.
leaf	1	1	1	A single bit flag denoting a leaf function.
saves	0	5	2	The number of callee-save registers.
args	0	0	3	The number of arguments passed on the stack.
non_gcs	16	10	8	The number of non-GCable spill slots in the stack frame.

There is a lot of bit pressure in this word, especially on x86 (where shortage of register pressure means that the values in the 'saves', 'args', and 'non_gcs' slots are often all non-zero, and variable-length instructions means that the 'intercept' slot needs to be larger). The sizes of the different fields have been carefully tweaked.

The intercept slot is an offset into a block of NOP instructions in the code, when compiled for interception. These instructions can be replaced to intercept function calls. This is used for tracing, debugging, space profiling, call-counting, etc. The units in which the intercept offset is counted are bytes on x86 and 32-bit words on SPARC and MIPS.

The ancillary vector has between one and three slots, depending on the compilation switches. Each slot is a boxed table indexed by code item number. The tables are as follows, in order:

Table	Description
names	An ML tuple of ML strings.
profile records	An ML tuple of pointers to C structures used when profiling: these are not allocated on the ML heap so the GC can treat them as non-pointers.
intercept fns	An ML array of ML functions. Note that this gets updated, so it does actually have to be an array.

There are C runtime macros for extracting and manipulating all this:

Macro	Meaning
`CCODEANCILL(p)`	the ancillary field of the code item.
`CCODENUMBER(p)`	the code index of the code item.
`CCODENONGC(p)`	the number of non-gc values.
`CCODEARGS(p)`	the number of stack parameters.
`CCODESAVES(p)`	the number of callee saves.
`CCODELEAF(p)`	the leaf flag.
`CCODEINTERCEPT(p)`	the intercept offset
`CCODE_CAN_INTERCEPT(p)`	is the function interceptable?
`CCODENAME(p)`	the code name (an ML string)
`CCODEPROFILE(p)`	the profile record
`CCODE_SET_PROFILE(p,v)`	set the profile record
`CCODEINTERFN(p)`	the intercept function
`CCODE_SET_INTERFN(p,f)`	set the intercept function
`CCODESTART(p)`	the address of the first instruction.

Arrays

An array (primary tag REFPTR, secondary tag ARRAY) has two words after the header word which are used by ML code and the garbage collector to manage "remembered sets". The words are slots (forward and back) for a doubly-linked list.

Whenever an array is created or updated, it's put on the "modified list", a singly-linked list linked through the back slots. Each generation has an "entry list": a doubly-linked list of arrays which should be scanned for pointers to this or any older generation.

Note that ML references are single-element arrays, and therefore occupy 4 words: header, forward and back slots, and reference.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Object Format

Boxed values

Header words

Floating-point Numbers

Code Objects

Arrays

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally