Contribution Guide

Source Code Compilation

The Nature compiler is written entirely in C, adhering to the C11 standard. It requires GCC version 11 or higher. The Linux runtime is statically compiled against musl libc. CMake is used as the build tool, requiring version 3.22 or higher.

Build Environment Used for Current Releases:

Apple clang version 16.0.0 (clang-1600.0.26.6)

gcc version 14.2.0 (Ubuntu 14.2.0-4ubuntu2)

Compile the Runtime

# darwin/amd64 architecture: Build only supported on Darwin devices.
rm -rf build-runtime && cmake -B build-runtime -S runtime -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=$(pwd)/cmake/darwin-amd64-toolchain.cmake && cmake --build build-runtime --target runtime
 
# darwin/arm64 architecture: Build only supported on Darwin devices.
rm -rf build-runtime && cmake -B build-runtime -S runtime -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=$(pwd)/cmake/darwin-arm64-toolchain.cmake && cmake --build build-runtime --target runtime
 
# linux/amd64 architecture: Requires the x86_64-linux-musl-gcc compiler installed on the build machine.
rm -rf build-runtime && cmake -B build-runtime -S runtime -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=$(pwd)/cmake/linux-amd64-toolchain.cmake && cmake --build build-runtime --target runtime
 
# linux/arm64 architecture: Requires the aarch64-linux-musl-gcc compiler installed on the build machine.
rm -rf build-runtime && cmake -B build-runtime -S runtime -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=$(pwd)/cmake/linux-arm64-toolchain.cmake && cmake --build build-runtime --target runtime

Compiling the runtime generates libruntime.a, which is automatically placed in the corresponding architecture-specific subdirectory under lib, for example: nature/lib/linux_arm64/libruntime.a.

Compile the Source Code

# Use 'build-release' as the build directory. If cross-compilers are installed, specify the target architecture using TOOLCHAIN_FILE.
# cmake -B build-release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_TOOLCHAIN_FILE=$(pwd)/cmake/linux-amd64-toolchain.cmake -DCMAKE_BUILD_TYPE=Release -DCPACK_OUTPUT_FILE_PREFIX=$(pwd)/release
cmake -B build-release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_BUILD_TYPE=Release -DCPACK_OUTPUT_FILE_PREFIX=$(pwd)/release
 
# Compile
cmake --build build-release
 
# Test
cmake --build build-release --target test
 
# Install (Defaults to /usr/local/nature, may require sudo privileges)
cmake --build build-release --target install
 
# Package (Creates an archive in the 'release' directory)
cmake --build build-release --target package

Coding Standards

File Naming

  • Source files: Use lowercase letters and end with .c.
  • Header files: Use lowercase letters and end with .h.
  • Filenames should be concise and clearly indicate the module's purpose.

Function Naming

  • Use snake_case (lowercase letters and underscores).
  • Prefix functions with the module name to avoid naming collisions.
  • Functions not intended for external use (internal linkage) should be declared with the static keyword within the .c file.

Type Naming

  • Suffix structure type names with _t.
  • Define types using typedef.
typedef struct module_t { ... } module_t;
typedef struct coroutine_t { ... } coroutine_t;

Variable Naming

  • Use snake_case.
  • Variable names should clearly convey their purpose.
  • Short names are acceptable for temporary variables with limited scope.

Code Formatting

Code should be formatted according to the .clang-format file located in the project's root directory.

Commenting Guidelines

Comments are optional for simple modules/functions but required for complex ones. Comments should be clear and explanatory. English is preferred for all comments. Existing Chinese comments in the codebase will eventually be translated to English.

Testing Guidelines

All test files reside in the tests directory. Source code tests are located in tests/features. Testing is implemented using CMake's CTest framework.

tests/features utilizes three types of source code testing methods. Refer to the existing examples within the tests directory:

  • Directory-based tests
  • Single file tests
  • testar archive tests

Git Branch Management

All Nature-related repositories follow the Github Flow workflow.

This means the master branch always contains the latest, thoroughly tested code. Feature development and bug fixes should be done on separate branches checked out from master. Once development is complete, submit a Pull Request (PR) to merge into master. After merging, the feature/bugfix branch should be deleted.

How to Contribute Code?

The Nature programming language uses GitHub as its primary code repository. Currently, there are no committees or similar organizations. Development decisions, PR merges, reviews, and releases are handled by the project initiator, weiwenhao.

For code contributions involving Features, Refactoring, Standard Library additions, or complex Bug Fixes, you must first propose your idea and discuss it in a GitHub Issue or via email. Development should only begin after the discussion concludes with approval. Once development is complete, submit a Pull Request (PR). Simple Bug Fixes can be submitted directly as PRs without prior discussion.

After developing Features, Standard Library additions, or Bug Fixes, test cases must be added. Once the functionality passes its specific tests, you must run the full CMake CTest regression suite locally before submitting the PR.

Directory Structure

├── CMakeLists.txt          # Main project build configuration file
├── LICENSE-APACHE          # Apache License
├── LICENSE-MIT             # MIT License
├── README.md               # English documentation
├── README_CN.md            # Chinese documentation
├── VERSION                 # Version information file
├── cmake                   # CMake toolchains and build configuration directory
├── cmd                     # Command-line tool related code
├── config                  # Compiler configuration file directory
├── include                 # Third-party library header file directory
   ├── mach                # MacOS system call related header files
   ├── mach-o              # MacOS executable format related header files
   ├── uv                  # libuv asynchronous IO library header files
   ├── uv.h
   └── yaml.h              # yaml parsing library header file
├── lib                     # Build outputs and dependency library directory
   ├── darwin_amd64        # MacOS x86_64 architecture library files
   ├── darwin_arm64        # MacOS ARM architecture library files
   ├── linux_amd64         # Linux x86_64 architecture library files
   └── linux_arm64         # Linux ARM architecture library files
├── main.c                  # Program entry point file
├── nls                     # Internationalization resource file directory
├── package                 # Packaging related configurations and scripts
├── runtime
   ├── CMakeLists.txt      # Runtime library build configuration
   ├── nutils              # Helper functions depended on by user code (e.g., vec/map/set)
   ├── runtime.h           # Runtime entry point
├── src
   ├── binary              # Assembler and linker source code
   ├── build               # Compiler build entry point
   ├── lower               # IR lowering and optimization related code
   ├── native              # Target platform code generation
   ├── register            # Register allocation related code
   ├── semantic            # Semantic analysis related code
   ├── symbol              # Symbol table related code
   ├── syntax              # Syntax analysis related code
   ├── ast.h               # Abstract Syntax Tree definition
   ├── cfg.h               # Control Flow Graph definition
   ├── debug               # Debug information related code
   ├── error.h             # Error handling definitions
   ├── linear.h            # AST to LIR compilation logic
   ├── lir.h               # Low-level IR definition
   ├── module.h            # Nature compiles module by module; module-related type definitions
   ├── package.h           # Package management definitions
   ├── ssa.h               # SSA form definitions
   └── types.h             # Basic type definitions
├── std                     # Standard library source code
├── tests                   # Test related files
   ├── CMakeLists.txt      # Test build configuration
   ├── features            # Feature test cases
   ├── test.h              # Test framework header file
   ├── test_arm64_encoding.c # ARM64 encoding test
   ├── test_bitmap.c       # Bitmap test
   ├── test_hello.c        # Hello World example test
└── utils                   # Common utility library
    ├── autobuf.h           # Auto-expanding buffer
    ├── bitmap.h            # Bitmap operations
    ├── ct_list.h           # Dynamic array
    ├── custom_links.h      # Custom linked data, depended on by runtime/compiler
    ├── error.h             # Error handling
    ├── exec.h              # Test process execution
    ├── helper.h            # Common helper functions
    ├── linked.h            # Linked list operations
    ├── log.h               # Compiler logging
    ├── mutex.h             # Mutex lock
    ├── sc_map.h            # Hash map implementation
    ├── slice.h             # Dynamic pointer array
    ├── stack.h             # Stack data structure
    ├── string_view.h       # String view
    ├── table.h             # Hash table (deprecated, use sc_map instead)
    ├── toml.h              # TOML configuration parsing
    ├── type.h              # Nature type definitions and reflection implementation shared by runtime/compiler
    └── ymal                # YAML parsing related

Compilation Flow

Nature's LSP (Language Server Protocol) is implemented in Rust and shares the same frontend logic as the Nature compiler. If you are more familiar with Rust, you can refer directly to the Nature LSP source code.

Compiler Frontend

Compiler Backend

  • main.build -> src/build/build.c#build (Entry Point)
  • Build modules -> src/build/build.c#build_modules
  • Semantic analyzer -> src/semantic/analyzer.c#analyzer
  • Type system infer -> src/semantic/infer.c#pre_infer and infer
  • AST to LIR -> src/linear.c#linear
  • LIR to CFG -> src/cfg.c#cfg
  • SSA on CFG -> src/ssa.c#ssa
  • Cross Arch Lower -> src/lower
  • Register Alloc -> src/register/linearscan.c#reg_alloc
  • Cross Native -> src/native
  • Assembler file -> src/build/build.c#build_assembler
  • Elf Linker -> src/build/build.c#build_elf_exe -> src/binary/elf/linker.c#elf_exe_file_format
  • Mach-O LD -> src/build/build.c#build_mach_exe

Runtime Flow

  • runtime_main -> runtime/runtime.c#runtime_main
  • sched_init -> runtime/processor.c#sched_init
  • processor_run -> runtime/processor.c#processor_run
  • wait_sysmon -> runtime/sysmon.c#wait_sysmon
  • runtime_gc -> runtime/collector.c#runtime_gc

Lexer Scanner

src/syntax/scanner.c

This is the lexical analyzer (Scanner), primarily responsible for converting source code text into a sequence of tokens. The Lexer Scanner is the first stage of the compiler frontend, providing the foundation for subsequent syntax analysis.

Syntax Parser

src/syntax/parser.c

This file implements the Syntax Parser, which is mainly responsible for converting the token sequence generated by the Lexer Scanner into an Abstract Syntax Tree (AST). The parser uses a Recursive Descent approach and is a core component of the compiler frontend.

Expression parsing employs the Operator Precedence Climbing algorithm, implemented via parser_expr() and parser_precedence_expr(). parser_expr handles several special expressions, such as go, match, fndef, and struct_new.

Error checking and reporting use the PARSER_ASSERTF macro. Currently, the compiler does not implement error recovery mechanisms. The LSP has implemented relevant error recovery, allowing it to continue parsing beyond errors.

Ambiguous symbols are resolved using lookahead parsing to determine their semantics.

Semantic Analyzer

src/semantic/analyzer.c

This implements the Semantic Analyzer within the compiler, primarily responsible for:

  1. Symbol resolution and scope analysis.
  2. Generating unique identifiers for variables and functions.
  3. Handling closure variable capture.
  4. Module import and symbol visibility analysis.
  5. Type alias resolution.
  6. Global variable initialization.
  7. Using the ANALYZER_ASSERTF macro for error checking and reporting.
  8. Transforming the module-level AST into function-level units, as subsequent stages operate at the function level.

Type System Infer

src/semantic/infer.c

This is the Type Inference module in the compiler, primarily responsible for:

  1. Type inference and checking:
    • Type inference for variable declarations.
    • Expression type inference and checking.
    • Function call argument type checking.
    • Generic type specialization and instantiation.
  2. Type Reduction:
    • Reducing Type Aliases to their concrete types.
    • Handling type reduction for composite types (arrays, pointers, structs, etc.).
    • Type reduction for generic parameters.
  3. Main functions:
    • pre_infer: Pre-processes global variable and function declarations, mainly handling function signatures without processing bodies.
    • infer: Entry function, handles type inference for the entire module.
    • infer_expr: Expression type inference.
    • reduction_type: Type reduction and type parameter handling.
    • infer_fndef: Type inference for function definitions.
    • generics_special_fn: Generic function handling.

It uses the INFER_ASSERTF macro for type error checking and reporting. This module is the core of the compiler's type system, ensuring correct type usage in the program and providing type information support for subsequent code generation. It executes after the semantic analysis phase and before code generation, playing a crucial role in ensuring type safety.

Linear to LIR

src/linear.c

This is the compiler's Linear Intermediate Representation (Linear IR) generation module, responsible for converting the AST into a linear intermediate code representation. It currently includes numerous runtime call invocations. Future versions should use inline optimization to convert runtime calls into LIR implementations.

Closure compilation uses a JIT-based approach, simplifying the compilation process, and the caller does not need to be concerned with the closure's captured environment. See the diagram below:

Nature requires variables to be assigned a value upon initialization. However, it is more lenient with struct initialization, assigning default values to most types:

  • The default value for int is 0.
  • The default value for string is the empty string ''.
  • The default values for vec/map/set are the corresponding empty containers.
  • tup/struct recursively assign default values to their members.

Certain types must be manually assigned because a reasonable default value cannot be set, such as function types (fn) and pointer types (ptr).

LIR to CFG

src/cfg.c

This module builds the Control Flow Graph (CFG) for the compiler.

  1. Basic Block Construction:
    • Implemented via cfg_build.
    • Splits linear IR code into basic blocks based on labels and jump instructions.
    • Establishes sequential and jump relationships between basic blocks.
  2. Critical Edge Handling:
    • Implemented via broken_critical_edges.
    • Handles critical edges in the control flow graph.
    • Inserts empty intermediate basic blocks to optimize data flow analysis. The subsequent resolve_data_flow in register allocation does not allow critical edges.
  3. Unreachable Code Elimination:
    • Implemented via cfg_pruning.
    • Removes basic blocks unreachable from the entry block.
    • Maintains predecessor and successor relationships between basic blocks.
  4. Return/Break Check:
    • Return statement check: Ensures all execution paths have a return statement.
    • Break statement check: Ensures correct break statements within match/catch structures.

SSA

src/ssa.c

This file implements the transformation into SSA (Static Single Assignment) form, an important step in compiler optimization. Main functionalities include:

  1. Dominance Analysis:
    • ssa_domers: Computes the dominator set for each basic block.
    • ssa_imm_domer: Computes the immediate dominator for each basic block.
    • ssa_df: Computes dominance frontiers.
  2. Variable Liveness Analysis:
    • ssa_use_def: Analyzes variable usage and definitions within each basic block.
    • ssa_live: Performs liveness analysis, computing live-in and live-out variables for each basic block.
  3. SSA Transformation:
    • ssa_add_phi: Inserts φ (phi) functions where necessary.
    • ssa_rename: Renames variables to ensure each variable is assigned only once.

The primary characteristic of SSA form is that each variable is assigned only once. This allows for:

  1. Simplified data flow analysis.
  2. Facilitation of subsequent code optimizations.
  3. Clearer definition-use relationships for variables.

Currently, the Nature compiler has not yet begun implementing optimizations based on SSA form.

SSA Example 1:

int foo = 1
foo = foo + 2
if (true) {
    int car = 4
    foo = 2
} else {
    int car = 5
    foo = 3
}
println(foo)

SSA Example 2:

int foo = 0
for (int i = 0; i <= 100; i+=1) {
 if ((i % 2) == 0) {
  foo += i
 }
}
 
println(foo)

Cross Lower

src/lower

The lower stage is primarily responsible for converting LIR into architecture-specific instruction forms for AMD64/ARM64. For instance, on the AMD64 architecture, ADD is a two-operand instruction where one source operand and the destination operand must be the same register. Another example is idiv, which uses the rdx:rax register pair as the dividend.

It also implements architecture-specific Calling Conventions. Taking the AMD64 architecture as an example:

  • When calling a function, arguments are passed in specific registers in order: the first argument uses rdi, the second uses rsi, and so on.
  • The function return value is uniformly placed in the rax register.

These architecture-related physical register (Fixed Register) constraints must be handled before register allocation. This allows the subsequent register allocation phase to accurately calculate the live range of each variable (including virtual and physical registers), leading to better allocation decisions.

Special Handling for Array Parameter Passing:

In C, when an array is passed as a function parameter, it automatically decays into a pointer, effectively passing the array's base address. In the Nature programming language, however, the intention is for array parameters to be passed by value (i.e., copying the entire array content). To achieve this, during the processing of function call arguments (Lower Call Args):

  1. Allocate space on the stack equal to the size of the original array.
  2. Copy the entire content of the original array into this allocated stack space.
  3. Pass the address of this stack space as the argument to the function.

This ensures that modifications to the array within the function do not affect the original array, achieving true value-passing semantics. If C-like array reference passing is desired, simply using arr as anyptr will bypass this copying behavior and pass the data by reference.

Register Alloc

src/register

Register allocation is the process of mapping program variables to physical registers. Since registers are limited in number but fast to access, effective register allocation significantly impacts program performance.

The Nature compiler employs a linear scan algorithm based on SSA-form LIR for register allocation. It primarily references two papers:

  1. Linear Scan Register Allocation for the Java HotSpot Client Compiler. — Wimmer, C.
  2. Linear Scan Register Allocation on SSA Form. — Christian Wimmer, Michael Franz

Main components:

  • interval.c: Builds live intervals for variables.
  • allocate.c: Implements the core allocation algorithm.
  • register.c: Manages physical registers.
  • amd64.c/arm64.c: Register definitions for different architectures.

Workflow:

  1. interval_block_order: Sort basic blocks.
  2. linear_prehandle: Add instruction sequence numbers.
  3. interval_build: Build live intervals for virtual variables and physical registers.
  4. allocate_walk: Traverse live intervals, allocate physical registers, and handle register spilling (moving variables to memory when registers are full).
  5. resolve_data_flow: Handle data flow across basic block boundaries and eliminate PHI operations.
  6. replace_virtual_register: Replace virtual registers with allocated physical registers.

After register allocation, the LIR no longer contains virtual variables (VAR), making it easier to convert to assembly code. In the current version, register allocation performance is relatively poor due to frequent register spills caused by numerous runtime calls.

Cross Native

src/native

The native stage converts the LIR, after register allocation is complete, into assembly code. This assembly code conforms to the syntax specifications and instruction sets of the AMD64/ARM64 architectures. In practice, it's actually converted into a C language array representation rather than directly into textual assembly.

Example: list_push(operations, AMD64_ASM("mov", dst_operand, src_operand));

Assembler

src/binary/encoding/amd64/opcode.c — AMD64 assembler src/binary/encoding/arm64/opcode.c — ARM64 assembler

The main task of the assembler is to convert the assembly syntax arrays generated by the Native stage into the binary machine code required by the corresponding CPU architecture.

For the AMD64 architecture, the amd64_asm_inst_encoding function encodes assembly instructions into machine code. This process generates a trie (prefix tree) to select the appropriate AMD64 instruction. After instruction selection, various parts like prefixes, opcode, ModR/M byte, SIB byte, and potentially displacement and immediate values are filled in, ultimately generating variable-length machine code. Example of the trie constructed from the AMD64 instruction set:

On the ARM64 architecture, the arm64_asm_inst_encoding function uses a simpler implementation. It matches the assembly instruction to the corresponding opcode handler function and then generates fixed 4-byte length machine code.

These assembler functions are called during the generation of operating system-specific object files. The relevant code is located in:

src/build/build.c#elf_assembler_module -> elf_amd64_operation_encodings/elf_arm64_operation_encodings src/build/build.c#mach_assembler_module -> mach_amd64_operation_encodings/mach_arm64_operation_encodings

The encodings functions are primarily responsible for encoding assembly instructions into machine code and handling symbol relocation issues. They work in two passes:

  • The first pass handles the encoding of assembly instructions and relocation of known symbols.
  • The second pass handles symbol references that could not be resolved in the first pass.

Finally, all encoded instruction data is written into the code section, and the instruction length of generated function labels (fn label) is recorded, paving the way for the subsequent reflection mechanism.

Linker

src/binary/elf/linker.c#elf_exe_file_format — ELF linker implementation

The linker is responsible for merging multiple object/archive files into a single executable file. Nature's linker is a simple implementation, primarily focusing on static linking. Its workflow involves:

  • elf_resolve_common_symbols: Handles COMMON symbols, allocating them to the .bss section.
  • elf_build_got_entries: Builds GOT/PLT entries, handling relocations related to dynamic linking (though dynamic linking itself is not fully supported).
  • set_section_sizes: Calculates the sizes of various sections.
  • alloc_section_names: Allocates the section name string table.
  • layout_sections: Lays out all sections, calculating their virtual addresses and file offsets.
  • elf_relocate_symbols: Relocates symbols, calculating their final addresses.
  • elf_relocate_sections: Processes all relocation sections.
  • elf_fill_got: Fills the Global Offset Table (GOT).
  • elf_output: Generates the final ELF executable file.

It does not support dynamic linking, version control, or debug information processing.

Runtime Sched

runtime/processor.c#sched_init

Nature's scheduler (sched) is a simple implementation resembling Go's GMP scheduling model, built around a Processor + Coroutine model.

Core Components

  • n_processor_t: Processor struct, representing an execution unit (like Go's P).
  • coroutine_t: Coroutine struct, representing an executable task (like Go's G).
  • Shared processor list (share_processor_list) and solo processor list (solo_processor_list).

Main Functions

  • Coroutine scheduling: Creating, running, switching, and destroying coroutines.
  • Processor management: Creating and managing multiple processor instances.
  • Non-preemptive scheduling: Yielding occurs through safepoints.
  • GC related: Supports safepoints for Stop-The-World (STW) garbage collection.

Key Functions

  • sched_init(): Scheduler initialization.
  • processor_run(): Processor main loop.
  • coroutine_resume(): Resume coroutine execution.
  • rt_coroutine_dispatch(): Coroutine scheduling dispatch.

Tips

  1. Scheduling is cooperative based on safepoints. Safepoints are inserted at function ends and before function calls. Therefore, long-running for loops without function calls can prevent other coroutines from being scheduled and block GC.
  2. Nature currently has a very basic coroutine scheduler. Once created, coroutines are generally balanced and fixed to a Processor.
  3. Nature uses shared stack coroutines. This allows for more lightweight coroutine stacks, faster creation, and faster switching. Shared stack diagram:
    • Before a yield, valid_sz and SP are not fixed, making it impossible to calculate the exact location of the foo variable on the save_stack!
    • If valid_sz's position were fixed, the offset align_retpt - 0xn would equal save_stack.ptr + valid_sz - 0xn.
  4. A significant issue with shared stacks is that the stack pointer becomes invalid for other coroutines upon yielding. Therefore, obtaining &q in Nature results in something like rawptr<T>, an unverified, unsafe pointer. This might be solvable through escape analysis or stack switching, but Nature currently implements neither. If you can guarantee no coroutine yield will occur, you can use rawptr. Otherwise, you must explicitly allocate data on the heap.

Runtime GC

runtime/collector.c#runtime_gc

GC (Garbage Collection) is an automatic memory management mechanism that automatically identifies and reclaims memory space no longer in use by the program, thereby preventing memory leaks and simplifying memory management. GC typically consists of three main components:

增变器、分配器、收集器

  • Mutator: Also known as the user program. During its operation, it performs three types of operations related to heap memory.
    var a = malloc() // New: Allocate a new piece of memory (obj) from the heap for local variable a
    var b = a.foo // Read: Read the foo field of obj a and assign it to local variable b
    a.foo = null // Write: Clear a.foo
    b.bar = a.bar // Write: Modify the pointer of b.bar
  • Allocator: The memory allocator, responsible for requesting free memory from the heap. The allocated memory unit is typically called an object (obj), which consists of one or more fields.
  • Collector: The garbage collector, responsible for identifying and cleaning up objects (obj) that are no longer used by the user program.

Nature's compiler uses Tri-color Marking GC, as illustrated below:

https://upload.wikimedia.org/wikipedia/commons/1/1d/Animation_of_tri-color_garbage_collection.gif

Incremental and Concurrent GC

Incremental garbage collection typically refers to a single-threaded model where the collector's work is divided into smaller sub-tasks, allowing the mutator (white blocks) and collector (grey/black blocks) to execute alternately. Incremental GC reduces the duration of Stop-The-World (STW) pauses perceived by the user program.

Extending incremental GC techniques to multi-core, multi-threaded environments allows for a form of incremental GC that can run concurrently most of the time. This is the GC approach adopted by both Go and Nature.

Although each thread might block its mutator during GC work, from a multi-threaded perspective, there will always be some user program threads making progress. Therefore, in a multi-threaded context, the definition of STW (Stop-The-World) needs refinement: only when the collector blocks all threads simultaneously is it considered a true STW. For example, the diagram above shows two STW phases.

The primary goal of both incremental and concurrent collectors is to minimize the pause times perceptible to the mutator. Ideally, a concurrent collector could run entirely in parallel with the mutator (i.e., different threads perform their own collection work without interference, although this is challenging in a coroutine scheduling model), potentially shortening the overall execution time.

Concurrent collection inevitably requires some form of communication and synchronization between the mutator and the collector, which manifests as mutator barrier techniques.

Barrier Techniques

As mentioned in the incremental GC section, simply interleaving collector sub-tasks with the mutator is not straightforward. The main problem is that the mutator can interfere with the collector's completed work, leading to GC correctness issues.

GC correctness issues are severe, as they can lead to the erroneous collection of memory still in use.

Barrier techniques intercept the mutator's operations on objects (obj) to ensure GC correctness. We previously discussed three types of mutator operations related to heap memory. Barrier techniques primarily intercept the Read and Write operations.

// var b = a.foo // Read: Read the foo field of obj a and assign it to local variable b
var b = read_barrier(a, foo) // Intercepting Read operations is called a read barrier
 
// a.foo = null // Write: Clear a.foo
write_barrier(a, foo, null) // Intercepting Write operations is called a write barrier
 
// b.bar = a.bar // Write: Modify the pointer of b.bar
write_barrier(b, bar, read_barrier(a, bar)) // Combines write and read barriers

Lisp statistics show that Read operations occur much more frequently than Write operations, making read barriers very expensive. Therefore, mark-sweep garbage collectors typically use write barriers.

Tri-color Marking Garbage Collection

Tri-color marking GC is a type of incremental mark-sweep garbage collection algorithm. It categorizes objects (obj) into three colors during GC:

  • White: Objects not yet scanned by the collector.
  • Grey: Objects scanned by the collector (added to the worklist), but whose fields have not yet been fully scanned.
  • Black: Objects scanned by the collector, and all of whose fields have also been scanned.

The mention of a worklist implies a breadth-first traversal, which is essentially what tri-color marking does. Starting from the mutator roots, it traverses all reachable objects. Objects that are not reached are considered garbage and need to be cleaned up.

The "incremental" aspect of tri-color marking generally refers to the marking phase. During marking, the current thread can yield control to the mutator at almost any time.

Consider the scenario in the diagram: control is yielded to the mutator. The mutator makes a black object point to a white object. Then, control returns to the collector. What happens? Since black objects are considered fully processed, they won't be scanned again. Consequently, the white object now pointed to only by the black object will be mistakenly collected as garbage.

Write Barriers and Tri-color Consistency

Tri-color consistency is a property that, if maintained, guarantees that no live object is mistakenly collected. There are two main forms:

  • Strong Tri-color Consistency: No black object points to a white object.
  • Weak Tri-color Consistency: A black object can point to a white object, but only if there is also a grey object pointing to that same white object along some path from the roots.

Grey objects are on the worklist and will be scanned, ensuring the white object they point to is eventually reached and not lost.

Barrier techniques are used to maintain tri-color consistency. Besides classifying barriers by read/write, they can also be categorized based on the initial color (state) of the mutator.

In tri-color marking, when discussing object colors: White objects are unscanned, Grey objects are scanned but their fields aren't, and Black objects are fully scanned (including fields).

If we consider the mutator roots (stack, globals) collectively as an "object," we can classify the mutator itself as White, Grey, or Black. This concept is key to the theory of concurrent GC.

Grey Mutator Barriers

The diagram shows a grey mutator where a field points to a white object. In this state, the mutator roots must be rescanned after the initial marking phase to ensure the white object isn't mistakenly collected.

When the mutator is grey, it means there's a root field pointing to a white object. Thus, performance-impacting read barriers aren't strictly necessary to maintain consistency. Instead, write barriers are used to prevent black objects from pointing to white objects, thereby satisfying strong tri-color consistency.

var b = a.foo If a read barrier were inserted: var b = read_barrier(a, foo)

Barrier techniques proposed by Steele, Boehm, and Dijkstra fall under the grey mutator category. Consider Dijkstra's write barrier (pseudo-code):

atomic write_barrier(obj, field, newobj):
    // Perform the write first or last depending on the specific barrier
    obj->field = newobj
 
    // Ensure strong consistency: prevent black -> white pointers
    if is_black(obj)
        shade(newobj) // Mark newobj grey if obj is black

The logic is simple: if the object obj being modified is black, then the newly pointed-to object newobj must be marked grey. This maintains strong tri-color consistency.

However, a consequence is that without read barriers, in an operation like var b = a.foo, if a.foo points to a white object, it's not immediately handled. Therefore, after the initial marking phase completes, a STW pause is required to rescan the mutator roots.

Black Mutator Barriers

The diagram shows a black mutator; all its fields have been processed, and the collector does not need to rescan the roots.

A black mutator ensures all its root fields point to either black or grey objects. The mutator itself inherently adheres to tri-color consistency. Therefore, no STW rescan is needed after marking, but a STW pause is required at the beginning of the GC cycle to scan the mutator roots and turn the mutator black.

Let's focus on Yuasa's deletion write barrier technique, based on a snapshot-at-the-beginning approach, which is a type of black mutator barrier.

Once the mutator roots are scanned (turning the mutator black), the write barrier is enabled on all threads. The pseudo-code is:

atomic write_barrier(obj, field, newobj):
    // Snapshot the object being overwritten
    oldobj = obj->field
    // If the object being overwritten (oldobj) is white,
    // mark it grey *before* performing the update.
    // Grey means mark and add to worklist for the collector.
    if is_white(oldobj)
        shade(oldobj)
 
    // Perform the actual write
    obj->field = newobj

During the obj->field = newobj assignment, if obj is black, this might create a black object pointing to a white newobj. The Yuasa barrier doesn't directly address this. Instead, it acts only when the pointer obj->field (pointing to oldobj) is about to be overwritten. If the oldobj being overwritten is white, the barrier marks oldobj grey, ensuring the collector eventually processes it.

In Yuasa's snapshot algorithm, oldobj = obj->field represents the snapshot.

Yuasa's algorithm requires no read barrier and no stack rescan, and it permits the mutator to point to white objects. However, it still maintains weak tri-color consistency.

This is because during the mark phase, if newobj is referenced by a mutator root, it must have been reachable by some other path before the current write. A newobj cannot appear out of thin air and become referenced by a root during the mark phase unless it was already reachable before GC started.

var a = allocate() // A brand new object; allocation after GC starts marks it, so no issue.
var b = a.foo

Even if the object pointed to by a.foo is white, it's safe because the object a itself must be grey or black.

Hybrid Write Barrier

Go version 1.8 implemented a hybrid write barrier based on Dijkstra + Yuasa, and Nature adopts a similar approach. This allows for concurrent GC with very short STW pauses.

The hybrid barrier inherits the best properties of both: it allows concurrent scanning of roots starting from the mark phase and ensures roots are black by the end of marking.

// Version 1
atomic write_barrier(obj, field, newobj):
    // Yuasa part: Shade the object being overwritten if it's white
    if is_white(obj->field)
        shade(obj->field) // grey
 
    // Dijkstra part (active only while stack scanning is incomplete):
    // If the current goroutine/thread's stack is still grey (not fully scanned)
    if stack is grey:
        shade(newobj) // grey
 
    // Actual write
    obj->field = newobj
 
// Version 2 (conceptually identical)
atomic write_barrier(slot, ptr):
    // Shade the value being overwritten (*slot) - Yuasa
    shade(*slot)
 
    // If the current stack is grey (scan incomplete) - Dijkstra
    if current stack is grey:
        // Shade the new value (ptr) being written
        shade(ptr)
 
    // Perform the actual write
    *slot = ptr

Before the GC mark phase begins, a brief STW is needed to enable the write barrier globally. Once enabled, root scanning can proceed concurrently across multiple threads. According to the code above, if a goroutine/thread hasn't been suspended and scanned yet (its stack is "grey"), any pointer write operations within that goroutine will cause the new object being pointed to (newobj or ptr) to be marked grey (Dijkstra part). Once the current goroutine is suspended and its stack scanned (stack becomes "black"), the write barrier effectively degrades to just the deletion barrier (Yuasa part). Crucially, once the write barrier is enabled, all newly allocated objects are marked black directly.

This ensures weak tri-color consistency even when goroutines of different "colors" (scan states) operate on the same object.

STW Phases

Different barrier techniques require STW pauses at different stages:

  • Grey Mutator: Requires an STW before marking starts to enable write barriers on all threads simultaneously. Otherwise, unsynchronized threads could violate consistency (e.g., create a black->white pointer undetected). It also requires another STW before marking ends to rescan the roots.

  • Black Mutator: Requires an STW before marking starts to scan roots and enable write barriers. No STW rescan is needed at the end.

  • Hybrid Write Barrier: Only requires a brief STW before marking starts to enable the write barrier. Subsequent marking and root scanning happen concurrently.

Although Nature uses a hybrid write barrier, it still requires two STW phases, but these pauses are designed to be very short.

Runtime Allocator

runtime/allocator.c#rti_gc_malloc

The allocator component is responsible for requesting memory from the operating system. Nature's memory allocator is directly inspired by Go's implementation, with some simplifications. The structure is shown below:

At the lowest level, Nature uses the mmap function (or equivalent) to request large chunks of memory from the OS. After the collector finishes, the madvise function (or equivalent) is used to return unused memory back to the operating system.

Runtime Libuv

libuv is a cross-platform asynchronous I/O library. Nature uses libuv to handle network I/O.

In the runtime implementation, an event loop is created for each shared processor (typically matching the CPU count) to handle asynchronous I/O events. However, libuv is not thread-safe. This conflicts with the coroutine model where coroutines can migrate between processors (threads). Therefore, care must be taken when using libuv: avoid using a libuv event loop across different shared processors.

Consider an HTTP server example (details in runtime/nutils/http.c):

  • I/O events (like TCP connections, data reads/writes) are monitored and triggered by a main libuv loop (likely associated with a dedicated I/O thread or the initial thread).
  • When an I/O event is ready, a new coroutine is created to handle the business logic.
  • After the business logic is processed, any resulting I/O operations must be performed by sending the relevant data back to the original libuv loop (using mechanisms like uv_async) for execution, rather than attempting the I/O operation directly within the event loop of the current coroutine's potentially different shared processor.

Runtime Reflect

utils/custom_links.h

The runtime relies on data generated at compile-time to implement features like error tracing, type reflection, and GC stack scanning. utils/custom_links.h and utils/type.h define the data structures shared with the runtime.

  • symdef_t: Defines the structure for symbols, including base address, size, GC marking information, etc.
  • fndef_t: Defines the structure for functions, including base address, size, runtime information, stack layout info, etc.
  • caller_t: Defines the structure for call sites, recording file, line, and column number information.
  • rtype_t: Defined in utils/type.h, defines reflection information for types, including size, hash, GC bitmap (gc_bits), kind, etc.

During compilation, this data is collected and then serialized into the object file. Symbols binding this data are registered in the ELF/Mach-O symbol table.

src/build/build.c#elf_custom_links src/build/build.c#mach_custom_links

Subsequently, the runtime only needs to declare these symbols as extern to read the relevant data.