Sprint 6 — Syscall Interface & Userspace Entry
Cross the Ring 0 / Ring 3 boundary.
✅ Complete
Table of contents
Overview #
Sprint 6 connects all the kernel subsystems together into a usable system by implementing the SYSCALL/SYSRET fast transition mechanism, an ELF loader, and the actual transition into Ring 3 (user mode). After this sprint, the kernel can load and run a userspace program.
SYSCALL/SYSRET #
What is SYSCALL/SYSRET?
The SYSCALL instruction is the fast path for entering the kernel from userspace on x86_64. Unlike software interrupts (int 0x80), SYSCALL doesn't push to the stack or read the IDT — it uses pre-configured MSRs (Model-Specific Registers) for maximum speed.
MSR Configuration
| MSR | Name | Purpose |
|---|---|---|
STAR | Segment Selectors | Bits 47:32 = kernel CS, Bits 63:48 = user CS base |
LSTAR | Syscall Entry | 64-bit address of the syscall handler entry point |
SFMASK | RFLAGS Mask | Flags to clear on syscall entry (disable interrupts) |
Syscall Entry Point
When userspace executes SYSCALL:
- CPU saves RIP in RCX, RFLAGS in R11
- CPU loads CS/SS from STAR MSR → kernel mode
- CPU masks RFLAGS with SFMASK → interrupts disabled
- CPU jumps to LSTAR → our entry point
Our handler then:
- Swap to kernel stack (from TSS RSP0)
- Save all user registers to the thread's save area
- Dispatch based on RAX (syscall number)
- Execute the syscall handler
- Restore user registers
- SYSRET back to userspace
Register Convention
| Register | Role |
|---|---|
| RAX | Syscall number (in) / return value (out) |
| RDI | Argument 1 |
| RSI | Argument 2 |
| RDX | Argument 3 |
| R10 | Argument 4 (RCX is clobbered by SYSCALL) |
| R8 | Argument 5 |
| R9 | Argument 6 |
| RCX | Saved RIP (by CPU) |
| R11 | Saved RFLAGS (by CPU) |
Syscall Dispatch Table #
The kernel dispatches syscalls via a match frame.rax in syscall_dispatch(). SYS_EXIT (RAX=0) is handled inline with thread_exit().
| RAX | Name | Arguments | Description |
|---|---|---|---|
| 0 | SYS_EXIT | — | Terminate calling thread (thread → Dead, schedule away) |
| 1 | SYS_SEND | slot, label, data0, data1 | IPC send on endpoint capability |
| 2 | SYS_RECV | slot | IPC receive — blocks until message arrives |
| 3 | SYS_PORT_OUT | slot, port, value, width | Write to I/O port via IoPort capability. R10 width: 0/1=byte, 4=dword |
| 4 | SYS_PORT_IN | slot, port, width | Read from I/O port via IoPort capability. R10 width: 0/1=byte (RDI=u8), 4=dword (RDI=u32) |
| 5 | SYS_WAIT_IRQ | slot | Block until hardware IRQ fires on IrqLine capability |
| 6 | SYS_SPAWN_PROCESS | — | Create empty child process, returns CNode slot of Process cap |
| 7 | SYS_ALLOC_MEMORY | alloc_slot, target_slot | Allocate physical frame via PmmAllocator, store MemoryFrame cap in target_slot |
| 8 | SYS_MAP_MEMORY | proc_slot, frame_slot, vaddr, flags | Map MemoryFrame into process VA. Flags: bit 0 = WRITABLE, bit 1 = EXECUTABLE |
| 9 | SYS_DELEGATE | proc_slot, src_slot, dst_slot | Copy capability from caller's CNode to child process's CNode |
| 10 | SYS_SPAWN_THREAD | proc_slot, user_rip, user_rsp | Create Ring 3 thread in target process, returns TID |
| 11 | SYS_DROP_CAP | slot | Remove capability from caller's CNode slot (frees for reuse) |
Error Convention
All syscalls return u64 in RAX. Success = 0 (or a positive value like slot/TID). Errors are sentinel values near u64::MAX:
| Value | Meaning |
|---|---|
u64::MAX | Invalid slot index |
u64::MAX - 1 | Insufficient rights |
u64::MAX - 2 | Wrong capability type |
u64::MAX - 3 | Endpoint/Process not found |
u64::MAX - 4 | PMM out-of-memory / alignment error |
u64::MAX - 5 | Process not found / already has waiter |
u64::MAX - 6 | map_page failure |
ELF Loader #
What is ELF?
ELF (Executable and Linkable Format) is the standard binary format for executables on Linux and bare-metal systems. The kernel must parse ELF files to load userspace programs.
Loading Process
- Read ELF header — verify magic bytes, architecture (x86_64), type (executable)
- Parse program headers — each
PT_LOADsegment describes a chunk to map:- Virtual address, file offset, file size, memory size
- Permissions (Read, Write, Execute)
- Allocate pages — use PMM to allocate physical frames for each segment
- Map pages — use VMM to create mappings in the process's address space with correct permissions
- Copy data — copy segment contents from the ELF file into the mapped pages
- Zero BSS — if memory size > file size, zero the remaining bytes
- Set up user stack — allocate and map pages at the top of userspace (e.g.,
0x7FFFFFFFE000) - Return entry point — the ELF header contains the address where execution begins
Address Space Layout (Userspace)
block-beta
columns 1
block:stack["0x00007FFFFFFFFFFF"]
A["User Stack (grows ↓)"]
end
block:guard["0x00007FFFFFFFE000"]
B["Guard Page"]
end
block:heap[" "]
C["Heap (grows ↑)"]
end
block:bss[" "]
D[".bss R+W"]
end
block:data[" "]
E[".data R+W"]
end
block:rodata[" "]
F[".rodata R"]
end
block:text["0x0000000000400000 ← ELF base"]
G[".text R+X"]
end
Ring 3 Entry #
Steps to Enter Userspace (Actual Implementation)
- Create process — allocate PML4, CNode (64 slots), kernel thread
- Load ELF — parse PT_LOAD segments, map with W^X permissions, zero BSS
- Set up user stack — map 4 pages at
0x7FFFFFFFE000with User + Writable + NX - Prepare initial capabilities — Init (PID 1) receives:
- Slot 1: PmmAllocator — right to allocate physical frames
- Slot 2: IoPort { base: 0x3F8, size: 8 } — COM1 serial
- Slot 3: Process { pid: 1 } — self-reference for memory mapping
- Slot 4: IoPort { base: 0xC000, size: 128 } — Virtio-Blk I/O BAR (dynamically discovered via PCI)
- Switch to user page tables — schedule() swaps CR3 to process PML4
- IRETQ —
jump_to_ring3()performsswapgstheniretqwith user CS/SS (GDT selectors 0x23/0x1B), RFLAGS=0x202 (IF set), entry at0x400000
Verification
Test that the syscall round-trip works correctly:
- Userspace calls
SYSCALL→ enters kernel - Kernel processes the request
- Kernel returns via
SYSRET→ back in userspace - Verify registers are preserved, return value is correct
Security Considerations #
- swapgs on entry/exit: GS base swapped between user and kernel (CpuLocal) on every syscall transition
- Kernel stack isolation: Each thread has its own 16 KiB kernel stack, TSS RSP0 updated on every context switch
- Capability validation: Every syscall validates slot index, capability type, and rights bitmask before operation
- SFMASK clears IF: Interrupts disabled on syscall entry, re-enabled only after kernel stack is set up
- W^X enforcement: ELF segments mapped with strict permissions — no page is both writable and executable
- User pages marked USER: Kernel cannot accidentally access user memory via supervisor-mode page table protections
Dependencies #
- Requires: Sprint 4 (threads, scheduler), Sprint 5 (CNode, capability types)
- Enables: Sprint 7–9 (init process, delegation, memory management from Ring 3), Sprint 10 (Ring 3 allocator)