x86-64 Kernel Internals: From Memory to Context Switches

1. Memory & Pointers Fundamentals

Before we can understand stacks or interrupts, we need a rock-solid model of what memory actually is at the hardware level. Everything else builds on this.

Memory as an Array of Bytes

Physical memory (RAM) is conceptually a giant array of bytes. Each byte has an address — a number that identifies its position in this array. On x86-64, addresses are 64-bit numbers, though typically only the lower 48 bits are used.

Memory as a linear array: Address (hex) Contents (1 byte each) ────────────────────────────────────────── 0x0000 [0x00] 0x0001 [0x00] 0x0002 [0x4B] 0x0003 [0xFF] ... ... 0x1000 [0x48] ← Some byte at address 0x1000 0x1001 [0x89] ... ...

Address

A number (typically written in hexadecimal) that identifies a specific byte location in memory. On x86-64, addresses are 64 bits wide.

Pointer

A value that holds an address. When we say "a pointer to X," we mean a value that contains the address where X is stored. The pointer itself is just a number.

Bytes vs Bits, and Multi-Byte Values

A byte is 8 bits. One byte can represent values 0–255 (unsigned) or -128 to +127 (signed). But most useful values need more than one byte:

Name	Size	Range (unsigned)
Byte	8 bits (1 byte)	0 to 255
Word	16 bits (2 bytes)	0 to 65,535
Double word (dword)	32 bits (4 bytes)	0 to ~4 billion
Quad word (qword)	64 bits (8 bytes)	0 to ~18 quintillion

Little-Endian Byte Order

x86-64 is little-endian: the least significant byte is stored at the lowest address.

Worked Example: Storing a 64-bit Value

Let's store the value 0x123456789ABCDEF0 at address 0x1000:

In little-endian, least significant byte goes to lowest address: Address Byte (hex) Which part of the value ───────────────────────────────────────────────── 0x1000 0xF0 ← Least significant byte 0x1001 0xDE 0x1002 0xBC 0x1003 0x9A 0x1004 0x78 0x1005 0x56 0x1006 0x34 0x1007 0x12 ← Most significant byte

Registers vs Memory

Registers are tiny storage locations inside the CPU itself. They are not part of RAM. The CPU can access registers in a single clock cycle — hundreds of times faster than RAM.

x86-64 General Purpose Registers (64-bit): RAX ─ Accumulator, often holds return values RBX ─ Base register (callee-saved) RCX ─ Counter, 4th argument RDX ─ Data, 3rd argument RSI ─ Source Index, 2nd argument RDI ─ Destination Index, 1st argument RBP ─ Base Pointer (frame pointer) RSP ─ Stack Pointer ← Critical for this tutorial! R8–R15 ─ Additional GP registers Special registers: RIP ─ Instruction Pointer (program counter) RFLAGS ─ Status flags (zero, carry, overflow, etc.)

Common Misconceptions

❌ "Registers are like variables in memory with special names."

✓ Registers are physically separate from RAM. They're built into the CPU silicon and have dedicated circuitry.

❌ "Little-endian means the bytes are stored backwards."

✓ It means the least significant byte is at the lowest address. The value itself isn't "backwards."

Sanity Check

If I have a 32-bit value 0xDEADBEEF stored at address 0x2000 in little-endian, what byte is at address 0x2000? At 0x2003?
Can the CPU add two values stored in RAM directly, or must they first be loaded into registers?
What's the difference between "address 0x1000" and "the value 0x1000 stored somewhere in memory"?

2. Code Execution Model

Now that we understand memory and registers, let's see how the CPU actually executes your code.

Where Machine Code Lives: The Text Segment

Typical memory layout of a process: High addresses ┌─────────────────────┐ │ Stack │ ← Grows downward │ ↓ │ ├─────────────────────┤ │ (unmapped) │ ├─────────────────────┤ │ ↑ │ │ Heap │ ← Grows upward ├─────────────────────┤ │ BSS (zero-init) │ ├─────────────────────┤ │ Data (globals) │ ├─────────────────────┤ │ Text (code) │ ← Your machine instructions live here └─────────────────────┘ Low addresses

The Instruction Pointer: RIP

RIP (Instruction Pointer)

A 64-bit register containing the memory address of the next instruction the CPU will fetch and execute. You cannot directly write to RIP with mov — you change it with jump/call/return instructions.

The Fetch-Decode-Execute Cycle

Fetch: Read the bytes at address RIP from memory.
Decode: Figure out what instruction this is and what operands it uses.
Execute: Perform the operation.
Advance RIP: Move RIP forward by the instruction's length.
Repeat.

CALL vs JMP: What CALL Does Extra

Instruction	What it does
`jmp target`	RIP ← target address. That's it.
`call target`	1. Push the return address onto the stack 2. RIP ← target address
`ret`	Pop a value from the stack into RIP

Worked Example: CALL and RET Mechanics

Before CALL: RIP = 0x401000 (about to execute: call 0x402000) RSP = 0x7FFF0100 (stack pointer) CALL instruction is 5 bytes long, so return address = 0x401005 CALL executes: 1. RSP ← RSP - 8 = 0x7FFF00F8 2. Memory[0x7FFF00F8] ← 0x401005 (push return address) 3. RIP ← 0x402000 Later, RET executes: 1. RIP ← Memory[RSP] = 0x401005 2. RSP ← RSP + 8 = 0x7FFF0100 Execution continues at 0x401005 — right after the CALL!

Sanity Check

If RIP is 0x500000 and the current instruction is 4 bytes long (with no jumps), what will RIP be after?
What's the key difference between jmp and call?
After call foo executes, where is the return address stored?

3. Stack Basics

The stack is just a region of memory. There's nothing magical about it — it's bytes like everything else. What makes it special is how we use it.

RSP: The Stack Pointer

Stack Pointer (RSP)

A register containing the address of the most recently pushed value on the stack. On x86-64, RSP points to the current top of stack (the last item pushed), not the next free slot.

Why the Stack "Grows Down"

On x86-64, the stack grows toward lower addresses. When you push, RSP decreases. When you pop, RSP increases.

Stack growing downward: High addresses │ │ ┌──────────────┐ │ │ Old data │ │ ├──────────────┤ │ │ Prev value │ │ ├──────────────┤ RSP →│ │ Top of stack │ ← Current "top" (most recent push) │ ├──────────────┤ ▼ │ (unused) │ ← Next push goes here └──────────────┘ Low addresses

PUSH and POP: The Exact Operations

push rax is equivalent to: sub rsp, 8 ; RSP ← RSP - 8 (make room) mov [rsp], rax ; Store RAX at address RSP pop rbx is equivalent to: mov rbx, [rsp] ; Load value at RSP into RBX add rsp, 8 ; RSP ← RSP + 8 (reclaim space)

Worked Example: Push and Pop with Concrete Addresses

Initial state: RSP = 0x7FFF0100 RAX = 0xDEADBEEFCAFEBABE Execute: push rax Step 1: RSP ← RSP - 8 = 0x7FFF00F8 Step 2: Memory[0x7FFF00F8] ← 0xDEADBEEFCAFEBABE Execute: pop rbx Step 1: RBX ← Memory[0x7FFF00F8] = 0xDEADBEEFCAFEBABE Step 2: RSP ← RSP + 8 = 0x7FFF0100 Note: The data at 0x7FFF00F8 is STILL THERE! Not erased!

Key Insight

"Freeing" stack memory means moving RSP past it. The bytes remain in RAM with their old values until something else overwrites them.

Common Misconceptions

❌ "The stack is a separate hardware structure."

✓ The stack is just a region of regular memory. RSP is just a register.

❌ "Pop erases data from memory."

✓ Pop copies the value and adjusts RSP. The memory still contains the old value.

Sanity Check

If RSP is 0x1000 and you execute push rax, what is RSP afterward?
After popping a value, could you theoretically read it back from memory if you knew the address?
Why does the stack grow downward on x86-64?

4. Stack Frames

Stack Frame

The contiguous region of stack memory allocated for one function call. It typically includes: the return address (pushed by call), saved base pointer, saved callee-saved registers, and local variables.

Function Prologue and Epilogue

Prologue (function entry): push rbp ; Save caller's base pointer mov rbp, rsp ; Set up our own base pointer sub rsp, N ; Allocate N bytes for local variables Epilogue (function exit): mov rsp, rbp ; Deallocate locals (RSP = RBP) pop rbp ; Restore caller's base pointer ret ; Return (pop return address into RIP)

System V AMD64 ABI: Argument Passing

Argument #	Register
1st integer/pointer	RDI
2nd	RSI
3rd	RDX
4th	RCX
5th	R8
6th	R9
7th+	Pushed on stack

Return values go in RAX.

Stack Frame Layout Diagram

Complete stack frame layout (System V AMD64): High addresses ┌───────────────────────────────┐ │ 8th+ argument (if any) │ [rbp + 24] ├───────────────────────────────┤ │ 7th argument (if any) │ [rbp + 16] ├───────────────────────────────┤ │ Return Address │ [rbp + 8] ← Pushed by CALL ├───────────────────────────────┤ │ Saved RBP (caller's) │ [rbp + 0] ← RBP points here ├───────────────────────────────┤ │ Saved callee-saved registers │ [rbp - 8], [rbp - 16], ... ├───────────────────────────────┤ │ Local variables │ [rbp - N] ├───────────────────────────────┤ │ (alignment padding if needed) │ ← RSP └───────────────────────────────┘ Low addresses

Sanity Check

After push rbp; mov rbp, rsp, what is the relationship between RBP and the return address?
If a function has 3 local 64-bit variables, how many bytes does sub rsp, N subtract?
In System V ABI, if I call foo(10, 20, 30), which registers hold 10, 20, and 30?

5. Calling Conventions & Register Saving

Caller-Saved vs Callee-Saved Registers

Callee-saved (callee must preserve these): RBX, RBP, R12, R13, R14, R15 If a function wants to use these, it must save them first (push) and restore them before returning (pop). Caller-saved (callee may freely clobber): RAX, RCX, RDX, RSI, RDI, R8, R9, R10, R11 If a caller has important data in these, it must save them before the call (if it needs them after).

Stack Alignment: The 16-Byte Rule

The System V ABI requires RSP to be 16-byte aligned at the point of a call instruction.

Why Alignment Matters

SSE/AVX instructions require 16-byte aligned memory operands. If the stack is misaligned, instructions like movaps will fault with a #GP (General Protection) exception.

Sanity Check

If a function uses R12, what must it do before returning?
Why is RDI caller-saved rather than callee-saved?
If RSP = 0x7FFF0108 before a call, is this properly aligned?

6. Interrupts & CPU Exceptions

Interrupt

An asynchronous event from hardware (keyboard, disk, timer) that signals the CPU to stop what it's doing and run a handler.

Exception

A synchronous event caused by the currently executing instruction: division by zero, page fault, invalid opcode, etc.

The Interrupt Descriptor Table (IDT)

IDT Structure (simplified): Vector Exception/Interrupt ─────────────────────────────────── 0 Divide Error (#DE) 1 Debug (#DB) 2 NMI (Non-Maskable Interrupt) 3 Breakpoint (#BP) 6 Invalid Opcode (#UD) 8 Double Fault (#DF) 13 General Protection (#GP) 14 Page Fault (#PF) 32-255 User-defined (hardware interrupts, syscalls, etc.)

What the CPU Pushes on Entry

Stack after interrupt/exception entry: ├──────────────────────┤ │ SS (padded to 64b) │ ← Only if privilege change ├──────────────────────┤ │ RSP │ ← Only if privilege change ├──────────────────────┤ │ RFLAGS │ ├──────────────────────┤ │ CS (padded to 64b) │ ├──────────────────────┤ RSP → │ RIP │ ← Return address ├──────────────────────┤ │ Error Code │ ← Only for some exceptions └──────────────────────┘

Critical Kernel Rule

The CPU does NOT save general-purpose registers. Your interrupt handler must save any registers it uses before modifying them, and restore them before returning.

IRETQ vs RET

IRETQ operation: 1. Pop RIP 2. Pop CS 3. Pop RFLAGS 4. Pop RSP (if returning to different privilege) 5. Pop SS (if returning to different privilege) It reverses exactly what the CPU pushed on entry.

Common Misconceptions

❌ "The CPU saves all registers on interrupt."

✓ The CPU only saves RIP, CS, RFLAGS, and (on privilege change) RSP/SS. General-purpose registers are the handler's responsibility.

❌ "I can return from an interrupt handler with RET."

✓ You must use IRETQ. RET only pops RIP, leaving CS, RFLAGS, etc. on the stack.

Sanity Check

What's the difference between an interrupt and an exception?
If a page fault handler modifies RAX without saving it first, what happens when the handler returns?
Does the CPU push an error code for all exceptions?

7. Stack Switching on Interrupts

Privilege Rings: Ring 3 vs Ring 0

Ring	Name	Who runs here
Ring 0	Kernel mode	OS kernel, drivers
Ring 3	User mode	Applications

The TSS: Task State Segment

TSS (Task State Segment)

A CPU structure containing stack pointers for privilege level transitions. The key field is RSP0 — the stack pointer to use when entering ring 0.

IST: Interrupt Stack Table

The IST provides up to 7 dedicated stacks for critical exceptions. Each IDT entry can specify an IST index (1-7). If non-zero, the CPU loads RSP from that IST entry, regardless of current privilege level.

Double Fault and IST

If an exception occurs while trying to invoke an exception handler, a Double Fault (#DF) fires. If the double fault handler also fails, the CPU triple-faults and resets. Always put your double fault handler on a dedicated IST stack!

Sanity Check

Why can't the kernel use the user's stack for interrupt handling?
What field in the TSS provides the kernel stack pointer?
Why does double fault need an IST entry?

8. Context Switches

Execution Context

All the state required to resume a thread's execution: GPRs, RSP, RIP, RFLAGS, and (for processes) the address space (page tables).

Thread Switch vs Process Switch

Thread Switch	Process Switch
Same address space	Different address space
Save/restore registers + RSP	All that + switch CR3
Relatively fast	Slow (TLB flush)

CR3 and Page Tables

CR3

A control register containing the physical address of the top-level page table (PML4 on x86-64). Changing CR3 switches the entire address space mapping and flushes the TLB.

Why Context Switches Are "Slow"

TLB flush: After switching CR3, all cached address translations are invalid.
Cache pollution: The new thread's working set isn't in L1/L2/L3 cache.
Branch predictor pollution: The CPU's branch predictor has learned the old thread's patterns.

Common Misconceptions

❌ "Context switches are slow because saving registers is slow."

✓ Saving ~15 registers is fast. The slowdown is TLB flush, cache misses, and branch predictor retraining.

❌ "Threads in the same process share everything."

✓ They share address space (code, heap) but each has its own stack and register state.

Sanity Check

What register holds the page table base address?
Why is a process switch slower than a thread switch?
What is the TLB and why does flushing it hurt performance?

9. Kernel-in-Rust Practical Guidance

The x86-interrupt Calling Convention

Rust's extern "x86-interrupt" is special. Unlike extern "C", it:

Expects the interrupt stack frame to already be on the stack
Automatically saves/restores all scratch registers
Uses iretq to return, not ret
Handles the error code for you (if present)

use x86_64::structures::idt::InterruptStackFrame;

extern "x86-interrupt" fn page_fault_handler(
    stack_frame: InterruptStackFrame,
    error_code: u64,
) {
    // Handle page fault...
    // Compiler generates iretq, not ret
}

Common Pitfalls

Pitfall 1: Misaligned Stack

If RSP isn't 16-byte aligned when you call a function, SSE instructions will fault.

Pitfall 2: Forgetting to Save Registers

If you write a handler in pure assembly or use extern "C", you must save/restore all registers yourself.

Pitfall 3: Wrong IDT Entry Type

Interrupt gates automatically clear IF (interrupt flag). Trap gates don't. Use interrupt gates for most handlers.

Pitfall 4: No IST for Double Fault

If your double fault handler uses the regular kernel stack, a kernel stack overflow causes triple fault → reboot.

Setting Up TSS and IST in Rust

use x86_64::structures::tss::TaskStateSegment;
use x86_64::VirtAddr;

pub const DOUBLE_FAULT_IST_INDEX: u16 = 0;

lazy_static! {
    static ref TSS: TaskStateSegment = {
        let mut tss = TaskStateSegment::new();
        
        tss.interrupt_stack_table[DOUBLE_FAULT_IST_INDEX as usize] = {
            const STACK_SIZE: usize = 4096 * 5;
            static mut STACK: [u8; STACK_SIZE] = [0; STACK_SIZE];
            let stack_start = VirtAddr::from_ptr(unsafe { &STACK });
            stack_start + STACK_SIZE  // Stack end (grows down!)
        };
        
        tss
    };
}

Sanity Check

What does extern "x86-interrupt" do differently from extern "C"?
Why might a kernel use assembly stubs instead of pure extern "x86-interrupt"?
What happens if your double fault handler doesn't use an IST entry and the kernel stack overflows?

Summary

If you can explain these 10 bullets, you understand the whole picture

Memory is an array of bytes, each with an address. Registers are separate, fast storage inside the CPU. Little-endian means least significant byte at lowest address.
RIP holds the address of the next instruction. The fetch-decode-execute loop runs continuously. call differs from jmp by pushing a return address.
The stack is regular memory used in a LIFO pattern. RSP points to the top. push = decrement RSP, store. pop = load, increment RSP.
A stack frame contains return address, saved RBP, saved callee-saved registers, and local variables.
Calling conventions (ABI) specify argument registers, return register, and which registers must be preserved. 16-byte stack alignment at call.
Interrupts and exceptions invoke handlers via the IDT. The CPU pushes RIP, CS, RFLAGS. Handlers must save GPRs themselves.
The TSS provides RSP0 for kernel stack on privilege transitions. The IST provides dedicated stacks for critical exceptions.
IRETQ returns from interrupts, popping everything the CPU pushed. Using ret instead would crash.
Context switches save/restore entire execution state. Process switches also change CR3, flushing the TLB.
In Rust kernels, extern "x86-interrupt" handles the weird calling convention. Use IST for double fault. Ensure stack alignment.

You now have a complete mental model. Go write that kernel! 🦀