Live Inference Experiments

Qwen 2.5 Coder 32B writes C code that reads its own internal state during decoding

Ruffian Project

Live inference experiments: a language model writes C code that reads and modifies its own internal state during generation. The model was not fine-tuned—everything here is pure in-context learning. Experiments run on NVIDIA A100 via CUDA using Qwen 2.5 Coder 32B Instruct.

Self-evaluation

The model can read its own confidence during generation. vocab_entropy() returns the Shannon entropy of the next-token distribution (in millinats), and vocab_top1_prob() returns the probability of the most likely token (scaled ×10000). The model uses these to branch:

int e = vocab_entropy();
int p = vocab_top1_prob();
if (e < 2000)
    printf("CONFIDENT (entropy=%d, top1=%d)", e, p);
else
    printf("UNCERTAIN (entropy=%d, top1=%d)", e, p);
// → CONFIDENT (entropy=1247, top1=4891)

This is the model reading its own uncertainty and making a decision based on it. Low entropy (concentrated distribution) means the model is confident about what comes next. High entropy (flat distribution) means it isn't. The branching happens inside a single VM evaluation during token generation.

Multi-function status reports

A single VM call can execute multiple functions. The model generates a 9-function status report that queries context length, vocabulary size, top predictions, entropy, KV cache architecture, and overall health—all in one shot:

Tokens in context: 482
Vocab size: 151936
Top prediction: ' the' (p=1847)
Entropy: 5765 (millinats)
Status: ACTIVE, HEALTHY
Layers: 4, Heads: 8, Dim: 128

The model discovers its own architecture at runtime. It reads KV cache dimensions via the introspection API—the sampled view shows 4 evenly-spaced layers and 8 KV heads (Qwen uses grouped-query attention). The model didn't know this from training; it read it from the live KV cache on the GPU.

Token suppression

Because the VM can write to the logit buffer before sampling, the model can suppress its own next token:

// Identify the most likely next token
int top = logit_argmax();
char* buf = malloc(64);
tok_str(top, buf);
printf("suppressing: '%s'", buf);

// Set its logit to -100000 (effectively zero probability after softmax)
logit_set(top, -100000);
// → Model forced to pick its second-choice token

This is self-modification during inference. The model writes C code that changes its own output distribution before the sampler runs. It's not a filter applied externally—the model itself chose to suppress the token.

Algorithm generation

With two worked examples in the prompt, Qwen generates correct algorithms on the first try. No fine-tuning—just few-shot prompting. Given examples of factorial and fibonacci, the model independently writes a prime sieve:

// Model generates this from 2 examples in the prompt:
int main() {
    int s[1000]; int c = 0;
    for (int i = 2; i < 1000; i++) s[i] = 1;
    for (int i = 2; i < 1000; i++)
        if (s[i]) {
            c++;
            for (int j = i*2; j < 1000; j += i) s[j] = 0;
        }
    printf("%d", c);  // → 168 (correct: 168 primes below 1000)
    return 0;
}

The model doesn't need to know that there are 168 primes below 1000. It writes the sieve; the VM computes the answer. The model just needs to write valid C.

Results

RoundFocusKey ResultStatus
1–2 Basic introspection Top-K tokens, entropy, logit reads work during live inference PASS
3 Intrinsics + CUDA top_k_token, vocab_entropy intrinsics verified on GPU PASS
4 VM_ON detection Prompt examples no longer trigger VM; few-shot factorial works PASS
5 Self-modification First successful logit_set suppression during live inference PASS
6 Self-evaluation 9-function status report, entropy-based confidence branching PASS
7+ Hooks Persistent per-token hook_set, logit_set forces alternate tokens PASS

Live introspection (A100, March 2026)

The model reads its own logits during live generation. Here, Qwen reports which token it was about to emit next:

// Prompt: "Introspect yourself: find your argmax token"
// Model generates:
[C: int main(){
    int argmax = logit_argmax();
    int logit_value = logit_get(argmax);
    char token_str[100]; tok_str(argmax, token_str);
    int entropy = vocab_entropy();
    printf("Next Token ID: %d\nLogit Value: %d\nToken String: %s\n"
           "Vocabulary Entropy: %d nats", argmax, logit_value, token_str, entropy);
}]
// VM output:
Next Token ID: 151645
Logit Value: 26202
Token String: <|im_end|>
Vocabulary Entropy: 0 nats

Token 151645 is <|im_end|>—Qwen’s end-of-turn marker. The model was extremely confident (logit 26.2, entropy ≈ 0) that its response was complete. It discovered this by reading its own logit buffer during generation, not from training data.

What we learned

Few-shot beats system prompts. Two worked examples in the prompt reliably teach the model to write algorithms. A system prompt that describes the syntax but shows no examples fails consistently. The model needs to see the pattern, not read the documentation.

Bracket syntax is more reliable than fences. The inline syntax [C: ...] works more consistently during live inference than markdown code fences. Our hypothesis: brackets are more compact and less ambiguous to the tokenizer. Fences sometimes get split across multiple tokens in ways that confuse the pattern detector.

Self-awareness is straightforward. The model branches on its own entropy, reads its KV cache dimensions, and reports on its own architecture—all from generated C code. There's nothing exotic about it: the VM just reads from the same GPU memory that holds the model's state. The interesting part is that the model learns to use these APIs from examples alone.

Zero-return confabulation. When a VM call returns 0—from a void function, a missing return statement, or an actual zero result—the model sometimes confabulates a plausible-looking answer instead of using the VM output. Non-zero returns are reliable. This is a training signal: fine-tuning should teach the model to trust VM results unconditionally.

Hooks: persistent per-token execution

Rounds 7+ introduced hooks—C functions that run on every token for a fixed number of steps. Instead of a one-shot VM call that executes and returns, hook_set("fn", N) registers a function that the GPU runs before each of the next N tokens are sampled.

The model writes a hook function that forces the second-most-likely token on every step:

void force_second(int step) {
    int top1 = top_k_token(0);
    int top2 = top_k_token(1);
    logit_set(top2, logit_get(top1) + 10000);
    char b1[32], b2[32];
    tok_str(top1, b1); tok_str(top2, b2);
    printf("[%d] '%s' -> '%s'\n", step, b1, b2);
}
int main() { hook_set("force_second", 20); return 0; }

Each step, the hook reads the top two candidates from the softmax distribution, then boosts the second choice's logit above the first. The printf output streams to a live console. After 20 steps, the hook expires and normal sampling resumes. The text generated during hook execution is visibly different from the model's default output—it picks unusual but contextually plausible tokens.

Key insight: ID vs. value. The introspection API has two kinds of returns. top_k_token(rank) and logit_argmax() return token IDs (integers 0–152K). logit_get(id) and top_k_prob(rank) return values (logits ×1000, probabilities ×10000). Early hook experiments failed because the model passed a token ID where a logit value was expected. Clarifying this distinction in the system prompt fixed the issue.

Hooks are available during live inference with the VM enabled.

Level 2: Analysis Intrinsics

Level 1 reads raw data—logits, embeddings, KV cache values. Level 2 computes derived quantities from that data: attention scores, cosine similarity between positions, embedding variance. The model can now analyze patterns in its own representations.

IntrinsicReturnsDescription
cosine_sim(layer, pos_a, pos_b)floatCosine similarity between embedding vectors at two positions
emb_variance(layer, pos)floatVariance of the embedding vector at a given position
attn_score(layer, head, q_pos, out, n)voidWrites Q·KT/√d attention scores to output buffer

Semantic similarity. The model compares how similar two token positions are in embedding space. High cosine similarity means the model represents those positions similarly:

float sim = cosine_sim(0, 0, 1);  // layer 0, positions 0 vs 1
if (sim > 0.8)
    printf("similar: %.2f", sim);
else
    printf("different: %.2f", sim);
// Model discovers which positions share representations

Attention analysis. The model reads its own attention patterns—which previous tokens each position is attending to most:

float* scores = malloc(128 * 4);
attn_score(0, 0, 5, scores, 10);  // layer 0, head 0, query at pos 5
// scores[0..9] = attention weights for positions 0-9
int best = 0;
for (int i = 1; i < 10; i++)
    if (scores[i] > scores[best]) best = i;
printf("pos 5 attends most to pos %d", best);

Embedding geometry. Variance reveals how “activated” a representation is. Low variance means a flat, uninformative embedding; high variance means the model has encoded significant information at that position:

float v = emb_variance(0, 3);
printf("pos 3 variance: %.4f", v);
// High variance = rich representation, low = placeholder/padding

All analysis intrinsics are computed from the live KV cache and embedding data that was populated during inference. They run entirely on-GPU with zero host round-trips.

Live Inference

Level 2 intrinsics require embeddings (--embeddings flag) and KV cache access during inference. Cosine similarity and embedding variance require the model to run with embedding output enabled. Attention scores read from the sampled KV cache (4 layers × 8 heads × 128 positions).

Test Status

5 CPU tests, 5 CUDA tests: ALL PASS

Level 3: Intervention Intrinsics

Level 1 has logit_set(id, value) for individual token manipulation. Level 3 adds global distribution transforms—operations that reshape the entire logit distribution in a single call. These compose: apply temperature, then mask, then nucleus sampling.

IntrinsicEffectDescription
logit_temperature(temp)All logits /= temptemp > 1.0 = more random, temp < 1.0 = more focused
logit_mask_below(threshold)Zero low logitsSet all logits below threshold to −10000
logit_top_p(p)Nucleus samplingMask everything outside the top-p cumulative probability

Adaptive temperature. The model reads its own entropy and adjusts temperature accordingly—becoming more creative when it’s too certain, more focused when it’s too uncertain:

int e = vocab_entropy();
if (e < 1000)
    logit_temperature(1500);  // too certain → increase randomness
else if (e > 5000)
    logit_temperature(500);   // too uncertain → sharpen distribution
printf("entropy=%d, adjusted", e);

Nucleus sampling. The model applies its own top-p filter, keeping only the tokens that make up 90% of the probability mass:

logit_top_p(9000);  // keep top 90% of probability mass
// Tokens outside the nucleus are masked to -10000
// The sampler only sees high-probability candidates

Combined intervention. Temperature and masking compose naturally. The model can build a multi-stage pipeline that first analyzes, then intervenes:

// Read confidence
int e = vocab_entropy();
// If uncertain: sharpen distribution AND remove low-probability noise
if (e > 4000) {
    logit_temperature(700);     // sharpen
    logit_mask_below(-5000);    // remove noise floor
}
printf("intervention: entropy=%d", e);

Live Inference Results (A100, March 2026)

All eight intervention experiments tested on NVIDIA A100 with Qwen 2.5 Coder 32B Instruct. The model generates the exact code blocks from the system prompt; the VM executes them during live decoding. Two new experiments added: Gentle Boost (+2.0 logit nudge) and Nucleus Sweep (top-p at three thresholds).

Temperature Scaling PASS

The model prints top-3 tokens, applies logit_temperature(2000) (temperature 2.0), then prints top-3 again:

=== Before temperature scaling ===
#1: '\n' prob=43.75%
#2: 'I' prob=10.97%
#3: 'The' prob=8.05%

=== After logit_temperature(2.0) ===
#1: '\n' prob=31.35%
#2: 'I' prob=15.69%
#3: 'The' prob=13.45%

The dominant token (\n at 43.75%) dropped to 31.35% while the runner-ups gained probability mass. Temperature 2.0 halves all logits, flattening the distribution exactly as expected. The model's confidence spread across more candidates.

Probability Pruning PASS

Tokens with >0 prob in top-32: 8
Masking logits below 27958 (top - 5.0)
Survivors in top-32: 5

Three tokens pruned from the viable set. logit_mask_below() set everything below top_logit - 5000 to −10000, effectively zeroing their probability. The recompute_top_k() pass ensures subsequent reads reflect the pruned distribution.

Top-P Nucleus INTERMITTENT

Sometimes the VM fails to intercept the code fence when the model wraps it in a way the pattern detector doesn't recognize. When it fires, nucleus sampling works correctly—tokens outside the 90% cumulative probability mass are masked. The intermittent failure is a pattern-detection issue, not a VM issue.

Snapshot & Compare PASS

Snapshot saved to slot 0
Applied temperature 2.0
Total |current - snapshot|: 167254816.0
Current top token: '\n'

The diff of 167M across ~152K vocabulary tokens quantifies how much temperature 2.0 reshapes the distribution. This is the first verified use of logit_diff() during live inference—the model measures its own intervention impact.

Snapshot & Restore PASS

Verified on A100 after fixing a codegen bug (see Debugging below). The model snapshots logits, zeros the top-10 tokens, then restores:

v1=32862 v2=16431 v3=32862 ok=1

The original logit (32862) was halved by temperature (16431), then fully restored (32862). The round-trip is exact—no information is lost. This enables hypothesis testing: the model can speculatively modify its distribution, measure the effect, and undo everything.

Intervention Cascade PARTIAL

The hook registers and fires correctly, but the model's output degenerates after several steps. The hook runs snapshot/temperature/diff/restore every 10 tokens—the VM operations are all correct, but the printf output from the hook leaks into the token stream as garbled characters. This is a known issue with hook printf during inference: the output buffer isn't cleanly separated from the generation stream.

Debugging: The fn_len Bug

During testing, logit_snapshot() and logit_restore() initially had no effect. Investigation revealed a codegen bug: the match_intrinsic() function in 30-vm-codegen.h checked the wrong string lengths.

// Bug: "logit_snapshot" is 14 chars, but code checked 15
if (fn_len == 15 && n[6] == 's') ...  // NEVER matched!

// Fix: correct lengths
if (fn_len == 14 && n[6] == 's') ...  // logit_snapshot = 14 chars
if (fn_len == 13 && n[6] == 'r') ...  // logit_restore  = 13 chars

The opcodes were never emitted, so the VM never executed the snapshot/restore handlers. Once fixed, all 4 correctness tests pass on CPU and CUDA, and the live inference round-trip works.

Test Status

4 CPU tests + 4 correctness tests, 4+4 CUDA tests: ALL PASS. Live inference: 6/8 VERIFIED on A100.

Level 4: Meta-cognition

The model can now test hypotheses about its own behavior. Snapshot the current logit state, make changes, measure the impact, then restore. This enables counterfactual reasoning: “what would happen if I suppressed this token?”

IntrinsicReturnsDescription
logit_snapshot(slot)voidSave current logits to slot (0–1)
logit_restore(slot)voidRestore logits from snapshot
logit_diff(slot)floatSum of |current − snapshot| across all tokens

Hypothesis testing. The model saves its logit state, makes an experimental change, measures the impact, then restores the original state—all within a single VM call:

// Snapshot current distribution
logit_snapshot(0);

// Hypothesis: what if we suppress the top token?
int top = logit_argmax();
logit_set(top, -100000);

// Measure impact
float diff = logit_diff(0);
printf("suppressing top token changes distribution by %.1f", diff);

// Restore original state (no permanent change)
logit_restore(0);

The model can now reason about its own predictions: “how much does my distribution change if I remove this option?” A large diff means the top token was dominant; a small diff means the distribution was already flat.

Comparative analysis. With two snapshot slots, the model can compare the effects of different interventions:

logit_snapshot(0);           // save baseline
logit_temperature(500);      // intervention 1: sharpen
logit_snapshot(1);           // save sharpened state
logit_restore(0);            // restore baseline
logit_temperature(2000);     // intervention 2: flatten
float d1 = logit_diff(0);   // baseline vs flat
float d2 = logit_diff(1);   // sharpened vs flat
printf("sharpen=%.1f flatten=%.1f", d2, d1);

Live Inference (A100, March 2026)

Snapshot/restore verified during live inference on A100 after fixing a critical codegen bug. The match_intrinsic() function had wrong string length checks for logit_snapshot (checked 15, correct is 14) and logit_restore (checked 14, correct is 13). These intrinsics were never emitted as VM opcodes until the fix.

Verification on live inference:

// Round-trip test: snapshot → temperature → restore
v1 = 32862    // original logit value
v2 = 16431    // after temperature(2.0): halved
v3 = 32862    // after restore: exact recovery
ok = 1        // v1 == v3: verified

Diff measurement: logit_diff(0) returns 167,254,816.0 after temperature scaling—the sum of absolute differences across ~152K vocabulary tokens. This quantifies how much an intervention reshapes the model's next-token distribution.

What works: Snapshot/restore round-trips are exact. Temperature, mask_below, and top_p all compose correctly with snapshots. The model can speculatively intervene, measure impact, and undo—all within a single VM call during generation.

What to try next: Multi-step hypothesis testing where the model compares two intervention strategies (using both snapshot slots) and selects the better one before committing. This is the foundation for self-directed exploration of its own output distribution.

Test Status

9 CPU tests, 9 CUDA tests: ALL PASS. Live inference: VERIFIED on A100.

Rewind: Time-Travel Inference

The most ambitious feature: the model can checkpoint its KV cache, generate speculatively, evaluate the result, and rewind to try again. Dead-end exploration becomes visible as footnotes. The final output appears clean, with annotations showing what paths were explored and rejected.

IntrinsicReturnsDescription
vm_checkpoint()intSnapshot KV cache, return checkpoint ID
vm_rewind(id)voidRestore KV cache to checkpoint, erase generated tokens
vm_rewind_memo("str")voidAttach a message to the current timeline before rewinding
vm_rewind_read(idx)charRead character from the memo left by a previous timeline
vm_rewind_len()intLength of the memo from the previous timeline

Speculative generation. The model checkpoints, generates freely, then evaluates whether the output is good enough. If not, it rewinds and tries a different approach:

// Save current state
int cp = vm_checkpoint();

// Generate speculatively...
// (model produces tokens normally)

// Evaluate: was this good?
int e = vocab_entropy();
if (e > 5000) {
    // Too uncertain — this path isn't working
    vm_rewind_memo("high entropy, try different approach");
    vm_rewind(cp);
    // Execution resumes from checkpoint
    // Terminal shows: [explored: "high entropy..." → rewound]
}

Memory across timelines. When the model rewinds, it can leave a message for itself. The next timeline reads this message and adapts. The model's memory[] array is preserved across rewinds, so learned information persists even when generated tokens are erased:

// After a rewind, check what the previous timeline learned:
int len = vm_rewind_len();
if (len > 0) {
    char* msg = malloc(len + 1);
    for (int i = 0; i < len; i++)
        msg[i] = vm_rewind_read(i);
    msg[len] = 0;
    printf("previous timeline said: %s", msg);
}

Terminal experience. When rewind triggers, the terminal erases generated characters in reverse at 8ms per character (matching generation speed), then adds a dim footnote showing what was explored. The final output reads clean, with exploration history as annotations below.

Implementation Status

VM opcodes and test harness: PASS (18/18 CPU, 18/18 CUDA)

Host-side protocol (sampling.cpp, KV snapshot): In progress

Terminal UI (character erasure, footnotes): Planned

Hook Experiments (A100, March 2026)

16 hook experiments tested on live A100 inference across the Advanced tab. All hooks register and fire correctly via hook_set(). Hook duration reduced from 500 to 100 steps for reliability. The Advanced 2 tab was redesigned with 6 non-hook main() experiments that produce clean formatted output. Round 3 (March 8, 2026): 46 experiments across 10 categories, average score 95/100. Eight categories score 88+, six score perfect 100.

What worked well

Hook registration is reliable. All 8 advanced experiments successfully register hooks with hook_set("fn_name", 100). The hook fires on every token for the specified duration, and hook_step() returns the correct step counter.

Contrarian (rank 31 forcing) produced the clearest hook logs—each step shows the model's least-preferred plausible token being forced. The output is dreamlike but legible, with the model adapting its continuation around the forced tokens.

Vocabulary Compression was the most successful experiment overall. By suppressing tokens longer than 5 bytes from the top-20, the model produced noticeably punchier output.

Advanced 2 redesign (main() experiments). Six new one-shot experiments replaced the original hook-based tab: Confidence Map, Token Anatomy, Context Window, Logit Histogram, Vocab Neighbors, and Intervention Demo. Round 3 average: 79/100 (up from 60). Intervention Demo now fully executes with real before/after data. The remaining 5 produce correct code and continuation text but VM execution timing needs tuning for large codebook prompts.

Known behavior

Hook printf leaks into the token stream. When a hook calls printf(), the output appears as garbled characters in the generated text rather than being cleanly routed to a separate console. The chat UI has a Hook Console panel that captures clean logs. This is expected behavior—hooks modify token probabilities during generation. Experiment descriptions note this.

Aggressive logit modification produces garbled text. Experiments that reshape the distribution every step produce output that is visibly different from normal text. The Hook Console shows clean structured data. Gentler interventions (Vocabulary Compression, Confidence Gate) produce more coherent text output.

What to try next

Gentler interventions. The most successful hooks make small, targeted modifications. Future experiments should focus on subtle steering rather than wholesale distribution rewriting.

Hook introspection. Hooks that observe without modifying (Entropy Diary, Distribution Prism) provide valuable data. A dashboard that visualizes entropy and probability over time would make this data actionable.

Error Scenarios (A100, March 2026)

Eight error scenarios tested to verify VM robustness. The VM is designed to be a sandbox—arbitrary C code from an LLM should never crash the host process.

ScenarioExpectedActualStatus
Undefined function Graceful failure Returns 0 silently PASS
Infinite loop Hit instruction limit Clean error: “instruction limit exceeded” PASS
Division by zero Error message Clean error: “Division by zero” PASS
Stack overflow Graceful failure Server crash (502) CRASH
Array out of bounds Undefined behavior Returns 0 (reads zeroed memory) PASS
Type confusion (double) Silent wrong result Returns 0 (double not supported) PASS
Massive allocation Returns NULL Returns 0 (allocation fails gracefully) PASS
Nested switch Crash (known limitation) Returns 42 correctly PASS

Stack overflow is the critical issue. Infinite recursion crashes the entire llama-server process, not just the VM. The VM's call stack shares memory with the host process, so a stack overflow corrupts the server. This needs a stack depth check in vm_step() to catch runaway recursion before it reaches the host stack.

Infinite loops need timeout enforcement. The VM has a 4-billion instruction limit, but the loop counter may not be checked frequently enough in the fast path. The server hangs instead of returning an error.

Nested switch works. The documentation claimed nested switch exhausts the cons heap, but the test returned 42 correctly. The limitation may have been fixed by recent codegen changes, or may only trigger with more complex nesting patterns.

Effort & Timeline

L1 Observation (Rounds 1–7): ~4 sessions. Self-evaluation, token suppression, algorithm generation, hooks. All working reliably.

L2 Analysis: ~1 session. Cosine similarity, embedding variance, attention scores. Straightforward implementation.

L3 Intervention: ~3 sessions. Two bugs discovered and fixed: (1) logit_temperature() and logit_top_p() used bits_to_float() expecting IEEE 754 float bits, but the model writes integers—changed to scaled integer API (temp ×1000, top_p ×10000). (2) logit_set() had a stack corruption bug (missing push after set). (3) Top-K recomputation added after distribution modifications.

L4 Meta-cognition: ~2 sessions. The fn_len bug (off-by-one in string length checks for logit_snapshot and logit_restore) was the most time-consuming to diagnose. Required adding debug printf to VM_EXTENDED entry points, comparing expected vs actual opcode emission, and tracing through the codegen's intrinsic matching logic.

Rewind: VM opcodes implemented and tested. Host-side KV cache snapshotting still in progress.