Live Inference Experiments

Qwen 2.5 Coder 32B writes C code that reads its own internal state during decoding

Ruffian Project

We ran multiple rounds of live inference experiments to test whether a language model could learn to use the Ruffian VM during generation—writing C code that reads its own logits, modifies its next-token distribution, registers persistent hooks, and generates algorithms. Experiments run on NVIDIA A100 via CUDA using Qwen 2.5 Coder 32B Instruct.

The model was not fine-tuned. Everything here is pure in-context learning: a few examples in the prompt, and the model figures out the rest.

Self-evaluation

The model can read its own confidence during generation. vocab_entropy() returns the Shannon entropy of the next-token distribution (in millinats), and vocab_top1_prob() returns the probability of the most likely token (scaled ×10000). The model uses these to branch:

int e = vocab_entropy();
int p = vocab_top1_prob();
if (e < 2000)
    printf("CONFIDENT (entropy=%d, top1=%d)", e, p);
else
    printf("UNCERTAIN (entropy=%d, top1=%d)", e, p);
// → CONFIDENT (entropy=1247, top1=4891)

This is the model reading its own uncertainty and making a decision based on it. Low entropy (concentrated distribution) means the model is confident about what comes next. High entropy (flat distribution) means it isn't. The branching happens inside a single VM evaluation during token generation.

Multi-function status reports

A single VM call can execute multiple functions. The model generates a 9-function status report that queries context length, vocabulary size, top predictions, entropy, KV cache architecture, and overall health—all in one shot:

Tokens in context: 482
Vocab size: 151936
Top prediction: ' the' (p=1847)
Entropy: 5765 (millinats)
Status: ACTIVE, HEALTHY
Layers: 4, Heads: 8, Dim: 128

The model discovers its own architecture at runtime. It reads KV cache dimensions via the introspection API—the sampled view shows 4 evenly-spaced layers and 8 KV heads (Qwen uses grouped-query attention). The model didn't know this from training; it read it from the live KV cache on the GPU.

Token suppression

Because the VM can write to the logit buffer before sampling, the model can suppress its own next token:

// Identify the most likely next token
int top = logit_argmax();
char* buf = malloc(64);
tok_str(top, buf);
printf("suppressing: '%s'", buf);

// Set its logit to -100000 (effectively zero probability after softmax)
logit_set(top, -100000);
// → Model forced to pick its second-choice token

This is self-modification during inference. The model writes C code that changes its own output distribution before the sampler runs. It's not a filter applied externally—the model itself chose to suppress the token.

Algorithm generation

With two worked examples in the prompt, Qwen generates correct algorithms on the first try. No fine-tuning—just few-shot prompting. Given examples of factorial and fibonacci, the model independently writes a prime sieve:

// Model generates this from 2 examples in the prompt:
int main() {
    int s[1000]; int c = 0;
    for (int i = 2; i < 1000; i++) s[i] = 1;
    for (int i = 2; i < 1000; i++)
        if (s[i]) {
            c++;
            for (int j = i*2; j < 1000; j += i) s[j] = 0;
        }
    printf("%d", c);  // → 168 (correct: 168 primes below 1000)
    return 0;
}

The model doesn't need to know that there are 168 primes below 1000. It writes the sieve; the VM computes the answer. The model just needs to write valid C.

Results

Round	Focus	Key Result	Status
1–2	Basic introspection	Top-K tokens, entropy, logit reads work during live inference	PASS
3	Intrinsics + CUDA	`top_k_token`, `vocab_entropy` intrinsics verified on GPU	PASS
4	VM_ON detection	Prompt examples no longer trigger VM; few-shot factorial works	PASS
5	Self-modification	First successful `logit_set` suppression during live inference	PASS
6	Self-evaluation	9-function status report, entropy-based confidence branching	PASS
7+	Hooks	Persistent per-token `hook_set`, `logit_set` forces alternate tokens	PASS

What we learned

Few-shot beats system prompts. Two worked examples in the prompt reliably teach the model to write algorithms. A system prompt that describes the syntax but shows no examples fails consistently. The model needs to see the pattern, not read the documentation.

Bracket syntax is more reliable than fences. The inline syntax [C: ...] works more consistently during live inference than markdown code fences. Our hypothesis: brackets are more compact and less ambiguous to the tokenizer. Fences sometimes get split across multiple tokens in ways that confuse the pattern detector.

Self-awareness is straightforward. The model branches on its own entropy, reads its KV cache dimensions, and reports on its own architecture—all from generated C code. There's nothing exotic about it: the VM just reads from the same GPU memory that holds the model's state. The interesting part is that the model learns to use these APIs from examples alone.

Zero-return confabulation. When a VM call returns 0—from a void function, a missing return statement, or an actual zero result—the model sometimes confabulates a plausible-looking answer instead of using the VM output. Non-zero returns are reliable. This is a training signal: fine-tuning should teach the model to trust VM results unconditionally.

Hooks: persistent per-token execution

Rounds 7+ introduced hooks—C functions that run on every token for a fixed number of steps. Instead of a one-shot VM call that executes and returns, hook_set("fn", N) registers a function that the GPU runs before each of the next N tokens are sampled.

The model writes a hook function that forces the second-most-likely token on every step:

void force_second(int step) {
    int top1 = top_k_token(0);
    int top2 = top_k_token(1);
    logit_set(top2, logit_get(top1) + 10000);
    char b1[32], b2[32];
    tok_str(top1, b1); tok_str(top2, b2);
    printf("[%d] '%s' -> '%s'\n", step, b1, b2);
}
int main() { hook_set("force_second", 20); return 0; }

Each step, the hook reads the top two candidates from the softmax distribution, then boosts the second choice's logit above the first. The printf output streams to a live console. After 20 steps, the hook expires and normal sampling resumes. The text generated during hook execution is visibly different from the model's default output—it picks unusual but contextually plausible tokens.

Key insight: ID vs. value. The introspection API has two kinds of returns. top_k_token(rank) and logit_argmax() return token IDs (integers 0–152K). logit_get(id) and top_k_prob(rank) return values (logits ×1000, probabilities ×10000). Early hook experiments failed because the model passed a token ID where a logit value was expected. Clarifying this distinction in the system prompt fixed the issue.

Hooks are available during live inference with the VM enabled.