A language model learns to program a GPU virtual machine through examples alone
Qwen 2.5 Coder is a family of code-focused language models. Ruffian is a GPU virtual machine that compiles C to CUDA shaders. We want to see if Qwen can learn to write Ruffian code through pure in-context learning: no fine-tuning, just examples in the prompt.
It works. With 10 examples, Qwen writes working GPU code 85% of the time.
During inference, when the model outputs C code in a marked region, the VM intercepts it, compiles and executes on-GPU, and injects the result back into the token stream:
LLM outputs: "847 x 293 = [C: return 847*293;]"
↓
VM compiles & executes on GPU → injects result
↓
Token stream continues: "847 x 293 = [C: return 847*293;] = VM[248171]"
The model doesn't need to know the answer. It just writes C code; the VM guarantees correctness. No network round trips, no JSON tool protocol—execution happens on the same GPU that runs inference.
We test whether Qwen can learn to write [C:...] blocks that the VM will execute. The model's job is to write syntactically valid C—it never needs to know or guess the answer. The VM computes the result and injects it back into the token stream.
Ruffian supports most C constructs. 2,116 tests pass across CPU and CUDA GPU:
| Feature | Status | Example |
|---|---|---|
| While/for loops | PASS | while(i < 10) i++; |
| ++/-- operators | PASS | i++; --j; |
| Compound assignment | PASS | sum += i; |
| All comparisons | PASS | != > < >= <= |
| Function calls | PASS | a = func(a); |
| Arrays (up to 2M) | PASS | int arr[2000000]; |
| Nested loops | PASS | while { while { } } |
| Recursion | PASS | return fib(n-1) + fib(n-2); |
| Bitwise ops | PASS | & | ^ << >> |
| Ternary operator | PASS | a > b ? a : b |
We verify 46 algorithms covering sorting, searching, number theory, and more:
| Category | Algorithms | Tests |
|---|---|---|
| Sorting | Bubble, Selection, Insertion | 3 |
| Data Structures | Stack, Queue operations | 4 |
| Number Theory | GCD, Prime check, Fibonacci, Factorial, Divisors, Collatz | 15 |
| Array Operations | Min, Max, Sum, Reverse, Search, Count | 9 |
| Bitwise | Popcount, Power of 2, Bit position | 5 |
| Project Euler | PE001, PE002, PE003, PE005, PE007, PE009, PE010 | 7 |
Qwen solves Project Euler problems correctly when given the VM:
| Problem | Description | Answer |
|---|---|---|
| PE001 | Sum of multiples of 3 or 5 below 1000 | 233168 |
| PE002 | Even Fibonacci sum ≤ 4M | 4613732 |
| PE003 | Largest prime factor of 13195 | 29 |
| PE005 | LCM of 1-10 | 2520 |
| PE006 | Sum square difference (1-100) | 25164150 |
| PE007 | 6th prime | 13 |
| PE010 | Sum of primes below 100 | 1060 |
Here's PE001 (sum of multiples of 3 or 5 below 1000). During inference, the model would output this inside a [C:...] block and the VM would return 233168:
int pe001(int n){
int s=0; int i=1;
while(i
"Qwen generates correct algorithms on the first try. The code is textbook-correct."
We teach Qwen the [C:...] syntax with a few examples. The model writes C code; the VM executes it. The prompt achieves 85%+ success:
You write C code for the Ruffian GPU VM.
When you need to compute something, output it inside [C: ... ] markers.
The VM will execute your code and return the result.
EXAMPLES:
User: What is 5 factorial?
Assistant: 5! = [C: int f(int n){int r=1;while(n>1){r=r*n;n=n-1;}return r;} return f(5);] = VM[120]
User: What is the 10th Fibonacci number?
Assistant: fib(10) = [C: int fib(int n){int a=0,b=1;for(int i=0;i<n;i++){int t=a+b;a=b;b=t;}return a;} return fib(10);] = VM[55]
User: What is gcd(48, 18)?
Assistant: gcd(48,18) = [C: int gcd(int a,int b){while(b!=0){int t=b;b=a%b;a=t;}return a;} return gcd(48,18);] = VM[6]
[YOUR TASK]
Small models write valid C instantly. Qwen picks up the [C:...] syntax and C subset from a few examples. It needs more examples to learn algorithm patterns, but syntactic compliance is immediate.
Explicit rules beat implicit patterns. Telling Qwen constraints explicitly works better than hoping it infers them from examples.
AI finds edge cases. When Qwen's correct-looking code fails, we find compiler bugs. The model becomes an accidental fuzzer—it generates valid C that exposes VM issues humans miss.
The model never computes. The model writes [C: return 847*293;] and the VM computes the result. The model only needs to write correct C—it never guesses the answer.
Requirements:
The model runs entirely on-GPU. No API calls, no cloud dependencies, no rate limits.