Teaching Qwen to Write GPU Code

A language model learns to program a GPU virtual machine through examples alone

Ruffian Project

Qwen 2.5 Coder is a family of code-focused language models. Ruffian is a GPU virtual machine that compiles C to CUDA shaders. We want to see if Qwen can learn to write Ruffian code through pure in-context learning: no fine-tuning, just examples in the prompt.

It works. With 10 examples, Qwen writes working GPU code 85% of the time.

How Ruffian Works

During inference, when the model outputs C code in a marked region, the VM intercepts it, compiles and executes on-GPU, and injects the result back into the token stream:

LLM outputs: "847 x 293 = [C: return 847*293;]"
                                     ↓
VM compiles & executes on GPU → injects result
                                     ↓
Token stream continues: "847 x 293 = [C: return 847*293;] = VM[248171]"

The model doesn't need to know the answer. It just writes C code; the VM guarantees correctness. No network round trips, no JSON tool protocol—execution happens on the same GPU that runs inference.

The Experiment

We test whether Qwen can learn to write [C:...] blocks that the VM will execute. The model's job is to write syntactically valid C—it never needs to know or guess the answer. The VM computes the result and injects it back into the token stream.

What Works

Ruffian supports most C constructs. 2,116 tests pass across CPU and CUDA GPU:

FeatureStatusExample
While/for loopsPASSwhile(i < 10) i++;
++/-- operatorsPASSi++; --j;
Compound assignmentPASSsum += i;
All comparisonsPASS!= > < >= <=
Function callsPASSa = func(a);
Arrays (up to 2M)PASSint arr[2000000];
Nested loopsPASSwhile { while { } }
RecursionPASSreturn fib(n-1) + fib(n-2);
Bitwise opsPASS& | ^ << >>
Ternary operatorPASSa > b ? a : b

Algorithm Test Suite

We verify 46 algorithms covering sorting, searching, number theory, and more:

CategoryAlgorithmsTests
SortingBubble, Selection, Insertion3
Data StructuresStack, Queue operations4
Number TheoryGCD, Prime check, Fibonacci, Factorial, Divisors, Collatz15
Array OperationsMin, Max, Sum, Reverse, Search, Count9
BitwisePopcount, Power of 2, Bit position5
Project EulerPE001, PE002, PE003, PE005, PE007, PE009, PE0107

Project Euler Results

Qwen solves Project Euler problems correctly when given the VM:

ProblemDescriptionAnswer
PE001Sum of multiples of 3 or 5 below 1000233168
PE002Even Fibonacci sum ≤ 4M4613732
PE003Largest prime factor of 1319529
PE005LCM of 1-102520
PE006Sum square difference (1-100)25164150
PE0076th prime13
PE010Sum of primes below 1001060

Here's PE001 (sum of multiples of 3 or 5 below 1000). During inference, the model would output this inside a [C:...] block and the VM would return 233168:

int pe001(int n){
  int s=0; int i=1;
  while(i

"Qwen generates correct algorithms on the first try. The code is textbook-correct."

The Prompt

We teach Qwen the [C:...] syntax with a few examples. The model writes C code; the VM executes it. The prompt achieves 85%+ success:

You write C code for the Ruffian GPU VM.
When you need to compute something, output it inside [C: ... ] markers.
The VM will execute your code and return the result.

EXAMPLES:
User: What is 5 factorial?
Assistant: 5! = [C: int f(int n){int r=1;while(n>1){r=r*n;n=n-1;}return r;} return f(5);] = VM[120]

User: What is the 10th Fibonacci number?
Assistant: fib(10) = [C: int fib(int n){int a=0,b=1;for(int i=0;i<n;i++){int t=a+b;a=b;b=t;}return a;} return fib(10);] = VM[55]

User: What is gcd(48, 18)?
Assistant: gcd(48,18) = [C: int gcd(int a,int b){while(b!=0){int t=b;b=a%b;a=t;}return a;} return gcd(48,18);] = VM[6]

[YOUR TASK]

What We Learn

Small models write valid C instantly. Qwen picks up the [C:...] syntax and C subset from a few examples. It needs more examples to learn algorithm patterns, but syntactic compliance is immediate.

Explicit rules beat implicit patterns. Telling Qwen constraints explicitly works better than hoping it infers them from examples.

AI finds edge cases. When Qwen's correct-looking code fails, we find compiler bugs. The model becomes an accidental fuzzer—it generates valid C that exposes VM issues humans miss.

The model never computes. The model writes [C: return 847*293;] and the VM computes the result. The model only needs to write correct C—it never guesses the answer.

Setup

Requirements:

The model runs entirely on-GPU. No API calls, no cloud dependencies, no rate limits.