ENGINEERINGMAR 24, 2026

110,227 Tests to Break Our Own Compiler — 0 Failures

The adversarial corpus: 110,227 tests across 7 levels — from individual monomers to real execution. Full monomer catalog, all backends, all languages. Zero failures.

The Question We Asked Ourselves

When you build a compiler that transpiles code between 10 input languages and 14 output targets, that certifies itself mathematically, and that compiles itself — there is an inevitable question:

How do you know it works?

The industry standard answer is "we tested it pretty well." We decided that "pretty well" was not enough.

What Are Abyssal Tests?

We call them abyssal tests because they go all the way down to the bottom of the system. These are not superficial integration tests that verify "the button works." These are tests that verify every atomic operation, in every value combination, across every backend, with every control flow pattern.

110,227 Tests Across 7 Categories

The tests span 7 categories: individual monomer operations, multi-family compositions, cross-target consistency, determinism verification, real execution with verified I/O, security and abuse resistance, and regression coverage. Every test verifies a concrete property. These are not randomly generated tests — each one exists because it covers a specific execution path that could fail.

What We Tried to Break

Level 1: Individual Operations

Each of the the full monomer catalog was tested with boundary values: 0, 1, 127, 128, 255, and combinations between them. ADD8(255, 1) must produce wrap-around. DIV8(x, 0) must produce a controlled error. SHL(1, 7)must produce 128. No exceptions, no "it depends."

Level 2: Compositions

An individual monomer can work perfectly and fail when composed with another. We generated chains of 2, 3, 4, 5, and 6 operations mixing families: arithmetic with logic, logic with strings, strings with float, float with trigonometry. If ADD8 works and SIN works, does SIN(ADD8(1,2)) work?

Yes. In every case.

Level 3: Cross-Target

The same PCD program must produce correct code in JavaScript, Python, Rust, Go, C, C++, PHP, and Java. Each monomer generates idiomatic code in the target language — with native semantics appropriate to that language. And all 8 backends must produce the same result for the same input.

2,864 tests verify this for monomer combinations alone.

Level 4: Determinism

The most important property of BRIK-64: the same input produces the same output, always. No garbage collection pausing between two runs. No JIT optimizing differently the second time. No scheduler reordering operations.

Every program is compiled twice. Hashes are compared. If they differ, the test fails. 600 determinism tests, zero failures.

Level 5: Real Execution

The first 100,000 tests verified code generation — that the compiler produces valid code. The last 10,000 verify real execution: that the generated code, when run, produces the correct values.

ADD8(1, 2) must not only generate code that compiles — it must produce 3 when executed. SIN(0) must produce 0.0. A loop that accumulates 10 times must produce exactly 10.

These tests execute the BIR (BRIK Intermediate Representation) with known input values and verify that the output is exactly what is expected.

Level 6: Security and Abuse

What happens if someone puts SQL injection in a PCD variable name? XSS in a string literal? Path traversal in a filesystem argument? Unicode homoglyphs to confuse the parser?

484 regression and security tests verify that the system rejects or correctly handles every malicious case.

Level 7: Regression

Every bug we found and fixed during development became a permanent test. The array overflow that caused a segfault in ELF. The variable scoping in if blocks that didn't propagate to the outer scope. The ENV function that didn't exist as a monomer and returned garbage.

These bugs can never come back. Their tests are there forever.

What We Did NOT Find

This is the most relevant part. After 110,227 attempts to break the system:

0 failures in core operations (all certified monomers, Φ_c = 1). The mathematical certification holds.

0 determinism failures. Same input, same output. Always.

0 uncontrolled crashes in the compilation pipeline.

0 cross-target inconsistencies. All 8 backends produce equivalent code.

Why This Is Possible

The secret is not that we are better testers. It's that the operation space is finite.

A conventional program has a virtually infinite state space: any combination of calls to any function with any argument. Exhaustively verifying a 1,000-line Python program is computationally impossible.

A PCD program is composed of exactly 128 atomic operations. Each one has a known signature, a known domain, and a known range. You can verify every combination because the space is finite.

It's the same reason you can formally verify a digital circuit with 128 gates but you cannot formally verify a modern processor with a billion transistors. The finiteness of the component space makes exhaustive verification viable.

The Result

110,227 tests. 0 failures. This is not a marketing claim — it is a verifiable fact. Every test is in the repository. Every one runs on every commit. Every one produces the same result today that it produced yesterday and will produce tomorrow.

Because that is what "deterministic by construction" means.

Run the Corpus

git clone https://github.com/brik64/brik64-demos.git
cd brik64-demos
./run_demo.sh adversarial-corpus

The abyssal tests cover: the full monomer catalog, 14 backends, 10 input languages, control flow, multi-family compositions, determinism, real execution, security, and regression. The code and the tests are part of the same verifiable artifact.