Intro

Can you factorize RSA-512 under a second even if the primes were big enough? You may want to use RsaCtfTool but it won’t help you with this prime. Give it a go if you want. We’ll be working with this modulus:

B3EFAF2E3A1FDF4C496EC1FEFEFC8C93603004CB36E0F24AEBC4E8A2E37D6F65407378FAA5288E67BB8567E530BA7DC58A9739D8B9700BA0965B736AB8E029B1

This is a journey of how I factored this modulus by exploiting a weakness in how its primes were generated.

The vulnerability isn’t in RSA itself. It’s in the random number generator used during key generation.

I will first talk about why this algorithm is vulnerable and how we will attack it. We will go from my original naive algorithm to the latest AI generated fast algorithm.

Disclaimer: I only wrote the initial multi.cpp implementation and others were all generated with help of AI. You can download all source codes from my repository

Vulnerable Algorithm

While analyzing a real-world software application, I found that the RSA modulus n was generated using the following process:

seed (32-bit)
   ↓
PRNG → 32 bytes → p_candidate → nextprime → p
   ↓
PRNG → 32 bytes → q_candidate → nextprime → q
   ↓
n = p × q

PRNG was Borland’s default linear congruential generator algorithm(LCG) which was basically like below:

int MakeRandom(char *out, int len, int rseed) {
    int seed = rseed;
    for (int j = 0; j < len; j++) {
        seed = seed * 0x8088405 + 1;
        uint64_t lval = (uint32_t)seed * 0xDFULL;
        out[j] = (char)((lval >> 32) + 0x20);
    }
    return seed;
}
  • The entire key generation depends on a single 32-bit seed.
  • The LCG is deterministic
  • Each run produces the same (p, q) for the same seed

The whole space of possible (p, q) pairs fits in 2^32 ≈ 4.3 billion seeds. That is trivially brute-forceable, if you have a fast way to test each seed. Basically, if you can guess the rseed you can guess the prime number.

The attack

When I was first dealing with this problem, I thought it was quite easy to factorize. After all, we are only dealing with 32-bit seed. However finding a nextprime takes such a long time for each seed. We need to eliminate calling nextprime function by doing some optimizations.

For each candidate seed:

  1. Regenerate the candidate p and q via the LCG.
  2. Compute fake = p * q.
  3. If 0 ≤ n - fake ≤ gap, this seed is worth checking. (gap is a public upper bound on how far nextprime() shifted the real primes away from their raw LCG output, so the real n should be within gap of fake.) This gap is NOT a special number it was just a guess after testing MakeRandom function several times.
  4. On a hit, run nextprime(p) and test whether the resulting prime divides n. If so, we have the factorization and private key recovered.

Baseline: GMP everywhere (multi.cpp)

The initial implementation uses GMP for every arithmetic step, spread across all CPU threads:

for (int seed = start; seed <= end && !found; seed++) {
    MakeRandom(buf, KEYLEN, seed);
    mpz_import(p, ...);
    MakeRandom(buf, KEYLEN, nextSeed);
    mpz_import(q, ...);
    mpz_mul(fake, q, p);
    mpz_sub(sub, n, fake);
    if (mpz_sgn(sub) < 0) continue;
    if (mpz_cmp(sub, gap) > 0) continue;
    mpz_nextprime(p, p);
    if (mpz_divisible_p(n, p)) { /* found */ }
}

This runs in about 9.8 seconds on an Apple M5 Max (18 CPU cores).

Since we have AI and a powerful machine, can we do any better?

Iteration 1: drop GMP from the hot loop (fast.cpp)

GMP is general-purpose: arbitrary precision, dynamic allocation, branchy code to handle every operand shape. Our operands, on the other hand, are always exactly 256 × 256 bits → 512 bits. That is a fixed shape, and we can open-code it much more tightly than a library ever will.

The work per iteration breaks down as:

  • mpz_import × 2 - bytes to GMP limbs, with size bookkeeping.
  • mpz_mul - 256×256 → 512 bit multiply through GMP’s generic dispatcher.
  • mpz_sub / mpz_sgn / mpz_cmp - a 512-bit subtraction and two comparisons, each a library call.

Each of those crosses a function boundary, touches allocator state, and runs branchy code paths designed to also handle 8000-bit operands. For our fixed shape we can replace all of it with 4-limb / 8-limb arithmetic on uint64_t, using __uint128_t to hold the 64×64 multiply carry:

static inline void mul_256x256(const uint64_t a[4], const uint64_t b[4],
                               uint64_t r[8]) {
    for (int i = 0; i < 8; i++) r[i] = 0;
    for (int i = 0; i < 4; i++) {
        uint64_t carry = 0;
        for (int j = 0; j < 4; j++) {
            __uint128_t prod = (__uint128_t)a[i] * b[j] + r[i+j] + carry;
            r[i+j] = (uint64_t)prod;
            carry = (uint64_t)(prod >> 64);
        }
        r[i+4] = carry;
    }
}

Crucially, GMP stays in the cold path: when the native filter says “candidate”, we fall back to mpz_nextprime and mpz_divisible_p to really verify it. That function runs maybe once per entire program, so there is no point porting it.

Result: 7.1 seconds, 121 core-seconds. The speedup is 1.4× - respectable but not spectacular.

Iteration 2: vectorize the LCG (neon.cpp)

Back-of-envelope on the 7.1s run:

  • 121 core-seconds / 4.29B seeds = ~28 ns per seed.
  • The LCG runs 64 multiplies per seed (32 for p bytes, 32 for q).
  • At roughly 1 cycle per multiply on a 3.3 GHz core, the LCG alone is ~20 ns.

That leaves only ~8 ns for everything else - the 16-multiply schoolbook, the 8-limb subtract, the comparison. The LCG is the bottleneck.

We cannot shorten the LCG chain within one seed without a jump-ahead formula (which exists but introduces complexity). But consecutive seeds are independent. Four seeds’ LCGs can run side by side in a single NEON 32×4 vector register:

uint32x4_t state = { seed, seed+1, seed+2, seed+3 };
for (int j = 0; j < KEYLEN; j++) {
    state = vmlaq_u32(addone, state, mulc);                 // state = state*A + 1
    uint64x2_t lo = vmull_n_u32(vget_low_u32(state),  0xDF);
    uint64x2_t hi = vmull_n_u32(vget_high_u32(state), 0xDF);
    uint32x4_t high  = vcombine_u32(vshrn_n_u64(lo, 32),
                                    vshrn_n_u64(hi, 32));
    uint32x4_t bytes = vaddq_u32(high, add20);
    p[0][j] = (uint8_t)vgetq_lane_u32(bytes, 0);
    p[1][j] = (uint8_t)vgetq_lane_u32(bytes, 1);
    p[2][j] = (uint8_t)vgetq_lane_u32(bytes, 2);
    p[3][j] = (uint8_t)vgetq_lane_u32(bytes, 3);
}

One vector step produces four bytes, one per seed. Sixty-four steps (32 for p, then 32 more for q, continuing the same LCG chain on each lane) emit four complete (p, q) pairs. The multiply / subtract / compare stays scalar and simply runs four times per outer batch.

Result: 5.4 seconds, 78 core-seconds. About 18 ns per seed - essentially at the LCG serial-chain floor for a single core. To go faster, we need to stop doing this on a CPU.

Iteration 3: hand it to the GPU (gpu.mm)

Per-seed work is independent and tiny - a couple hundred integer ops, no shared state, no data dependencies across seeds. That is the textbook shape for a GPU: an Apple Silicon GPU has thousands of ALUs, not eighteen.

The plan: one GPU thread per seed, 16 batches of 256M threads each (= 4.29B seeds total), early-exit as soon as any thread finds the hit.

The kernel is almost a direct port of the native version, but using 32-bit limbs instead of 64-bit. GPUs have a fast native 32×32 → 64 multiply, and MSL does not have __uint128_t anyway - so 8 × uint32 limbs for p and q, 16 × uint32 limbs for n, gap, and prod, with ulong intermediates for carry:

kernel void search(
    device atomic_uint *found        [[buffer(0)]],
    constant uint      *n_limbs      [[buffer(1)]],
    constant uint      *gap_limbs    [[buffer(2)]],
    constant uint      &base_seed    [[buffer(3)]],
    uint                tid          [[thread_position_in_grid]])
{
    uint seed = base_seed + tid;
    // 32 LCG steps -> p-bytes, 32 more -> q-bytes
    // pack bytes -> 8 x uint limbs each
    // mul_256x256 with 32-bit limbs and ulong intermediates
    // sub = n - prod, 32-bit limbs with borrow
    // compare sub vs gap
    atomic_store_explicit(found, seed + 1u, memory_order_relaxed);
}

The CPU side compiles the kernel at runtime via newLibraryWithSource:, allocates buffers for n, gap, and a single-uint found slot, then loops over batches with a CPU-side check between dispatches so we can early-exit:

for (uint64_t base = 0; base < TOTAL; base += BATCH) {
    // ... encode dispatch with base_seed = base ...
    [cmd commit];
    [cmd waitUntilCompleted];
    if (*((uint32_t*)found_buf.contents)) break;
}

When a thread finds the hit, it writes seed + 1 into the shared found slot (0 is reserved for “not yet found”). Nothing fancy for the atomic - there is at most one true hit in the whole keyspace, so contention is a non-issue.

The GMP cold path (nextprime + divisibility + RSA key output) stays on the CPU. No reason to move it; it runs once.

Result: 0.87 seconds wall time, 0.02 seconds CPU. The GPU actually processed only 12 of 16 batches - the hit lives at seed 0xBEC0FFEE, about 75% through the keyspace, and we early-exited.

Numbers

Version Wall CPU (core-sec) vs baseline Change
multi (GMP) 9.8s 165s 1.00× -
fast (native 64-bit limbs) 7.1s 121s 1.38× GMP overhead out of hot loop
neon (4-wide NEON LCG) 5.4s 78s 1.82× LCG vectorized
gpu (Metal) 0.87s 0.02s 11.3× thousands of parallel ALUs

Conclusion

Years ago, I remember spending a long time solving this problem. When I first got a working version running on a single thread and it finished in 5–6 hours, I was so happy about it!

Over the years, as CPUs got faster and with some manual optimizations, I managed to bring that down to around 10 seconds. But seeing it reach 0.87 seconds with the help of AI feels like a completely different level.

There’s a lot of debate around AI, but from my perspective, for people who understand what they’re doing, it’s a tool that can open doors we couldn’t even imagine before.