Hand-rolling BitNet b1.58 in Rust: from autograd to tensor cores
I wanted to genuinely understand BitNet b1.58 - ternary weights with INT8 activations, trained through a straight-through estimator. So I wrote it from scratch in Rust with no third-party ML dependencies. Then I tried to make it run on Ada tensor cores. Then I found out the GPU was slower than the CPU at my batch size, and learnt why.
Read full article