For over a decade, Ruby developers who wanted to do anything with numerical computing or machine learning faced the same choice: switch to Python, or cobble together half-maintained gems that wrapped NumPy at arm’s length. Neither option was great. The first meant abandoning a language you love. The second meant living with sparse documentation, broken builds, and APIs that felt foreign in Ruby.

I finally sat down and ported MLX — Apple’s array framework designed for machine learning on Apple silicon — to Ruby. The result is mlx-ruby: a native C++ extension that gives Ruby lazy-evaluated arrays, automatic differentiation, a full neural network library, and GPU acceleration. Not a wrapper around Python. Not a transpiler. A direct binding to the MLX C++ runtime, with a Ruby API designed to feel like Ruby.

The Gap

Python dominates ML for well-known reasons: NumPy shipped early, SciPy built on it, and then PyTorch and TensorFlow cemented the ecosystem. But Python didn’t win because of its syntax. It won because the libraries existed.

Ruby, meanwhile, has had scattered attempts. NMatrix came and went. Numo::NArray works for basic numerics but has no autodiff, no GPU support, no neural network abstractions. Torch.rb wraps LibTorch through FFI, which is viable but means inheriting PyTorch’s design decisions wholesale — a C++ API shaped by Python conventions, adapted back into Ruby through a foreign function interface. Every layer of indirection costs you something in ergonomics.

The core problem was never that Ruby couldn’t do math. It’s that nobody built the full stack: arrays, differentiation, compilation, neural network modules, optimizers, and serialization — all designed together, with Ruby’s strengths in mind.

Why Ruby Is Actually Good at This

Ruby gets dismissed for ML work as “too slow” or “not serious enough.” But the actual model-definition work in any ML framework isn’t about raw loop speed — it’s about expressing architecture. You’re composing layers, defining forward passes, specifying loss functions. The heavy computation happens in C++/Metal/CUDA kernels regardless of which language you write your model definition in.

And for the expressive part — the part where you actually design things — Ruby is exceptional. Consider what a trainable model looks like:

class LinearRegressor < MLX::NN::Module
  def initialize
    super()
    self.linear = MLX::NN::Linear.new(3, 1)
  end
 
  def call(x)
    linear.call(x)
  end
end

That’s it. No decorators. No type annotations fighting the runtime. No __init__ / super().__init__() ceremony. Ruby’s self.name = value pattern registers parameters automatically for tracking and optimization. The module base class handles the rest.

Ruby’s blocks make functional transforms read naturally:

loss_fn = ->(w) do
  mx.mean(mx.square(mx.matmul(x_train, w) - y)) * 0.5
end
 
grad_fn = mx.grad(loss_fn)

A training loop is just a loop:

200.times do
  grad = grad_fn.call(w)
  w = w - grad * lr
  mx.eval(w)
end

No session management, no tape context, no with torch.no_grad():. MLX’s lazy evaluation model means you build a computation graph implicitly and materialize it when you call mx.eval. Ruby’s clean syntax makes this feel invisible.

Why MLX

I chose MLX as the foundation for a few reasons.

Clean C++ core. MLX was designed from scratch by Apple’s ML research team. The C++ API is modern (C++20), well-factored, and doesn’t carry decades of backward-compatibility debt. Binding to it directly is tractable — the native extension is about 8,000 lines of C++, which is manageable for one person.

Lazy evaluation. MLX arrays aren’t computed until you need them. This means you can build arbitrarily large computation graphs without materializing intermediate results. It also makes automatic differentiation natural — the graph is already there.

Unified memory. On Apple silicon, MLX operates on unified CPU/GPU memory. No explicit transfers, no tensor.to("cuda"). You write your code and it runs where it should. On Linux, the CPU backend works the same way.

Composable transforms. MLX treats grad, vmap, jvp, vjp, and compile as function transforms — higher-order functions that take a function and return a new function. This maps perfectly to Ruby’s lambda/proc model.

Metal, Unified Memory, and the GPU

The performance story of mlx-ruby is really the performance story of MLX itself — and that story starts with Apple silicon’s unified memory architecture.

No More tensor.to("cuda")

In PyTorch, moving data between CPU and GPU is explicit and error-prone. You allocate a tensor, call .to("cuda") to copy it to the GPU, do your work, then copy results back. Forget a transfer and you get a cryptic device mismatch error. On Apple silicon, MLX doesn’t need any of this — CPU and GPU share the same physical memory. An array created on one device is immediately accessible to the other:

a = mx.random_uniform([1024, 1024])
 
# Same data, no copies
mx.stream(mx.cpu) { mx.add(a, a) }   # runs on CPU
mx.stream(mx.gpu) { mx.add(a, a) }   # runs on GPU

There’s no serialization, no PCIe bottleneck, no cudaMemcpy. The array lives in unified memory and both processors see it directly. MLX handles cross-device dependencies automatically — if a GPU operation depends on a CPU result, MLX ensures the CPU work finishes first without you writing synchronization code.

Device Selection

Picking where your code runs is straightforward:

# Environment variable
# MLX_DEFAULT_DEVICE=gpu ruby train.rb
 
# Or in code
mx.set_default_device(mx.gpu)
 
# Or scope it to a block
mx.stream(mx.gpu) do
  result = mx.matmul(a, b)  # runs on GPU
end
# back to default device here

You can query what’s available at runtime:

mx.metal_is_available  # => true on Apple silicon
mx.device_info(mx.gpu) # => {"name" => "Apple M1 Max", ...}

Mixed CPU/GPU Execution

Because there’s no transfer cost, you can mix devices within a single computation and actually gain performance. Put compute-heavy operations on the GPU and small operations on the CPU, and they overlap:

def mixed_compute(a, b)
  x = mx.stream(mx.gpu) { mx.matmul(a, b) }  # GPU: big matmul
  20.times do
    b = mx.stream(mx.cpu) { mx.exp(b) }       # CPU: small element-wise
  end
  [x, b]
end

On an M1 Max, this mixed approach runs in about 1.4ms compared to 2.8ms for GPU-only execution — a 2x speedup from letting both processors work simultaneously. MLX’s lazy evaluation and automatic dependency tracking make this safe without manual synchronization.

Lazy Evaluation and Async Execution

Nothing computes until you say so. This isn’t just an implementation detail — it’s a performance strategy. You build an arbitrarily large computation graph, and MLX optimizes and executes it as a unit:

# None of this allocates or computes anything yet
x = mx.random_uniform([1000, 1000])
y = mx.matmul(x, x)
z = mx.sum(mx.exp(y))
 
# Now it all runs, fused and optimized
mx.eval(z)

For training loops where you don’t need to block on every step, there’s async evaluation:

mx.async_eval(model.parameters)  # returns immediately
# Ruby continues while Metal churns in the background

Critically, both eval and async_eval release Ruby’s Global VM Lock (GVL) during computation. This means other Ruby threads — serving HTTP requests, processing IO, running background jobs — continue unblocked while the GPU works. Your ML pipeline doesn’t freeze your application.

Custom Metal Kernels

When the built-in operations aren’t enough, you can write custom Metal shaders and call them directly from Ruby:

source = <<~METAL
  uint elem = thread_position_in_grid.x;
  T tmp = inp[elem];
  out[elem] = metal::exp(tmp);
METAL
 
kernel = mx.metal_kernel("myexp", ["inp"], ["out"], source)
 
outputs = kernel.call(
  inputs: [x],
  output_shapes: [[4]],
  output_dtypes: [mx.float32],
  grid: [4, 1, 1],
  threadgroup: [4, 1, 1],
  template: [["T", mx.float32]]
)

This is the same API that the upstream MLX project uses for fused kernels. The performance gains are real — a custom grid-sample kernel on an M1 Max runs 8.3x faster on the forward pass and 40.5x faster on the backward pass compared to the equivalent composition of standard MLX primitives. Having this escape hatch available from Ruby means you’re never stuck at a performance ceiling.

JIT Compilation

For pure-Ruby function graphs, mx.compile JIT-compiles the computation:

def expensive(x, y)
  mx.exp(mx.negative(x)) + y
end
 
fast = mx.compile(method(:expensive))
 
# First call compiles, subsequent calls use the cached version
fast.call(x, y)

The compiled function recompiles only when input shapes, types, or count change. For inner-loop operations that get called thousands of times with the same shapes, this eliminates graph-construction overhead entirely.

What’s in the Box

mlx-ruby ships as a gem (gem install mlx) with a native C++ extension that compiles against the upstream MLX runtime. Here’s what you get:

Core Array Operations

Everything you’d expect from a NumPy-like library: creation (zeros, ones, arange, eye), arithmetic, comparisons, reductions (sum, mean, max), reshaping, slicing, concatenation, linear algebra (matmul, transpose), trigonometric and transcendental functions, FFT, einsum, and more.

mx = MLX::Core
x = mx.array([1.0, 2.0, 3.0], mx.float32)
y = mx.sqrt(x + 1.0)
mx.eval(y)
p y.to_a  # => [1.414..., 1.732..., 2.0]

Function Transforms

Automatic differentiation, vectorized mapping, JIT compilation, and checkpointing — all as composable function transforms:

grad_fn    = mx.grad(loss_fn)
mapped_fn  = mx.vmap(fn, in_axes: 0)
compiled   = mx.compile(fn)
loss, grad = mx.value_and_grad(fn).call(params)

Neural Network Modules

Over 30 layer types organized under MLX::NN:

  • Linear layers: Linear, Bilinear, Identity
  • Convolutions: Conv1d, Conv2d, Conv3d and their transposed variants
  • Recurrent: RNN, GRU, LSTM
  • Normalization: LayerNorm, RMSNorm, GroupNorm, BatchNorm, InstanceNorm
  • Attention: MultiHeadAttention, TransformerEncoderLayer
  • Pooling: MaxPool1d/2d, AvgPool1d/2d, AdaptiveAvgPool1d/2d
  • Activations: ReLU, GELU, SiLU, Mish, Sigmoid, Tanh, and more
  • Positional encoding: RoPE, ALiBi
  • Embedding: Embedding, QuantizedEmbedding
  • Quantization: QuantizedLinear, QQLinear

Optimizers and Schedulers

Eleven optimizers (SGD, Adam, AdamW, RMSprop, Adagrad, AdaDelta, Adamax, Lion, Adafactor, Muon, and more) with learning rate schedulers (exponential decay, cosine decay, step decay, linear schedules, joined schedules).

Loss Functions

cross_entropy, binary_cross_entropy, l1_loss, mse_loss, smooth_l1_loss, kl_div_loss — all with configurable reduction.

Serialization

Load and save models in NPZ and SafeTensors formats. Weight loading with strict mode for catching mismatches:

model.save_weights("model.safetensors")
model.load_weights("model.safetensors", strict: true)

A Real Example: Nano GPT in Ruby

Here’s a Karpathy-style GPT defined in Ruby. This isn’t pseudocode — it runs:

class NanoGpt < MLX::NN::Module
  def initialize(vocab_size:, seq_len:, dims:, heads:, layers:)
    super()
    self.token_embedding = MLX::NN::Embedding.new(vocab_size, dims)
    self.pos_embedding   = MLX::NN::Embedding.new(seq_len, dims)
    self.blocks = Array.new(layers) do
      MLX::NN::TransformerEncoderLayer.new(
        dims, heads,
        mlp_dims: dims * 4,
        dropout: 0.0,
        norm_first: true
      )
    end
    self.norm = MLX::NN::LayerNorm.new(dims)
    self.head = MLX::NN::Linear.new(dims, vocab_size)
    @causal_mask = MLX::NN::MultiHeadAttention
                     .create_additive_causal_mask(seq_len)
  end
 
  def call(input_ids)
    positions = mx.arange(0, input_ids.shape[1], 1, mx.int32)
    hidden = mx.add(
      token_embedding.call(input_ids),
      pos_embedding.call(positions)
    )
    blocks.each { |block| hidden = block.call(hidden, @causal_mask) }
    head.call(norm.call(hidden))
  end
end

Training it:

model     = NanoGpt.new(vocab_size: 65, seq_len: 32,
                        dims: 128, heads: 4, layers: 2)
optimizer = MLX::Optimizers::AdamW.new(learning_rate: 1e-3)
 
loss_and_grad = MLX::NN.value_and_grad(
  model,
  lambda do |ids, labels|
    logits = model.call(ids)
    logits2d = mx.reshape(logits, [batch_size * seq_len, vocab_size])
    labels1d = mx.reshape(labels, [batch_size * seq_len])
    MLX::NN.cross_entropy(logits2d, labels1d, reduction: "mean")
  end
)
 
loss, grads = loss_and_grad.call(input_ids, target_ids)
optimizer.update(model, grads)
mx.eval(loss, model.parameters, optimizer.state)

Read that code and tell me Ruby isn’t a natural fit for this kind of work. The model definition is a class. The forward pass is a method. The training loop uses standard Ruby iteration. Parameters are tracked automatically. Gradients flow through lambdas.

Where It Stands

This is a v1.0 release. The parity test suite runs 300+ tests comparing Ruby output against the upstream Python MLX implementation. Benchmarks cover transformers, CNNs, MLPs, RNNs, and a full Karpathy GPT-2 training loop — all runnable against both Ruby and Python to verify there’s no meaningful performance gap from the language binding.

Ruby developers have built extraordinary things: Rails changed how the world builds web applications. The language’s emphasis on developer happiness and expressive power isn’t a liability for serious computing — it’s an asset.

The reason Ruby hasn’t been used for ML isn’t that it can’t express these ideas well. It’s that nobody built the infrastructure. MLX Ruby is that infrastructure: a complete, native, GPU-accelerated machine learning framework that treats Ruby as a first-class citizen.

If you’ve ever wanted to train a model, experiment with neural architectures, or run LLM inference without leaving Ruby — now you can.

gem install mlx

GitHub | Documentation | RubyGems