Backend Dispatch Guide¶
This guide explains how NuMojo should route math operations internally while keeping the public API simple and stable.
Summary¶
NuMojo should expose clean user-facing functions like:
nm.sin(x)nm.add(x, y)nm.greater(x, y)
Backend selection (CPU SIMD, CPU scalar fallback, future GPU kernels) should happen internally in one place, not in user-facing signatures.
Goals¶
- Keep public API minimal and ergonomic.
- Centralize execution policy and backend routing.
- Avoid duplicating loop logic across modules.
- Make future GPU integration additive (no API breakage).
Design Principles¶
1) Public API never exposes backend parameters¶
Do this:
fn sin[dtype: DType](x: NDArray[dtype]) -> NDArray[dtype]
Do not do this in public API:
fn sin[dtype, backend](...)
Backend choices are internal implementation details.
2) Use a shared execution engine¶
Implement generic executors once:
- unary elementwise apply
- binary elementwise apply
- compare apply
- reduction apply (later)
Operation wrappers should call these executors rather than re-implement loops.
3) Dispatch policy lives in one internal layer¶
A central engine chooses execution path based on:
- dtype support
- contiguity/layout
- array size thresholds
- target device (future)
This keeps behavior consistent across all ops.
Recommended Layering¶
numojo/api/*: user-facing functions (sin,add, etc.)numojo/ops/*: internal op wrappers and shared enginesnumojo/ops/backend/*: backend-specific implementations (CPU vectorized/scalar, future GPU)
Suggested flow:
api.math.sin(x)- calls
ops.elementwise.sin(x) - calls
ExecutionEngine.apply_unary(math.sin, x) - engine dispatches to backend implementation
Minimal Pattern (Sin Example)¶
Public API¶
```/dev/null/api_math_trig.mojo#L1-8 import math from numojo.core.ndarray import NDArray from numojo.ops.execution.engine import ExecutionEngine
fn sindtype: DType raises -> NDArray[dtype]: return ExecutionEngine.apply_unarydtype, math.sin
### Execution Engine
```/dev/null/execution_engine.mojo#L1-22
from numojo.core.ndarray import NDArray
from numojo.ops.backend.cpu.vectorized import CpuVectorized
from numojo.ops.backend.cpu.scalar import CpuScalar
struct ExecutionEngine:
@staticmethod
fn apply_unary[
dtype: DType,
f: fn[type: DType, w: Int](SIMD[type, w]) -> SIMD[type, w],
](x: NDArray[dtype]) raises -> NDArray[dtype]:
if x.is_c_contiguous() and x.size >= 128 and dtype.is_floating_point():
return CpuVectorized.apply_unary[dtype, f](x)
return CpuScalar.apply_unary[dtype, f](x)
CPU Vectorized Backend¶
```/dev/null/cpu_vectorized.mojo#L1-18 from algorithm.functional import vectorize from sys import simd_width_of from numojo.core.ndarray import NDArray
struct CpuVectorized: @staticmethod fn apply_unary dtype: DType, f: fn[type: DType, w: Int](SIMD[type, w]) -> SIMD[type, w], raises -> NDArray[dtype]: var out = NDArraydtype comptime width = simd_width_ofdtype
@parameter
fn body[w: Int](i: Int) unified {mut out, read x}:
out._buf.ptr.store(i, f[dtype, w](x._buf.ptr.load[width=w](i)))
vectorize[width](x.size, body)
return out^
```
Why this is better than per-function backend wiring¶
- no backend noise in public signatures
- one place to tune thresholds and dispatch policy
- easier testing (engine behavior tested once)
- easier to add new ops (
sin/cos/exp/...reuse same unary path)
CPU Paths to Support Initially¶
- CPU vectorized path
- contiguous arrays
- supported dtypes
-
medium/large arrays
-
CPU scalar fallback
- unsupported dtype for SIMD path
- non-contiguous or small arrays
- correctness-first fallback
This gives robust behavior now and a stable architecture for future acceleration.
Future GPU Integration¶
When GPU kernels become available:
- Add backend modules:
ops/backend/gpu/cuda.mojo-
ops/backend/gpu/mps.mojo(as needed) -
Extend engine policy:
-
if array is on GPU and op supported -> GPU kernel
-
else -> CPU path (copy or explicit error depending on policy)
-
Public API stays unchanged.
Dispatch Policy Recommendations¶
Keep policy deterministic and documented.
Suggested checks in order:
- Validate dtype/op support.
- Normalize layout assumptions (contiguous fast path vs fallback).
- Evaluate size threshold for vectorization.
- Choose backend implementation.
- Fallback to scalar if fast path unavailable.
Use clear errors for unsupported op/dtype combinations.
Testing Strategy for Dispatch¶
What to test¶
- Correctness parity
- vectorized path result == scalar path result
-
compare with NumPy reference where applicable
-
Path coverage
- contiguous arrays trigger vectorized path
-
strided/small arrays trigger fallback path
-
Edge cases
- empty arrays
- singleton dimensions
-
non-finite values (
nan,inf) -
Performance smoke tests
- verify no major regressions for common sizes
Suggested test organization¶
- operation correctness tests stay under routine/module tests
- dispatch behavior tests in a focused internal test file
- avoid testing private details too tightly; assert behavior, not implementation quirks
Migration Plan (Incremental)¶
- Introduce
ExecutionEngine.apply_unaryandapply_binary. - Migrate a small set of ops (
sin,cos,add,mul) to validate pattern. - Remove backend parameters from those public APIs.
- Expand migration by module (
math, thenlogic, thenbitwise). - Remove obsolete backend plumbing once all callers are migrated.
This avoids a risky all-at-once refactor.
Common Pitfalls¶
- Reintroducing backend generics in public modules.
- Copy-pasting loop code into each op instead of reusing engine.
- Hiding unsupported dtype failures with silent behavior changes.
- Mixing dispatch policy across many files.
Recommended Naming¶
ExecutionEnginefor centralized routing/apply methods.CpuVectorized,CpuScalarfor backend implementations.apply_unary,apply_binary,apply_comparefor shared executors.
Keep names explicit and consistent.
Related Docs¶
docs/developer-guide/architecture.mddocs/developer-guide/adding-functions.mddocs/developer-guide/testing.mddocs/developer-guide/style-guide.md