Optimizing Python Math for Speed
Python is one of the slowest interpreted languages. Yet AI models train on billions of operations per second. The trick is vectorization: offloading loops to NumPy's pre-compiled C code. Writing AI code that performs well requires knowing when Python is the bottleneck and how to eliminate it.
Replace Loops with Vectorized Operations
Every Python for loop carries the overhead of the interpreter: type checking, object lookups, garbage collection. A NumPy operation on an entire array avoids all of this — the CPU processes the data directly using SIMD (single instruction, multiple data) instructions.
The Speed Gap
Memory Layout: C-order vs. F-order
NumPy arrays can store data in row-major (C-order) or column-major (Fortran-order) layout. When you access data in the wrong order for its layout, the CPU cache fills with data you won't use — a cache miss — causing slowdowns up to 10x. Always process data along the axis that matches its memory layout.
Checking and Using Memory Order
Profiling to Find Real Bottlenecks
Don't guess where your code is slow — measure it. Python's %timeit (in Jupyter) and cProfile (in scripts) pinpoint exact bottlenecks. Optimise only the hot path; the rest isn't worth the complexity.