Optimizing Python Math for Speed

Python is one of the slowest interpreted languages. Yet AI models train on billions of operations per second. The trick is vectorization: offloading loops to NumPy's pre-compiled C code. Writing AI code that performs well requires knowing when Python is the bottleneck and how to eliminate it.

Replace Loops with Vectorized Operations

Every Python for loop carries the overhead of the interpreter: type checking, object lookups, garbage collection. A NumPy operation on an entire array avoids all of this — the CPU processes the data directly using SIMD (single instruction, multiple data) instructions.

The Speed Gap

<pre><code class="language-python">import numpy as np import time arr = np.arange(1_000_000, dtype=float) # Loop version (slow) start = time.time() result = [x ** 2 for x in arr] print(f"Loop: {time.time() - start:.3f}s") # ~0.25s # Vectorized version (fast) start = time.time() result = arr ** 2 print(f"NumPy: {time.time() - start:.3f}s") # ~0.002s # ~100x faster </pre>

Memory Layout: C-order vs. F-order

NumPy arrays can store data in row-major (C-order) or column-major (Fortran-order) layout. When you access data in the wrong order for its layout, the CPU cache fills with data you won't use — a cache miss — causing slowdowns up to 10x. Always process data along the axis that matches its memory layout.

Checking and Using Memory Order

<pre><code class="language-python"># Default is C-order (row-major) A = np.ones((1000, 1000)) # C-contiguous print(A.flags['C_CONTIGUOUS']) # True # Sum along rows (axis=1) is fast for C-order # Sum along columns (axis=0) accesses non-contiguous memory # When you transpose, the array becomes F-contiguous B = A.T print(B.flags['F_CONTIGUOUS']) # True # To force a fresh copy in C-order after transpose: B_fast = np.ascontiguousarray(A.T) </pre>

Profiling to Find Real Bottlenecks

Don't guess where your code is slow — measure it. Python's %timeit (in Jupyter) and cProfile (in scripts) pinpoint exact bottlenecks. Optimise only the hot path; the rest isn't worth the complexity.

Quick Profiling Techniques

<pre><code class="language-python">import cProfile import numpy as np def my_function(): a = np.random.rand(1000, 1000) b = np.random.rand(1000, 1000) return a @ b # Profile the function cProfile.run('my_function()', sort='cumulative') # In Jupyter, use the magic command: # %timeit my_function() </pre>