RFM Feature Engineering
Recency, Frequency, and Monetary (RFM) features are the gold standard for customer clustering: Recency (days since last purchase), Frequency (number of purchases), Monetary (total spend).
Building an RFM Dataset
<pre><code class="language-python">import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
# Simulate customer RFM data
np.random.seed(42)
n = 500
df = pd.DataFrame({
'recency': np.random.exponential(30, n).astype(int) + 1,
'frequency': np.random.poisson(5, n) + 1,
'monetary': np.random.lognormal(5, 1, n)
})
# Scale features
scaler = StandardScaler()
X_rfm = scaler.fit_transform(df)
print(df.describe().round(2))</pre>
Segmenting with K-Means
Apply K-Means to standardized RFM features, then profile each segment by interpreting cluster centroids in the original feature space.
Clustering and Profiling
<pre><code class="language-python">from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# Find best K
scores = [silhouette_score(X_rfm, KMeans(k, n_init=10, random_state=42).fit_predict(X_rfm))
for k in range(2, 9)]
best_k = range(2, 9)[scores.index(max(scores))]
print(f"Best K: {best_k}")
km = KMeans(n_clusters=best_k, n_init=10, random_state=42)
df['segment'] = km.fit_predict(X_rfm)
# Profile segments in original scale
profile = df.groupby('segment')[['recency','frequency','monetary']].mean().round(1)
print(profile)</pre>
Segment Interpretation
After profiling, label segments meaningfully: e.g., low recency + high frequency + high monetary = Champions; high recency + low frequency + low monetary = At Risk customers. These labels guide targeted marketing actions for each group.
Visualizing Segments
PCA or t-SNE can project high-dimensional customer features to 2D for visual cluster inspection.
2D PCA Visualization
<pre><code class="language-python">from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
pca = PCA(n_components=2, random_state=42)
X_2d = pca.fit_transform(X_rfm)
plt.scatter(X_2d[:, 0], X_2d[:, 1], c=df['segment'], cmap='tab10', alpha=0.6)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%})')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%})')
plt.title('Customer Segments (PCA Projection)')
plt.colorbar(label='Segment')
plt.show()</pre>