Bivariate Analysis: Scatter Plots and Hexbins
Bivariate analysis examines the relationship between two variables — are they correlated, do they form clusters, are there non-linear patterns? Scatter plots reveal this for small datasets; hexbin plots handle large datasets where point overlap obscures the structure.
Scatter Plots
A scatter plot places each observation as a point in 2D space defined by two feature values. Adding a third dimension via color (hue) or size turns it into a trivariate display.
Scatter Plots with seaborn
<pre><code class="language-python">import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
df = pd.read_csv("data.csv")
# Basic scatter
sns.scatterplot(data=df, x="age", y="income", alpha=0.4)
plt.title("Age vs Income")
plt.show()
# Colored by class
sns.scatterplot(data=df, x="age", y="income",
hue="churn", palette="Set1", alpha=0.5)
plt.show()</pre>
Hexbin Plots for Large Datasets
When a dataset has thousands of points, scatter plots become unreadable due to overplotting. Hexbin plots divide the 2D plane into hexagonal bins and color each by point density, preserving the overall structure.
Hexbin with matplotlib and seaborn
<pre><code class="language-python"># matplotlib hexbin
plt.hexbin(df["age"], df["income"], gridsize=30, cmap="YlOrRd")
plt.colorbar(label="Count")
plt.xlabel("Age")
plt.ylabel("Income")
plt.show()
# seaborn jointplot with hexbin
sns.jointplot(data=df, x="age", y="income",
kind="hex", cmap="Blues")
plt.show()</pre>
Adding a Regression Line
<pre><code class="language-python"># Scatter with linear regression overlay
sns.regplot(data=df, x="age", y="income",
scatter_kws={"alpha": 0.3}, line_kws={"color": "red"})
plt.title("Age vs Income with Regression Line")
plt.show()</pre>