Identifying and Dropping Missing Data

Missing values are nearly universal in real-world datasets and, if ignored, will crash many ML algorithms or silently corrupt model training. Identifying and strategically handling them is one of the first steps in any preprocessing pipeline.


Detecting Missing Values

Pandas represents missing values as NaN (for floats) or None (for objects). The isnull() and isna() methods return boolean masks that make it easy to quantify missingness across columns.

Quantifying Missingness

<pre><code class="language-python">import pandas as pd df = pd.read_csv("data.csv") # Count missing values per column print(df.isnull().sum()) # Percentage of missing values missing_pct = df.isnull().mean() * 100 print(missing_pct.sort_values(ascending=False)) # Columns with ANY missing values df.columns[df.isnull().any()].tolist()</pre>

Dropping Missing Data

Dropping rows or columns with missing values is appropriate when missingness is rare (under ~5%) or when a column is so incomplete it carries no useful signal. However, dropping data carelessly discards information and can introduce bias.

Using dropna Effectively

<pre><code class="language-python"># Drop rows with ANY missing value df_clean = df.dropna() # Drop rows where SPECIFIC columns are null df_clean = df.dropna(subset=["age", "income"]) # Drop columns missing more than 40% of values threshold = len(df) * 0.6 df_clean = df.dropna(axis=1, thresh=int(threshold)) print(f"Rows before: {len(df)}, after: {len(df_clean)}")</pre>

When NOT to Drop

If the missingness pattern is not at random (e.g., high-income respondents skip the income field), dropping those rows creates biased training data. In such cases, imputation or missingness indicator features are preferable. Always analyze why data is missing before deciding how to handle it.