Introduction to Pandas DataFrames
The pandas DataFrame is the workhorse data structure for ML preprocessing in Python — a labeled, two-dimensional table that supports powerful indexing, filtering, and aggregation. Mastering it is non-negotiable for any data practitioner.
Creating and Inspecting DataFrames
DataFrames can be created from dictionaries, CSV files, databases, or NumPy arrays. The first step with any new dataset is always inspection — shape, dtypes, and a few sample rows.
Loading and Inspecting Data
<pre><code class="language-python">import pandas as pd
df = pd.read_csv("data.csv")
print(df.shape) # (rows, cols)
print(df.dtypes) # column types
print(df.head()) # first 5 rows
print(df.info()) # non-null counts and types
print(df.describe()) # summary statistics</pre>
Selecting and Filtering
<pre><code class="language-python"># Column selection
df["age"] # single column (Series)
df[["age", "income"]] # multiple columns
# Row filtering
df[df["age"] > 30]
df.loc[df["income"] > 50000, ["name", "income"]]
df.iloc[0:5] # first 5 rows by position</pre>
Modifying DataFrames
DataFrames support in-place and out-of-place transformations. Understanding when pandas returns a view versus a copy prevents the common SettingWithCopyWarning.
Adding, Renaming, and Dropping Columns
<pre><code class="language-python"># Add a derived column
df["age_squared"] = df["age"] ** 2
# Rename columns
df = df.rename(columns={"old_name": "new_name"})
# Drop columns
df = df.drop(columns=["unnecessary_col"])
# Apply a function element-wise
df["log_income"] = df["income"].apply(lambda x: x ** 0.5)</pre>