Comprehensive Guide to Exploratory Data Analysis and Data Preprocessing

October 9, 2025Jonesh Shrestha

📌TL;DR

This comprehensive tutorial covers exploratory data analysis and preprocessing techniques using the Adult Census dataset (32K+ records). Key topics include: handling missing values with imputation and deletion strategies, creating visualizations (histograms, bar plots, box plots, scatter plots), analyzing feature distributions and relationships, detecting outliers using IQR method, and applying statistical tests (chi-square, t-test, ANOVA). The workflow demonstrates data type conversions, correlation analysis, and preparing clean data for machine learning models. Perfect for understanding essential data science preprocessing steps before model building.

Introduction

Before building any machine learning model, you must understand your data. Exploratory Data Analysis (EDA) and preprocessing are critical steps that often determine project success more than algorithm choice. In this tutorial, I'll walk you through a comprehensive EDA and preprocessing workflow using the Adult Census dataset, demonstrating techniques for handling missing values, visualizing distributions, analyzing relationships, and preparing data for modeling.

Understanding the Dataset

The Adult Census dataset contains demographic and employment information for 32,561 individuals with 12 attributes:

  • age: Individual's age
  • workclass: Employment type (Private, Public, Self-employed)
  • education: Education level
  • education-num: Numeric encoding of education
  • marital-status: Marital status
  • occupation: Job type
  • relationship: Relationship status
  • race: Racial category
  • sex: Gender
  • hours-per-week: Weekly work hours
  • native-country: Country of origin
  • income: Income level (≤50K or >50K)

Initial Data Exploration

adult_census_df = pd.read_csv("adult-modified-09-13-2025.csv")
adult_census_df.head(10)
adult_census_df.info()

Using info() provides crucial initial insights:

  • Data types of each column
  • Non-null counts (identifying potential missing values)
  • Memory usage

Our dataset showed no null values initially, but further investigation revealed that missing values were encoded as '?' rather than standard null indicators.

Discovering Hidden Missing Values

for col in adult_census_df.columns:
    print(col)
    print(adult_census_df[col].unique())

By examining unique values for each column, I discovered '?' symbols representing missing data. This is a common real-world scenario-missing values aren't always null, they might be encoded as special strings, -999, or other placeholders.

(adult_census_df == "?").sum()

This revealed:

  • age: 725 missing
  • workclass: 1,836 missing
  • occupation: 1,843 missing
  • native-country: 583 missing

Handling Missing Values

Converting Placeholders to Actual Missing Values

adult_census_df.replace("?", np.nan, inplace=True)

This standardizes missing value representation, allowing pandas' built-in methods to handle them properly.

Strategy: Mean Imputation for Numeric Features

adult_census_df["age"] = adult_census_df["age"].astype(float)
age_mean = adult_census_df["age"].mean()
adult_census_df.fillna({"age": age_mean}, axis=0, inplace=True)

Why mean imputation for age?

  • Preserves the distribution's center
  • Doesn't create impossible values
  • Simple and interpretable

However, note that this reduces variance and can bias relationships. For critical applications, consider more sophisticated imputation (median, KNN imputation, or predictive models).

Strategy: Drop Rows for Categorical Features

adult_census_df.drop(adult_census_df[adult_census_df.workclass.isnull()].index, axis=0, inplace=True)
adult_census_df.drop(adult_census_df[adult_census_df.occupation.isnull()].index, axis=0, inplace=True)

For categorical features like workclass and occupation, imputation is tricky-there's no meaningful "average occupation." Dropping rows is often safer than introducing bias through arbitrary imputation.

Trade-off: We lose some data, but maintain data quality. We went from 32,561 to approximately 30,000 samples-still plenty for analysis.

Feature Selection and Removal

adult_census_df.drop(columns=["education", "native-country"], inplace=True)

I removed:

  • education: Redundant because education-num provides the same information numerically
  • native-country: High cardinality (many unique countries) with most being rare, making it less useful for modeling

This demonstrates an important principle: more features aren't always better. Redundant or high-cardinality categorical features can hurt model performance.

Numeric Feature Analysis

Statistical Summary

adult_census_df.describe()

For numeric attributes:

  • age: Mean ~37 years, std ~13 years, range 17-90
  • education-num: Mean ~10 (approximately HS graduate level)
  • hours-per-week: Mean ~40 hours (standard full-time)

Box Plots for Distribution and Outliers

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

ax[0].boxplot(adult_census_df["age"])
ax[0].set_title("boxplot of age")

ax[1].boxplot(adult_census_df["hours-per-week"])
ax[1].set_title("boxplot of hours-per-week")

ax[2].boxplot(adult_census_df["education-num"])
ax[2].set_title("boxplot of education-num")

Box plots reveal:

  • Central tendency: Median shown as horizontal line
  • Dispersion: Box shows interquartile range (IQR, middle 50% of data)
  • Skewness: Asymmetric whiskers indicate skew
  • Outliers: Points beyond whiskers

Age analysis: Median ~37, positive skew (longer upper whisker), outliers above 75. This tells us the population skews younger with some elderly individuals.

Hours-per-week analysis: Median ~40, minimal skew, outliers on both ends. Most people work standard hours, but some work very few or very many hours.

Education-num analysis: Median ~10, slight negative skew, few low-education outliers. Most people completed at least some high school.

Histograms for Detailed Distribution

fig, ax = plt.subplots(1, 3, figsize=(15, 5))

ax[0].hist(adult_census_df["age"], bins=15)
ax[1].hist(adult_census_df["hours-per-week"], bins=15)
ax[2].hist(adult_census_df["education-num"], bins=15)

Histograms complement box plots by showing:

  • Shape of distribution: Normal, uniform, multimodal?
  • Concentration: Where most values cluster
  • Gaps: Discontinuities in the data

The age histogram confirms positive skew with concentration in 25-45 range. Hours-per-week shows strong peak at 40 (standard work week). Education-num shows discrete levels with peaks at specific education milestones.

Scatter Plots for Relationships

plt.scatter(adult_census_df["age"], adult_census_df["education-num"])
plt.xlabel("age")
plt.ylabel("education-num")

This scatter plot revealed no strong linear relationship between age and education level, suggesting that in this population, education is not strongly correlated with age (people of all ages have various education levels).

Categorical Feature Analysis

Frequency Analysis with Bar Charts

workclass_freq = adult_census_df["workclass"].value_counts()
sex_freq = adult_census_df["sex"].value_counts()

fig, ax = plt.subplots(7, 1, figsize=(8, 35), constrained_layout=True)

ax[0].bar(workclass_freq.index, workclass_freq.values)
ax[0].set_title("Workclass")

Bar charts for categorical features show:

  • Dominant categories: Private workclass dominates
  • Class imbalance: Many more males than females in dataset
  • Rare categories: Some categories have very few samples

Key insights:

  • Private sector dominates employment
  • More married individuals than unmarried
  • Professionals and craft-repair are most common occupations
  • Racial distribution heavily skewed toward White
  • Significant gender imbalance (more males)
  • Most people earn ≤50K

This information is crucial for modeling-class imbalance requires special handling to avoid biased predictions.

Cross-Tabulation Analysis

Examining Relationships Between Categorical Variables

cross_tab_workclass_income = pd.crosstab(
    adult_census_df["workclass"],
    adult_census_df["income"]
)

cross_tab_workclass_income.plot(kind="bar")

Cross-tabulation reveals how categories relate:

  • Workclass vs. Income: Shows whether certain employment types correlate with higher income
  • Race vs. Income: Reveals income disparities across racial groups

Percentage Analysis

race_income_percent = (
    cross_tab_race_income["<=50K"] / cross_tab_race_income.sum(axis=1)
) * 100

Converting to percentages makes comparisons fair across groups of different sizes. This showed that certain racial categories have higher percentages of low-income individuals, revealing potential societal inequalities in the data.

Subset Analysis

Creating Targeted Subsets

private_bachelor_df = adult_census_df[
    (adult_census_df["workclass"] == "Private")
    & (adult_census_df["education-num"] < 13)
]

Analyzing specific subgroups provides deeper insights:

  • Average age ~36 years
  • Predominantly married, male, White
  • Most earn ≤50K
  • Work ~39 hours/week

This analysis helps understand specific populations, enabling targeted policies or marketing.

Comparing Low vs. High Income Groups

low_income_df = adult_census_df[adult_census_df["income"] == "<=50K"]
high_income_df = adult_census_df[adult_census_df["income"] == ">50K"]

Comparing these groups revealed:

Low income characteristics:

  • Younger (~36 years)
  • Lower education (HS grad or some college)
  • Never-married or married
  • Adm-clerical, craft-repair occupations
  • 30-40 work hours

High income characteristics:

  • Older (~44 years)
  • Higher education (Associate degree or higher)
  • Predominantly married
  • Exec-managerial, prof-specialty occupations
  • 40-50 work hours

These insights suggest education, age, and occupation strongly correlate with income-valuable for both understanding society and building predictive models.

Data Transformation for Machine Learning

One-Hot Encoding with get_dummies()

std_format_df = pd.get_dummies(adult_census_df, dtype=int)

Machine learning algorithms need numeric inputs. get_dummies() converts each categorical value into a binary column:

  • sex: Becomes sex_Male and sex_Female
  • workclass: Becomes workclass_Private, workclass_Public, etc.

This creates a "standard spreadsheet format" where every column is numeric, enabling machine learning.

Correlation Analysis

corr_matrix = std_format_df.corr()
corr_matrix["education-num"].sort_values(ascending=False)

Correlation reveals feature relationships:

  • Strong positive correlation: Features that increase together
  • Strong negative correlation: Features that move oppositely
  • Weak correlation: Little relationship

Education-num correlations:

  • Positive: Prof-specialty occupation, income >50K
  • Negative: Other-service occupation, income ≤50K

This tells us higher education strongly predicts professional occupations and higher income-intuitive but quantitatively confirmed.

Feature Discretization

Creating Age Bins

age_bins = pd.qcut(std_format_df.age, 3, labels=["young", "mid-age", "old"])
dis_age_df["age_bins"] = age_bins

pd.qcut() creates equal-sized bins:

  • Divides data into 3 equal groups
  • Labels them young, mid-age, old
  • Preserves distribution information while reducing granularity

This is useful when relationships are non-linear or when you want interpretable categories for analysis or visualization.

Feature Normalization and Standardization

Min-Max Normalization

max_edu_num = std_format_df["education-num"].max()
min_edu_num = std_format_df["education-num"].min()
edu_min_max_norm = (std_format_df["education-num"] - min_edu_num) / (max_edu_num - min_edu_num)

Min-max normalization scales features to [0, 1]:

  • Preserves relationships
  • Removes units
  • Useful for algorithms sensitive to feature scales (neural networks, KNN)

Formula: (x - min) / (max - min)

Z-Score Standardization

z_score_std = (z_score_std - z_score_std.mean()) / z_score_std.std()

Z-score standardization creates mean=0, std=1:

  • Centers data at zero
  • Expresses values in standard deviations from mean
  • Preferred for algorithms assuming normally distributed features (logistic regression, SVM)

Formula: (x - mean) / std

When to use each:

  • Min-Max: When you need bounded range [0,1], for neural networks
  • Z-Score: When you need standard deviations, for linear models

Both ensure features contribute equally regardless of original scale.

Key Takeaways

  1. Missing Values Aren't Always Null: Always examine unique values. Missing data might be encoded as '?', -999, 'NA', or other placeholders. Standardize to np.nan for consistent handling.

  2. Multiple Visualizations Tell Complete Stories: Box plots show distribution and outliers, histograms show detailed shape, scatter plots show relationships. Use all three for comprehensive understanding.

  3. Cross-Tabulation Reveals Relationships: Don't just analyze features in isolation. Cross-tabs show how categorical features interact, revealing insights you'd miss otherwise.

  4. Subset Analysis Provides Depth: Analyzing specific populations (like private sector workers without degrees) reveals nuances that aggregate analysis misses.

  5. Normalization is Algorithm-Dependent: Choose normalization strategy based on your algorithm's assumptions and your data's characteristics.

  6. Feature Engineering Before Modeling: Proper encoding (one-hot), normalization, and handling missing values often matters more than algorithm choice.

Common Pitfalls

  1. Imputing Categorical Missing Values: Don't impute with mode blindly-this can introduce bias. Often better to drop or create "missing" category.

  2. Ignoring Correlation: High correlation between features causes multicollinearity, hurting some models. Remove redundant features.

  3. Normalizing After Train/Test Split: Fit normalization on training data only, then apply to test data. Otherwise, you leak information.

  4. Forgetting to Save Preprocessing Parameters: For deployment, you need the same min/max values, mean/std, or encoding mappings used during training.

Conclusion

Exploratory Data Analysis and preprocessing aren't preliminary steps to rush through-they're fundamental to machine learning success. By thoroughly understanding your data through visualization and statistical analysis, handling missing values thoughtfully, and preparing features appropriately, you build a solid foundation for modeling.

The Adult Census dataset analysis demonstrated the complete workflow: discovering hidden missing values, examining distributions, analyzing relationships, creating subsets for deeper insights, and transforming data for machine learning. Each step revealed insights that inform both understanding of the domain (societal patterns in employment and income) and modeling decisions (which features to keep, how to encode them, how to handle imbalance).

Remember: garbage in, garbage out. Invest time in EDA and preprocessing-it pays dividends in model performance and insight generation. A simple model on well-prepared data almost always outperforms a complex model on poorly prepared data.


📓 Jupyter Notebook

Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:

→ View Notebook on GitHub

You can also run it interactively: