Data Cleaning and Exploratory Data Analysis (EDA) with Python

Rajeev Bagra 2026-04-10

Last Updated on December 29, 2024 by Rajeev Bagra

Data cleaning and exploratory data analysis (EDA) are critical steps in any data-driven project. They ensure that the data is accurate, consistent, and ready for analysis. In this blog post, we will explore the processes of data cleaning and EDA using Python, leveraging libraries like pandas and matplotlib. We’ll also delve into key statistical concepts such as mean, median, mode, quartile deviations, histograms, and boxplots, including handling outliers.


Data Cleaning

What is Data Cleaning?

Data cleaning is the process of detecting, correcting, or removing errors, inconsistencies, and inaccuracies in data. It prepares raw data for analysis, ensuring its quality and usability.

Steps in Data Cleaning

  1. Loading the Data Use pandas to load data from CSV, Excel, or databases.

    import pandas as pd  df = pd.read_csv('data.csv') 
  2. Inspecting the Data

    • Preview the data:
      print(df.head()) print(df.info()) 
    • Check for missing values:
      print(df.isnull().sum()) 
  3. Handling Missing Values

    • Drop missing values:
      df = df.dropna() 
    • Fill missing values:
      df['column_name'] = df['column_name'].fillna(df['column_name'].mean()) 
  4. Removing Duplicates

    df = df.drop_duplicates() 
  5. Fixing Data Types

    df['date_column'] = pd.to_datetime(df['date_column']) df['numeric_column'] = pd.to_numeric(df['numeric_column']) 
  6. Standardizing Data

    • Convert text to lowercase:
      df['text_column'] = df['text_column'].str.lower() 
  7. Outlier Detection and Removal Use statistical methods or visualization tools (discussed in the EDA section) to detect and handle outliers.


Exploratory Data Analysis (EDA)

EDA involves analyzing and summarizing data sets to understand their main characteristics. It’s often the first step in any data analysis project.

Key Concepts in EDA

  1. Mean

    • The average of a dataset.
    • Formula: [ $\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}} $]
    • In Python:
      mean_value = df['column_name'].mean() 
  2. Median

    • The middle value in a sorted dataset.
    • In Python:
      median_value = df['column_name'].median() 
  3. Mode

    • The most frequent value in a dataset.
    • In Python:
      mode_value = df['column_name'].mode()[0] 
  4. Quartiles and Quartile Deviation

    • Quartiles: Divide data into four equal parts.
    • Interquartile Range (IQR): The range between Q3 (75th percentile) and Q1 (25th percentile).
      Q1 = df['column_name'].quantile(0.25) Q3 = df['column_name'].quantile(0.75) IQR = Q3 - Q1 
  5. Outliers

    • Values outside the range: [ $\text{Outlier Range} = [Q1 – 1.5 \times IQR, Q3 + 1.5 \times IQR] $]
    • Detecting outliers:
      outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR))  (df['column_name'] > (Q3 + 1.5 * IQR))] 
  6. Describe Method

    • Summarizes the central tendency, dispersion, and shape of a dataset’s distribution.
    • Example:
      print(df.describe()) 
    • Output includes statistics such as count, mean, standard deviation, minimum, maximum, and quartiles for numerical columns.

Visualizations in EDA

  1. Histogram

    • Shows the frequency distribution of a dataset.
    • Example:
      import matplotlib.pyplot as plt  df['column_name'].plot(kind='hist', bins=20) plt.title('Histogram of Column') plt.xlabel('Values') plt.ylabel('Frequency') plt.show() 
  2. Boxplot

    • Displays data distribution and highlights outliers.
    • Example:
      df.boxplot(column='column_name') plt.title('Boxplot of Column') plt.show() 

Practical Example

Data Cleaning and EDA Workflow

# Importing libraries import pandas as pd import matplotlib.pyplot as plt  # Loading data df = pd.read_csv('data.csv')  # Data Cleaning print(df.isnull().sum()) df['Age'] = df['Age'].fillna(df['Age'].mean()) df = df.drop_duplicates()  # Outlier Detection Q1 = df['Salary'].quantile(0.25) Q3 = df['Salary'].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df['Salary'] < (Q1 - 1.5 * IQR))  (df['Salary'] > (Q3 + 1.5 * IQR))] print(outliers)  # Summary Statistics print(df.describe())  # Visualizations plt.figure(figsize=(10, 5)) df['Salary'].plot(kind='hist', bins=20, color='blue', alpha=0.7) plt.title('Salary Distribution') plt.xlabel('Salary') plt.ylabel('Frequency') plt.show()  df.boxplot(column='Salary') plt.title('Boxplot of Salary') plt.show() 

Conclusion

Data cleaning and EDA are essential to ensure that your data is ready for further analysis or modeling. Using Python libraries like pandas and matplotlib, you can efficiently clean your data and gain meaningful insights. By understanding statistical concepts and visualizations, you’re better equipped to make data-driven decisions.

Leave a Comment
Submitted successfully!

Recommended Articles