FutureMind Academy

Last Updated on December 29, 2024 by Rajeev Bagra

Data cleaning and exploratory data analysis (EDA) are critical steps in any data-driven project. They ensure that the data is accurate, consistent, and ready for analysis. In this blog post, we will explore the processes of data cleaning and EDA using Python, leveraging libraries like pandas and matplotlib. We’ll also delve into key statistical concepts such as mean, median, mode, quartile deviations, histograms, and boxplots, including handling outliers.

Data Cleaning

What is Data Cleaning?

Data cleaning is the process of detecting, correcting, or removing errors, inconsistencies, and inaccuracies in data. It prepares raw data for analysis, ensuring its quality and usability.

Steps in Data Cleaning

Loading the Data Use pandas to load data from CSV, Excel, or databases.
```
import pandas as pd  df = pd.read_csv('data.csv') 
```

Inspecting the Data

Preview the data:
```
print(df.head()) print(df.info()) 
```
Check for missing values:
```
print(df.isnull().sum()) 
```

Handling Missing Values

Drop missing values:
```
df = df.dropna() 
```

Fill missing values:

df['column_name'] = df['column_name'].fillna(df['column_name'].mean())

Removing Duplicates
```
df = df.drop_duplicates() 
```

Fixing Data Types

df['date_column'] = pd.to_datetime(df['date_column']) df['numeric_column'] = pd.to_numeric(df['numeric_column'])

Standardizing Data

Convert text to lowercase:

df['text_column'] = df['text_column'].str.lower()

Outlier Detection and Removal Use statistical methods or visualization tools (discussed in the EDA section) to detect and handle outliers.

Exploratory Data Analysis (EDA)

EDA involves analyzing and summarizing data sets to understand their main characteristics. It’s often the first step in any data analysis project.

Key Concepts in EDA

Mean
- The average of a dataset.
- Formula: [ $\text{Mean} = \frac{\text{Sum of all values}}{\text{Number of values}} $]
- In Python:
```
mean_value = df['column_name'].mean() 
```
Median
- The middle value in a sorted dataset.
- In Python:
```
median_value = df['column_name'].median() 
```
Mode
- The most frequent value in a dataset.
- In Python:
```
mode_value = df['column_name'].mode()[0] 
```
Quartiles and Quartile Deviation
- Quartiles: Divide data into four equal parts.
- Interquartile Range (IQR): The range between Q3 (75th percentile) and Q1 (25th percentile).
```
Q1 = df['column_name'].quantile(0.25) Q3 = df['column_name'].quantile(0.75) IQR = Q3 - Q1 
```
Outliers
- Values outside the range: [ $\text{Outlier Range} = [Q1 – 1.5 \times IQR, Q3 + 1.5 \times IQR] $]
- Detecting outliers:
```
outliers = df[(df['column_name'] < (Q1 - 1.5 * IQR))  (df['column_name'] > (Q3 + 1.5 * IQR))] 
```
Describe Method
- Summarizes the central tendency, dispersion, and shape of a dataset’s distribution.
- Example:
```
print(df.describe()) 
```
- Output includes statistics such as count, mean, standard deviation, minimum, maximum, and quartiles for numerical columns.

Visualizations in EDA

Histogram

Shows the frequency distribution of a dataset.

Example:

import matplotlib.pyplot as plt  df['column_name'].plot(kind='hist', bins=20) plt.title('Histogram of Column') plt.xlabel('Values') plt.ylabel('Frequency') plt.show()

Boxplot

Displays data distribution and highlights outliers.

Example:

df.boxplot(column='column_name') plt.title('Boxplot of Column') plt.show()

Practical Example

Data Cleaning and EDA Workflow

# Importing libraries import pandas as pd import matplotlib.pyplot as plt  # Loading data df = pd.read_csv('data.csv')  # Data Cleaning print(df.isnull().sum()) df['Age'] = df['Age'].fillna(df['Age'].mean()) df = df.drop_duplicates()  # Outlier Detection Q1 = df['Salary'].quantile(0.25) Q3 = df['Salary'].quantile(0.75) IQR = Q3 - Q1 outliers = df[(df['Salary'] < (Q1 - 1.5 * IQR))  (df['Salary'] > (Q3 + 1.5 * IQR))] print(outliers)  # Summary Statistics print(df.describe())  # Visualizations plt.figure(figsize=(10, 5)) df['Salary'].plot(kind='hist', bins=20, color='blue', alpha=0.7) plt.title('Salary Distribution') plt.xlabel('Salary') plt.ylabel('Frequency') plt.show()  df.boxplot(column='Salary') plt.title('Boxplot of Salary') plt.show()

Conclusion

Data cleaning and EDA are essential to ensure that your data is ready for further analysis or modeling. Using Python libraries like pandas and matplotlib, you can efficiently clean your data and gain meaningful insights. By understanding statistical concepts and visualizations, you’re better equipped to make data-driven decisions.

FutureMind Academy

Data Cleaning and Exploratory Data Analysis (EDA) with Python

Data Cleaning

What is Data Cleaning?

Steps in Data Cleaning

Exploratory Data Analysis (EDA)

Key Concepts in EDA

Visualizations in EDA

Practical Example

Data Cleaning and EDA Workflow

Conclusion

Recommended Articles

Competition Launch: Home Credit – Credit Risk Model Stability

WorldQuant University: Free education in data science and financial engineering

How SPSS Statistics facilitates hypothesis testing with its data management, test selection, and result interpretation

Mastering regular expressions: Unlocking their power in data science and web development

What are your experiences with popular developer tools and platforms?

Understanding the difference between simple random sampling and stratified sampling: Methods, advantages, and applications

Data Cleaning and Exploratory Data Analysis (EDA) with Python

Understanding the add() Function vs add() Dunder Method in Python

Data Cleaning and Exploratory Data Analysis (EDA) with Python

Data Cleaning

What is Data Cleaning?

Steps in Data Cleaning

Exploratory Data Analysis (EDA)

Key Concepts in EDA

Visualizations in EDA

Practical Example

Data Cleaning and EDA Workflow

Conclusion

Recommended Articles

Competition Launch: Home Credit – Credit Risk Model Stability

WorldQuant University: Free education in data science and financial engineering

How SPSS Statistics facilitates hypothesis testing with its data management, test selection, and result interpretation

Mastering regular expressions: Unlocking their power in data science and web development

What are your experiences with popular developer tools and platforms?

Understanding the difference between simple random sampling and stratified sampling: Methods, advantages, and applications

Data Cleaning and Exploratory Data Analysis (EDA) with Python

Understanding the add() Function vs __add__() Dunder Method in Python

Understanding the add() Function vs add() Dunder Method in Python