Missing data is one of the most common issues in real-world datasets. In Python, especially when working with pandas, missing data is typically represented by special markers such as NaN (Not a Number), None, or NaT (for datetime). Managing this missing information effectively is critical for ensuring the accuracy of data analysis and machine learning pipelines.
In this guide, we will comprehensively explore how to identify, analyze, and handle missing data using pandas. We will also discuss various strategies like imputation, interpolation, removal, and the implications of each approach.
import pandas as pd
import numpy as np
data = {
'Name': ['Alice', 'Bob', np.nan, 'David', 'Eva'],
'Age': [25, np.nan, 30, 45, None],
'Salary': [50000, 60000, None, 80000, 75000],
'Department': ['HR', 'Finance', 'IT', None, 'Finance']
}
df = pd.DataFrame(data)
print(df)
The functions isnull() and notnull() return a Boolean DataFrame indicating whether each element is missing.
print(df.isnull()) # True for missing, False for present
print(df.notnull()) # Inverse of isnull()
print(df.isnull().any()) # Check if any column has missing values
print(df.isnull().all()) # Check if all values in a column are missing
print(df.isnull().sum()) # Total count of missing values per column
print(df.isnull().sum().sum()) # Total missing values in DataFrame
dropna() removes missing data based on rows or columns.
# Drop rows with any missing values
df_drop_rows = df.dropna()
print(df_drop_rows)
# Drop rows where all columns are NaN
df_drop_all = df.dropna(how='all')
print(df_drop_all)
df_drop_col = df.dropna(axis=1)
print(df_drop_col)
# Keep rows with at least 3 non-null values
df_thresh = df.dropna(thresh=3)
print(df_thresh)
fillna() allows you to fill missing data with a constant, method, or value.
df_filled = df.fillna(0) # Replace all NaN values with 0
print(df_filled)
df_ffill = df.fillna(method='ffill') # Fill from previous row
print(df_ffill)
df_bfill = df.fillna(method='bfill') # Fill from next row
print(df_bfill)
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
print(df)
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Salary'] = df['Salary'].fillna(df['Salary'].median())
df['Department'] = df['Department'].fillna(df['Department'].mode()[0])
print(df)
df['Department'] = df['Department'].fillna('Unknown')
print(df)
Interpolation fills in missing values by using linear or time-based approaches.
df['Age'] = df['Age'].interpolate(method='linear')
print(df)
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='mean')
df[['Age', 'Salary']] = imp.fit_transform(df[['Age', 'Salary']])
print(df)
df['Department'] = df['Department'].fillna(df['Department'].mode()[0])
print(df)
df['Department'] = df['Department'].fillna('Unknown')
print(df)
df['Salary'] = df.groupby('Department')['Salary'].transform(lambda x: x.fillna(x.mean()))
print(df)
df.loc[(df['Name'] == 'Charlie') & (df['Age'].isnull()), 'Age'] = 35
print(df)
import seaborn as sns
import matplotlib.pyplot as plt
sns.heatmap(df.isnull(), cbar=False, cmap='viridis')
plt.title('Missing Data Heatmap')
plt.show()
import missingno as msno
msno.matrix(df)
msno.heatmap(df)
msno.bar(df)
employee_data = {
'EmployeeID': [101, 102, 103, 104, 105],
'Name': ['John', 'Anna', None, 'Mike', 'Sara'],
'Experience': [5, None, 7, 10, None],
'Department': ['HR', 'Finance', 'IT', None, 'Finance']
}
emp_df = pd.DataFrame(employee_data)
# Step 1: Fill missing Experience with mean
emp_df['Experience'] = emp_df['Experience'].fillna(emp_df['Experience'].mean())
# Step 2: Fill missing Name and Department
emp_df['Name'] = emp_df['Name'].fillna('Unknown')
emp_df['Department'] = emp_df['Department'].fillna('Unassigned')
print(emp_df)
ts = pd.Series([np.nan, 2, np.nan, 4, np.nan, np.nan, 7],
index=pd.date_range('2024-01-01', periods=7))
ts_filled = ts.fillna(method='ffill', limit=1)
print(ts_filled)
ts_interpolated = ts.interpolate(method='time')
print(ts_interpolated)
print(df.groupby('Department')['Salary'].apply(lambda x: x.isnull().sum()))
msno.heatmap(df)
df.to_csv('cleaned_data.csv', index=False)
df_cleaned = pd.read_csv('cleaned_data.csv')
print(df_cleaned.head())
Handling missing data is a critical step in every data science project. Whether you choose to drop, impute, or fill values, the approach should be informed by the data context, size, type of analysis, and impact on downstream tasks. Python and pandas provide robust tools to detect, visualize, and manage missing data effectively.
By using techniques like fillna, dropna, interpolate, and group-based imputation, you can create reliable, consistent datasets ready for analytics and modeling. Visual tools like seaborn and missingno also enhance the understanding of data completeness and guide better preprocessing strategies.
Always test the impact of your missing data handling methods and validate them through EDA and model performance checks.
Python is commonly used for developing websites and software, task automation, data analysis, and data visualisation. Since it's relatively easy to learn, Python has been adopted by many non-programmers, such as accountants and scientists, for a variety of everyday tasks, like organising finances.
Learning Curve: Python is generally considered easier to learn for beginners due to its simplicity, while Java is more complex but provides a deeper understanding of how programming works.
The point is that Java is more complicated to learn than Python. It doesn't matter the order. You will have to do some things in Java that you don't in Python. The general programming skills you learn from using either language will transfer to another.
Read on for tips on how to maximize your learning. In general, it takes around two to six months to learn the fundamentals of Python. But you can learn enough to write your first short program in a matter of minutes. Developing mastery of Python's vast array of libraries can take months or years.
6 Top Tips for Learning Python
The following is a step-by-step guide for beginners interested in learning Python using Windows.
Best YouTube Channels to Learn Python
Write your first Python programStart by writing a simple Python program, such as a classic "Hello, World!" script. This process will help you understand the syntax and structure of Python code.
The average salary for Python Developer is βΉ5,55,000 per year in the India. The average additional cash compensation for a Python Developer is within a range from βΉ3,000 - βΉ1,20,000.
Copyrights © 2024 letsupdateskills All rights reserved