Our Goal: To create a clean and ready to use dataset for machine learning model.
We are going to create a small dataset of 100 applicant as an example that will have:
applicant_id: Unique applicant ID
age: Applicant's age
income: Monthly income (in USD)
loan_amount: Requested loan amount (in USD)
credit_score: Creditworthiness score (300-850)
gender: Male, Female, or Non-Binary
loan_status: Approved (1) or Rejected (0)
Our target is loan_status, which means that we want to create a model that will use all the other features as information and it will approve or reject a loan for an applicant.
# import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
We will create a dataset with 100 rows that has null values and outliers in order to clean our dataset.
num_rows = 100 # We will create a dataset with 100 rows
# set the seed for generating random numbers
np.random.seed(42)
df = {
"applicant_id": range(1, num_rows + 1), # Applicant id from 1 to 100
"age": np.random.randint(18, 65, num_rows).astype(int), # Random ages from 18 to 65
"income": np.random.randint(3000, 20000, num_rows).astype(float), # Monthly income from 3000 to 20000
"loan_amount": np.random.randint(10000, 100000, num_rows).astype(float), # Loan request from 10000 to 100000
"credit_score": np.random.randint(300, 850, num_rows).astype(int), # Credit score from 300 to 850
"gender": np.random.choice(["Male", "Female", "Non-Binary"], num_rows), # Random genders
"loan_status": np.random.choice([0, 1], num_rows) # Approved (1) or Rejected (0)
}
We have to convert our dictionary to dataframe to be able to handle operations and use it later in machine learning model.
# convert dictionary object to dataframe
df = pd.DataFrame(df)
df.head()
The dataset that we created above is clean. We are going to add:
We are going to add in the "age", "income", "loan_amount" and "credit_score" columns a 10% of null values
# Introduce Null Values Randomly
for col in ["age", "income", "loan_amount", "credit_score"]:
df.loc[df.sample(frac=0.1).index, col] = np.nan # 10% missing values in each column
We are going to find 5 random rows to duplicate and then we will add them in our dataset
# Introduce Duplicate Rows
duplicate_df = df.sample(n=5, random_state=42) # Select 5 random rows to duplicate
df = pd.concat([df, duplicate_df], ignore_index=True) # Add duplicates
We are going to add randomly in 5 rows unrealistic values for the "age", "income" and "loan_amount" columns.
# Introduce Outliers in Age, Income, Loan Amount
outlier_indices = np.random.choice(df.index, size=5, replace=False) # Select 5 rows randomly
df.loc[outlier_indices, "age"] = np.random.randint(100, 120, 5) # Unrealistically high ages
df.loc[outlier_indices, "income"] = np.random.randint(100000, 500000, 5) # High income outliers
df.loc[outlier_indices, "loan_amount"] = np.random.randint(200000, 1000000, 5) # High loan amounts
Now, if our code is right, we should have null values in our dataset.
But how we check if a dataset has null values if we are not sure that there are any?
We will use the code below which detects and sums the number of null values in each column.
# checking for null values
df.isna().sum()
The above outcome means that we have:
# Handling missing values
df.fillna(df.mean(numeric_only=True), inplace=True)
To clean a dataset is important to check if it has duplicated values.
This can be easily done with the code below where we can observe exactly which rows are duplicates.
If we did not have duplicates in our dataset the outcome would be an empty row.
df[df.duplicated(keep=False)]
Now, let's delete them from the dataset.
# Removing duplicate entries
df.drop_duplicates(inplace=True)
There are many ways to detect if the dataset has outliers. The most popular are using graphs like boxplot or distribution plot for each column.
In our example we will use the code below which helps us quickly identify potential outliers by analyzing summary statistics.
max vs. 75% (Upper Outliers)
max(age) = 117 → Very high! Most ages are below 50, so 117 is an outlier.
max(income) = 414997 → Much higher than 75% income = 17555, so likely an outlier.
max(loan_amount) = 876926 → Unusually high, likely an outlier.
min vs. 25% (Lower Outliers)
If min is much lower than 25%, it might be a lower outlier (not seen in this case).
# Detect outliers
df.describe()[["age", "income", "loan_amount"]]
Now, that we know there are outliers let's handle them. We are going to:
# Creating a function that detects and caps outliers in a column using the IQR method.
def handle_outliers(df, column):
Q1 = df[column].quantile(0.25) # First quartile (25th percentile)
Q3 = df[column].quantile(0.75) # Third quartile (75th percentile)
IQR = Q3 - Q1 # Interquartile range
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Capping the outliers
df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])
return df
Now, let's call the function for the numerical columns that we want to detect outliers.
# Applying outlier handling to numerical columns
for col in ["age", "income", "loan_amount", "credit_score"]:
df = handle_outliers(df, col)
Of course there are uncleaned datasets that are more messy.
The above cleaning process that we have created applies to most of them but you might face more issues depending the dataset that you are using.
Class imbalance occurs when one class in a classification problem significantly outweighs the other class. It's common in many machine learning problems.
Let's say that in our dataset 90% of the loans are approved and only 10% are not.
Problem: A machine learning model trained on this data will be biased towards predicting loan approvals, because the majority class dominates.
That's why we should check if one class is much more frequent than the other, forcing our model to favor an output.
# Checking Class Imbalance in loan_status
loan_approval_rate = df['loan_status'].value_counts(normalize=True)
print("Loan Approval Rate:\n", loan_approval_rate)
We can see that our target is balanced, so we do not have that problem.
Feature balance is NOT necessary, but in some cases, we should check:
# Checking Class Imbalance in gender
gender_rate = df['gender'].value_counts(normalize=True)
print("Gender Rate:\n", gender_rate)
Female applicants have the lowest approval rate (23.76%)
This suggests that gender might be influencing loan approvals unfairly.
In this case, gender imbalance is present because:
How to Confirm Gender Bias?
We can perform a statistical test to check if gender significantly affects loan_status:
We will use Chi-Square Test for Gender Bias
The Chi-Square Test checks if loan approval rates are independent of gender.
If p-value < 0.05, it means loan approvals are biased toward certain genders.
import scipy.stats as stats
import pandas as pd
# Create a contingency table (counts of approvals/rejections for each gender)
contingency_table = pd.crosstab(df["gender"], df["loan_status"])
# Perform the chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
# Print the results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-Value: {p}")
# Interpret the results
if p < 0.05:
print("🚨 Loan approval rates are significantly different across genders! Bias may be present.")
else:
print("✅ No significant gender bias detected.")
How to Fix Gender Imbalance?
If we confirm gender bias, we have three main solutions:
We will not exclude the "gender" column here as we want to observe the method of encoding categorical variables in our ML model and is the only categorical variable in our dataset.
But, if you work with a real world dataset check the above methods for handling imbalance.
Machine Learning Algorithms can handle categorical variables, they can't directly understand that "female" is different from "male".
Using Encoding method it’s like translating these categories into a language that machines can understand and work with.
That's what we are going to do now.
df['gender'].unique()
# Encoding categorical variables
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['gender'].unique()
Now, instead of categories in our column we have numbers that represent them.
1 -> "Male"
2 -> "Non-Binary"
3 -> "Female"
Feature scaling is a fundamental preprocessing step in machine learning aimed at ensuring that numerical features have a similar scale.
This is important because many ml algorithms perform better or converge faster when the input numerical features are on a similar scale.
There are a two popular ways to do this, we are going to use the min-max scaler, known in statistics as normalization.
This method scales each feature so that all values are within the range of 0 and 1. It achieves this by subtracting the minimum value of the feature and dividing by the range (difference between maximum and minimum values).
Another popular method is standardization, you can choose whichever you prefer.
# Scaling numerical features
scaler = MinMaxScaler()
df[['income', 'loan_amount', 'credit_score']] = scaler.fit_transform(df[['income', 'loan_amount', 'credit_score']])
Now, that our dataset is ready, we have to seperate the features we want to use for our ml prediction from our target value.
# Selecting relevant features
features = ['age', 'income', 'loan_amount', 'credit_score', 'gender']
X = df[features]
y = df['loan_status']
The last part before we choose an ml algorith to use is to split our data.
This fundamental technique divides a dataset into two parts to evaluate a model’s performance on unseen data.
I have written an article about this method and how to procide next with creating and evaluating your model's performance. You can check it if you want here:
https://medium.com/@ritaaggelou/train-test-split-in-python-a-step-by-step-guide-with-example-for-accurate-model-evaluation-53741204ff7d
# Splitting the Data into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Display the first few rows after preprocessing
X_train.head()