Data Preparation for a Loan Approval Model¶

Our Goal: To create a clean and ready to use dataset for machine learning model.

Creating our dataset

We are going to create a small dataset of 100 applicant as an example that will have:
applicant_id: Unique applicant ID
age: Applicant's age
income: Monthly income (in USD)
loan_amount: Requested loan amount (in USD)
credit_score: Creditworthiness score (300-850)
gender: Male, Female, or Non-Binary
loan_status: Approved (1) or Rejected (0)

Our target is loan_status, which means that we want to create a model that will use all the other features as information and it will approve or reject a loan for an applicant.

# import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

1. Data Collection (Simulating Raw Data)¶

1.1 Creating the Dataset Using Dictionary¶

We will create a dataset with 100 rows that has null values and outliers in order to clean our dataset.

Set our random seed to 42 to always produce the same random dataset, making results reproducible.
Create a dictionary with the features that we described before.

num_rows = 100  # We will create a dataset with 100 rows

# set the seed for generating random numbers
np.random.seed(42)

df = {
    "applicant_id": range(1, num_rows + 1), # Applicant id from 1 to 100 
    "age": np.random.randint(18, 65, num_rows).astype(int),  # Random ages from 18 to 65
    "income": np.random.randint(3000, 20000, num_rows).astype(float),  # Monthly income from 3000 to 20000
    "loan_amount": np.random.randint(10000, 100000, num_rows).astype(float),  # Loan request from 10000 to 100000
    "credit_score": np.random.randint(300, 850, num_rows).astype(int),  # Credit score from 300 to 850
    "gender": np.random.choice(["Male", "Female", "Non-Binary"], num_rows),  # Random genders
    "loan_status": np.random.choice([0, 1], num_rows)  # Approved (1) or Rejected (0)
}

1.2. Converting to Dataframe¶

We have to convert our dictionary to dataframe to be able to handle operations and use it later in machine learning model.

# convert dictionary object to dataframe
df = pd.DataFrame(df)
df.head()

The dataset that we created above is clean. We are going to add:

null values
duplicated rows
outliers

to have the opportunity to see how those key steps in the cleaning process work.

1.3 Adding Null Values¶

We are going to add in the "age", "income", "loan_amount" and "credit_score" columns a 10% of null values

# Introduce Null Values Randomly
for col in ["age", "income", "loan_amount", "credit_score"]:
    df.loc[df.sample(frac=0.1).index, col] = np.nan  # 10% missing values in each column

1.4. Adding Duplicated Rows¶

We are going to find 5 random rows to duplicate and then we will add them in our dataset

# Introduce Duplicate Rows
duplicate_df = df.sample(n=5, random_state=42)  # Select 5 random rows to duplicate
df = pd.concat([df, duplicate_df], ignore_index=True)  # Add duplicates

1.5 Adding Outliers¶

We are going to add randomly in 5 rows unrealistic values for the "age", "income" and "loan_amount" columns.

# Introduce Outliers in Age, Income, Loan Amount
outlier_indices = np.random.choice(df.index, size=5, replace=False)  # Select 5 rows randomly
df.loc[outlier_indices, "age"] = np.random.randint(100, 120, 5)  # Unrealistically high ages
df.loc[outlier_indices, "income"] = np.random.randint(100000, 500000, 5)  # High income outliers
df.loc[outlier_indices, "loan_amount"] = np.random.randint(200000, 1000000, 5)  # High loan amounts

2. Handling Missing Values¶

Now, if our code is right, we should have null values in our dataset.

But how we check if a dataset has null values if we are not sure that there are any?
We will use the code below which detects and sums the number of null values in each column.

# checking for null values
df.isna().sum()

applicant_id     0
age              9
income          10
loan_amount     10
credit_score    10
gender           0
loan_status      0
dtype: int64

The above outcome means that we have:

8 null values in "age" and "income" columns and
10 null values in "loan_amount" and "credit_score" columns.

# Handling missing values
df.fillna(df.mean(numeric_only=True), inplace=True)

3. Handling Duplicated Rows¶

To clean a dataset is important to check if it has duplicated values.
This can be easily done with the code below where we can observe exactly which rows are duplicates.
If we did not have duplicates in our dataset the outcome would be an empty row.

df[df.duplicated(keep=False)]

Now, let's delete them from the dataset.

# Removing duplicate entries
df.drop_duplicates(inplace=True)

4. Handling Outlier using IQR¶

There are many ways to detect if the dataset has outliers. The most popular are using graphs like boxplot or distribution plot for each column.
In our example we will use the code below which helps us quickly identify potential outliers by analyzing summary statistics.

max vs. 75% (Upper Outliers)

max(age) = 117 → Very high! Most ages are below 50, so 117 is an outlier.
max(income) = 414997 → Much higher than 75% income = 17555, so likely an outlier.
max(loan_amount) = 876926 → Unusually high, likely an outlier.

min vs. 25% (Lower Outliers)

If min is much lower than 25%, it might be a lower outlier (not seen in this case).

# Detect outliers
df.describe()[["age", "income", "loan_amount"]]

Now, that we know there are outliers let's handle them. We are going to:

create a function that detects outliers using the Interquartile Range (IQR) Method and
(The Interquartile Range (IQR) method is a statistical technique used to detect outliers in a dataset. It helps identify values that are significantly higher or lower than the majority of the data.)
use the cap technique for outliers.
(We essentially set a limit for the min and max outlier values. Anything above or below the cap gets set to the capped min or max respectively.
For example, if we set the cap max for age at 53, any outlier above 53 will be set to 53.)

# Creating a function that detects and caps outliers in a column using the IQR method.
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)  # First quartile (25th percentile)
    Q3 = df[column].quantile(0.75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile range
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the outliers
    df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
    df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

    return df

Now, let's call the function for the numerical columns that we want to detect outliers.

# Applying outlier handling to numerical columns
for col in ["age", "income", "loan_amount", "credit_score"]:
    df = handle_outliers(df, col)

Of course there are uncleaned datasets that are more messy.
The above cleaning process that we have created applies to most of them but you might face more issues depending the dataset that you are using.

6. Bias Detection¶

Class imbalance occurs when one class in a classification problem significantly outweighs the other class. It's common in many machine learning problems.

6.1. Checking Class Imbalance in our Target¶

Let's say that in our dataset 90% of the loans are approved and only 10% are not.

Problem: A machine learning model trained on this data will be biased towards predicting loan approvals, because the majority class dominates.
That's why we should check if one class is much more frequent than the other, forcing our model to favor an output.

# Checking Class Imbalance in loan_status
loan_approval_rate = df['loan_status'].value_counts(normalize=True)
print("Loan Approval Rate:\n", loan_approval_rate)

Loan Approval Rate:
 loan_status
0    0.534653
1    0.465347
Name: proportion, dtype: float64

We can see that our target is balanced, so we do not have that problem.

6.2. Checking Class Imbalance in a Feature¶

Feature balance is NOT necessary, but in some cases, we should check:

If a feature is extremely imbalanced, it may indicate bias.
If a feature directly impacts fairness (e.g., gender, race, or geography), balance may be necessary.

In our case would be important to check the "gender" column to notice if we have any imbalanced class.
If one gender is underrepresented, the model may not learn well for that group.

# Checking Class Imbalance in gender
gender_rate = df['gender'].value_counts(normalize=True)
print("Gender Rate:\n", gender_rate)

Gender Rate:
 gender
Male          0.425743
Non-Binary    0.336634
Female        0.237624
Name: proportion, dtype: float64

Female applicants have the lowest approval rate (23.76%)
This suggests that gender might be influencing loan approvals unfairly.

In this case, gender imbalance is present because:

Males are approved at almost twice the rate of females (42.57% vs. 23.76%)
If this bias exists in real-world data, our ML model may learn gender discrimination!

How to Confirm Gender Bias?
We can perform a statistical test to check if gender significantly affects loan_status:

We will use Chi-Square Test for Gender Bias
The Chi-Square Test checks if loan approval rates are independent of gender.

If p-value < 0.05, it means loan approvals are biased toward certain genders.

import scipy.stats as stats
import pandas as pd

# Create a contingency table (counts of approvals/rejections for each gender)
contingency_table = pd.crosstab(df["gender"], df["loan_status"])

# Perform the chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

# Print the results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-Value: {p}")

# Interpret the results
if p < 0.05:
    print("🚨 Loan approval rates are significantly different across genders! Bias may be present.")
else:
    print("✅ No significant gender bias detected.")

Chi-Square Statistic: 6.1852368396182875
P-Value: 0.045382967033281485
🚨 Loan approval rates are significantly different across genders! Bias may be present.

How to Fix Gender Imbalance?
If we confirm gender bias, we have three main solutions:

Rebalance the Dataset (Oversample/Undersample)
If fewer female applicants exist, duplicate their data to balance representation.
Remove Gender from Model Features
If gender is causing bias, remove it from the training dataset.
Use Bias-Mitigating ML Models
Libraries like Fairlearn can reduce bias in AI models

We will not exclude the "gender" column here as we want to observe the method of encoding categorical variables in our ML model and is the only categorical variable in our dataset.
But, if you work with a real world dataset check the above methods for handling imbalance.

5. Data Transformation¶

5.1. Encode Categorical Variables¶

Machine Learning Algorithms can handle categorical variables, they can't directly understand that "female" is different from "male".

Using Encoding method it’s like translating these categories into a language that machines can understand and work with.
That's what we are going to do now.

df['gender'].unique()

array(['Male', 'Non-Binary', 'Female'], dtype=object)

# Encoding categorical variables
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])

df['gender'].unique()

array([1, 2, 0])

Now, instead of categories in our column we have numbers that represent them.
1 -> "Male"
2 -> "Non-Binary"
3 -> "Female"

5.2. Scale Numerical Features¶

Feature scaling is a fundamental preprocessing step in machine learning aimed at ensuring that numerical features have a similar scale.
This is important because many ml algorithms perform better or converge faster when the input numerical features are on a similar scale.
There are a two popular ways to do this, we are going to use the min-max scaler, known in statistics as normalization.

This method scales each feature so that all values are within the range of 0 and 1. It achieves this by subtracting the minimum value of the feature and dividing by the range (difference between maximum and minimum values).

Another popular method is standardization, you can choose whichever you prefer.

# Scaling numerical features
scaler = MinMaxScaler()
df[['income', 'loan_amount', 'credit_score']] = scaler.fit_transform(df[['income', 'loan_amount', 'credit_score']])

7. Feature Engineering¶

Now, that our dataset is ready, we have to seperate the features we want to use for our ml prediction from our target value.

# Selecting relevant features
features = ['age', 'income', 'loan_amount', 'credit_score', 'gender']
X = df[features]
y = df['loan_status']

8. Splitting the Data¶

The last part before we choose an ml algorith to use is to split our data.
This fundamental technique divides a dataset into two parts to evaluate a model’s performance on unseen data.

I have written an article about this method and how to procide next with creating and evaluating your model's performance. You can check it if you want here:
https://medium.com/@ritaaggelou/train-test-split-in-python-a-step-by-step-guide-with-example-for-accurate-model-evaluation-53741204ff7d

# Splitting the Data into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the first few rows after preprocessing
X_train.head()

	applicant_id	age	income	loan_amount	credit_score	gender	loan_status
44	45	42.0	19157.0	12869.0	346.0	Male	1
53	54	61.0	8530.0	56214.0	426.0	Male	0
70	71	51.0	16949.0	40746.0	519.0	Non-Binary	1
83	84	35.0	18087.0	53484.0	322.0	Non-Binary	0
100	84	35.0	18087.0	53484.0	322.0	Non-Binary	0
101	54	61.0	8530.0	56214.0	426.0	Male	0
102	71	51.0	16949.0	40746.0	519.0	Non-Binary	1
104	45	42.0	19157.0	12869.0	346.0	Male	1

	age	income	loan_amount
count	101.000000	101.000000	101.000000
mean	43.827970	24603.175612	84249.281918
std	20.313057	58533.046676	140804.492034
min	18.000000	3197.000000	12049.000000
25%	31.000000	8056.000000	39299.000000
50%	42.000000	12914.000000	57202.000000
75%	53.000000	17555.000000	82595.147368
max	117.000000	414997.000000	876926.000000

	age	income	loan_amount	credit_score	gender
89	86.0	1.000000	1.000000	0.592593	0
26	39.0	0.550120	0.520673	0.465361	2
42	35.0	0.480101	0.520673	0.353704	2
70	51.0	0.480730	0.211801	0.398148	2
15	20.0	0.483876	0.520673	0.465361	2

	applicant_id	age	income	loan_amount	credit_score	gender	loan_status
0	1	56	5695.0	11802.0	460	Male	1
1	2	46	18422.0	18155.0	555	Non-Binary	0
2	3	32	8258.0	83656.0	622	Male	1
3	4	60	9736.0	49384.0	427	Male	0
4	5	25	3391.0	57254.0	317	Female	1