Data Preparation for a Loan Approval Model

Our Goal: To create a clean and ready to use dataset for machine learning model.

Creating our dataset

We are going to create a small dataset of 100 applicant as an example that will have:
applicant_id: Unique applicant ID
age: Applicant's age
income: Monthly income (in USD)
loan_amount: Requested loan amount (in USD)
credit_score: Creditworthiness score (300-850)
gender: Male, Female, or Non-Binary
loan_status: Approved (1) or Rejected (0)

Our target is loan_status, which means that we want to create a model that will use all the other features as information and it will approve or reject a loan for an applicant.

In [1]:
# import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler, LabelEncoder

1. Data Collection (Simulating Raw Data)

1.1 Creating the Dataset Using Dictionary

We will create a dataset with 100 rows that has null values and outliers in order to clean our dataset.

  • Set our random seed to 42 to always produce the same random dataset, making results reproducible.
  • Create a dictionary with the features that we described before.
In [2]:
num_rows = 100  # We will create a dataset with 100 rows

# set the seed for generating random numbers
np.random.seed(42)

df = {
    "applicant_id": range(1, num_rows + 1), # Applicant id from 1 to 100 
    "age": np.random.randint(18, 65, num_rows).astype(int),  # Random ages from 18 to 65
    "income": np.random.randint(3000, 20000, num_rows).astype(float),  # Monthly income from 3000 to 20000
    "loan_amount": np.random.randint(10000, 100000, num_rows).astype(float),  # Loan request from 10000 to 100000
    "credit_score": np.random.randint(300, 850, num_rows).astype(int),  # Credit score from 300 to 850
    "gender": np.random.choice(["Male", "Female", "Non-Binary"], num_rows),  # Random genders
    "loan_status": np.random.choice([0, 1], num_rows)  # Approved (1) or Rejected (0)
}

1.2. Converting to Dataframe

We have to convert our dictionary to dataframe to be able to handle operations and use it later in machine learning model.

In [3]:
# convert dictionary object to dataframe
df = pd.DataFrame(df)
df.head()
Out[3]:
applicant_id age income loan_amount credit_score gender loan_status
0 1 56 5695.0 11802.0 460 Male 1
1 2 46 18422.0 18155.0 555 Non-Binary 0
2 3 32 8258.0 83656.0 622 Male 1
3 4 60 9736.0 49384.0 427 Male 0
4 5 25 3391.0 57254.0 317 Female 1

The dataset that we created above is clean. We are going to add:

  • null values
  • duplicated rows
  • outliers

    to have the opportunity to see how those key steps in the cleaning process work.

1.3 Adding Null Values

We are going to add in the "age", "income", "loan_amount" and "credit_score" columns a 10% of null values

In [4]:
# Introduce Null Values Randomly
for col in ["age", "income", "loan_amount", "credit_score"]:
    df.loc[df.sample(frac=0.1).index, col] = np.nan  # 10% missing values in each column

1.4. Adding Duplicated Rows

We are going to find 5 random rows to duplicate and then we will add them in our dataset

In [5]:
# Introduce Duplicate Rows
duplicate_df = df.sample(n=5, random_state=42)  # Select 5 random rows to duplicate
df = pd.concat([df, duplicate_df], ignore_index=True)  # Add duplicates

1.5 Adding Outliers

We are going to add randomly in 5 rows unrealistic values for the "age", "income" and "loan_amount" columns.

In [6]:
# Introduce Outliers in Age, Income, Loan Amount
outlier_indices = np.random.choice(df.index, size=5, replace=False)  # Select 5 rows randomly
df.loc[outlier_indices, "age"] = np.random.randint(100, 120, 5)  # Unrealistically high ages
df.loc[outlier_indices, "income"] = np.random.randint(100000, 500000, 5)  # High income outliers
df.loc[outlier_indices, "loan_amount"] = np.random.randint(200000, 1000000, 5)  # High loan amounts

2. Handling Missing Values

Now, if our code is right, we should have null values in our dataset.

But how we check if a dataset has null values if we are not sure that there are any?
We will use the code below which detects and sums the number of null values in each column.

In [7]:
# checking for null values
df.isna().sum()
Out[7]:
applicant_id     0
age              9
income          10
loan_amount     10
credit_score    10
gender           0
loan_status      0
dtype: int64

The above outcome means that we have:

  • 8 null values in "age" and "income" columns and
  • 10 null values in "loan_amount" and "credit_score" columns.
In [8]:
# Handling missing values
df.fillna(df.mean(numeric_only=True), inplace=True)

3. Handling Duplicated Rows

To clean a dataset is important to check if it has duplicated values.
This can be easily done with the code below where we can observe exactly which rows are duplicates.
If we did not have duplicates in our dataset the outcome would be an empty row.

In [9]:
df[df.duplicated(keep=False)]
Out[9]:
applicant_id age income loan_amount credit_score gender loan_status
44 45 42.0 19157.0 12869.0 346.0 Male 1
53 54 61.0 8530.0 56214.0 426.0 Male 0
70 71 51.0 16949.0 40746.0 519.0 Non-Binary 1
83 84 35.0 18087.0 53484.0 322.0 Non-Binary 0
100 84 35.0 18087.0 53484.0 322.0 Non-Binary 0
101 54 61.0 8530.0 56214.0 426.0 Male 0
102 71 51.0 16949.0 40746.0 519.0 Non-Binary 1
104 45 42.0 19157.0 12869.0 346.0 Male 1

Now, let's delete them from the dataset.

In [10]:
# Removing duplicate entries
df.drop_duplicates(inplace=True)

4. Handling Outlier using IQR

There are many ways to detect if the dataset has outliers. The most popular are using graphs like boxplot or distribution plot for each column.
In our example we will use the code below which helps us quickly identify potential outliers by analyzing summary statistics.

max vs. 75% (Upper Outliers)

max(age) = 117 → Very high! Most ages are below 50, so 117 is an outlier.
max(income) = 414997 → Much higher than 75% income = 17555, so likely an outlier.
max(loan_amount) = 876926 → Unusually high, likely an outlier.

min vs. 25% (Lower Outliers)

If min is much lower than 25%, it might be a lower outlier (not seen in this case).

In [11]:
# Detect outliers
df.describe()[["age", "income", "loan_amount"]]
Out[11]:
age income loan_amount
count 101.000000 101.000000 101.000000
mean 43.827970 24603.175612 84249.281918
std 20.313057 58533.046676 140804.492034
min 18.000000 3197.000000 12049.000000
25% 31.000000 8056.000000 39299.000000
50% 42.000000 12914.000000 57202.000000
75% 53.000000 17555.000000 82595.147368
max 117.000000 414997.000000 876926.000000

Now, that we know there are outliers let's handle them. We are going to:

  • create a function that detects outliers using the Interquartile Range (IQR) Method and
    (The Interquartile Range (IQR) method is a statistical technique used to detect outliers in a dataset. It helps identify values that are significantly higher or lower than the majority of the data.)
  • use the cap technique for outliers.
    (We essentially set a limit for the min and max outlier values. Anything above or below the cap gets set to the capped min or max respectively.
    For example, if we set the cap max for age at 53, any outlier above 53 will be set to 53.)
In [12]:
# Creating a function that detects and caps outliers in a column using the IQR method.
def handle_outliers(df, column):
    Q1 = df[column].quantile(0.25)  # First quartile (25th percentile)
    Q3 = df[column].quantile(0.75)  # Third quartile (75th percentile)
    IQR = Q3 - Q1  # Interquartile range
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Capping the outliers
    df[column] = np.where(df[column] < lower_bound, lower_bound, df[column])
    df[column] = np.where(df[column] > upper_bound, upper_bound, df[column])

    return df

Now, let's call the function for the numerical columns that we want to detect outliers.

In [13]:
# Applying outlier handling to numerical columns
for col in ["age", "income", "loan_amount", "credit_score"]:
    df = handle_outliers(df, col)

Of course there are uncleaned datasets that are more messy.
The above cleaning process that we have created applies to most of them but you might face more issues depending the dataset that you are using.

6. Bias Detection

Class imbalance occurs when one class in a classification problem significantly outweighs the other class. It's common in many machine learning problems.

6.1. Checking Class Imbalance in our Target

Let's say that in our dataset 90% of the loans are approved and only 10% are not.

Problem: A machine learning model trained on this data will be biased towards predicting loan approvals, because the majority class dominates.
That's why we should check if one class is much more frequent than the other, forcing our model to favor an output.

In [14]:
# Checking Class Imbalance in loan_status
loan_approval_rate = df['loan_status'].value_counts(normalize=True)
print("Loan Approval Rate:\n", loan_approval_rate)
Loan Approval Rate:
 loan_status
0    0.534653
1    0.465347
Name: proportion, dtype: float64

We can see that our target is balanced, so we do not have that problem.

6.2. Checking Class Imbalance in a Feature

Feature balance is NOT necessary, but in some cases, we should check:

  • If a feature is extremely imbalanced, it may indicate bias.
  • If a feature directly impacts fairness (e.g., gender, race, or geography), balance may be necessary.

    In our case would be important to check the "gender" column to notice if we have any imbalanced class.
    If one gender is underrepresented, the model may not learn well for that group.
In [15]:
# Checking Class Imbalance in gender
gender_rate = df['gender'].value_counts(normalize=True)
print("Gender Rate:\n", gender_rate)
Gender Rate:
 gender
Male          0.425743
Non-Binary    0.336634
Female        0.237624
Name: proportion, dtype: float64

Female applicants have the lowest approval rate (23.76%)
This suggests that gender might be influencing loan approvals unfairly.

In this case, gender imbalance is present because:

  • Males are approved at almost twice the rate of females (42.57% vs. 23.76%)
  • If this bias exists in real-world data, our ML model may learn gender discrimination!

How to Confirm Gender Bias?
We can perform a statistical test to check if gender significantly affects loan_status:

We will use Chi-Square Test for Gender Bias
The Chi-Square Test checks if loan approval rates are independent of gender.

If p-value < 0.05, it means loan approvals are biased toward certain genders.

In [16]:
import scipy.stats as stats
import pandas as pd

# Create a contingency table (counts of approvals/rejections for each gender)
contingency_table = pd.crosstab(df["gender"], df["loan_status"])

# Perform the chi-square test
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)

# Print the results
print(f"Chi-Square Statistic: {chi2}")
print(f"P-Value: {p}")

# Interpret the results
if p < 0.05:
    print("🚨 Loan approval rates are significantly different across genders! Bias may be present.")
else:
    print("✅ No significant gender bias detected.")
Chi-Square Statistic: 6.1852368396182875
P-Value: 0.045382967033281485
🚨 Loan approval rates are significantly different across genders! Bias may be present.

How to Fix Gender Imbalance?
If we confirm gender bias, we have three main solutions:

  1. Rebalance the Dataset (Oversample/Undersample)
    If fewer female applicants exist, duplicate their data to balance representation.
  2. Remove Gender from Model Features
    If gender is causing bias, remove it from the training dataset.
  3. Use Bias-Mitigating ML Models
    Libraries like Fairlearn can reduce bias in AI models

We will not exclude the "gender" column here as we want to observe the method of encoding categorical variables in our ML model and is the only categorical variable in our dataset.
But, if you work with a real world dataset check the above methods for handling imbalance.

5. Data Transformation

5.1. Encode Categorical Variables

Machine Learning Algorithms can handle categorical variables, they can't directly understand that "female" is different from "male".

Using Encoding method it’s like translating these categories into a language that machines can understand and work with.
That's what we are going to do now.

In [17]:
df['gender'].unique()
Out[17]:
array(['Male', 'Non-Binary', 'Female'], dtype=object)
In [18]:
# Encoding categorical variables
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
In [19]:
df['gender'].unique()
Out[19]:
array([1, 2, 0])

Now, instead of categories in our column we have numbers that represent them.
1 -> "Male"
2 -> "Non-Binary"
3 -> "Female"

5.2. Scale Numerical Features

Feature scaling is a fundamental preprocessing step in machine learning aimed at ensuring that numerical features have a similar scale.
This is important because many ml algorithms perform better or converge faster when the input numerical features are on a similar scale.
There are a two popular ways to do this, we are going to use the min-max scaler, known in statistics as normalization.

This method scales each feature so that all values are within the range of 0 and 1. It achieves this by subtracting the minimum value of the feature and dividing by the range (difference between maximum and minimum values).


Another popular method is standardization, you can choose whichever you prefer.

In [20]:
# Scaling numerical features
scaler = MinMaxScaler()
df[['income', 'loan_amount', 'credit_score']] = scaler.fit_transform(df[['income', 'loan_amount', 'credit_score']])

7. Feature Engineering

Now, that our dataset is ready, we have to seperate the features we want to use for our ml prediction from our target value.

In [21]:
# Selecting relevant features
features = ['age', 'income', 'loan_amount', 'credit_score', 'gender']
X = df[features]
y = df['loan_status']

8. Splitting the Data

The last part before we choose an ml algorith to use is to split our data.
This fundamental technique divides a dataset into two parts to evaluate a model’s performance on unseen data.

I have written an article about this method and how to procide next with creating and evaluating your model's performance. You can check it if you want here:
https://medium.com/@ritaaggelou/train-test-split-in-python-a-step-by-step-guide-with-example-for-accurate-model-evaluation-53741204ff7d

In [22]:
# Splitting the Data into Training and Test Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
In [23]:
# Display the first few rows after preprocessing
X_train.head()
Out[23]:
age income loan_amount credit_score gender
89 86.0 1.000000 1.000000 0.592593 0
26 39.0 0.550120 0.520673 0.465361 2
42 35.0 0.480101 0.520673 0.353704 2
70 51.0 0.480730 0.211801 0.398148 2
15 20.0 0.483876 0.520673 0.465361 2