Online Shopping EDA for Beginners

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

pd.set_option("display.max_columns", 40)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
/kaggle/input/online-shopping-dataset/file.csv
/kaggle/input/online-shopping-dataset/file.xlsx

The dataset that we are going to use in this notebook, contains informations about customer purchase orders of an online shop. Our purpose is to clean the data, make few basic visualizations and create a dashboard with customer behavior.

So, first lets explain what data our dataset has.

Description of the columns:

  • CustomerID: Unique identifier for each customer.
  • Gender: Gender of the customer (e.g., Male, Female).
  • Location: Location or address information of the customer.
  • Tenure_Months: Number of months the customer has been associated with the platform.
  • Transaction_ID: Unique identifier for each transaction.
  • Transaction_Date: Date of the transaction.
  • Product_SKU: Stock Keeping Unit (SKU) identifier for the product.
  • Product_Description: Description of the product.
  • Product_Category: Category to which the product belongs.
  • Quantity: Quantity of the product purchased in the transaction.
  • Avg_Price: Average price of the product.
  • Delivery_Charges: Charges associated with the delivery of the product.
  • Coupon_Status: Status of the coupon associated with the transaction.
  • GST: Goods and Services Tax associated with the transaction.
  • Date: Date of the transaction (potentially redundant with Transaction_Date).
  • Offline_Spend: Amount spent offline by the customer.
  • Online_Spend: Amount spent online by the customer.
  • Month: Month of the transaction.
  • Coupon_Code: Code associated with a coupon, if applicable.
  • Discount_pct: Percentage of discount applied to the transaction.
In [2]:
df = pd.read_csv('/kaggle/input/online-shopping-dataset/file.csv')
df.head()
Out[2]:
Unnamed: 0 CustomerID Gender Location Tenure_Months Transaction_ID Transaction_Date Product_SKU Product_Description Product_Category Quantity Avg_Price Delivery_Charges Coupon_Status GST Date Offline_Spend Online_Spend Month Coupon_Code Discount_pct
0 0 17850.0 M Chicago 12.0 16679.0 2019-01-01 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainle... Nest-USA 1.0 153.71 6.5 Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
1 1 17850.0 M Chicago 12.0 16680.0 2019-01-01 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainle... Nest-USA 1.0 153.71 6.5 Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
2 2 17850.0 M Chicago 12.0 16696.0 2019-01-01 GGOENEBQ078999 Nest Cam Outdoor Security Camera - USA Nest-USA 2.0 122.77 6.5 Not Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
3 3 17850.0 M Chicago 12.0 16699.0 2019-01-01 GGOENEBQ079099 Nest Protect Smoke + CO White Battery Alarm-USA Nest-USA 1.0 81.50 6.5 Clicked 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
4 4 17850.0 M Chicago 12.0 16700.0 2019-01-01 GGOENEBJ079499 Nest Learning Thermostat 3rd Gen-USA - Stainle... Nest-USA 1.0 153.71 6.5 Clicked 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0

DATA CLEANING

Null Values

Let's check if our dataset has null values.

In [3]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52955 entries, 0 to 52954
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   Unnamed: 0           52955 non-null  int64  
 1   CustomerID           52924 non-null  float64
 2   Gender               52924 non-null  object 
 3   Location             52924 non-null  object 
 4   Tenure_Months        52924 non-null  float64
 5   Transaction_ID       52924 non-null  float64
 6   Transaction_Date     52924 non-null  object 
 7   Product_SKU          52924 non-null  object 
 8   Product_Description  52924 non-null  object 
 9   Product_Category     52955 non-null  object 
 10  Quantity             52924 non-null  float64
 11  Avg_Price            52924 non-null  float64
 12  Delivery_Charges     52924 non-null  float64
 13  Coupon_Status        52924 non-null  object 
 14  GST                  52924 non-null  float64
 15  Date                 52924 non-null  object 
 16  Offline_Spend        52924 non-null  float64
 17  Online_Spend         52924 non-null  float64
 18  Month                52955 non-null  int64  
 19  Coupon_Code          52555 non-null  object 
 20  Discount_pct         52555 non-null  float64
dtypes: float64(10), int64(2), object(9)
memory usage: 8.5+ MB

There are a few columns with null values. We are going to drop all the empty values and some columns that we are not going to use in our analysis.

In [4]:
df.dropna(inplace=True)
In [5]:
df = df.drop(columns=['Unnamed: 0','Product_SKU', 'Product_Description'])
df.head()
Out[5]:
CustomerID Gender Location Tenure_Months Transaction_ID Transaction_Date Product_Category Quantity Avg_Price Delivery_Charges Coupon_Status GST Date Offline_Spend Online_Spend Month Coupon_Code Discount_pct
0 17850.0 M Chicago 12.0 16679.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
1 17850.0 M Chicago 12.0 16680.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
2 17850.0 M Chicago 12.0 16696.0 2019-01-01 Nest-USA 2.0 122.77 6.5 Not Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
3 17850.0 M Chicago 12.0 16699.0 2019-01-01 Nest-USA 1.0 81.50 6.5 Clicked 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0
4 17850.0 M Chicago 12.0 16700.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Clicked 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0

Check for Duplicates

Let's see if there are any duplicates or miswrightings in our data.

In [6]:
df2 = pd.DataFrame(df, columns=['Gender', 'Location', 'Product_Category', 'Coupon_Status', 'Coupon_Code', 'Tenure_Months'])
for col in df2:
   print (col, df2[col].unique())
Gender ['M' 'F']
Location ['Chicago' 'California' 'New York' 'New Jersey' 'Washington DC']
Product_Category ['Nest-USA' 'Office' 'Apparel' 'Bags' 'Drinkware' 'Lifestyle' 'Waze'
 'Headgear' 'Notebooks & Journals' 'Nest-Canada' 'Bottles' 'Gift Cards'
 'Housewares' 'Android' 'Nest' 'Accessories']
Coupon_Status ['Used' 'Not Used' 'Clicked']
Coupon_Code ['ELEC10' 'OFF10' 'SALE10' 'AIO10' 'EXTRA10' 'WEMP10' 'HGEAR10' 'NJ10'
 'NCA10' 'BT10' 'GC10' 'HOU10' 'AND10' 'ELEC30' 'OFF30' 'SALE30' 'AIO30'
 'EXTRA30' 'WEMP30' 'HGEAR30' 'NJ30' 'NE30' 'ACC30' 'ELEC20' 'OFF20'
 'SALE20' 'AIO20' 'EXTRA20' 'WEMP20' 'HGEAR20' 'NJ20' 'NCA20' 'HOU20'
 'BT20' 'ACC20' 'NCA30' 'HOU30' 'BT30' 'GC30' 'NE20' 'AND20' 'GC20' 'NE10'
 'ACC10' 'AND30']
Tenure_Months [12. 43. 33. 49. 46. 14. 25. 50. 39. 21. 30. 29. 26. 41. 24. 40. 28. 15.
 18. 32. 23.  7.  4.  8. 27. 13. 38. 22.  6. 34. 42. 45.  3. 16. 17. 20.
  9. 10. 19.  5. 37. 31. 44. 36. 48. 47.  2. 35. 11.]
In [7]:
df.nunique()
Out[7]:
CustomerID           1468
Gender                  2
Location                5
Tenure_Months          49
Transaction_ID      25000
Transaction_Date      365
Product_Category       16
Quantity              151
Avg_Price             542
Delivery_Charges      267
Coupon_Status           3
GST                     4
Date                  365
Offline_Spend          11
Online_Spend          365
Month                  12
Coupon_Code            45
Discount_pct            3
dtype: int64
In [8]:
pd.isnull(df).sum()
Out[8]:
CustomerID          0
Gender              0
Location            0
Tenure_Months       0
Transaction_ID      0
Transaction_Date    0
Product_Category    0
Quantity            0
Avg_Price           0
Delivery_Charges    0
Coupon_Status       0
GST                 0
Date                0
Offline_Spend       0
Online_Spend        0
Month               0
Coupon_Code         0
Discount_pct        0
dtype: int64

Make new features from the old ones

We can add two columns which will help us to better vizualize our results.

In [9]:
df['Total_Price']=df['Avg_Price']*df['Quantity']+df['Delivery_Charges']
df['Total_Spend']=df['Offline_Spend']+df['Online_Spend']
df.head()
Out[9]:
CustomerID Gender Location Tenure_Months Transaction_ID Transaction_Date Product_Category Quantity Avg_Price Delivery_Charges Coupon_Status GST Date Offline_Spend Online_Spend Month Coupon_Code Discount_pct Total_Price Total_Spend
0 17850.0 M Chicago 12.0 16679.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0 160.21 6924.5
1 17850.0 M Chicago 12.0 16680.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0 160.21 6924.5
2 17850.0 M Chicago 12.0 16696.0 2019-01-01 Nest-USA 2.0 122.77 6.5 Not Used 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0 252.04 6924.5
3 17850.0 M Chicago 12.0 16699.0 2019-01-01 Nest-USA 1.0 81.50 6.5 Clicked 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0 88.00 6924.5
4 17850.0 M Chicago 12.0 16700.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Clicked 0.1 1/1/2019 4500.0 2424.5 1 ELEC10 10.0 160.21 6924.5

Change column types

The date column is object, so we are going to make datetime.

In [10]:
df["Date"] = pd.to_datetime(df["Date"])
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 52524 entries, 0 to 52923
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   CustomerID        52524 non-null  float64       
 1   Gender            52524 non-null  object        
 2   Location          52524 non-null  object        
 3   Tenure_Months     52524 non-null  float64       
 4   Transaction_ID    52524 non-null  float64       
 5   Transaction_Date  52524 non-null  object        
 6   Product_Category  52524 non-null  object        
 7   Quantity          52524 non-null  float64       
 8   Avg_Price         52524 non-null  float64       
 9   Delivery_Charges  52524 non-null  float64       
 10  Coupon_Status     52524 non-null  object        
 11  GST               52524 non-null  float64       
 12  Date              52524 non-null  datetime64[ns]
 13  Offline_Spend     52524 non-null  float64       
 14  Online_Spend      52524 non-null  float64       
 15  Month             52524 non-null  int64         
 16  Coupon_Code       52524 non-null  object        
 17  Discount_pct      52524 non-null  float64       
 18  Total_Price       52524 non-null  float64       
 19  Total_Spend       52524 non-null  float64       
dtypes: datetime64[ns](1), float64(12), int64(1), object(6)
memory usage: 8.4+ MB
In [11]:
df['Day'] = df['Date'].dt.day
df['Month'] = df['Date'].dt.month
df['Year'] = df['Date'].dt.year
df.head()
Out[11]:
CustomerID Gender Location Tenure_Months Transaction_ID Transaction_Date Product_Category Quantity Avg_Price Delivery_Charges Coupon_Status GST Date Offline_Spend Online_Spend Month Coupon_Code Discount_pct Total_Price Total_Spend Day Year
0 17850.0 M Chicago 12.0 16679.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Used 0.1 2019-01-01 4500.0 2424.5 1 ELEC10 10.0 160.21 6924.5 1 2019
1 17850.0 M Chicago 12.0 16680.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Used 0.1 2019-01-01 4500.0 2424.5 1 ELEC10 10.0 160.21 6924.5 1 2019
2 17850.0 M Chicago 12.0 16696.0 2019-01-01 Nest-USA 2.0 122.77 6.5 Not Used 0.1 2019-01-01 4500.0 2424.5 1 ELEC10 10.0 252.04 6924.5 1 2019
3 17850.0 M Chicago 12.0 16699.0 2019-01-01 Nest-USA 1.0 81.50 6.5 Clicked 0.1 2019-01-01 4500.0 2424.5 1 ELEC10 10.0 88.00 6924.5 1 2019
4 17850.0 M Chicago 12.0 16700.0 2019-01-01 Nest-USA 1.0 153.71 6.5 Clicked 0.1 2019-01-01 4500.0 2424.5 1 ELEC10 10.0 160.21 6924.5 1 2019
In [12]:
df.to_csv('shopping_data.csv', index=False)

DATA VISUALIZATION

Gender Feature

Female customers are more than males. But as we can see from the second pie chart below males spend more money than female customers.

In [13]:
import seaborn as sb
import matplotlib.pyplot as plt
val=df.Gender.value_counts()
plt.pie(val,labels=['Female','Male'],autopct="%1.1f%%",shadow=True,explode=(0.1,0),)
plt.axis('equal')
plt.title('Number of Male - Female Customer')
sb.set(style='white')
In [14]:
male=df[['Total_Price','Gender']].groupby('Gender').get_group('M')['Total_Price'].sum()
female=df[['Total_Price','Gender']].groupby('Gender').get_group('F')['Total_Price'].sum()
val=[male,female]
plt.pie(val,labels=['Female','Male'],autopct="%1.1f%%",shadow=True,explode=(0.1,0),)
plt.axis('equal')
plt.title('Price Spent by Male / Female Customer')
sb.set(style='white')
In [15]:
df.Coupon_Code.value_counts()
Out[15]:
Coupon_Code
SALE20     6373
SALE30     5915
SALE10     5838
ELEC10     4826
ELEC30     4647
ELEC20     4540
EXTRA10    2317
OFF10      2250
EXTRA20    2211
OFF20      2202
OFF30      2061
EXTRA30    2047
NE30       1003
NE20        740
AIO10       654
AIO20       618
AIO30       610
NE10        455
NJ20        299
NJ10        280
HGEAR10     267
HGEAR20     267
HGEAR30     237
WEMP20      205
WEMP30      185
NJ30        170
WEMP10      164
NCA10       116
NCA30       110
BT10         98
ACC20        97
NCA20        91
GC10         89
BT30         87
BT20         83
ACC30        77
ACC10        60
HOU20        50
HOU10        42
GC20         41
HOU30        30
GC30         29
AND30        16
AND10        14
AND20        13
Name: count, dtype: int64

Coupon Feature

We have many coupons and with different discount (10%, 20% and 30%) but how many of them are actually used? We can see the big picture for all of them in the bar plot below. Only 34% of them was used.

In [16]:
fig=plt.figure()
axis=fig.add_axes([1,1,1,1])
nd1=df.Coupon_Status
sb.countplot(data=nd1,x=nd1,ax=axis)

for i in axis.patches:
    axis.annotate(i.get_height(),(i.get_x(),i.get_height()),va='bottom',ha='left')
plt.ylabel('Frequency')
plt.title('Coupon_status')
Out[16]:
Text(0.5, 1.0, 'Coupon_status')

We can see a more detailed view of how the coupons used below.

In [17]:
list = ['30','20','10']
for i in list:
    filtered_df = df[df.Coupon_Code.str.contains(i)]
    import seaborn as sns
    p = sns.countplot(data=filtered_df, x='Coupon_Code', hue='Coupon_Status')
    p.legend(title='Coupon_Status', bbox_to_anchor=(1, 1), loc='upper left')
    #p.set(xlabel='How many customers used the ' +i '% Coupons')
    plt.title('Coupon Status for ' + i +'% Discount')
    # change rotation of x-labels
    plt.xticks(rotation=45, ha='right')
    #adjust size
    plt.figure(figsize=(12, 10))
    plt.show()
<Figure size 1200x1000 with 0 Axes>
<Figure size 1200x1000 with 0 Axes>
<Figure size 1200x1000 with 0 Axes>

Price Feature

Let's see for different months how the average price from customers performs (max, min, mean and outliers.)

  • January, February, October and November have higher max for average prices.
  • August have the lowest max for average prices.
  • December have the highest min for average prices.
  • The highest mean value for average prices is in November.
  • The lowest mean value for average prices is in August.
In [18]:
#boxplot chart
df.boxplot(by ='Month', column =['Avg_Price'], grid = False)
Out[18]:
<Axes: title={'center': 'Avg_Price'}, xlabel='Month'>

We can observe the same results also in the line graph below.

  • The highest mean value for prices is in November.
  • The lowest mean value for prices is in August.
In [19]:
#line chart
df3=df.groupby(pd.PeriodIndex(df['Date'], freq="M"))['Total_Price'].mean()
df3.plot.line(x='Date', y='Avg_Price', title = 'Mean Value of Total Prices per Month')
Out[19]:
<Axes: title={'center': 'Mean Value of Total Prices per Month'}, xlabel='Date'>

Gender Feature vs Products Feature

It would also be interesting to see how many Female or Male chose the different kind of products.

  • Apparel type is the most prefered one. (11355 males and 6771 females)
  • Nest-USA type is the second one. (8642 males and 5371 females)
In [20]:
df2 = df.groupby(['Product_Category', 'Gender'])['Product_Category'].count().unstack('Gender').fillna(0)
#stacked bar chart
ax=df2[['F','M']].plot(kind='bar', stacked=True, title='Number of Products purchased by Gender')
#add annotations
for container in ax.containers:
    ax.bar_label(container, size=8)

Now, it would be interesting to see how the most purchased products sell over different months.

In [21]:
df_plot= df.groupby(['Month','Product_Category'])['Product_Category'].count().unstack('Product_Category').fillna(0) 
df_plot.head()
Out[21]:
Product_Category Accessories Android Apparel Bags Bottles Drinkware Gift Cards Headgear Housewares Lifestyle Nest Nest-Canada Nest-USA Notebooks & Journals Office Waze
Month
1 1.0 4.0 960.0 155.0 16.0 295.0 4.0 54.0 10.0 200.0 0.0 39.0 1563.0 54.0 607.0 61.0
2 0.0 1.0 905.0 139.0 19.0 284.0 6.0 66.0 5.0 199.0 0.0 22.0 1055.0 51.0 468.0 25.0
3 1.0 7.0 1437.0 195.0 24.0 401.0 6.0 66.0 17.0 245.0 0.0 28.0 1099.0 57.0 658.0 49.0
4 4.0 2.0 1582.0 145.0 20.0 267.0 7.0 72.0 10.0 252.0 0.0 27.0 996.0 73.0 606.0 35.0
5 6.0 5.0 1876.0 163.0 26.0 287.0 25.0 65.0 12.0 262.0 0.0 21.0 970.0 57.0 685.0 56.0
In [22]:
areaplot = df_plot[['Apparel','Nest-USA','Office']].plot.area( figsize = (10,8), title='Number of top 3 Products purchased by Month', stacked= False)

Location Feature

We have different cities from USA (Chicago, California, New York, New Jersey, Washington DC) but where from are the most purchases?

  • The most purchases where made from Chicago.
  • The least purchases where made from Washinghton DC.
In [23]:
sb.histplot(df.Location)
plt.ylabel('Frequency')
plt.xlabel('Cities')
plt.xticks(rotation=45)
plt.title('Location Frequencies')
Out[23]:
Text(0.5, 1.0, 'Location Frequencies')