Data Driven Approach to Predict Success of Bank- Marketing

Nipun Agrawal
20 min readApr 11, 2021

INTRODUCTION: The older marketing options have contributed minimal in increasing the business of banks. Due to internal competition and financial crisis European Banks were under pressure to increase their financial assets. They offered long term deposits with good interest rates to the people using direct marketing strategy but contacting many people takes lot of time and success rate is also less. So they want to take help of the technology to come up with a solution that increases the efficiency by making fewer calls but improves the success rate.Portuguese Banking Institution has provided the data related to marketing campaigns that took over phone calls.

TABLE OF CONTENTS

i. Business Problem

ii. Machine Learning Problem

iii. Data set Overview

iv. Performance Metric

v. Exploratory Data Analysis

vi. Pre-Processing

vii. Data Modelling

viii. Model Comparison

ix. Model Deployment

x. Further Improvements

xi. References

BUSINESS PROBLEM: Finding out the characteristics that are helping Bank to make customers successfully subscribe for deposits, which helps in increasing campaign efficiently and selecting high value customers.

MACHINE LEARNING PROBLEM: The goal is to build Machine Learning model that learns the unknown patterns, maps and several input features classifying whether client will subscribe for longer deposits or not.

DATA SET OVERVIEW:

The data is related with direct marketing campaigns of a Portuguese banking institution.The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (‘yes’) or (‘no’).

Data set is taken from UCI Machine Learning repository:

1) bank-additional-full.csv with all examples (41188),
2) bank-additional.csv with 10% of the examples (4119).

Attribute Information:

Input variables:
# bank client data:
1 — age (numeric)
2 — job : type of job (categorical: ‘admin.’,’blue-collar’,’entrepreneur’,’housemaid’,’management’,’retired’,’self-employed’,’services’,’student’,’technician’,’unemployed’,’unknown’)
3 — marital : marital status (categorical: ‘divorced’, ‘married’, ‘single’, ‘unknown’; note: ‘divorced’ means divorced or widowed)
4 — education (categorical: ‘basic.4y’,’basic.6y’,’basic.9y’,’high.school’,’illiterate’,’professional.course’,’university.degree’,’unknown’)
5 — default: has credit in default? (categorical: ‘no’, ‘yes’, ‘unknown’)
6 — housing: has housing loan? (categorical: ‘no’, ‘yes’ ,’unknown’)
7 — loan: has personal loan? (categorical: ‘no’, ‘yes’, ‘unknown’)
# related with the last contact of the current campaign:
8 — contact: contact communication type (categorical: ‘cellular’, ‘telephone’)
9 — month: last contact month of year (categorical: ‘jan’, ‘feb’, ‘mar’, …, ‘nov’, ‘dec’)
10 — day_of_week: last contact day of the week (categorical: ‘mon’,’tue’,’wed’,’thu’,’fri’)
11 — duration: last contact duration, in seconds (numeric).

Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

# other attributes:
12 — campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 — pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 — previous: number of contacts performed before this campaign and for this client (numeric)
15 — poutcome: outcome of the previous marketing campaign (categorical: ‘failure’,’nonexistent’,’success’)

# social and economic context attributes
16 — emp.var.rate: employment variation rate — quarterly indicator (numeric)
17 — cons.price.idx: consumer price index — monthly indicator (numeric)
18 — cons.conf.idx: consumer confidence index — monthly indicator (numeric)
19 — euribor3m: euribor 3 month rate — daily indicator (numeric)
20 — nr.employed: number of employees — quarterly indicator (numeric)

Output variable (desired target):
21 — y — has the client subscribed a term deposit? (binary: ‘yes’, ‘no’)

PERFORMANCE METRICS:

Different performance metrics are used to evaluate machine learning model. Based on our task we can choose our performance metrics. Since our task is of classification and that too binary class classification, whether client will or will not subscribe for deposits.

PRIMARY PERFORMANCE METRICS

Here we will be using AUC ROC curve.

What is ROC?

ROC also known as Receiver Operating Characteristics, shows the performance of binary class classifiers across the range of all possible thresholds plotting between true positive rate and 1-false positive rate

What is AUC-ROC?

AUC measures the likelihood of two given random points, one from positive and one from negative, the classifier will rank the positive points above negative points. AUC-ROC is popular classification metric that presents the advantage of being independent of false positive or negative points.

Ideal AUC-ROC score is 1. AUC-ROC score for random classifier is 0.5.

SECONDARY PERFORMANCE METRICS

Macro-F1 Score: F1 score is the harmonic mean between Precision and Recall. Macro F1 score is used to know how our model works in overall dataset.

Confusion Matrix: This matrix gives the count of true negative, true positive, false positive and false negative data points.

EDA:

Before going further let’s understand what is EDA?

EDA is a way of interpreting, summarising and visualisation the information from dataset. It could help us to find out the patterns and relationships that may not be understood or visible. It is the one of the important things in data science life cycle.

Let’s check for missing values in the dataset.

As we can see there are no missing values present in the dataset. We don’t need to impute any missing values.

Let’s see how our target variable i.e. y distribution looks like.

From the above plot we can observe that our dataset is highly imbalanced. Majority of the data points belong to no class. Ratio of No class to yes class is 8:1.

Let’s start doing EDA on rest of the columns of the datapoints.

1. Job

From the above plot we can observe that people with admin jobs have been contacted more by the bank. People with unknown jobs are very few. Let’s check people with which jobs have subscribed for the deposits.

From the above plot we can observe people with admin jobs have subscribed more for the deposits than people with any other profession.

2. Marital

People who are married are contacted more by the bank. Marital status of about 0.2% of people is unknown to the bank

People who are married have subscribed for deposits more than people with any other marital status. They are also the most one’s who have turned down the deposits offered by the bank.

3. Contact

Most people are contacted more in cellular than telephone.

More people contacted on cellular by bank have subscribed the deposits offered by the bank than the ones contacted on telephone.

4. Education

People contacted by the bank with university degree as their educational qualification are more than the people with any other educational qualification. Bank has not contacted illiterate people.

Let’s check how many people with which educational qualification are more in number who have subscribed for deposits or not.

People with university degree as education qualification are the most who have subscribed for the deposits. They are also the most who have not subscribed for deposits.

5. Default

We can clearly see that the people with default status as ‘no’ are the most who have been contacted by the bank for the deposits. People with default status ‘yes’ have not been contacted by the bank at all. While very few people with unknown default status have been contacted by the bank.

People with default status as no are the most one’s who have and have not subscribed for bank deposits.

6. Housing

People with housing loan are the most ones who have been contacted by the bank followed by people with no housing loan. Very few people with housing loan status as unknown are present.

People with housing loan are the most ones who have subscribed for deposits. They are also the most ones who have not subscribed for the deposits. Very few people with unknown housing loan status have subscribed for the deposits offered by the bank.

7. Personal Loan

People with no personal loan are the most ones who have been contacted by the bank for the deposits. Very few people with personal loan are contacted by the bank for the deposits.

People with no personal loan are the most ones who have not subscribed for the deposits offered by the bank. People with no personal loan are the most ones also who have subscribed for the deposits. Very few people with unknown personal loan have subscribed for deposits.

8. Month

People have been contacted more in the month of May, followed by July, August, June. Very few people have been contacted in the month of December. People have not been contacted in the month of January and February. Let’s check it once.

People have not been contacted in the month of January and February.

People contacted in May have higher chances to subscribe for longer term deposits but have also higher chances for not subscribing the long term deposits. Very few people are contacted in the month of December, March, September,October and have almost equal chances for subscribing the deposits or not.

9. Day of Week

People have not been contacted on Saturday and Sunday. Rest all the days, count of people contacted by the bank is almost same.

In all the days they have equal chances for subscribing and not subscribing the term deposits. Day_of_week may not be very helpful in predicting whether the customer will subscribe for long term deposits or not.

10.Outcome of Previous Marketing Strategy(Poutcome)

From the above plot it is evident that majority of the outcome of the previous campaign is Non-Existent. Very few people from previous marketing strategy have subscribed for the deposits.

From the above plot, people whose previous outcome is non-existent have actually subscribed more than any other group of people belonging to previous outcome. Among the group of people with previous outcome as success, people have actually subscribed more for the deposits than people who have not subscribed for the deposits.

For the Numerical data we will be plotting, kdplot, violinplot with hue and kdplot with hue.

11. Age

From the above plot it is right skewed graph and also after the age of 60, there might be outliers present.

From the above it is clearly visible that there are outliers present for both the class. In No class, outliers are present above age 70 and for Yes class, outliers are present above age 75. Median for No class is around 40 which is same for Yes class. Also, it is visible that IQR range is almost overlapping so age might not be very helpful in predicting class label.

Plotting kdeplot with hue as classes it is more clear that age might not be very helpful in prediction of class labels because there is so much of overlapping. After age of 60 which might be our outliers, there is not that much of overlapping.

12. Duration

From the above plot it is clearly evident that from duration around 1500 onward outliers, are present and it is right skewed data.

Any duration of call with class labels as no, more than 1000 are considered as outliers while with class labels yes, more than 1500 would be considered as outliers.

From the above 3 plots it is evident that duration feature would be very helpful in predicting class labels.It is also mentioned in the research paper but to make realistic predictive model we do not have to use this column. Most of the call duration with people who have not subscribed for the long term deposits are between 0–1000 and with people who have subscribed are between 0–2000.

13. Campaign

From the above plot, outliers might be present after the number of campaigns 10.

Outliers are present when number of campaign are more than 10 irrespective of any class labels.

From the above plot, though outliers are present but there is so much overlapping

14. Days Passed after The Client has been Last Contacted by the Bank for Previous Campaign (Pdays)

From the above plot most of the pdays values is 999 which means most of the clients have not been contacted by the bank.

From the above plot it is visible that irrespective of class labels, mostly people has not been contacted by the bank. Very people has been contacted by the bank and number of days passed for previous campaign is between 0–100. It means we either have to compute pdays or drop the pdays depends on the percentage of values.

Pdays can only be used when values are between 0–200 or 999 for predicting class labels.

15. Previous(Number contacts performed before this campaign and for the particular client)

From the above plot, there might be outliers present after the number 4. Let’s check the violin plot.

From the above violin plot, for any class labels, previous with greater than 4 are outliers.

From the above plot Previous with number zero might be helpful in predicting whether client will or will not subscribe for deposits.

16. Emp.Var.Rate

From above plots it is visible that there are no outliers present. Let’s confirm with violin plot.

There are no outliers present

As we can see emp.var.rate would be very helpful in predicting the class labels.

17. Cons.Price.Idx

From above plot we can see that there are no outliers present. Let’s plot violin plot

Cons.price.idx would be helpful in predicting class labels and there are no outliers present.

18. Euribor3m

From the above plot it is visible that, Euribor3m would be helpful in predicting class labels and there are no outliers present.

19. Nr.Employed

From the above plot, there are no outliers present also this feature would be helpful in predicting class labels.

20. Cons.Conf.Idx

From the above plot there might be a case of outliers. Let’s check with violinplot

For no class, cons.conf.idx with values above -30 are possible outliers.

From above plot it is visible that this feature would be helpful in prediciting the class labels.

Let’s Check for correlation of features between the Numerical Features

The emp.var.rate, euribor3m, nr.employed and cons.price.index have very high correlation. Euribor3m with nr.employed and emp.var.rate with nr.employed with the highest correlation with more than 0.9 value.

Feature Engineering and Their EDA:

Feature engineering is the process of converting data into features that improves the prediction and performance of model in unseen data.

1. Converting Age to Age-Group

Here I have converted my age which is numeric to age_group (categorical data.)

Here I have created 9 groups from minimum age 10 to maximum age 100. After creating, inserted the age group into data frame and deleted the age column from the data frame.

Lets plot the graph

Bank has contacted most people between the age group of 30–39 followed by 40–49.

Age group of 30–39 are the most people who have not subscribed for the deposits. They are also the most who have subscribed for the deposits.

For other eda related to age_group, please go through the notebook.

2. Creating I_loan

This feature will refer whether person has personal or housing loan or neither.

Steps-1. If person has either housing loan or personal loan ->I_loan=Yes

If person does not have either personal loan or housing loan ->I_loan=No

If person’s housing loan and personal loan status is unknown ->I_loan=Unknown

Steps-2 Inserting I_loan into the data frame

Step-3 Dropping housing and loan column from the dataframe.

From the above plot we can see that people having loan are more, followed by people having loan status as no. Very few people have loan status as unknown.

People who have loan are in majority who have subscribed for deposits. They are also the one who have not subscribed for the deposits.

For other eda related to i_loan, please go through the notebook.

Pre-Processing:

Let’s check for any duplicate values present in the data.

There are 27 data that are duplicate. Now in parameter we have given keep=last, reason being, we want to have most recent records in our data. Older duplicate data should be removed.

Removing Duplicate Data

After removing duplicate values, we have data shape of (41161,20)

Now we will be mapping our target values yes with 1 and no with 0.

Let’s Separate our target variable from data frame and drop the target variable

Splitting Dataset into Train, Cross-Validation and Test

Feature Scaling of Numerical Features

Reason of doing log transformation is that, most of our dataset are right skewed. For log transformation we need to add some constant to avoid getting NAN value in feature. Here I have added 100 because with values less than 100, Euribo3m feature was getting NAN values.

Encoding of Categorical Features

  1. One Hot Encoding:One hot encoding creates column for each category and checks whether the category is present in that row or not. If category is present it would be marked as 1 else zero.

2. Response Encoding: Steps involved for Response Encoding of Features.

1. We will be fitting our train dataset by passing train, target labels.

2. We will be trying to get category for each columns of training dataset.

3. Based on category, we will be iterating and try to get the probability of each target labels and return dictionary which have probability score of each category

4. For transforming cross validation and test dataset, we will be passing the dictionary of class labels.

5. We will check whether the category is present in the dictionary. If present, we will pass the values present in the dictionary. If not, we will put 0.5 value and return the dictionary of transformed dataset.

6. Mapping the values in dataframe based on values in dictionary

7. For every column we will be having 2 columns based on the class labels where for columns with class labels as 1 have different count and class labels with 0 have different count.

DATA MODELLING:

Here we will be using different machine learning models and check which models perform better. Let me explain you with the steps.

1. As pdays have approximately 70% data corrupted and marked as 999 values, we could either impute these values or drop this column. I have dropped pdays column and check how does my model perform

2. Duration feature will only be used for baseline model. Since to make our models realistic we would be dropping column for other machine learning models.

3. Hyper tuning of models has been done, using calibrated classifier, where cross validation dataset is used to get best hyper parameter.

4. After getting best hyper parameter, train and test data set is used for evaluating the model performance

5. Feature scaling of numerical data is done only for linear models. Tree based models does not require feature scaling and is also robust to outliers.

6. For each encoding of categorical data, we have used models to compare which encoding would work better.

7. Since our dataset is highly imbalanced, I have used class_weight=’balanced’ as parameter to balance the dataset internally.

Need of Calibrated Classifier

Non-linear models like (KNN, Tree based Models) predicts uncalibrated probability. Though models predict probability but they might not be the same like observed probability in training. It requires adjustment and that is done by calibration.

Baseline Model with Duration.

For Baseline model we will be using Dummy.Classifier.

  1. One Hot Encoding.
Strategy with Stratified has the best AUC score among other strategies.

2. Response Encoding

Our other machine learning models irrespective of encoding should perform better than our baseline models.

Model Evaluation

All the model are hypertuned using Random Search Cross Validation.

  1. One Hot Encoding.

i. KNN:

ii. Logistic Regression:

iii. SGD with Logloss

iv. SGD with Hinge-Loss

v. Random Forest

vi. XGBoost

vii. Adaboost

2. Response Encoding

i. KNN

ii. Logistic Regression

iii. SGD with Log Loss

iv. SGD with Hinge Loss

v. Random Forest

vi. XGboost

vii. Adaboost

Now we will be building Custom Stacking with Response Encoding.

Custom Stacking.

Steps Involved in Building Custom Stacking.

1. Splitting the dataset into train and test with the ratio of 80:20

2. Splitting the train dataset into dataset1 and dataset2 in the ratio of 50:50

3. Dataset1 would be used to train our base learner. Here our base learner is Decision Tree. For n number of base learners, n number of sampled dataset would be created using sampling with replacement.

4. Trained base learner’s objects would be stored into the list.

5. Now dataset2 would be used to predict the n predictions.

6. After getting n predictions from base learner, it would be stacked and used to trained our meta classifier.

7. Here our meta classifier is logistic regression which is hyper tuned.

Now our test dataset would be used to get the final predictions from meta-classifier

Layout of Custom Stacking:

image source: https://www.geeksforgeeks.org/stacking-in-machine-learning-2/

Here I have used 200 decision trees as our base learner to get the prediction.

Model Comparison

Let’s compare our Models and check which model should be used for deployment.

From the above we can see that model with XGBoost with response encoding has best AUC among other models. We would be saving xgboost using joblib library to predict the data.

Model Deployment

This function takes datapoint as the input and return the prediction.I have deployed my model on my local system.

Here is the YouTube video of my model deployment.

Further Improvements

1. We can impute pdays which have around 70% of values with 999 by using different imputation methods.

2. For balancing of dataset, we have used class_weight=’balanced’. We can use different sampling techniques to balance the dataset, which might further improve our AUC score.

3. We have trained our models on very limited data set. To improve our AUC score we can use Deep Learning techniques which requires more dataset for training.

References:

  1. Converting age into categorical age: https://www.absentdata.com/pandas/pandas-cut-continuous-to-categorical/
  2. Hyper tuning of Models: https://towardsdatascience.com/machine-learning-case-study-a-data-driven-approach-to-predict-the-success-of-bank-telemarketing-20e37d46c31c
  3. Model Deployment: https://www.appliedaicourse.com/lecture/11/applied-machine-learning-online-course/4148/hands-on-live-session-deploy-an-ml-model-using-flask-apis-on-aws/5/module-5-feature-engineering-productionization-and-deployment-of-ml-models

Github Repository: https://github.com/Nipun-1997/Bank-Marketing

Hope you like this blog. For any details contact me via LinkedIn

Linkedin Profile: https://www.linkedin.com/in/nipun-agrawal-200597110

--

--