Making an Insurance Claims Prediction model with CatBoost in R

Harish Nagpal
Analytics Vidhya
Published in
6 min readNov 29, 2020

--

Photo by Cristofer Jeschke on Unsplash

R is a Beautiful language. Many times I have tried to shift towards Python but simplicity of R pulls me back.

As per CatBoost website, CatBoost is a machine learning algorithm that uses gradient boosting on decision trees. It is available as an open source library.

It has support for many languages including Python and R.

The wonderful thing about this package is that it gives you very high accuracy more than that of XGBOOST or Random Forest.

Sometimes ago I did one practical problem at Zindi. This site has very good data science competitions. You can learn a lot by taking part in various competitions here. I was looking for data science competitions in insurance. I found one at Kaggle, where the problem was on Prudential Life Insurance Assessment. It was a risk assessment/underwriter problem.

And on Zindi, I found two Insurance data science problems. There was one data science competition on Insurance Recommendation. I attempted this challenge. I didn’t get very high ranking but it was a very good learning experience.

Another competition was on insurance claims ie whether a building will have a claim or not in certain period. Given below is description of problem statement:

Description of the challenge:

Recently, there has been an increase in the number of building collapse in Lagos and major cities in Nigeria. Olusola Insurance Company offers a building insurance policy that protects buildings against damages that could be caused by a fire or vandalism, by a flood or storm.

You have been appointed as the Lead Data Analyst to build a predictive model to determine if a building will have an insurance claim during a certain period or not. You will have to predict the probability of having at least one claim over the insured period of the building.

The model will be based on the building characteristics. The target variable, Claim, is a:

1 if the building has at least a claim over the insured period.

0 if the building doesn’t have a claim over the insured period.

Installation of CatBoost in R

Before using CatBoost package, you need to install it. Here is the code to install this on your Windows system. You may need to change the version number, if required. For more details, please click here.

install.packages('devtools')
devtools::install_url('https://github.com/catboost/catboost/releases/download/v0.20/catboost-R-Windows-0.20.tgz', INSTALL_opts = c("--no-multiarch"))

Let us start building this model in R. Please install the datasets from here.

First set the working directory and load the required packages. Not all of these packages may be used. These may be required in different stages of feature engineering or EDA so you can comment/uncomment as per your ussage.

setwd(“D:/Harish/Zindi/Nigeria_Insurance_Prediction”)
getwd()
# load packages
library(dplyr)
library(readr)
library(stringr)
library(caret)
#library(data.table)
library(mltools)
library(plyr) # for rbind
#library(randomForest)
library(lubridate) # for dates
library(tidyr)
#library(tibble)
#library(purrr)
library(Matrix)
library(sqldf) # for running sql queries
library(catboost)

Load the train and test datsets

train1 = read.csv(“train_data.csv”)
test1 = read.csv(“test_data.csv”)
head(train1, n = 5)

Claim is the Response variable which we need to predict and it has value of 0 or 1.

A brief look at the dataset
summary(train1)
summary(test1)

NAs are mainly present in Garden, Building Dimension and Date of Occupancy. Number of Windows has various value with one of them being dot ‘.’ which has maximum rows.

# Number of windows around with . value around 50%
# Building.Dimension, NA 106, — take mean
#Garden — NA 7 — Replace with ‘V’
#Geo code — 102 Blank — Replace with mode.
#Date of Occupancy — 508 blank — Replace with round(mode)

Number of rows in train set are 7160 and 3069 in test dataset.

nrow(train1) #7160
nrow(test1) # 3069

For feature engineering, I am combining both data sets by adding a flag to differentiate them. It is always a good practice to combine both datasets to do EDA and Feature Engineering as it brings consistency in your data.

train1$flag <- 'traindata'
test1$flag <- 'testdata'
#combine both data sets
combined <- rbind.fill(train1, test1)
summary(combined)
nrow(combined) #10229
str(combined)

I did feature engineering on many variables. Filled the NAs with mode and mean. I created a new variable dur to find the difference between YearOfObservation and Date_of_Occupancy. Converted few variables to factor.

The Insured_Period had various decimal values. So I converted it to a new variable Frequency.

unique(rnd)rnd <- replace(rnd, rnd == 0.0, 1)
rnd <- replace(rnd, rnd == 0.1, 2)
rnd <- replace(rnd, rnd == 0.2, 3)
rnd <- replace(rnd, rnd == 0.3, 4)
rnd <- replace(rnd, rnd == 0.4, 5)
rnd <- replace(rnd, rnd == 0.5, 6)
rnd <- replace(rnd, rnd == 0.6, 7)
rnd <- replace(rnd, rnd == 0.7, 8)
rnd <- replace(rnd, rnd == 0.8, 9)
rnd <- replace(rnd, rnd == 0.9, 10)
rnd <- replace(rnd, rnd == 1.0, 12)
combined$Frequency <- rnd

I converted Garden, Building_Painted, Building_Fenced and Settlement to 0 and 1. I also did scaling on few numerical variables to normalize them.

I removed below variables as I already used them in some other way.

combined$YearOfObservation <- NULL
combined$Date_of_Occupancy <- NULL
combined$Insured_Period <- NULL

Finally I separated train and test dataset:

Otrain <- subset(combined, flag==’traindata’)
Otest <- subset(combined, flag==’testdata’)
Otrain$flag <- NULL
Otest$flag <- NULL

When my train and test datasets were ready, I divided the train dataset into 70/30% split and found the best model using CatBoost.

Once I was certain that my model will work on any test set, I run it fully on train dataset. Then I did variable importance to find top variables impacting the accuracy.

I also tried submitting test dataset on Zindi site to check if my score is improving or not using many permutations and combinations.

Here is the final code for CatBoost:

Otrain$Customer.Id <- NULL # Remove customer id.TstCusId <- Otest$Customer.Id # we will use this value at the time of submissionOtest$Customer.Id <- NULL # Remove customer id.######Catboost model
# Using All fields after I ran variable importance and found below fields having maximum impact on accuracy of model.
Otrain <- select(Otrain, Claim, Residential, Building.Dimension, Building_Type, dur, Frequency, Geo_Code,Settlement,NumberOfWindows)Otest <- select(Otest, Claim, Residential, Building.Dimension, Building_Type, dur, Frequency, Geo_Code,Settlement,NumberOfWindows)set.seed(7)y_train <- unlist(Otrain[c(‘Claim’)])
X_train <- Otrain %>% select(-Claim)
y_valid <- unlist(Otest[c(‘Claim’)])
X_valid <- Otest %>% select(-Claim)
train_pool <- catboost.load_pool(data = X_train, label = y_train)
test_pool <- catboost.load_pool(data = X_valid, label = y_valid)
params <- list(iterations=500,
learning_rate=0.01,
depth=10,
loss_function=’RMSE’,
eval_metric=’AUC’,
random_seed = 55,
od_type=’Iter’,
metric_period = 50,
od_wait=20,
bootstrap_type = ‘Bernoulli’,
use_best_model=TRUE)
model <- catboost.train(learn_pool = train_pool, params = params)y_pred=catboost.predict(model,test_pool)

You can use below bootstrap_type. I found ‘Bernoulli’ bootstrap type giving best accuracy.

  • Bayesian
  • Bernoulli
  • MVS
  • Poisson (supported for GPU only)
  • No

You can find all training parameters here.

Use below code for Submission

#TstCusId <- as.data.frame(TstCusId)
#test_predictions_3 <- cbind(TstCusId, y_pred)
#write.csv(test_predictions_3, file = ‘final_pred_Claim_CTBST_0110_BER_2.csv’, row.names = T)

Use below code for ensembling and variable importance.

#enssambling
#ber1 = read.csv(“final_pred_Claim_CTBST_0110_BER.csv”)
#ber2 = read.csv(“final_pred_Claim_CTBST_0110_BER_2.csv”)
#names(ber1)
#names(ber2)
#ensm <- sqldf(“select ber1.TstCusId, ber1.y_pred, ber2.y_pred_2 from ber1, ber2 where ber1.TstCusId = ber2.TstCusId” )#ensm$y_pred_3 <- ensm$y_pred*.70 + ensm$y_pred_2*.30#ensm$y_pred <- NULL
#ensm$y_pred_2 <- NULL
#write.csv(ensm, file = ‘final_pred_Claim_CTBST_0110_ENSM_3.csv’, row.names = T)#summary(ensm)#feature_importance
#catboost.get_feature_importance(model,
# pool = NULL,
# type = ‘FeatureImportance’,
# thread_count = -1)
#####end of catboost

With CatBoost model, I got my rank in top 100.

This code is not final. Using CatBoost and with more EDA and Feature Scaling, you can also get into top 50 in any competition.

You can find full R code and dataset here at my Github page

Connect with me on LinkedIn here. Mail me here for any query.

Have a look at my few other articles:

--

--

Harish Nagpal
Analytics Vidhya

An IT professional, passionate about art in all forms — data, nature, paintings and visual art. Linkedln — https://www.linkedin.com/in/harish-nagpal-8696529/