 The last modifications of this post were around 3 years ago, some information may be outdated!

## Introduction

In this challenge, we are going to answer the question: "What sorts of people were more likely to survive?" using passenger data. Datasets to be used: `train.csv` (for training and predicting), `test.csv` (for submitting). First 10 rows of the dataset.

VariableDefinitionKey
survivalSurvival0 = No, 1 = Yes
pclassTicket class1 = 1st, 2 = 2nd, 3 = 3rd
sexSex
AgeAge in years
sibsp# of siblings / spouses aboard the Titanic
parch# of parents / children aboard the Titanic
ticketTicket number
farePassenger fare
cabinCabin number
embarkedPort of EmbarkationC = Cherbourg, Q = Queenstown, S = Southampton

## TL;DR;

• Take an overview about dataset.
• `.describe` for numerical / categorical features.
• Find percentage of missing data on each feature.
• Survival based on some categorical features.
• Visualize survival based on `Age`.
• Check if the result depends on the titles indicated in the `Name`?
• Preprocessing data:
• Drop unnecessary features (columns) (`Name`, `Ticket`, `Cabin`) using `df.drop()`.
• Convert categorical variables to dummy ones using `pd.get_dummies()`.
• Impute missing continuous values using `sklearn.impute.SimpleImputer`.
• Take an idea to change `Age` to a categorical feature and then also convert to dummy.
• Using `GridSearchCV` to find the optimal hyper parameters and apply some algorithms, e.g. Random Forest.
• Export the result to an output file.

## Preliminaries

``import numpy as npimport matplotlib.pyplot as plt # plotimport pandas as pd # working with datasetfrom sklearn import preprocessingfrom sklearn.impute import SimpleImputer # impute missing datafrom sklearn.model_selection import GridSearchCV, cross_val_score``

## Overview datasets

``train = pd.read_csv("train.csv")test = pd.read_csv("test.csv")``

Take a look

``train.head(10)train.info()train.info()train.describe() # for numerical featurestrain.describe(include=['O']) # for categorical features``

Find the percentage of missing data on each feature,

``total = train.isnull().sum().sort_values(ascending=False)percent = (round(train.isnull().sum()/train.isnull().count()*100, 1)).sort_values(ascending=False)pd.concat([total, percent], axis=1, keys=['Total', '% of missing'])``

Survival based on some categorical features,

``train.pivot_table(index="Sex", values="Survived")train.pivot_table(index="Pclass", values="Survived")train.pivot_table(index="SibSp", values="Survived")train.pivot_table(index="Parch", values="Survived")``

Visualize survival based on `Age` (numerical),

``train[train["Survived"]==1]['Age'].plot.hist(alpha=0.5, color='blue', bins=50) # survivedtrain[train["Survived"]==0]['Age'].plot.hist(alpha=0.5, color='blue', bins=50) # died``

List of titles (Mr., Mrs., Dr.,...) from `Name`,

``train.Name.str.extract(' ([A-Za-z]+)\.', expand=False)``

## Preprocessing data

In this task, you have to do the same techniques for both `train` and `test` sets!

### Drop unnecessary features

Drop some unnecessary features (columns),

``train.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)test.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)``

### Convert to dummy

Convert categorical features to dummy variables,

``def create_dummies(df, column_name):  # Convert the column_name training feature into dummies using one-hot  #   and leave one first category to prevent perfect collinearity  dummies = pd.get_dummies(df[column_name], prefix=column_name, drop_first=True)  df = pd.concat([df, dummies], axis=1)  return df``
``# Sextrain = create_dummies(train, 'Sex')test = create_dummies(test, 'Sex')``
``# Embarkedtrain = create_dummies(train, 'Embarked')test = create_dummies(test, 'Embarked')``
``# Social Classtrain = create_dummies(train, 'Pclass')test = create_dummies(test, 'Pclass')``

### Impute Missing Values

For continuous variables, we wanna fill missing data with the mean value.

``def impute_data(df_train, df_test, column_name):  imputer = SimpleImputer(missing_values=np.nan, strategy='mean', verbose=0)  # Fit the imputer object on the training data  imputer.fit(df_train[column_name].values.reshape(-1, 1)) # transform single column to 1  # Apply the imputer object to the df_train and df_test  df_train[column_name] = imputer.transform(df_train[column_name].values.reshape(-1, 1))  df_test[column_name] = imputer.transform(df_test[column_name].values.reshape(-1, 1))  return df_train, df_test``
``# Agetrain, test = impute_data(train, test, 'Age')# Faretrain, test = impute_data(train, test, 'Fare')``

### Continuous to categorical

In the case, for example, you wanna convert `Age` feature which is initially a numerical feature to a categorical feature (many ranges of ages, for example).

``def process_age(df, cut_points, label_names):    df["Age"] = df["Age"].fillna(-0.5)    df["Age_categories"] = pd.cut(df["Age"], cut_points, labels=label_names)    return dfcut_points = [-1, 0, 5, 12, 18, 35, 60, 100]label_names = ["Missing", 'Infant', "Child", 'Teenager', "Young_Adult", 'Adult', 'Senior']train = process_age(main, cut_points, label_names)test = process_age(test, cut_points, label_names)``

Convert to a dummy variable,

``main = create_dummies(main, 'Age_categories')test = create_dummies(test, 'Age_categories')``

## Training with Random Forest

We will use Grid Search to test with different parameters and then choose the best ones.

``# Create a dictionary containing all the candidate values of the parametersparameter_grid = dict(n_estimators=list(range(1, 5001, 1000)),                      criterion=['gini','entropy'],                      max_features=list(range(1, len(features), 2)),                      max_depth= [None] + list(range(5, 25, 1)))# Creata a random forest objectrandom_forest = RandomForestClassifier(random_state=0, n_jobs=-1)# Create a gridsearch object with 5-fold cross validation, and uses all cores (n_jobs=-1)clf = GridSearchCV(estimator=random_forest, param_grid=parameter_grid, cv=5, verbose=1, n_jobs=-1)``

Split into `X_train`, `y_train`:

``X_train = train[train.columns.difference(['Survived'])]y_train = train['Survived']``
``# Nest the gridsearchCV in a 3-fold CV for model evaluationcv_scores = cross_val_score(clf, X_train, y_train)# Print resultsprint('Accuracy scores:', cv_scores)print('Mean of score:', np.mean(cv_scores))print('Variance of scores:', np.var(cv_scores))``

Retrain The Random Forest With The Optimum Parameters

``# Retrain the model on the whole datasetclf.fit(X_train, y_train)# Predict who survived in the test datasetpredictions = clf.predict(test)``

## Create an output file

``final_ids = test["PassengerId"]submission_df = {"PassengerId": final_ids,                 "Survived": predictions}submission = pd.DataFrame(submission_df)submission.to_csv('titanic_submission.csv', index=False)``

Another way, check the last section of this post.

## Other approaches

• Based on the number of family/sibling members: combination of `SibSp` and `Parch`.
• Go alone?
• Consider the title from `Name`.
• Use Decision Tree with K-fold.