Titanic survival predictions

I’ve put together a small analysis to demonstrate how we can create, tune, and evaluate models. I looked at the dataset of Titanic passengers and attempted to predict their survival outcomes based on information from the passenger list. I used a Logistic Regression, kNN, Decision Tree, and Random Forest model. The hyperparameters for each were tuned across a grid space that allowed for full computation in less than 5 minutes.

I am able to extract some information from the Names column which are included in the fit. Next steps are to determine which extracted features are important and start to limit them. The full roster from the training set is used in the model, but I expect some names are more important than others. I can limit the features output from my CountVectorizer to increase the efficiency of the model.

I also impute some ages, but they are not included in the model as I did not implement the fit in a manner to prevent information leaking from the test set into the training set.

I used a Train/Test split with 20% of the data held back to validate and compare the models. In tuning the hyperparameters for each model, I used KFolds validation with 5 folds.

Comparing the models, I see that the Logistic Regression model fared the best with precision(P) and recall(R) at 0.85. The Random Forest was very close with P=R=0.84. Both these models fare well with an area under the curve (AUC) for the ROC = 0.90. The Logistic Regression model again differentiated itself with an AUC for the PR curve of 0.84 (beating the 0.79 for the Random Forest). The ROC (blue) and PR (orange) curves for the Random Forest model are shown below.

The kNN (P=R=0.8) and Decision Tree (P=R=0.83) models were close, but not quite as good as the others. They also vell behind in the AUC ROC (kNN=0.86,DT=0.83) and AUC PR curve (both=0.84).

The upshot of this analysis is that we are able to predict Titanic survival with over 80% accuracy using one of 4 models.

We have more information on the data stored in this database. The details can be found at Kaggle.

df = pd.read_csv("titanic_data.csv")

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

df[df.Age.isnull()].head()

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Fare	Cabin	Embarked
5	6	0	3	Moran, Mr. James	male	NaN	330877	8.4583	NaN	Q
17	18	1	2	Williams, Mr. Charles Eugene	male	NaN	244373	13.0000	NaN	S
19	20	1	3	Masselmani, Mrs. Fatima	female	NaN	2649	7.2250	NaN	C
26	27	0	3	Emir, Mr. Farred Chehab	male	NaN	2631	7.2250	NaN	C
28	29	1	3	O'Dwyer, Miss. Ellen "Nellie"	female	NaN	330959	7.8792	NaN	Q

for i in df.columns:
    print df[i].value_counts().head(10)

891    1
293    1
304    1
303    1
302    1
301    1
300    1
299    1
298    1
297    1
Name: PassengerId, dtype: int64
0    549
1    342
Name: Survived, dtype: int64
3    491
1    216
2    184
Name: Pclass, dtype: int64
Graham, Mr. George Edward                              1
Elias, Mr. Tannous                                     1
Madill, Miss. Georgette Alexandra                      1
Cumings, Mrs. John Bradley (Florence Briggs Thayer)    1
Beane, Mrs. Edward (Ethel Clarke)                      1
Roebling, Mr. Washington Augustus II                   1
Moran, Mr. James                                       1
Padro y Manent, Mr. Julian                             1
Scanlan, Mr. James                                     1
Ali, Mr. William                                       1
Name: Name, dtype: int64
male      577
female    314
Name: Sex, dtype: int64
24.0    30
22.0    27
18.0    26
19.0    25
30.0    25
28.0    25
21.0    24
25.0    23
36.0    22
29.0    20
Name: Age, dtype: int64
0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64
0    678
1    118
2     80
5      5
3      5
4      4
6      1
Name: Parch, dtype: int64
CA. 2343        7
347082          7
1601            7
347088          6
CA 2144         6
3101295         6
382652          5
S.O.C. 14879    5
PC 17757        4
4133            4
Name: Ticket, dtype: int64
8.0500     43
13.0000    42
7.8958     38
7.7500     34
26.0000    31
10.5000    24
7.9250     18
7.7750     16
26.5500    15
0.0000     15
Name: Fare, dtype: int64
C23 C25 C27        4
G6                 4
B96 B98            4
D                  3
C22 C26            3
E101               3
F2                 3
F33                3
B57 B59 B63 B66    2
C68                2
Name: Cabin, dtype: int64
S    644
C    168
Q     77
Name: Embarked, dtype: int64

Null value analysis

It seems most columns have no null values. The df.info() summary shows several columns already hold numerical data (for which the non-null value read out is accurate). Furthermore, the .value_counts() method on each series shows none of the typical null value entries (i.e. *,-,None,NA,999,-1,""). The upshot is only three columns are missing any data: Age, Cabin, and Embarked.

The Cabin column has relatively few values, and seems overly specific to be much help. I chose to drop it. The Embarked column is only missing 2 values, so there is not much pay off for the effort to re-integrate them. The Age coumn is missing 177 values which need to be imputed, and 15 more that are 0.0000 (though after research I see these VIPs like the ship designer). I will use a linear regression model on Pclass, Sex, Fare, Embarked, etc. to impute Age values that are missing.

df[df["Fare"] < 0.001].sort_values("Ticket")
# reasearching several of these individuals shows they were employees
# and much more likely to have died
# add a feature that is 1 for free_fare

	PassengerId	Survived	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked
806	807	0	1	Andrews, Mr. Thomas Jr	male	39.0	112050	A36	S
633	634	0	1	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S
815	816	0	1	Fry, Mr. Richard	male	NaN	112058	B102	S
263	264	0	1	Harrison, Mr. William	male	40.0	112059	B94	S
822	823	0	1	Reuchlin, Jonkheer. John George	male	38.0	19972	NaN	S
277	278	0	2	Parkes, Mr. Francis "Frank"	male	NaN	239853	NaN	S
413	414	0	2	Cunningham, Mr. Alfred Fleming	male	NaN	239853	NaN	S
466	467	0	2	Campbell, Mr. William	male	NaN	239853	NaN	S
481	482	0	2	Frost, Mr. Anthony Wood "Archie"	male	NaN	239854	NaN	S
732	733	0	2	Knight, Mr. Robert J	male	NaN	239855	NaN	S
674	675	0	2	Watson, Mr. Ennis Hastings	male	NaN	239856	NaN	S
179	180	0	3	Leonard, Mr. Lionel	male	36.0	LINE	NaN	S
271	272	1	3	Tornquist, Mr. William Henry	male	25.0	LINE	NaN	S
302	303	0	3	Johnson, Mr. William Cahoone Jr	male	19.0	LINE	NaN	S
597	598	0	3	Johnson, Mr. Alfred	male	49.0	LINE	NaN	S

df = df.drop("Cabin", axis=1)
df = df.drop("Ticket", axis=1)
df = df.drop("PassengerId", axis=1)

print df[df["Embarked"].isnull()]
df.Embarked.fillna("S",inplace=True)
# https://www.encyclopedia-titanica.org/titanic-survivor/amelia-icard.html
# research shows these should both be set to S for Southampton

     Survived  Pclass                                       Name     Sex  \
61          1       1                        Icard, Miss. Amelie  female   
829         1       1  Stone, Mrs. George Nelson (Martha Evelyn)  female   

      Age  SibSp  Parch  Fare Embarked  
61   38.0      0      0  80.0      NaN  
829  62.0      0      0  80.0      NaN  

df = pd.get_dummies(df,columns=["Pclass","Sex","Embarked"],drop_first=True)

df['free_fare'] = df.Fare.map(lambda x: 1 if x < 0.001 else 0)

The Age of a person is related to some key words in the Name column

master = df[["Master." in x for x in df["Name"]]]["Age"].dropna()
rev = df[["Rev." in x for x in df["Name"]]]["Age"].dropna()
mr = df[["Mr." in x for x in df["Name"]]]["Age"].dropna()
miss = df[["Miss." in x for x in df["Name"]]]["Age"].dropna()
mrs = df[["Mrs." in x for x in df["Name"]]]["Age"].dropna()
quot = df[['"' in x for x in df["Name"]]]["Age"].dropna()
paren = df[["(" in x for x in df["Name"]]]["Age"].dropna()
both = df[["(" in x and '"' in x for x in df["Name"]]]["Age"].dropna()
plt.boxplot([master.values,mr.values,miss.values, \
             mrs.values,quot.values,paren.values, \
             both.values])
plt.title("Ages vs name keywords")
plt.show()

png

df["has_master"] = df.Name.map(lambda x: 1 if 'Master.' in x else 0)
df["has_rev"] = df.Name.map(lambda x: 1 if 'Rev.' in x else 0)
df["has_mr"] = df.Name.map(lambda x: 1 if 'Mr.' in x else 0)
df["has_miss"] = df.Name.map(lambda x: 1 if 'Miss.' in x else 0)
df["has_mrs"] = df.Name.map(lambda x: 1 if 'Mrs.' in x else 0)
df["has_quote"] = df.Name.map(lambda x: 1 if '"' in x else 0)
df["has_parens"] = df.Name.map(lambda x: 1 if '(' in x else 0)
df["last_name"] = df.Name.map(lambda x: x.replace(",","").split()[0] )

Note that those name columns also have a bearing on the fare.

master = df[["Master." in x for x in df["Name"]]]["Fare"].dropna()
rev = df[["Rev." in x for x in df["Name"]]]["Fare"].dropna()
mr = df[["Mr." in x for x in df["Name"]]]["Fare"].dropna()
miss = df[["Miss." in x for x in df["Name"]]]["Fare"].dropna()
mrs = df[["Mrs." in x for x in df["Name"]]]["Fare"].dropna()
quot = df[['"' in x for x in df["Name"]]]["Fare"].dropna()
paren = df[["(" in x for x in df["Name"]]]["Fare"].dropna()
both = df[["(" in x and '"' in x for x in df["Name"]]]["Fare"].dropna()
plt.boxplot([master.values,mr.values,miss.values, \
             mrs.values,quot.values,paren.values,both.values])
plt.title("Fares vs name keywords")
plt.ylim(0,150)
plt.show()
# master

png

Impute the age….

dropping for now, will finish when I have more time

X = df.iloc[:,2:]
y = df.iloc[:,0]

from sklearn.preprocessing import StandardScaler, MaxAbsScaler, MinMaxScaler, Imputer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, LabelBinarizer
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline, FeatureUnion
from sklearn.linear_model import LogisticRegressionCV, ElasticNetCV
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import tree

X.head()

	Age	SibSp	Fare	Pclass_3	Sex_male	Embarked_S	has_parens	last_name	has_mr	has_miss	has_mrs
0	22.0	1	7.2500	1	1	1	0	Braund	1	0	0
1	38.0	1	71.2833	0	0	0	1	Cumings	0	0	1
2	26.0	0	7.9250	1	0	1	0	Heikkinen	0	1	0
3	35.0	1	53.1000	0	0	1	1	Futrelle	0	0	1
4	35.0	0	8.0500	1	1	1	0	Allen	1	0	0

cols = [x for x in X.columns if x !='last_name' and x != 'Age']

u = StandardScaler()
v = ElasticNetCV()
g = GridSearchCV(Pipeline([('scale',u),('fit',v)]),{'fit__l1_ratio':(0.00001,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1)})
g.fit(X[X.Age.notnull()].ix[:,cols],X.Age[X.Age.notnull()])

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('fit', ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
       l1_ratio=0.5, max_iter=1000, n_alphas=100, n_jobs=1,
       normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0))]),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'fit__l1_ratio': (1e-05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)

g.best_estimator_

Pipeline(steps=[('scale', StandardScaler(copy=True, with_mean=True, with_std=True)), ('fit', ElasticNetCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
       l1_ratio=0.2, max_iter=1000, n_alphas=100, n_jobs=1,
       normalize=False, positive=False, precompute='auto',
       random_state=None, selection='cyclic', tol=0.0001, verbose=0))])

pd.DataFrame(g.cv_results_).sort_values("rank_test_score")

	mean_fit_time	mean_score_time	mean_test_score	mean_train_score	param_fit__l1_ratio	params	rank_test_score	split0_test_score	split0_train_score	split1_test_score	split1_train_score	split2_test_score	split2_train_score	std_fit_time	std_score_time	std_test_score	std_train_score
2	0.046002	0.000550	0.391722	0.429859	0.2	{u'fit__l1_ratio': 0.2}	1	0.413091	0.422948	0.366310	0.446579	0.395766	0.420050	0.002945	4.856305e-06	0.019311	0.011882
3	0.042412	0.000610	0.391685	0.430925	0.3	{u'fit__l1_ratio': 0.3}	2	0.412379	0.424478	0.366585	0.448172	0.396090	0.420125	0.000749	7.684356e-05	0.018953	0.012324
4	0.050115	0.000556	0.391633	0.431215	0.4	{u'fit__l1_ratio': 0.4}	3	0.412232	0.424710	0.366492	0.449001	0.396175	0.419933	0.005820	1.209189e-05	0.018948	0.012727
10	0.052624	0.000580	0.391569	0.430489	1	{u'fit__l1_ratio': 1}	4	0.412814	0.423898	0.365593	0.450084	0.396301	0.417486	0.004806	2.035405e-05	0.019566	0.014100
6	0.044164	0.000552	0.391485	0.431206	0.6	{u'fit__l1_ratio': 0.6}	5	0.412507	0.424419	0.365893	0.449813	0.396054	0.419386	0.000968	4.112672e-06	0.019303	0.013316
5	0.043362	0.000547	0.391462	0.431158	0.5	{u'fit__l1_ratio': 0.5}	6	0.412284	0.424680	0.366234	0.449498	0.395867	0.419297	0.000841	4.052337e-07	0.019056	0.013153
7	0.045165	0.000662	0.391385	0.430981	0.7	{u'fit__l1_ratio': 0.7}	7	0.412651	0.424263	0.365791	0.449899	0.395712	0.418780	0.000889	1.501678e-04	0.019373	0.013563
9	0.050791	0.000570	0.391225	0.430514	0.9	{u'fit__l1_ratio': 0.9}	8	0.412857	0.423943	0.365807	0.449975	0.395013	0.417623	0.004066	2.837103e-05	0.019394	0.014001
8	0.046530	0.000582	0.391185	0.430643	0.8	{u'fit__l1_ratio': 0.8}	9	0.412813	0.424076	0.365768	0.449945	0.394972	0.417907	0.001666	4.249577e-05	0.019392	0.013879
1	0.040549	0.000563	0.391117	0.426731	0.1	{u'fit__l1_ratio': 0.1}	10	0.413325	0.418232	0.364558	0.441993	0.395468	0.419967	0.001385	2.537861e-05	0.020145	0.010815
0	0.054007	0.000892	-0.002510	0.002103	1e-05	{u'fit__l1_ratio': 1e-05}	11	-0.002335	0.001980	-0.006706	0.002114	0.001509	0.002216	0.012009	3.612715e-04	0.003356	0.000096

g.predict(X[X.Age.notnull()].ix[:,cols]).shape

(714,)

X.Age[X.Age.notnull()].shape

(714,)

g.score(X[X.Age.notnull()].ix[:,cols],X.Age[X.Age.notnull()])

0.42502999520596807

plt.scatter(X.Age[X.Age.notnull()],g.predict(X[X.Age.notnull()].ix[:,cols]))
plt.plot(X.Age[X.Age.notnull()],X.Age[X.Age.notnull()])

[<matplotlib.lines.Line2D at 0x141d71a90>]

png

g.predict(X[X.Age.isnull()].ix[:,cols])

array([ 33.52360534,  33.89086473,  31.06711941,  28.61787182,
        21.53509477,  29.89157263,  40.0343745 ,  23.17295561,
        28.6178183 ,  28.60932366,  29.88960763,  31.36828054,
        23.17295561,  24.30249704,  42.3477308 ,  41.164614  ,
         5.36742999,  29.89157263,  29.88960763,  23.17247774,
        29.88960763,  29.88960763,  28.25535822,  29.89311202,
        20.8983758 ,  29.88960763,  33.53263137,  15.64632392,
        30.2580401 ,  29.89900576,  29.88180239,  -8.8549017 ,
        44.19505112,  42.46974728,   2.38825012,   1.51462849,
        32.58249212,  42.16295389,  32.18131372,  33.53263137,
        21.5367412 ,  11.87430425,  31.46704062,  29.89157263,
        12.75778031,  19.53630348,  16.10048191,  19.37239037,
        29.89980221,  42.95785004,  33.53263137,  23.17295561,
        42.40507536,  21.5367412 ,  32.42031237,  42.46879154,
        41.164614  ,  42.41144698,  21.5367412 ,  29.20392972,
        27.17867284,  28.25339321,  29.74517895,  11.87430425,
        18.84425395,  41.48063911,  29.89157263,  30.17068142,
        42.35410242,  28.61787182,  23.17130919,  21.53509477,
        31.36828054,  31.06706589,  23.17295561,  40.85440168,
        29.89157263,  33.53289643,  12.75778031,  29.89157263,
        33.54399452,  34.05652678,  32.33885522,  28.60932366,
        29.89980221,  33.53263137,  30.17068142,  29.88881117,
        27.67215956,  29.88960763,  42.52287645,  33.53263137,
        29.88960763,  34.05652678,  33.53294995,  29.89980221,
        42.13746742,  32.42031237,  12.75778031,  27.67215956,
        28.52569618,  29.79976782,  23.17449499,  40.82556834,
        29.88960763,  33.32364232,  28.61787182,  28.6178183 ,
        39.97393115,  28.6178183 ,  30.16740384,  29.80741376,
        32.59762471,  33.53162211,  38.6184621 ,  33.53263137,
        29.88960763,  19.52993187,  28.6178183 ,  23.17295561,
        28.90935301,  28.59891626,  29.88960763,  22.46608724,
        23.27632426,  28.61787182,  29.89157263,  42.25980248,
        29.90235086,  21.00860478,  33.53263137,  33.53284418,
        42.80011565,  27.72143383,  29.2722514 ,  29.89597924,
        29.89157263,  21.53573193,  29.89157263,  29.89311202,
        42.52112426,  34.05652678,  23.16801761,  29.2722514 ,
        21.53695401,   3.73121558,  42.46178276,  33.4338713 ,
        23.1731149 ,  34.05652678,  29.89157263,  29.89157263,
        42.41781859,  29.80741376,  41.01323456,  31.25805156,
        28.61787182,  33.53263137,  33.53279066,  26.92090268,
        31.89641696,   1.51462849,  41.12670288,  42.80011565,
        33.54282596,  29.2722514 ,  33.53263137,  28.6178183 ,
        29.88960763,  41.13939259,  11.87430425,  40.76604774,
        28.6178183 ,  -0.12158593,  29.87112993,  29.89157263,  14.9250125 ])

Aborted imputation

Drop rows not required

df = df.drop("Age",axis=1)
df = df.drop("Name",axis=1)

df.head()

	Survived	SibSp	Fare	Pclass_3	Sex_male	Embarked_S	has_mr	has_miss	has_mrs	has_parens	last_name
0	0	1	7.2500	1	1	1	1	0	0	0	Braund
1	1	1	71.2833	0	0	0	0	0	1	1	Cumings
2	1	0	7.9250	1	0	1	0	1	0	0	Heikkinen
3	1	1	53.1000	0	0	1	0	0	1	1	Futrelle
4	0	0	8.0500	1	1	1	1	0	0	0	Allen

X = df.iloc[:,1:]
y = df.iloc[:,0]

Make a Train/test split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test, = train_test_split(X, y, test_size=0.2, random_state=42)

Define functions for grid search

class ModelTransformer(BaseEstimator,TransformerMixin):

    def __init__(self, model=None):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return self.model.transform(X)
    
class SampleExtractor(BaseEstimator, TransformerMixin):
    """Takes in varaible names as a **list**"""

    def __init__(self, vars):
        self.vars = vars  # e.g. pass in a column names to extract

    def transform(self, X, y=None):
        if len(self.vars) > 1:
            return pd.DataFrame(X[self.vars]) # where the actual feature extraction happens
        else:
            return pd.Series(X[self.vars[0]])

    def fit(self, X, y=None):
        return self  # generally does nothing
    
    
class DenseTransformer(BaseEstimator,TransformerMixin):

    def transform(self, X, y=None, **fit_params):
#         print X.todense()
        return X.todense()

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y, **fit_params)
        return self.transform(X)

    def fit(self, X, y=None, **fit_params):
        return self

Logistic Regression

kf_shuffle = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)

cols = [x for x in X.columns if x !='last_name']

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('names', Pipeline([
                      ('text',SampleExtractor(['last_name'])),
                      ('dummify', CountVectorizer(binary=True)),
                      ('densify', DenseTransformer()),
                     ])),
        ('cont_features', Pipeline([
                      ('continuous', SampleExtractor(cols)),
                      ])),
        ])),
        ('scale', ModelTransformer()),
        ('fit', LogisticRegressionCV(solver='liblinear')),
])


parameters = {
    'scale__model': (StandardScaler(),MinMaxScaler()),
    'fit__penalty': ('l1','l2'),
    'fit__class_weight':('balanced',None),
    'fit__Cs': (5,10,15),
}

logreg_gs = GridSearchCV(pipeline, parameters, verbose=False, cv=kf_shuffle, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)


logreg_gs.fit(X_train, y_train)

print("Best score: %0.3f" % logreg_gs.best_score_)
print("Best parameters set:")
best_parameters = logreg_gs.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['features', 'scale', 'fit'])
parameters:
{'fit__class_weight': ('balanced', None), 'scale__model': (StandardScaler(copy=True, with_mean=True, with_std=True), MinMaxScaler(copy=True, feature_range=(0, 1))), 'fit__Cs': (5, 10, 15), 'fit__penalty': ('l1', 'l2')}
Best score: 0.837
Best parameters set:
	fit__Cs: 10
	fit__class_weight: None
	fit__penalty: 'l1'
	scale__model: StandardScaler(copy=True, with_mean=True, with_std=True)

cv_pred = pd.Series(logreg_gs.predict(X_test))

pd.DataFrame(zip(logreg_gs.cv_results_['mean_test_score'],\
                 logreg_gs.cv_results_['std_test_score']\
                )).sort_values(0,ascending=False).head(10)

	0	1
12	0.837079	0.030212
4	0.835674	0.035058
23	0.835674	0.034204
15	0.834270	0.031169
20	0.834270	0.033009
21	0.832865	0.037548
17	0.832865	0.033389
13	0.831461	0.035749
16	0.830056	0.034090
3	0.830056	0.038983

# logreg_gs.best_estimator_

confusion_matrix(y_test,cv_pred)

array([[94, 11],
       [16, 58]])

print classification_report(y_test,cv_pred)

             precision    recall  f1-score   support

          0       0.85      0.90      0.87       105
          1       0.84      0.78      0.81        74

avg / total       0.85      0.85      0.85       179

from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn-white')

# Y_score = logreg_gs.best_estimator_.decision_function(X_test)
Y_score = logreg_gs.best_estimator_.predict_proba(X_test)[:,1]

# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test, Y_score)
ROC_AUC = auc(FPR, TPR)
PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Logistic Regression for Titanic Survivors', fontsize=18)
plt.legend(loc="lower right")
plt.show()

png

plt.scatter(y_test,cv_pred,color='r')
plt.plot(y_test,y_test,color='k')
plt.xlabel("True value")
plt.ylabel("Predicted Value")
plt.show()

png

kNN

kf_shuffle = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)

cols = [x for x in X.columns if x !='last_name']

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('names', Pipeline([
                      ('text',SampleExtractor(['last_name'])),
                      ('dummify', CountVectorizer(binary=True)),
                      ('densify', DenseTransformer()),
                     ])),
        ('cont_features', Pipeline([
                      ('continuous', SampleExtractor(cols)),
                      ])),
        ])),
        ('scale', ModelTransformer()),
        ('fit', KNeighborsClassifier()),
])


parameters = {
    'scale__model': (StandardScaler(),MinMaxScaler()),
    'fit__n_neighbors': (2,3,5,7,9,11,16,20),
    'fit__weights': ('uniform','distance'),
}

knn_gs = GridSearchCV(pipeline, parameters, verbose=False, cv=kf_shuffle, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)


knn_gs.fit(X_train, y_train)

print("Best score: %0.3f" % knn_gs.best_score_)
print("Best parameters set:")
best_parameters = knn_gs.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['features', 'scale', 'fit'])
parameters:
{'fit__n_neighbors': (2, 3, 5, 7, 9, 11, 16, 20), 'scale__model': (StandardScaler(copy=True, with_mean=True, with_std=True), MinMaxScaler(copy=True, feature_range=(0, 1))), 'fit__weights': ('uniform', 'distance')}
Best score: 0.829
Best parameters set:
	fit__n_neighbors: 5
	fit__weights: 'uniform'
	scale__model: MinMaxScaler(copy=True, feature_range=(0, 1))

cv_pred = pd.Series(knn_gs.predict(X_test))

pd.DataFrame(zip(knn_gs.cv_results_['mean_test_score'],\
                 knn_gs.cv_results_['std_test_score']\
                )).sort_values(0,ascending=False).head(10)

	0	1
9	0.828652	0.027271
11	0.827247	0.024505
5	0.821629	0.023673
25	0.818820	0.035553
7	0.818820	0.021308
31	0.816011	0.032000
27	0.814607	0.032273
29	0.811798	0.033521
19	0.810393	0.031341
17	0.810393	0.033816

# knn_gs.best_estimator_

confusion_matrix(y_test,cv_pred)

array([[91, 14],
       [21, 53]])

print classification_report(y_test,cv_pred)

             precision    recall  f1-score   support

          0       0.81      0.87      0.84       105
          1       0.79      0.72      0.75        74

avg / total       0.80      0.80      0.80       179

from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn-white')

# Y_score = knn_gs.best_estimator_.decision_function(X_test)
Y_score = knn_gs.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test, Y_score)
ROC_AUC = auc(FPR, TPR)

PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('kNN for Titanic Survivors', fontsize=18)
plt.legend(loc="lower right")
plt.show()

png

Decision Tree

kf_shuffle = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)

cols = [x for x in X.columns if x !='last_name']

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('names', Pipeline([
                      ('text',SampleExtractor(['last_name'])),
                      ('dummify', CountVectorizer(binary=True)),
                      ('densify', DenseTransformer()),
                     ])),
        ('cont_features', Pipeline([
                      ('continuous', SampleExtractor(cols)),
                      ])),
        ])),
#         ('scale', ModelTransformer()),
        ('fit', tree.DecisionTreeClassifier()),
])


parameters = {
#     'scale__model': (StandardScaler(),MinMaxScaler()),
    'fit__max_depth': (2,3,4,None),
    'fit__min_samples_split': (2,3,4,5),
    'fit__class_weight':('balanced',None),
}

dt_gs = GridSearchCV(pipeline, parameters, verbose=False, cv=kf_shuffle, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)


dt_gs.fit(X_train, y_train)

print("Best score: %0.3f" % dt_gs.best_score_)
print("Best parameters set:")
best_parameters = dt_gs.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['features', 'fit'])
parameters:
{'fit__class_weight': ('balanced', None), 'fit__min_samples_split': (2, 3, 4, 5), 'fit__max_depth': (2, 3, 4, None)}
Best score: 0.833
Best parameters set:
	fit__class_weight: 'balanced'
	fit__max_depth: None
	fit__min_samples_split: 5

cv_pred = pd.Series(dt_gs.predict(X_test))

pd.DataFrame(zip(dt_gs.cv_results_['mean_test_score'],\
                 dt_gs.cv_results_['std_test_score']\
                )).sort_values(0,ascending=False).head(10)

	0	1
15	0.832865	0.025600
12	0.831461	0.027212
30	0.831461	0.028052
13	0.830056	0.032148
31	0.830056	0.033806
5	0.828652	0.025481
4	0.828652	0.025481
7	0.828652	0.025481
6	0.828652	0.025481
29	0.827247	0.034274

# dt_gs.best_estimator_

confusion_matrix(y_test,cv_pred)

array([[92, 13],
       [18, 56]])

print classification_report(y_test,cv_pred)

             precision    recall  f1-score   support

          0       0.84      0.88      0.86       105
          1       0.81      0.76      0.78        74

avg / total       0.83      0.83      0.83       179

from sklearn.metrics import roc_curve, auc
plt.style.use('seaborn-white')

# Y_score = dt_gs.best_estimator_.decision_function(X_test)
Y_score = dt_gs.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test, Y_score)
ROC_AUC = auc(FPR, TPR)

PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=18)
plt.ylabel('True Positive Rate', fontsize=18)
plt.title('Decision Tree for Titanic Survivors', fontsize=18)
plt.legend(loc="lower right")
plt.show()

png

Random Forest

from sklearn.ensemble import RandomForestClassifier

kf_shuffle = StratifiedKFold(n_splits=5,shuffle=True,random_state=777)

cols = [x for x in X.columns if x !='last_name']

pipeline = Pipeline([
    ('features', FeatureUnion([
        ('names', Pipeline([
                      ('text',SampleExtractor(['last_name'])),
                      ('dummify', CountVectorizer(binary=True)),
                      ('densify', DenseTransformer()),
                     ])),
        ('cont_features', Pipeline([
                      ('continuous', SampleExtractor(cols)),
                      ])),
        ])),
#         ('scale', ModelTransformer()),
        ('fit', RandomForestClassifier()),
])


parameters = {
#     'scale__model': (StandardScaler(),MinMaxScaler()),
    'fit__max_depth': (4,7,10),
    'fit__n_estimators': (25,50,100),
    'fit__class_weight':('balanced',None),
    'fit__max_features': ('auto',0.3,0.5),
}

rf_gs = GridSearchCV(pipeline, parameters, verbose=False, cv=kf_shuffle, n_jobs=-1)

print("Performing grid search...")
print("pipeline:", [name for name, _ in pipeline.steps])
print("parameters:")
print(parameters)


rf_gs.fit(X_train, y_train)

print("Best score: %0.3f" % rf_gs.best_score_)
print("Best parameters set:")
best_parameters = rf_gs.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Performing grid search...
('pipeline:', ['features', 'fit'])
parameters:
{'fit__class_weight': ('balanced', None), 'fit__n_estimators': (25, 50, 100), 'fit__max_features': ('auto', 0.3, 0.5), 'fit__max_depth': (4, 7, 10)}
Best score: 0.841
Best parameters set:
	fit__class_weight: None
	fit__max_depth: 10
	fit__max_features: 0.5
	fit__n_estimators: 25

cv_pred = pd.Series(rf_gs.predict(X_test))

pd.DataFrame(zip(rf_gs.cv_results_['mean_test_score'],\
                 rf_gs.cv_results_['std_test_score']\
                )).sort_values(0,ascending=False).head(10)

	0	1
51	0.841292	0.029846
52	0.838483	0.027874
53	0.837079	0.031107
48	0.837079	0.030145
43	0.837079	0.032100
41	0.837079	0.028741
26	0.837079	0.023149
50	0.837079	0.029748
44	0.835674	0.030564
21	0.835674	0.029450

# rf_gs.best_estimator_

confusion_matrix(y_test,cv_pred)

array([[94, 11],
       [18, 56]])

print classification_report(y_test,cv_pred)

             precision    recall  f1-score   support

          0       0.84      0.90      0.87       105
          1       0.84      0.76      0.79        74

avg / total       0.84      0.84      0.84       179

from sklearn.metrics import roc_curve, auc, precision_recall_curve, average_precision_score
plt.style.use('seaborn-white')

Y_score = rf_gs.best_estimator_.predict_proba(X_test)[:,1]


# For class 1, find the area under the curve
FPR, TPR, _ = roc_curve(y_test, Y_score)
ROC_AUC = auc(FPR, TPR)

PREC, REC, _ = precision_recall_curve(y_test, Y_score)
PR_AUC = auc(REC, PREC)

# Plot of a ROC curve for class 1 (has_cancer)
plt.figure(figsize=[11,9])
plt.plot(FPR, TPR, label='ROC curve (area = %0.2f)' % ROC_AUC, linewidth=4)
plt.plot(REC, PREC, label='PR curve (area = %0.2f)' % PR_AUC, linewidth=4)
plt.plot([0, 1], [0, 1], 'k--', linewidth=4)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate or Recall', fontsize=18)
plt.ylabel('True Positive Rate or Precision', fontsize=18)
plt.title('Random Forest for Titanic Survivors', fontsize=18)
plt.legend(loc="lower right")
plt.show()

png

rf_gs.best_estimator_.steps[1][1].feature_importances_

array([  2.43993321e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.88729037e-03,   0.00000000e+00,
         1.82609102e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.01864924e-02,   0.00000000e+00,
         1.01726821e-03,   3.36357246e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   5.97271125e-03,   0.00000000e+00,
         1.85980568e-03,   0.00000000e+00,   0.00000000e+00,
         5.33125037e-04,   1.42663025e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.72686233e-03,   1.37342352e-03,   2.10729310e-03,
         0.00000000e+00,   5.08389260e-04,   0.00000000e+00,
         3.22826323e-03,   0.00000000e+00,   0.00000000e+00,
         1.55068098e-03,   1.69654764e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   2.17833461e-03,   1.96345241e-03,
         0.00000000e+00,   2.02352363e-03,   2.60110856e-04,
         1.21028903e-03,   0.00000000e+00,   0.00000000e+00,
         2.58686550e-03,   3.21356995e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   3.54672991e-03,
         0.00000000e+00,   2.23678031e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   2.33725320e-03,   8.46304365e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         2.02266361e-04,   0.00000000e+00,   0.00000000e+00,
         2.32004715e-03,   0.00000000e+00,   0.00000000e+00,
         3.45731891e-03,   0.00000000e+00,   0.00000000e+00,
         1.03254617e-03,   5.37900073e-04,   0.00000000e+00,
         4.76422917e-05,   4.60906573e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         4.49155430e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.28634188e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         9.62072946e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   2.33940464e-04,   0.00000000e+00,
         6.63859910e-03,   3.18824190e-03,   7.81486187e-04,
         0.00000000e+00,   0.00000000e+00,   2.08962446e-04,
         0.00000000e+00,   3.15301426e-04,   5.11570521e-04,
         3.07277924e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   2.79901900e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.71925602e-03,   0.00000000e+00,
         1.10877611e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.72660193e-04,   4.74944188e-04,   0.00000000e+00,
         9.64868341e-04,   6.69893974e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.62353220e-04,   2.47982606e-04,
         1.31923089e-03,   4.15051604e-04,   2.27338506e-03,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   4.44213810e-04,   0.00000000e+00,
         5.20062743e-04,   0.00000000e+00,   7.37495583e-05,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   1.26319552e-03,   1.27752923e-03,
         0.00000000e+00,   3.83681113e-03,   0.00000000e+00,
         0.00000000e+00,   1.76380590e-03,   0.00000000e+00,
         1.63580385e-03,   0.00000000e+00,   0.00000000e+00,
         2.44040326e-03,   1.22518289e-03,   1.02030148e-03,
         5.41470510e-04,   6.26082572e-04,   0.00000000e+00,
         3.14088604e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   1.28351669e-03,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         6.47527781e-04,   0.00000000e+00,   1.26876251e-03,
         2.93623032e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   2.44024274e-04,
         6.35089403e-03,   0.00000000e+00,   0.00000000e+00,
         2.76114576e-03,   3.79949319e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.12744779e-03,   9.73588122e-04,   0.00000000e+00,
         3.93905950e-03,   3.50839012e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.19802495e-03,   3.43686305e-04,   1.91844659e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   4.40313049e-04,   0.00000000e+00,
         9.50613760e-04,   0.00000000e+00,   0.00000000e+00,
         1.18392117e-03,   0.00000000e+00,   0.00000000e+00,
         1.02522244e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   1.23522044e-03,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         3.63012657e-04,   8.59882869e-04,   2.50121557e-03,
         2.21267024e-03,   4.34446451e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   6.18592444e-03,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         3.59646694e-03,   6.32630134e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         3.04611805e-03,   2.42299984e-04,   1.07791089e-03,
         0.00000000e+00,   0.00000000e+00,   7.86811562e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   4.07194611e-03,   3.60803113e-03,
         0.00000000e+00,   1.82830752e-04,   3.14950622e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         7.61392034e-04,   1.42917053e-03,   0.00000000e+00,
         4.33085245e-03,   0.00000000e+00,   3.38076209e-04,
         5.05424906e-04,   0.00000000e+00,   8.78314863e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   1.99574730e-03,
         0.00000000e+00,   0.00000000e+00,   1.89344093e-03,
         2.34191645e-04,   0.00000000e+00,   2.31475782e-03,
         0.00000000e+00,   0.00000000e+00,   4.68326596e-06,
         0.00000000e+00,   2.04414938e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   1.82021434e-03,
         0.00000000e+00,   1.56066514e-04,   8.47376255e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.83121808e-04,   1.57363828e-03,   0.00000000e+00,
         2.30668709e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   1.29701315e-03,
         0.00000000e+00,   3.51598123e-04,   0.00000000e+00,
         9.62648256e-04,   0.00000000e+00,   1.48094076e-03,
         3.55935774e-03,   4.30743922e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   5.96215305e-03,
         0.00000000e+00,   2.29533030e-03,   2.12665936e-04,
         6.92093327e-04,   3.89455246e-04,   8.37933139e-04,
         2.06012034e-04,   2.72230355e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   3.40648274e-04,
         3.12133028e-04,   0.00000000e+00,   1.71356611e-03,
         0.00000000e+00,   0.00000000e+00,   7.43399551e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         4.80438512e-04,   0.00000000e+00,   0.00000000e+00,
         6.14276736e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   4.07242107e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.28637336e-03,   4.11836733e-04,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   2.80063313e-04,   0.00000000e+00,
         1.90569167e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         8.59025637e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   7.10615123e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         3.27564978e-04,   4.31046166e-03,   3.12050711e-03,
         2.34858746e-03,   9.88011208e-04,   1.63284651e-03,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         1.99728023e-04,   1.00929394e-03,   0.00000000e+00,
         4.06965913e-04,   0.00000000e+00,   2.17943251e-04,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   2.22063252e-03,
         0.00000000e+00,   7.11121821e-05,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         9.92222423e-04,   2.77645396e-04,   5.35579534e-03,
         1.80536592e-03,   0.00000000e+00,   0.00000000e+00,
         3.40726324e-03,   1.33144258e-03,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   0.00000000e+00,
         2.43885155e-03,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   4.90491182e-04,   0.00000000e+00,
         0.00000000e+00,   2.34099771e-04,   0.00000000e+00,
         0.00000000e+00,   6.00299361e-04,   0.00000000e+00,
         0.00000000e+00,   1.98031361e-03,   1.56216365e-04,
         6.96190607e-04,   0.00000000e+00,   0.00000000e+00,
         0.00000000e+00,   0.00000000e+00,   7.96366323e-03,
         4.15514654e-03,   4.70998488e-02,   2.22963649e-02,
         1.22423889e-01,   7.26190979e-03,   7.08305928e-02,
         1.38032194e-01,   6.55195016e-03,   1.21600510e-02,
         0.00000000e+00,   1.01664273e-02,   1.16605028e-02,
         1.91686244e-01,   1.99839408e-02,   2.43815100e-02,
         3.02540060e-03,   2.77851205e-02])

Written on August 8, 2017