Using bivariate kernel density estimation for plotting multi-task classification results

One common technique for interpreting the outputs of a single classification model is kernel density estimation (KDE). Similar to a histogram, a KDE plot allows us to estimate the underlying probability density of our model.

This is particularly useful for visualising the impact of selecting different classification thresholds (i.e. deciding at what point to round a given probability to 1 or 0).

You can apply this visualisation technique to multi-task classifcation too. This uses bivariate KDE, which also generalises to multivariate KDE. Unforunately we're constrained to two tasks, given the limitation of having only two axes on 2D plots.

To achieve this, we're going to create a suitable test dataset based on the Digits classification data, train a Random Forest Classifier using two labels, and output a bivariate KDE plot using the Seaborn visualisation library.

Preparing data for multi-task learning

To simulate multi-task learning, we're going to load three classes of the Digits data (i.e. digits 0, 1 and 2), and break this into labels for two binary tasks:

  1. the digit is one
  2. the digit is two

(In the case that the digit is zero, both tasks will have a False label.)

Before training our model, we'll also consolidate the two label sets into a single 2 x N set.

In [1]:
from sklearn.datasets import load_digits
import numpy as np

X, y = load_digits(return_X_y=True, n_class=3)

y_task1 = y == 1
y_task2 = y == 2

y_multitask = np.stack((y_task1,y_task2), axis=-1)

Training a multi-task Random Forest classifier

After splitting our data into train and test sets, we'll train the classifier and split the predicted probabilities into two sets:

  1. probability that the digit is one
  2. probability that the digit is two
In [14]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_multitask, test_size=0.4, random_state=1234)

model = RandomForestClassifier(random_state=1234)
model.fit(X_train, y_train)

y_hat = model.predict_proba(X_test)
y_task1_hat = y_hat[0][:,1]
y_task2_hat = y_hat[1][:,1]

Visualisation bivariate KDE with Seaborn

Seaborn (a set of extensions over Matplotlib) comes to the rescue with the built-in KDE plot function. The function accepts two data sets, one for the X-axis (in this case, task 1) and one fo the Y-axis (task 2).

We've also constrained the axis to the range (0,1). Without this, the full range of the KDE function will be plotted, going below 0% and above 100%, which doesn't make sense in the context of our problem (or any binary classification task).

In [13]:
%matplotlib inline
import seaborn as sns
from matplotlib import pyplot as plt
kdeplot = sns.kdeplot(y_task1_hat, y_task2_hat, shade=True, legend=True)
plt.axvline(x=0.5, color='grey', linewidth=1, linestyle='dotted')
plt.axhline(y=0.5, color='grey', linewidth=1, linestyle='dotted')
kdeplot.set(xlabel='P(Task 1)', ylabel='P(Task 2)')
plt.xlim(0,1)
plt.ylim(0,1);

The bivariate KDE plot makes the probability densities for our two learning tasks clear, and as you would expect, demonstrates that it is not likely for a given case to have a high probability for both tasks simulatenously.

Automating error analysis with RuleFit models

When building machine learning models, the goal is generally to improve the performance of a model based on some performance metric. One of the most simple metrics is error. Error is simply the inverse of model accuracy - so if a model had 95% accuracy, this would correspond with 5% error.

There are many ways to improve the performance of a model and subsequently decrease the model error. This includes adding more training observations (rows), enriching training obversations with more features (columns), modifying the model algorithm or optimising the algorithm parameters.

Reducing error by introducing new features

In the post linked above we looked at how to optimise the parameters of a given algorithm, so for now we're interested in what we can do with the data itself.

While there are many ways to create more training observations, this is often infeasible due to consideration of cost (imagine the cost of high-end medical studies) and time (such as waiting for enough events to occur).

The next option we have is to introduce new features to the observations already in our model. There is often lots of different approaches here too, including:

  • engineering new features based on existing features
  • creating new features from available data not already used
  • making more data available (such as from external providers)

The goal of this post is to explore a method not for assessing which approach to take, but for identifying where the gaps are to help you assess all of the options available to you.

Modelling error analysis

To get started, let's install an implementation of RuleFit from GitHub using pip:

pip install git+https://github.com/christophM/rulefit

Now we're going to load up a sample data set to work on, partitioning it into data for training our initial model, and data for testing its performance. Note that feature names will be important for this exercise.

In [1]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# load our data - we also care about feature names
data = load_breast_cancer()
data.feature_names = list(map(lambda s: s.replace(' ' , '_'), data.feature_names))
X, y = data.data, data.target

# split data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

With our data ready, let's build a quick logistic regression model on the training data. We're also going to generate predictions for our test data (as positive probabilities, or the likelihood of the class label being True).

In [2]:
from sklearn.linear_model import LogisticRegression
import numpy as np

# define our model
model = LogisticRegression(random_state=1234)

# fit our model
model.fit(X_train, y_train)

# generate some predictions
y_hat = model.predict_proba(X_test)[:,1]

At the start of this post we discussed model error, so let's now calculate this for our model to see how much room for improvement there is.

In [3]:
# calculate error on each obversation in the test set
y_error = np.absolute(y_test - y_hat)

# is there much room for improvement?
print('model error:', y_error.mean())
model error: 0.0705335674785

It looks like there is almost 7% mean absolute error. Maybe we can find some good leads for improving on this?

To do so, we're going to create a new model using the RuleFit class, but instead of targetting the original class label y, we're going to calculate the absolute error of each observation.

The absolute error is the difference between the discrete actual value of y (0 or 1) and the continuous positive probability we predicted (0.0 to 1.0).

In [4]:
from rulefit import RuleFit
from sklearn.ensemble import GradientBoostingRegressor

# define and fit our shiny new RuleFit model
generator_params = {
    'max_depth': 5,       # control the complexity of our rules
    'n_estimators': 1000,  
    'learning_rate': 0.003,
    'random_state': 1234
}
generator = GradientBoostingRegressor(**generator_params)

rf = RuleFit(generator)
rf.fit(X_test, y_error, feature_names=data.feature_names)
Out[4]:
RuleFit(tree_generator=GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
             learning_rate=0.003, loss='ls', max_depth=5,
             max_features=None, max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=1000,
             presort='auto', random_state=1234, subsample=1.0, verbose=0,
             warm_start=False))

With the RuleFit model fitted to our errors, we can generate a set of rules that might help us to isolate areas of our data that need enriching with new features.

RuleFit actually generates rules for the data having a positive impact on the model, but we can ignore these for error analysis for filtering coef > 0.

If we multiply the coefficient and support values calculated by RuleFit, we can use that as a rough estimate for how much error is due to that subset of the data.

By summing these estimates, we get an approximate amount of error explained by these rules. This will differ from the above simply because our rules may not perfect fit our errors (that is, our error model has its own error).

In [5]:
import pandas as pd

# get the outputs
rules = rf.get_rules()

# remove the rules we're not interested in. if the coefficient isn't above 0
# there rule is not a good indicator of an area for improvement
rules = rules[(rules.coef > 0.) & (rules.type != 'linear')]

# we can estimate an effect for each rule on the error score from above by 
# multiplying the coefficient and support values
rules['effect'] = rules['coef'] * rules['support']

print('modelled error:', rules['effect'].sum())
print('unexplained error:', np.max(y_error.mean() - rules['effect'].sum(),0))
modelled error: 0.046811390404179753
unexplained error: 0.0237221770743

Let's take a look at the top 10 rules:

In [6]:
# display the top 10 rules by effect
pd.set_option('display.max_colwidth', -1)
rules.nlargest(10,'effect')
Out[6]:
rule type coef support effect
883 worst_area <= 976.25 & worst_area > 553.299987793 & radius_error > 0.275350004435 rule 0.056166 0.216783 0.012176
1176 compactness_error > 0.0203649997711 & worst_fractal_dimension <= 0.113150000572 & area_error > 23.2399997711 & radius_error <= 0.339100003242 rule 0.265549 0.041958 0.011142
769 worst_area > 548.650024414 & area_error > 23.2350006104 & worst_area <= 976.25 rule 0.028937 0.223776 0.006475
458 worst_radius > 15.345000267 & worst_concavity > 0.207249999046 & worst_symmetry > 0.203749999404 & worst_symmetry <= 0.560700058937 & worst_radius <= 16.8100013733 rule 0.101535 0.041958 0.004260
1548 worst_symmetry > 0.203749999404 & area_error <= 33.375 & worst_symmetry <= 0.560700058937 & worst_perimeter > 100.555000305 rule 0.024403 0.153846 0.003754
5361 radius_error > 0.344399988651 & symmetry_error <= 0.017725000158 & area_error > 23.2399997711 & worst_perimeter <= 105.199996948 rule 0.133092 0.027972 0.003723
357 worst_radius > 15.5699996948 & worst_symmetry > 0.203749999404 & mean_texture > 19.1049995422 & worst_symmetry <= 0.560700058937 & radius_error <= 0.418200016022 rule 0.032694 0.069930 0.002286
57 worst_area > 548.650024414 & worst_area <= 1086.0 rule 0.002326 0.405594 0.000943
1433 worst_area > 548.650024414 & mean_perimeter > 79.2050018311 & mean_texture > 21.1399993896 & area_error <= 36.4049987793 rule 0.012087 0.069930 0.000845
5547 worst_area > 744.0 & worst_symmetry > 0.203749999404 & mean_texture > 19.1049995422 & worst_symmetry <= 0.560700058937 & radius_error <= 0.418200016022 rule 0.004637 0.069930 0.000324

To wrap up, let's produce a report of the top 3 rules, including up to 10 examples from the data to which the rules apply.

This report can be used in conjunction with subject-matter experitise on the data to isolate areas for feature enrichment, to improve your model!

In [7]:
from IPython.core.display import display
import pandas as pd

# prepare a dataframe for use below (we really care about the `query` function)
df = pd.DataFrame(X_test, columns=data.feature_names)
df['y_error'] = y_error
df['y_sq_error'] = y_error**2

for index, rule in rules.nlargest(3, 'effect').iterrows():
    print('rule:', rule['rule'])
    print('support:', rule['support'])
    print('coef:', rule['coef'])
    print('estimated error effect (support x coef):', rule['effect'])
    
    # it might be useful to compare the local error to the estimated model effect
    print('rule MAE:', df.query(rule['rule'])['y_error'].mean())
    print('rule RMSE:', df.query(rule['rule'])['y_sq_error'].mean()**(1/2))
    
    # we can use the rule to filter the data
    display(df.query(rule['rule']).nlargest(10, 'y_error'))
rule: worst_area <= 976.25 & worst_area > 553.299987793 & radius_error > 0.275350004435
support: 0.21678321678321677
coef: 0.05616597023836435
estimated error effect (support x coef): 0.012175839702023041
rule MAE: 0.2658110370900938
rule RMSE: 0.4155759177730745
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension y_error y_sq_error
91 11.76 18.14 75.00 431.1 0.09968 0.05914 0.02685 0.03515 0.1619 0.06287 ... 85.10 553.6 0.1137 0.07974 0.0612 0.07160 0.1978 0.06915 0.974325 0.949308
83 15.37 22.76 100.20 728.2 0.09200 0.10360 0.11220 0.07483 0.1717 0.06097 ... 107.50 830.9 0.1257 0.19970 0.2846 0.14760 0.2556 0.06828 0.929329 0.863652
89 14.60 23.29 93.97 664.7 0.08682 0.06636 0.08390 0.05271 0.1627 0.05416 ... 102.20 758.2 0.1312 0.15810 0.2675 0.13590 0.2477 0.06836 0.856296 0.733242
47 13.80 15.79 90.43 584.1 0.10070 0.12800 0.07789 0.05069 0.1662 0.06566 ... 110.30 812.4 0.1411 0.35420 0.2779 0.13830 0.2589 0.10300 0.832785 0.693531
137 14.22 27.85 92.55 623.9 0.08223 0.10390 0.11030 0.04408 0.1342 0.06129 ... 102.50 764.0 0.1081 0.24260 0.3064 0.08219 0.1890 0.07796 0.707592 0.500686
73 11.80 16.58 78.99 432.0 0.10910 0.17000 0.16590 0.07415 0.2678 0.07371 ... 91.93 591.7 0.1385 0.40920 0.4504 0.18650 0.5774 0.10300 0.703641 0.495111
130 14.99 22.11 97.53 693.7 0.08515 0.10250 0.06859 0.03876 0.1944 0.05913 ... 110.20 867.1 0.1077 0.33450 0.3114 0.13080 0.3163 0.09251 0.646361 0.417782
111 13.27 14.76 84.74 551.7 0.07355 0.05055 0.03261 0.02648 0.1386 0.05318 ... 104.50 830.6 0.1006 0.12380 0.1350 0.10010 0.2027 0.06206 0.471311 0.222134
119 16.25 19.51 109.80 815.8 0.10260 0.18930 0.22360 0.09194 0.2151 0.06578 ... 122.10 939.7 0.1377 0.44620 0.5897 0.17750 0.3318 0.09136 0.467160 0.218238
25 13.90 19.24 88.73 602.9 0.07991 0.05326 0.02995 0.02070 0.1579 0.05594 ... 104.40 830.5 0.1064 0.14150 0.1673 0.08150 0.2356 0.07603 0.344350 0.118577

10 rows × 32 columns

rule: compactness_error > 0.0203649997711 & worst_fractal_dimension <= 0.113150000572 & area_error > 23.2399997711 & radius_error <= 0.339100003242
support: 0.04195804195804196
coef: 0.26554881554597404
estimated error effect (support x coef): 0.011141908344586324
rule MAE: 0.7144778689742627
rule RMSE: 0.7290404836131931
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension y_error y_sq_error
83 15.37 22.76 100.20 728.2 0.09200 0.1036 0.11220 0.07483 0.1717 0.06097 ... 107.50 830.9 0.1257 0.1997 0.2846 0.14760 0.2556 0.06828 0.929329 0.863652
47 13.80 15.79 90.43 584.1 0.10070 0.1280 0.07789 0.05069 0.1662 0.06566 ... 110.30 812.4 0.1411 0.3542 0.2779 0.13830 0.2589 0.10300 0.832785 0.693531
137 14.22 27.85 92.55 623.9 0.08223 0.1039 0.11030 0.04408 0.1342 0.06129 ... 102.50 764.0 0.1081 0.2426 0.3064 0.08219 0.1890 0.07796 0.707592 0.500686
73 11.80 16.58 78.99 432.0 0.10910 0.1700 0.16590 0.07415 0.2678 0.07371 ... 91.93 591.7 0.1385 0.4092 0.4504 0.18650 0.5774 0.10300 0.703641 0.495111
130 14.99 22.11 97.53 693.7 0.08515 0.1025 0.06859 0.03876 0.1944 0.05913 ... 110.20 867.1 0.1077 0.3345 0.3114 0.13080 0.3163 0.09251 0.646361 0.417782
119 16.25 19.51 109.80 815.8 0.10260 0.1893 0.22360 0.09194 0.2151 0.06578 ... 122.10 939.7 0.1377 0.4462 0.5897 0.17750 0.3318 0.09136 0.467160 0.218238

6 rows × 32 columns

rule: worst_area > 548.650024414 & area_error > 23.2350006104 & worst_area <= 976.25
support: 0.22377622377622378
coef: 0.02893728555291459
estimated error effect (support x coef): 0.006475476487365502
rule MAE: 0.2578116453233828
rule RMSE: 0.40903437023906347
mean_radius mean_texture mean_perimeter mean_area mean_smoothness mean_compactness mean_concavity mean_concave_points mean_symmetry mean_fractal_dimension ... worst_perimeter worst_area worst_smoothness worst_compactness worst_concavity worst_concave_points worst_symmetry worst_fractal_dimension y_error y_sq_error
91 11.76 18.14 75.00 431.1 0.09968 0.05914 0.02685 0.03515 0.1619 0.06287 ... 85.10 553.6 0.1137 0.07974 0.0612 0.07160 0.1978 0.06915 0.974325 0.949308
83 15.37 22.76 100.20 728.2 0.09200 0.10360 0.11220 0.07483 0.1717 0.06097 ... 107.50 830.9 0.1257 0.19970 0.2846 0.14760 0.2556 0.06828 0.929329 0.863652
89 14.60 23.29 93.97 664.7 0.08682 0.06636 0.08390 0.05271 0.1627 0.05416 ... 102.20 758.2 0.1312 0.15810 0.2675 0.13590 0.2477 0.06836 0.856296 0.733242
47 13.80 15.79 90.43 584.1 0.10070 0.12800 0.07789 0.05069 0.1662 0.06566 ... 110.30 812.4 0.1411 0.35420 0.2779 0.13830 0.2589 0.10300 0.832785 0.693531
137 14.22 27.85 92.55 623.9 0.08223 0.10390 0.11030 0.04408 0.1342 0.06129 ... 102.50 764.0 0.1081 0.24260 0.3064 0.08219 0.1890 0.07796 0.707592 0.500686
73 11.80 16.58 78.99 432.0 0.10910 0.17000 0.16590 0.07415 0.2678 0.07371 ... 91.93 591.7 0.1385 0.40920 0.4504 0.18650 0.5774 0.10300 0.703641 0.495111
130 14.99 22.11 97.53 693.7 0.08515 0.10250 0.06859 0.03876 0.1944 0.05913 ... 110.20 867.1 0.1077 0.33450 0.3114 0.13080 0.3163 0.09251 0.646361 0.417782
111 13.27 14.76 84.74 551.7 0.07355 0.05055 0.03261 0.02648 0.1386 0.05318 ... 104.50 830.6 0.1006 0.12380 0.1350 0.10010 0.2027 0.06206 0.471311 0.222134
119 16.25 19.51 109.80 815.8 0.10260 0.18930 0.22360 0.09194 0.2151 0.06578 ... 122.10 939.7 0.1377 0.44620 0.5897 0.17750 0.3318 0.09136 0.467160 0.218238
25 13.90 19.24 88.73 602.9 0.07991 0.05326 0.02995 0.02070 0.1579 0.05594 ... 104.40 830.5 0.1064 0.14150 0.1673 0.08150 0.2356 0.07603 0.344350 0.118577

10 rows × 32 columns

Introduction to Classification using Logistic Regression with Scikit-Learn

This post, including the source .ipynb notebook file, will be used as a basis for other topics. You can obtain a copy of the source by clicking the Source link at the post of this post.

To keep things simple, we're going to utilise one of the many toy datasets built into Scikit-Learn! (And yes, it is a real dataset.)

We're also not going to explain how Scikit-Learn's LogisticRegression is implemented in this post.

To structure our code, we will define our model in two parts:

  • The code we need to fit our model
  • The code we need to use our fitted model to generate predictions

When it comes to model building, these are the two main functional components - so, and for reasons which will be explained in other posts, we're going to build a Python class called CustomModel, with a function for each of these components:

In [1]:
from sklearn.linear_model import LogisticRegression

class CustomModel(object):
    
    def fit(self, X, y):

        # LogisticRegression implements a number of parameters, you can read about them here:
        # http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
        #
        # With the exception of `random_state`, each of these are the defaults.
        
        model_params = {
            'penalty': 'l2',
            'dual': False,
            'tol': 0.0001,
            'C': 1.0,
            'fit_intercept': True,
            'intercept_scaling': 1,
            'class_weight': None,
            'random_state': 1234,    # Fixed to 1234 for reproducibility
            'solver': 'liblinear',
            'max_iter': 100,
            'multi_class': 'ovr',
            'verbose': 0,
            'warm_start': False,
            'n_jobs': 1
        }
    
        self.clf = LogisticRegression(**model_params)
        self.clf.fit(X, y)
        
        return self # fun fact: returning self enables method chaining i.e. .fit().predict()

    def predict(self, X):
        
        # We only want to output the positive case (the second column returned by `predict_proba`:
    
        return self.clf.predict_proba(X)[:,1]

Now we're ready to use our model!

In the next section we're going to load the sample data discussed above, and divide it into two portions:

  • 75% for model fitting
  • 25% for predictions
In [2]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

Now we have everything we need, lets load up our model, fit it with the training data, and generate some predictions:

In [4]:
# load our model
model = CustomModel()

# fit our model
model.fit(X_train, y_train)

# generate some predictions
model.predict(X_test)
Out[4]:
array([  9.25168417e-01,   9.99922130e-01,   9.53635418e-01,
         9.88416588e-01,   9.97542577e-01,   9.95232506e-01,
         4.60659258e-02,   9.98390194e-01,   6.59002902e-10,
         2.76899836e-06,   8.30718694e-10,   9.63993586e-01,
         9.94157890e-01,   9.50980576e-01,   9.96974859e-01,
         6.97038792e-10,   9.99809391e-01,   9.96431765e-01,
         9.99363563e-01,   8.43800531e-06,   9.95502414e-01,
         7.77576547e-03,   1.12727716e-09,   3.40904102e-17,
         3.68627970e-09,   6.55649762e-01,   3.51723839e-03,
         9.97326888e-01,   9.98785233e-01,   9.97552026e-01,
         9.86350517e-01,   9.98844211e-01,   5.70842717e-04,
         9.87742427e-01,   9.19814189e-01,   9.78443649e-01,
         9.92882821e-01,   1.14676290e-02,   1.48817234e-01,
         9.98733024e-01,   4.13813658e-05,   9.93177003e-01,
         1.72319657e-10,   8.54534408e-01,   8.81187668e-01,
         9.97568264e-01,   9.98086681e-01,   8.32784885e-01,
         4.49929586e-11,   8.89087737e-01,   9.28259947e-01,
         9.91244116e-01,   9.94876558e-01,   1.51106510e-08,
         2.60668778e-01,   9.99597520e-01,   9.98940073e-01,
         9.99968817e-01,   9.91318570e-01,   8.29369844e-03,
         9.93238377e-01,   9.92431535e-01,   9.29775117e-01,
         9.99271713e-01,   9.96474598e-01,   2.41572863e-04,
         1.51376226e-11,   9.97330558e-01,   9.98831771e-01,
         4.79400697e-01,   9.99798779e-01,   3.57307727e-07,
         9.99656809e-01,   7.03641088e-01,   9.98247027e-01,
         9.96093354e-01,   9.99588791e-01,   2.58369708e-08,
         9.98136922e-01,   7.97865310e-03,   9.99065333e-01,
         9.98470351e-01,   9.94581260e-01,   9.29328694e-01,
         1.41996390e-02,   1.43214384e-04,   3.71155631e-05,
         4.45838811e-06,   9.13207438e-01,   8.56295696e-01,
         9.99467328e-01,   9.74324559e-01,   9.99328632e-01,
         2.91312374e-12,   1.00998256e-01,   9.86992421e-01,
         9.97149193e-01,   9.13815924e-01,   9.98807818e-01,
         9.84005486e-01,   3.17865443e-08,   2.30937811e-11,
         9.98036358e-01,   9.99532884e-01,   1.24075526e-03,
         9.98819765e-01,   9.99752279e-01,   8.53677349e-04,
         1.53192255e-01,   9.30832406e-01,   1.49723823e-05,
         5.28688983e-01,   1.48786146e-03,   9.92804571e-51,
         8.86447353e-01,   9.95516043e-01,   9.98554149e-01,
         1.75078944e-03,   9.99922978e-01,   4.67159833e-01,
         9.99825913e-01,   9.57716419e-01,   9.95069689e-01,
         9.98728887e-01,   7.49375338e-14,   9.92513330e-01,
         1.49918676e-02,   1.63977226e-02,   9.95785292e-01,
         9.56124754e-01,   3.53639065e-01,   9.96011137e-01,
         7.27728677e-33,   9.97779030e-01,   7.77872222e-02,
         9.90058068e-01,   9.80367925e-01,   2.92408222e-01,
         9.98164180e-01,   1.67926421e-01,   9.99996297e-01,
         6.35631576e-10,   1.06440027e-01])

Building a utility function wrapper for Scikit-Learn models

In a previous post we learned how to access a notebook programmatically using the ipynb package.

This is very powerful as it allows a data scientist to focus on implementing a model which is re-usable, specifying a fit and predict method to provide some structure to their code.

In this post, we're going to build a utility wrapper which takes the previous code and the following functionality:

  • Serialization, so we don't have to re-fit models if we don't need to
  • Scoring, so we can determine how well our model is performing
  • Feature importance, so we can determine the predictive power of individual features - and provide insight into feature selection

Building the wrapper

Here is the code:

In [1]:
from sklearn.externals import joblib
from sklearn.metrics import roc_auc_score
from sklearn.exceptions import NotFittedError
import os.path

class ModelUtils(object):
    
    def __init__(self, model, serialize_path=None):
        """
            If serialize_path is specified and valid, load the model from disk.
        """
        self.serialize_path = serialize_path
        
        if self.serialize_path is not None and os.path.exists(self.serialize_path):
            print('Loaded from', self.serialize_path)
            self.clf = joblib.load(self.serialize_path)
            self.is_fitted = True
            return
        
        self.clf = model
        self.is_fitted = False

    def fit(self, X, y):
        """
            Fit our model, saving the model to disk if serialize_path is specified.
        """
        # fit our model
        self.clf.fit(X, y)
        self.is_fitted = True
        
        # serialise to path
        if self.serialize_path is not None:
            joblib.dump(self.clf, self.serialize_path)
            print('Saved to', self.serialize_path)
            
        return self
    
    def predict(self, X):
        if not self.is_fitted:
            raise NotFittedError
        return self.clf.predict(X)
    
    def score(self, X, y_true):
        """
            Generates a score for the model based on predicting on X and comparing 
            to y_true.
        """
        y_pred = self.predict(X)
        return roc_auc_score(y_true, y_pred)
    
    def feature_importance(self, X, y, normalize=True):
        """
            To calculate feature importance, we iterate through each feature i, 
            generating a model score with all other features zeroed.
            
            If normalize is True, divide the results by the minimum score, such that
            each score represents "N times better than the worst feature".
        """
        scores = [self.score(self.__zero_except(X, i), y) for i in range(X.shape[1])]
        
        if normalize:
            return scores / min(scores)
        return scores
    
    def __zero_except(self, X, i):
        """
            A helper function to replace all but the ith column with zeroes, and 
            return the result. (There is probably a cleaner way to do this.)
        """
        X_copy = X.copy()
        X_i = X[:,i]
        X_copy.fill(0)
        X_copy[:,i] = X_i
        return X_copy

Using the wrapper

Now we have our ModelUtils wrapper class, lets import CustomModel as before and put it to work.

As we instantiate the wrapper, we're specifying test.pkl in the current directory as the location to serialize the model.

If serialize_path is configured and valid, the pre-fitted model will be loaded from there, and the predict function will be immediately available. If configured but the file does not exist, ModelUtils will serialize to this location after fitting the model.

In [2]:
from ipynb.fs.defs.model import CustomModel

model = ModelUtils(CustomModel(), serialize_path='test.pkl')

Let's load up the sample data again, and fit our model and then use it to create some predictions. Note the Saved to test.pkl output.

In [3]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

# fit our model (as before)
model.fit(X_train, y_train)

# generate some predictions (as before)
model.predict(X_test);  # fun fact: the ; character suppresses notebook output
Saved to test.pkl

Model and feature performance

Finally, here are our two new functions.

First, let's score the performance of our model. This is using a metric called ROC AUC - we won't explain what that is in this post in any detail, but essentially it is a measure of how well the model can separate each class in y.

Then we will calculate relative feature importance for each of the 30 features in the sample dataset. Based on individual scoring performance, what this means is the the first feature is ~2.12x more powerful than the lowest performing feature, and the best feature is ~35.86x more powerful.

In [4]:
# score our model
auc = model.score(X_test, y_test)
print('AUC:', auc)

# calculate relative feature importance
importance = model.feature_importance(X_test, y_test)

print('Top feature relative performance:', max(importance))
print(model.feature_importance(X_test, y_test))
AUC: 0.989669421488
Top feature relative performance: 35.855513308
[  2.121673     8.6730038   34.85551331   2.06844106  25.14068441
  29.74904943  33.68060837  34.68441065  24.88973384  15.51711027
   4.66920152  18.64638783   4.85931559  34.7148289   16.36121673
  24.82129278  27.2851711   27.96958175  15.74904943  16.24334601   1.
  29.3269962   35.85551331  35.83269962  26.1634981   30.85931559
  33.77186312  35.12927757  27.24714829  23.27376426]

👏

Accessing Jupyter notebooks programatically

In a previous post we created a simple classifier using Scikit-Learn's LogisticRegression.

As we pieced together our model, we structured the code into a class called CustomModel, with two functions: fit and predict.

To start working programatically with the notebook created in that post, you will first need to install the ipynb package:

pip install git+https://github.com/blairhudson/ipynb

(Note: This is actually a fork of an IPython repo. Unfortunately the master has a bug with parsing tuple-based assignments (e.g. X, y = ...). A pull request has been submitted.)

Now you're ready to go!

Using the ipynb package

To simplify things considerably, make sure that you have a copy of the source in the current working directory, and rename it to model.ipynb.

Now, thanks to the ipynb package you can access the CustomModel class like this:

In [5]:
from ipynb.fs.defs.model import CustomModel

To prove it, let's generate predictions on the same sample data:

In [6]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

X, y = load_breast_cancer(return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    train_size=0.75, 
                                                    test_size=0.25, 
                                                    random_state=1234) # more reproducibility

# load our model
model = CustomModel()

# fit our model
model.fit(X_train, y_train)

# generate some predictions
model.predict(X_test)
Out[6]:
array([  9.25168417e-01,   9.99922130e-01,   9.53635418e-01,
         9.88416588e-01,   9.97542577e-01,   9.95232506e-01,
         4.60659258e-02,   9.98390194e-01,   6.59002902e-10,
         2.76899836e-06,   8.30718694e-10,   9.63993586e-01,
         9.94157890e-01,   9.50980576e-01,   9.96974859e-01,
         6.97038792e-10,   9.99809391e-01,   9.96431765e-01,
         9.99363563e-01,   8.43800531e-06,   9.95502414e-01,
         7.77576547e-03,   1.12727716e-09,   3.40904102e-17,
         3.68627970e-09,   6.55649762e-01,   3.51723839e-03,
         9.97326888e-01,   9.98785233e-01,   9.97552026e-01,
         9.86350517e-01,   9.98844211e-01,   5.70842717e-04,
         9.87742427e-01,   9.19814189e-01,   9.78443649e-01,
         9.92882821e-01,   1.14676290e-02,   1.48817234e-01,
         9.98733024e-01,   4.13813658e-05,   9.93177003e-01,
         1.72319657e-10,   8.54534408e-01,   8.81187668e-01,
         9.97568264e-01,   9.98086681e-01,   8.32784885e-01,
         4.49929586e-11,   8.89087737e-01,   9.28259947e-01,
         9.91244116e-01,   9.94876558e-01,   1.51106510e-08,
         2.60668778e-01,   9.99597520e-01,   9.98940073e-01,
         9.99968817e-01,   9.91318570e-01,   8.29369844e-03,
         9.93238377e-01,   9.92431535e-01,   9.29775117e-01,
         9.99271713e-01,   9.96474598e-01,   2.41572863e-04,
         1.51376226e-11,   9.97330558e-01,   9.98831771e-01,
         4.79400697e-01,   9.99798779e-01,   3.57307727e-07,
         9.99656809e-01,   7.03641088e-01,   9.98247027e-01,
         9.96093354e-01,   9.99588791e-01,   2.58369708e-08,
         9.98136922e-01,   7.97865310e-03,   9.99065333e-01,
         9.98470351e-01,   9.94581260e-01,   9.29328694e-01,
         1.41996390e-02,   1.43214384e-04,   3.71155631e-05,
         4.45838811e-06,   9.13207438e-01,   8.56295696e-01,
         9.99467328e-01,   9.74324559e-01,   9.99328632e-01,
         2.91312374e-12,   1.00998256e-01,   9.86992421e-01,
         9.97149193e-01,   9.13815924e-01,   9.98807818e-01,
         9.84005486e-01,   3.17865443e-08,   2.30937811e-11,
         9.98036358e-01,   9.99532884e-01,   1.24075526e-03,
         9.98819765e-01,   9.99752279e-01,   8.53677349e-04,
         1.53192255e-01,   9.30832406e-01,   1.49723823e-05,
         5.28688983e-01,   1.48786146e-03,   9.92804571e-51,
         8.86447353e-01,   9.95516043e-01,   9.98554149e-01,
         1.75078944e-03,   9.99922978e-01,   4.67159833e-01,
         9.99825913e-01,   9.57716419e-01,   9.95069689e-01,
         9.98728887e-01,   7.49375338e-14,   9.92513330e-01,
         1.49918676e-02,   1.63977226e-02,   9.95785292e-01,
         9.56124754e-01,   3.53639065e-01,   9.96011137e-01,
         7.27728677e-33,   9.97779030e-01,   7.77872222e-02,
         9.90058068e-01,   9.80367925e-01,   2.92408222e-01,
         9.98164180e-01,   1.67926421e-01,   9.99996297e-01,
         6.35631576e-10,   1.06440027e-01])

Magic ✨

Using Jupyter notebooks with Anaconda

Jupyter is a popular data science environment, and Jupyter notebooks (such as the notebook this post was written with) are a great way to create and share great data science with inline documentation (using Markdown syntax).

Jupyter is capable of running kernels in many different programming languages, but in this post we're focussed just on Python.

Installing Anaconda

  1. Download Anaconda for your operating system from the Anaconda website. For best compatibility with modern data science packages, I suggest Python 3.6 version or newer.
  2. Run the downloaded installer and follow the prompts.

Launching Jupyter

  1. Run the following command to launch the Jupyter environment in your current directory: jupyter notebook
  2. By default, this will open the web interface in your default web browser, and by default at http://localhost:8888/.
  3. Now you can select an existing .ipynb file from the file navigator to open it, or create a new notebook.
In [1]:
print("🚀")
🚀

Optimising hyper-parameters efficiently with Scikit-Optimize

One of the most well-known techniques for experimenting with various model configurations is Grid Search.

With grid search, you specify a discrete search space (a parameter grid) of all of the parameter values you would like to test. The search permutes through the grid, testing various combinations until all are exhausted. Basic a specified performance metric (e.g. error), you can select the best parameter combination for your model.

What's wrong with this?

If you have a large parameter grid, this doesn't work too well:

In [1]:
import numpy as np

param_grid = {
    'param_a': [0.01, 0.03, 0.1],
    'param_b': [0, 1, 2],
    'param_c': [1, 50, 100]
}

def num_searches(param_grid):
    return np.prod([len(p) for p in param_grid.values()])
    
num_searches(param_grid)
Out[1]:
27

And maybe we want to search over four possible values instead for param_a, and add two more new parameters:

In [2]:
param_grid = {
    'param_a': [0.01, 0.03, 0.1, 0.3],
    'param_b': [0, 1, 2],
    'param_c': [1, 50, 100],
    'param_d': ["a", "b"],
    'param_e': [0, 1, 2]
}

num_searches(param_grid)
Out[2]:
216

As you can see from the first grid, there's already 27 combinations to try. Then this jumps to 216 for our larger grid. Depending on the complexity of the model and the amount of data to process, this can very easily become infeasible.

There are a few approaches to solving this, including:

  • breaking down the search into multiple smaller steps (such as searching param_a and param_b first, with defaults for the others, then using the best values to search the remaining parameters - this can be tricky in practice)
  • searching the parameter space at random (which has an additional benefit of discovering better parameter values when random samples are drawn frmo a continuous range)

While Scikit-Learn doesn't provide many more options, some clever people have developed a drop-in replacement for Scikit-Learn's GridSearchCV and RandomizedSearchCV called BayesSearchCV in a package called Scikit-Optimize.

Let's install Scikit-Optimize and implement BayesSearchCV with a simple example!

Installing Scikit-Optimize

Assuming you already have already installed Anaconda and Jupyter, you will need to do the following:

  • pip install scikit-optimize

If you have trouble installing, you may first need to run the following to install one of Scikit-Optmize's dependencies:

  • pip install scikit-garden

Implementing BayesSearchCV

Here's an example implementation using a sample dataset and Logistic Regression.

In [3]:
import warnings
warnings.filterwarnings('ignore')

from skopt import BayesSearchCV
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# prep some sample data
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=1234)

# we're using a logistic regression model
clf = LogisticRegression(random_state=1234, verbose=0)

# this is our parameter grid
param_grid = {
    'solver': ['liblinear', 'saga'],  
    'penalty': ['l1','l2'],
    'tol': (1e-5, 1e-3, 'log-uniform'),
    'C': (1e-5, 100, 'log-uniform'),
    'fit_intercept': [True, False]
}

# set up our optimiser to find the best params in 30 searches
opt = BayesSearchCV(
    clf,
    param_grid,
    n_iter=30,
    random_state=1234,
    verbose=0
)

opt.fit(X_train, y_train)
In [4]:
print('Best params achieve a test score of', opt.score(X_test, y_test), ':')

opt.best_params_
Best params achieve a test score of 0.958041958042 :
Out[4]:
{'C': 100.0,
 'fit_intercept': True,
 'penalty': 'l1',
 'solver': 'liblinear',
 'tol': 0.00094035472283658726}

By increasing the value of n_iter, you can continue the search to find better parameter combinations. You can also use the optimiser for prediction, by calling .predict() or .predict_proba() for probabilities, or extract and use the best one standalone:

In [5]:
opt.best_estimator_
Out[5]:
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=1234, solver='liblinear',
          tol=0.00094035472283658726, verbose=0, warm_start=False)

You may also find it useful to re-use the best parameters programatically to define an equivalent model:

In [6]:
LogisticRegression(**opt.best_params_)
Out[6]:
LogisticRegression(C=100.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear',
          tol=0.00094035472283658726, verbose=0, warm_start=False)

Creating a blog with Jupyter notebooks

Assuming you already have already installed Jupyter notebook, you will need to do the following:

Installing and configuring a Nikola blog

  1. First you'll need to create a directory structure as follows:

     - /blog
     -- /posts
     -- /output
    • /blog is the root directory for everything you'll be doing with your blog
    • /posts is where you'll store your Jupyter notebooks
    • /output will contain the code generated for your blog
  2. Run the following command to install Nikola (the static website generator which will do most of the heavy lifting)[1]:

    pip install --upgrade "Nikola[extras]"

  3. Change directory to your blog root:

    cd blog

  4. Start up Nikola, following the prompts to configure your new blog:

    nikola init .

  5. Open /blog/conf.py and change the POSTS and PAGES sections to include the lines as follows. This will allow Nikola to treat .ipynb files as blog posts.

     POSTS = (
         ("posts/*.rst", "posts", "post.tmpl"),
         ("posts/*.md", "posts", "post.tmpl"),
         ("posts/*.txt", "posts", "post.tmpl"),
         ("posts/*.html", "posts", "post.tmpl"),
         ("posts/*.ipynb", "posts", "post.tmpl"),
     )
     PAGES = (
         ("pages/*.rst", "pages", "page.tmpl"),
         ("pages/*.md", "pages", "page.tmpl"),
         ("pages/*.txt", "pages", "page.tmpl"),
         ("pages/*.html", "pages", "page.tmpl"),
         ("pages/*.ipynb", "pages", "page.tmpl"),
     )
  6. Write your blog post in Jupyter, saving the .ipynb file to /posts.

  7. You will need to explicitly add the following metadata to your notebook (in the Jupyter menu, select Edit > Edit Notebook Metadata). Change the metadata to match your post.[2]

     "nikola": {
         "title": "Creating a blog with Jupyter notebooks",
         "slug": "creating-a-blog-with-jupyter-notebooks",
         "date": "2017-09-09 21:09:01 UTC+10:00"
     }
  8. Run nikola build each time you update your /posts, which will generate your site and store it in /output!

  9. If you're going to be publishing your blog on Github (like me), you can push the content of /output to your website repo (example).

[1]Problems installing Nikola?

I ran into some issues installing Nikola on OS X with Anaconda. Specifically, gcc in Anaconda was the culprit. Resolution:

  • conda remove gcc to uninstall gcc provided by Anaconda

This will default to the system gcc, which you can check by running which gcc (which should output /usr/bin/gcc).

If this still doesn't resolve the issue still, you may need to install a more up-to-date gcc:

  1. Install Homebrew
  2. brew install gcc (you may be prompted to install Developer Tools)
  3. brew unlink gcc
  4. brew link --overwrite gcc

which gcc should now show /usr/local/Cellar/gcc/7.2.0. 👍

[2]Inferring Nikola post metadata

Like me, you probably want as little as possible to come between your latest notebook hack and your awesome new blog.

Nikola parses Jupyter notebooks with a plugin, which with some modification we can have infer all of the Nikola post metadata automatically. For me, the plugin file was here (though it may differ for you):

~/anaconda/lib/python3.5/site-packages/nikola/plugins/compile/ipynb.py

To automagically infer the required metadata, you can replace the read_metadata() function in the file above with the following code:

In [5]:
def read_metadata(self, post, file_metadata_regexp=None, unslugify_titles=False, lang=None):
    """Read metadata directly from ipynb file.

    As ipynb file support arbitrary metadata as json, the metadata used by Nikola
    will be assume to be in the 'nikola' subfield.
    """
    self._req_missing_ipynb()
    if lang is None:
        lang = LocaleBorg().current_lang
    source = post.translated_source_path(lang)
    with io.open(source, "r", encoding="utf8") as in_file:
        nb_json = nbformat.read(in_file, current_nbformat)
    # Metadata might not exist in two-file posts or in hand-crafted
    # .ipynb files.

    # infer metadata
    title = os.path.splitext(os.path.basename(source))[0]
    slug = title.lower().replace(' ', '-')
    from datetime import datetime
    date = datetime.fromtimestamp(os.path.getctime(source)).strftime('%Y-%m-%d %k:%M:%S')

    implicit = {'title':title, 'slug': slug, 'date':date}
    explicit = nb_json.get('metadata', {}).get('nikola', {})
    
    # replace inference with explicit if available
    metadata = {**implicit, **explicit}

    return metadata

With this small modification, we instruct Nikola to infer the title and slug values based on the filename, and the date value based on the filesystem. Magical! ✨

Update: The makers of Nikola have suggested some official methods for achieving this that are built right into the existing workflow:

In [9]:
%%html
<blockquote class="twitter-tweet" data-conversation="none" data-lang="en"><p lang="en" dir="ltr">Titles and slugs can be done via FILE_METADATA_REGEXP, and auto dates are prone to issues.<br>Better: import files with `nikola new_post -i`</p>&mdash; Nikola Generator (@GetNikola) <a href="https://twitter.com/GetNikola/status/907570254611484672">September 12, 2017</a></blockquote> <script async src="//platform.twitter.com/widgets.js" charset="utf-8"></script>