Module contents

cobra.preprocessing module

class cobra.preprocessing.KBinsDiscretizer(n_bins: int = 10, strategy: str = 'quantile', closed: str = 'right', auto_adapt_bins: bool = False, starting_precision: int = 0, label_format: str = '{} - {}', change_endpoint_format: bool = False)[source]

Bases: sklearn.base.BaseEstimator

Bin continuous data into intervals of predefined size. It provides a way to partition continuous data into discrete values, i.e. transform continuous data into nominal data. This can make a linear model more expressive as it introduces nonlinearity to the model, while maintaining the interpretability of the model afterwards.

This module is a rework of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/_discretization.py, though it is purely written in pandas instead of numpy because it is more intuitive. It also includes some custom modifications to align it with the Python Predictions methodology. See the README of the GitHub repository for more background information.

auto_adapt_bins

Reduces the number of bins (starting from n_bins) as a function of the number of missings.

Type

bool

change_endpoint_format

Whether or not to change the format of the lower and upper bins into <= x and > y resp.

Type

bool

closed

Whether to close the bins (intervals) from the left or right

Type

str

label_format

Format string to display the bin labels e.g. min - max, (min, max], …

Type

str

n_bins

Number of bins to produce. Raises ValueError if n_bins < 2. A warning is issued when a variable can only produce a lower number of bins than asked for.

Type

int

starting_precision

Initial precision for the bin edges to start from, can also be negative. Given a list of bin edges, the class will automatically choose the minimal precision required to have proper bins e.g. [5.5555, 5.5744, ...] will be rounded to [5.56, 5.57, ...]. In case of a negative number, an attempt will be made to round up the numbers of the bin edges e.g. 5.55 -> 10, 146 -> 100, …

Type

int

strategy

Binning strategy. Currently only uniform and quantile e.g. equifrequency is supported.

Type

str

valid_strategies = ('uniform', 'quantile')
valid_keys = ['n_bins', 'strategy', 'closed', 'auto_adapt_bins', 'starting_precision', 'label_format', 'change_endpoint_format']
attributes_to_dict() dict[source]

Return the attributes of KBinsDiscretizer in a dictionary

Returns

Contains the attributes of KBinsDiscretizer instance with the names as keys

Return type

dict

set_attributes_from_dict(params: dict)[source]

Set instance attributes from a dictionary of values with key the name of the attribute.

Parameters

params (dict) – Contains the attributes of KBinsDiscretizer with their names as key.

Raises

ValueError – In case _bins_by_column is not of type dict

fit(data: pandas.core.frame.DataFrame, column_names: list)[source]

Fits the estimator

Parameters
  • data (pd.DataFrame) – Data to be discretized

  • column_names (list) – Names of the columns of the DataFrame to discretize

transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame[source]

Discretizes the data in the given list of columns by mapping each number to the appropriate bin computed by the fit method

Parameters
  • data (pd.DataFrame) – Data to be discretized

  • column_names (list) – Names of the columns of the DataFrame to discretize

Returns

data with additional discretized variables

Return type

pd.DataFrame

fit_transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame[source]

Fits to data, then transform it

Parameters
  • data (pd.DataFrame) – Data to be discretized

  • column_names (list) – Names of the columns of the DataFrame to discretize

Returns

data with additional discretized variables

Return type

pd.DataFrame

class cobra.preprocessing.TargetEncoder(weight: float = 0.0, imputation_strategy: str = 'mean')[source]

Bases: sklearn.base.BaseEstimator

Target encoding for categorical features, inspired by http://contrib.scikit-learn.org/category_encoders/targetencoder.html.

Replace each value of the categorical feature with the average of the target values (in case of a binary target, this is the incidence of the group). This encoding scheme is also called Mean encoding.

Note that, when applying this target encoding, values of the categorical feature that have not been seen during fit will be imputed according to the configured imputation strategy (replacement with the mean, minimum or maximum value of the categorical variable).

The main problem with Target encoding is overfitting; the fact that we are encoding the feature based on target classes may lead to data leakage, rendering the feature biased. This can be solved using some type of regularization. A popular way to handle this is to use cross-validation and compute the means in each out-of-fold. However, the approach implemented here makes use of additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing).

In summary:

  • with a binary classification target, a value of a categorical variable is

replaced with:

[count(variable=value) * P(target=1|variable=value) + weight * P(target=1)] / [count(variable=value) + weight]

  • with a regression target, a value of a categorical variable is replaced

with:

[count(variable=value) * E(target|variable=value) + weight * E(target)] / [count(variable=value) + weight]

imputation_strategy

In case there is a particular column which contains new categories, the encoding will lead to NULL values which should be imputed. Valid strategies then are to replace the NULL values with the global mean of the train set or the min (resp. max) incidence of the categories of that particular variable.

Type

str

weight

Smoothing parameter (non-negative). The higher the value of the parameter, the bigger the contribution of the overall mean of targets learnt from all training data (prior) and the smaller the contribution of the mean target learnt from data with the current categorical value (posterior), so the bigger the smoothing (regularization) effect. When set to zero, there is no smoothing (e.g. the mean target of the current categorical value is used).

Type

float

valid_imputation_strategies = ('mean', 'min', 'max')
attributes_to_dict() dict[source]

Return the attributes of TargetEncoder in a dictionary.

Returns

Contains the attributes of TargetEncoder instance with the names as keys.

Return type

dict

set_attributes_from_dict(params: dict)[source]

Set instance attributes from a dictionary of values with key the name of the attribute.

Parameters

params (dict) – Contains the attributes of TargetEncoder with their names as key.

fit(data: pandas.core.frame.DataFrame, column_names: list, target_column: str)[source]

Fit the TargetEncoder to the data.

Parameters
  • data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.

  • column_names (list) – Columns of data to be encoded.

  • target_column (str) – Column name of the target.

transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame[source]

Replace (e.g. encode) values of each categorical column with a new value (reflecting the corresponding average target value, optionally smoothed by a regularization weight), which was computed when the fit method was called.

Parameters
  • data (pd.DataFrame) – Data to encode.

  • column_names (list) – Name of the categorical columns in the data to be encoded.

Returns

The resulting transformed data.

Return type

pd.DataFrame

Raises

NotFittedError – Exception when TargetEncoder was not fitted before calling this method.

fit_transform(data: pandas.core.frame.DataFrame, column_names: list, target_column: str) pandas.core.frame.DataFrame[source]

Fit the encoder and transform the data.

Parameters
  • data (pd.DataFrame) – Data to be encoded.

  • column_names (list) – Columns of data to be encoded.

  • target_column (str) – Column name of the target.

Returns

Data with additional columns, holding the target-encoded variables.

Return type

pd.DataFrame

class cobra.preprocessing.CategoricalDataProcessor(model_type: str = 'classification', regroup: bool = True, regroup_name: str = 'Other', keep_missing: bool = True, category_size_threshold: int = 5, p_value_threshold: float = 0.001, scale_contingency_table: bool = True, forced_categories: dict = {})[source]

Bases: sklearn.base.BaseEstimator

Regroups the categories of categorical variables based on significance with target variable.

This class implements the Python Prediction’s way of dealing with categorical data preprocessing. There are three steps involved:

  • An optional regrouping of the different categories based on category size and significance of the category w.r.t. the target.

    • For a given categorical variable, all categories below the (weighted) category size threshold are put into a rest category (by default Other)

    • The remaining categories are subject to a statistical test, if there is sufficient dependence with the target variable compared to all other categories, the category is kept as-is, otherwise it is also put into the rest category

    • Beware: one can force categories to be kept, and if no single category passes the statistical test, the categorical variable is left unprocessed altogether

  • Missing value replacement with the additional category Missing.

  • Change of dtype to category (could potentially lead to memory optimization).

See the README of the GitHub repository for more methodological background information.

category_size_threshold

All categories with a size (corrected for incidence if applicable) in the training set above this threshold are kept as a separate category, if statistical significance w.r.t. target is detected. Remaining categories are converted into Other (or else, cf. regroup_name).

Type

int

forced_categories

Map to prevent certain categories from being grouped into Other for each column - dict of the form {col:[forced vars]}.

Type

dict

keep_missing

Whether or not to keep missing as a separate category.

Type

bool

model_type

Model type (classification or regression).

Type

str

p_value_threshold

Significance threshold for regrouping.

Type

float

regroup

Whether or not to regroup categories.

Type

bool

regroup_name

New name of the non-significant regrouped variables

Type

str

scale_contingency_table

Whether contingency table should be scaled before chi^2.

Type

bool

valid_keys = ['model_type', 'regroup', 'regroup_name', 'keep_missing', 'category_size_threshold', 'p_value_threshold', 'scale_contingency_table', 'forced_categories']
attributes_to_dict() dict[source]

Return the attributes of CategoricalDataProcessor as a dictionary.

Returns

Contains the attributes of CategoricalDataProcessor instance with the attribute name as key.

Return type

dict

set_attributes_from_dict(params: dict)[source]

Set instance attributes from a dictionary of values with key the name of the attribute.

Parameters

params (dict) – Contains the attributes of CategoricalDataProcessor with their names as key.

Raises

ValueError – In case _cleaned_categories_by_column is not of type dict.

fit(data: pandas.core.frame.DataFrame, column_names: list, target_column: str)[source]

Fit the CategoricalDataProcessor.

Parameters
  • data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.

  • column_names (list) – Columns of data to be processed.

  • target_column (str) – Column name of the target.

transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame[source]

Transform the data.

Parameters
  • data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.

  • column_names (list) – Columns of data to be processed.

Returns

Data with additional transformed variables.

Return type

pd.DataFrame

fit_transform(data: pandas.core.frame.DataFrame, column_names: list, target_column: str) pandas.core.frame.DataFrame[source]

Fits the data, then transforms it.

Parameters
  • data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.

  • column_names (list) – Columns of data to be processed.

  • target_column (str) – Column name of the target.

Returns

Data with additional transformed variables.

Return type

pd.DataFrame

class cobra.preprocessing.PreProcessor(categorical_data_processor: cobra.preprocessing.categorical_data_processor.CategoricalDataProcessor, discretizer: cobra.preprocessing.kbins_discretizer.KBinsDiscretizer, target_encoder: cobra.preprocessing.target_encoder.TargetEncoder, is_fitted: bool = False)[source]

Bases: sklearn.base.BaseEstimator

This class implements a so-called facade pattern to define a higher-level interface to work with the CategoricalDataProcessor, KBinsDiscretizer and TargetEncoder classes, so that their fit and transform methods are called in the correct order.

Additionally, it provides methods such as (de)serialization to/from JSON so that preprocessing pipelines can be stored and reloaded, example for scoring.

We refer to the README of the GitHub repository for more background information on the preprocessing methodology.

categorical_data_processor

Instance of CategoricalDataProcessor to do the preprocessing of categorical variables.

Type

CategoricalDataProcessor

discretizer

Instance of KBinsDiscretizer to do the preprocessing of continuous variables by means of discretization.

Type

KBinsDiscretizer

target_encoder

Instance of TargetEncoder to do the incidence replacement.

Type

TargetEncoder

is_fitted

Whether or not object is yet fit.

Type

bool

model_type

The model_type variable as specified in CategoricalDataProcessor (classification or regression).

Type

str

classmethod from_params(model_type: str = 'classification', n_bins: int = 10, strategy: str = 'quantile', closed: str = 'right', auto_adapt_bins: bool = False, starting_precision: int = 0, label_format: str = '{} - {}', change_endpoint_format: bool = False, regroup: bool = True, regroup_name: str = 'Other', keep_missing: bool = True, category_size_threshold: int = 5, p_value_threshold: float = 0.001, scale_contingency_table: bool = True, forced_categories: dict = {}, weight: float = 0.0, imputation_strategy: str = 'mean')[source]

Constructor to instantiate PreProcessor from all the parameters that can be set in all its required (attribute) classes along with good default values.

Parameters
  • model_type (str) – Model type (classification or regression).

  • n_bins (int, optional) – Number of bins to produce. Raises ValueError if n_bins < 2.

  • strategy (str, optional) – Binning strategy. Currently only uniform and quantile e.g. equifrequency is supported.

  • closed (str, optional) – Whether to close the bins (intervals) from the left or right.

  • auto_adapt_bins (bool, optional) – Reduces the number of bins (starting from n_bins) as a function of the number of missings.

  • starting_precision (int, optional) – Initial precision for the bin edges to start from, can also be negative. Given a list of bin edges, the class will automatically choose the minimal precision required to have proper bins e.g. [5.5555, 5.5744, ...] will be rounded to [5.56, 5.57, ...]. In case of a negative number, an attempt will be made to round up the numbers of the bin edges e.g. 5.55 -> 10, 146 -> 100, …

  • label_format (str, optional) – Format string to display the bin labels e.g. min - max, (min, max], …

  • change_endpoint_format (bool, optional) – Whether or not to change the format of the lower and upper bins into < x and > y resp.

  • regroup (bool) – Whether or not to regroup categories.

  • regroup_name (str) – New name of the non-significant regrouped variables.

  • keep_missing (bool) – Whether or not to keep missing as a separate category.

  • category_size_threshold (int) – All categories with a size (corrected for incidence if applicable) in the training set above this threshold are kept as a separate category, if statistical significance w.r.t. target is detected. Remaining categories are converted into Other (or else, cf. regroup_name).

  • p_value_threshold (float) – Significance threshold for regrouping.

  • forced_categories (dict) – Map to prevent certain categories from being grouped into Other for each column - dict of the form {col:[forced vars]}.

  • scale_contingency_table (bool) – Whether contingency table should be scaled before chi^2.

  • weight (float, optional) – Smoothing parameters (non-negative). The higher the value of the parameter, the bigger the contribution of the overall mean. When set to zero, there is no smoothing (e.g. the pure target incidence is used).

  • imputation_strategy (str, optional) – In case there is a particular column which contains new categories, the encoding will lead to NULL values which should be imputed. Valid strategies are to replace with the global mean of the train set or the min (resp. max) incidence of the categories of that particular variable.

Returns

Class encapsulating CategoricalDataProcessor, KBinsDiscretizer, and TargetEncoder instances.

Return type

PreProcessor

classmethod from_pipeline(pipeline: dict)[source]

Constructor to instantiate PreProcessor from a (fitted) pipeline which was stored as a JSON file and passed to this function as a dict.

Parameters

pipeline (dict) – The (fitted) pipeline as a dictionary.

Returns

Instance of PreProcessor instantiated from a stored pipeline.

Return type

PreProcessor

Raises

ValueError – If the loaded pipeline does not have all required parameters and no others.

fit(train_data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list, target_column_name: str)[source]

Fit the data to the preprocessing pipeline.

Parameters
  • train_data (pd.DataFrame) – Data to be preprocessed.

  • continuous_vars (list) – List of continuous variables.

  • discrete_vars (list) – List of discrete variables.

  • target_column_name (str) – Column name of the target.

transform(data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list) pandas.core.frame.DataFrame[source]

Transform the data by applying the preprocessing pipeline.

Parameters
  • data (pd.DataFrame) – Data to be preprocessed.

  • continuous_vars (list) – List of continuous variables.

  • discrete_vars (list) – List of discrete variables.

Returns

Transformed (preprocessed) data.

Return type

pd.DataFrame

Raises

NotFittedError – In case PreProcessor was not fitted first.

fit_transform(train_data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list, target_column_name: str) pandas.core.frame.DataFrame[source]

Fit preprocessing pipeline and transform the data.

Parameters
  • train_data (pd.DataFrame) – Data to be preprocessed

  • continuous_vars (list) – List of continuous variables.

  • discrete_vars (list) – List of discrete variables.

  • target_column_name (str) – Column name of the target.

Returns

Transformed (preprocessed) data.

Return type

pd.DataFrame

static train_selection_validation_split(data: pandas.core.frame.DataFrame, train_prop: float = 0.6, selection_prop: float = 0.2, validation_prop: float = 0.2) pandas.core.frame.DataFrame[source]

Adds split column with train/selection/validation values to the dataset.

Train set = data on which the model is trained and on which the encoding is based. Selection set = data used for univariate and forward feature selection. Often called the validation set. Validation set = data that generates the final performance metrics. Often called the test set.

Parameters
  • data (pd.DataFrame) – Input dataset to split into train-selection and validation sets.

  • train_prop (float, optional) – Percentage data to put in train set.

  • selection_prop (float, optional) – Percentage data to put in selection set.

  • validation_prop (float, optional) – Percentage data to put in validation set.

Returns

DataFrame with additional split column.

Return type

pd.DataFrame

serialize_pipeline() dict[source]

Serialize the preprocessing pipeline by writing all its required parameters to a dictionary to later store it as a JSON file.

Returns

Return the pipeline as a dictionary.

Return type

dict

cobra.model_building module

cobra.model_building.compute_univariate_preselection(target_enc_train_data: pandas.core.frame.DataFrame, target_enc_selection_data: pandas.core.frame.DataFrame, predictors: list, target_column: str, model_type: str = 'classification', preselect_auc_threshold: float = 0.053, preselect_rmse_threshold: float = 5, preselect_overtrain_threshold: float = 0.05) pandas.core.frame.DataFrame[source]

Perform a preselection of predictors based on an AUC (in case of classification) or a RMSE (in case of regression) threshold of a univariate model on a train and selection dataset and return a DataFrame containing for each variable the train and selection AUC or RMSE along with a boolean “preselection” column.

As the AUC just calculates the quality of a ranking, all monotonous transformations of a given ranking (i.e. transformations that do not alter the ranking itself) will lead to the same AUC. Hence, pushing a categorical variable (incl. a binned continuous variable) through a logistic regression will produce exactly the same ranking as pushing it through incidence replacement (i.e. target encoding), as it will produce the exact same output: a ranking of the categories on the training set. Therefore, no univariate model is trained here as the target encoded train and selection data is/must be used as inputs for this function. These will be used as predicted scores to compute the AUC with against the target.

Parameters
  • model_type (str) – Model type (“classification” or “regression”).

  • target_enc_train_data (pd.DataFrame) – Train data.

  • target_enc_selection_data (pd.DataFrame) – Selection data.

  • predictors (list) – List of predictors (e.g. column names in the train set and selection data sets).

  • target_column (str) – Name of the target column.

  • preselect_auc_threshold (float, optional) – Threshold on min. AUC to select predictor. Ignored if model_type is “regression”.

  • preselect_rmse_threshold (float, optional) – Threshold on max. RMSE to select predictor. Ignored if model_type is “classification”. It is important to note that the threshold depends heavily on the scale of the target variable, and should be modified accordingly.

  • preselect_overtrain_threshold (float, optional) – Threshold on the difference between train and selection AUC or RMSE (in case of the latter, as a proportion).

Returns

DataFrame containing for each variable the train AUC or RMSE and selection AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.

Return type

pd.DataFrame

cobra.model_building.get_preselected_predictors(df_metric: pandas.core.frame.DataFrame) list[source]

Wrapper function to extract a list of predictors from df_metric.

Parameters

df_metric (pd.DataFrame) – DataFrame containing for each variable the train AUC or RMSE and test AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.

Returns

List of preselected predictors.

Return type

list

cobra.model_building.compute_correlations(target_enc_train_data: pandas.core.frame.DataFrame, predictors: list) pandas.core.frame.DataFrame[source]

Given a DataFrame and a list of predictors, compute the correlations amongst the predictors in the DataFrame.

Parameters
  • target_enc_train_data (pd.DataFrame) – Data to compute correlation.

  • predictors (list) – List of column names of the DataFrame between which to compute the correlation matrix.

Returns

The correlation matrix of the training set.

Return type

pd.DataFrame

class cobra.model_building.LogisticRegressionModel[source]

Bases: object

Wrapper around the LogisticRegression class, with additional methods implemented such as evaluation (using AUC), getting a list of coefficients, a dictionary of coefficients per predictor, … for convenience.

logit

scikit-learn logistic regression model.

Type

LogisticRegression

predictors

List of predictors used in the model.

Type

list

serialize() dict[source]

Serialize model as JSON.

Returns

Dictionary containing the serialized JSON.

Return type

dict

deserialize(model_dict: dict)[source]

Deserialize a model previously stored as JSON.

Parameters

model_dict (dict) – Serialized JSON file as a dict.

Raises

ValueError – In case JSON file is no valid serialized model.

get_coef() numpy.array[source]

Returns the model coefficients.

Returns

Array of model coefficients.

Return type

np.array

get_intercept() float[source]

Returns the intercept of the model.

Returns

Intercept of the model.

Return type

float

get_coef_by_predictor() dict[source]

Returns a dictionary mapping predictor (key) to coefficient (value).

Returns

A map {predictor: coefficient}.

Return type

dict

fit(X_train: pandas.core.frame.DataFrame, y_train: pandas.core.series.Series)[source]

Fit the model.

Parameters
  • X_train (pd.DataFrame) – Predictors of train data.

  • y_train (pd.Series) – Target of train data.

score_model(X: pandas.core.frame.DataFrame) numpy.ndarray[source]

Score a model on a (new) dataset.

Parameters

X (pd.DataFrame) – Dataset of predictors to score the model.

Returns

Score (i.e. predicted probabilities) of the model for each observation.

Return type

np.ndarray

evaluate(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, split: Optional[str] = None, metric: Optional[Callable] = None) float[source]

Evaluate the model on a given dataset (X, y). The optional split parameter is to indicate that the dataset belongs to (train, selection, validation), so that the computation on these sets can be cached!

Parameters
  • X (pd.DataFrame) – Dataset containing the predictor values for each observation.

  • y (pd.Series) – Dataset containing the target of each observation.

  • split (str, optional) – Split name of the dataset (e.g. “train”, “selection”, or “validation”).

  • metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (AUC). The function should require y_true and y_pred (binary output) arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Returns

The performance score of the model (AUC by default).

Return type

float

compute_variable_importance(data: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]

Compute the importance of each predictor in the model and return it as a DataFrame.

Parameters

data (pd.DataFrame) – Data to score the model.

Returns

DataFrame containing columns predictor and importance.

Return type

pd.DataFrame

class cobra.model_building.LinearRegressionModel[source]

Bases: object

Wrapper around the LinearRegression class, with additional methods implemented such as evaluation (using RMSE), getting a list of coefficients, a dictionary of coefficients per predictor, … for convenience.

linear

scikit-learn linear regression model.

Type

LinearRegression

predictors

List of predictors used in the model.

Type

list

serialize() dict[source]

Serialize model as JSON.

Returns

Dictionary containing the serialized JSON.

Return type

dict

deserialize(model_dict: dict)[source]

Deserialize a model previously stored as JSON.

Parameters

model_dict (dict) – Serialized JSON file as a dict.

Raises

ValueError – In case JSON file is no valid serialized model.

get_coef() numpy.array[source]

Returns the model coefficients.

Returns

Array of model coefficients.

Return type

np.array

get_intercept() float[source]

Returns the intercept of the model.

Returns

Intercept of the model.

Return type

float

get_coef_by_predictor() dict[source]

Returns a dictionary mapping predictor (key) to coefficient (value).

Returns

A map {predictor: coefficient}.

Return type

dict

fit(X_train: pandas.core.frame.DataFrame, y_train: pandas.core.series.Series)[source]

Fit the model.

Parameters
  • X_train (pd.DataFrame) – Predictors of train data.

  • y_train (pd.Series) – Target of train data.

score_model(X: pandas.core.frame.DataFrame) numpy.ndarray[source]

Score a model on a (new) dataset.

Parameters

X (pd.DataFrame) – Dataset of predictors to score the model.

Returns

Score of the model for each observation.

Return type

np.ndarray

evaluate(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, split: Optional[str] = None, metric: Optional[Callable] = None) float[source]

Evaluate the model on a given dataset (X, y). The optional split parameter is to indicate that the dataset belongs to (train, selection, validation), so that the computation on these sets can be cached!

Parameters
  • X (pd.DataFrame) – Dataset containing the predictor values for each observation.

  • y (pd.Series) – Dataset containing the target of each observation.

  • split (str, optional) – Split name of the dataset (e.g. “train”, “selection”, or “validation”).

  • metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (RMSE). The function should require y_true and y_pred arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Returns

The performance score of the model (RMSE by default).

Return type

float

compute_variable_importance(data: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame[source]

Compute the importance of each predictor in the model and return it as a DataFrame.

Parameters

data (pd.DataFrame) – Data to score the model.

Returns

DataFrame containing columns predictor and importance.

Return type

pd.DataFrame

class cobra.model_building.ForwardFeatureSelection(model_type: str = 'classification', max_predictors: int = 50, pos_only: bool = True)[source]

Bases: object

Perform forward feature selection for a given dataset using a given algorithm.

Predictors are sequentially added to the model, starting with the one that has the highest univariate predictive power, and then proceeding with those that jointly lead to the best fit, optimizing for selection AUC or RMSE. Interaction effects are not explicitly modeled, yet they are implicitly present given the feature selection and the underlying feature correlation structure.

model_type

Model type (classification or regression).

Type

str

MLModel

LogisticRegressionModel or LinearRegressionModel.

Type

Cobra model

max_predictors

Maximum number of predictors allowed in any model. This corresponds more or less with the maximum number of steps in the forward feature selection.

Type

int

pos_only

Whether or not the model coefficients should all be positive (no sign flips).

Type

bool

self._fitted_models

List of fitted models.

Type

list

get_model_from_step(step: int)[source]

Get fitted model from a particular step.

Parameters

step (int) – Particular step in the forward selection.

Returns

Fitted model from the given step.

Return type

self.MLModel

Raises

ValueError – In case step is larger than the number of available models.

compute_model_performances(data: pandas.core.frame.DataFrame, target_column_name: str, splits: list = ['train', 'selection', 'validation'], metric: Optional[Callable] = None) pandas.core.frame.DataFrame[source]

Compute for each model the performance for different sets (e.g. train-selection-validation) and return them along with a list of predictors used in the model. Note that the computation of the performance for each split is cached inside the model itself, so it is inexpensive to perform it multiple times!

Parameters
  • data (pd.DataFrame) – Dataset for which to compute performance of each model.

  • target_column_name (str) – Name of the target column.

  • splits (list, optional) – List of splits to compute performance on.

  • metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (AUC for classification, RMSE for regression). The function should require y_true and y_pred arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Returns

Contains for each model the performance for train, selection and validation sets as well as the set of predictors used in this model.

Return type

DatFrame

fit(train_data: pandas.core.frame.DataFrame, target_column_name: str, predictors: list, forced_predictors: list = [], excluded_predictors: list = [])[source]

Fit the forward feature selection estimator.

Parameters
  • train_data (pd.DataFrame) – Data on which to fit the model. Should include a “train” and “selection” split for correct model selection! The “train” split is used to train a model, the “selection” split is used to evaluate which model to include in the actual forward feature selection.

  • target_column_name (str) – Name of the target column.

  • predictors (list) – List of predictors on which to train the estimator.

  • forced_predictors (list, optional) – List of predictors to force in the estimator.

  • excluded_predictors (list, optional) – List of predictors to exclude from the estimator.

Raises

ValueError – In case the number of forced predictors is larger than the maximum number of allowed predictors in the model.

cobra.evaluation module

cobra.evaluation.generate_pig_tables(basetable: pandas.core.frame.DataFrame, id_column_name: str, target_column_name: str, preprocessed_predictors: list) pandas.core.frame.DataFrame[source]

Compute PIG tables for all predictors in preprocessed_predictors.

The output is a DataFrame with columns variable, label, pop_size, global_avg_target and avg_target.

Parameters
  • basetable (pd.DataFrame) – Basetable to compute PIG tables from.

  • id_column_name (str) – Name of the basetable column containing the IDs of the basetable rows (e.g. customernumber).

  • target_column_name (str) – Name of the basetable column containing the target values to predict.

  • preprocessed_predictors (list) – List of basetable column names containing preprocessed predictors.

Returns

DataFrame containing a PIG table for all predictors.

Return type

pd.DataFrame

cobra.evaluation.compute_pig_table(basetable: pandas.core.frame.DataFrame, predictor_column_name: str, target_column_name: str, id_column_name: str) pandas.core.frame.DataFrame[source]

Compute the PIG table of a given predictor for a given target.

Parameters
  • basetable (pd.DataFrame) – Input data from which to compute the pig table.

  • predictor_column_name (str) – Predictor name of which to compute the pig table.

  • target_column_name (str) – Name of the target variable.

  • id_column_name (str) – Name of the id column (used to count population size).

Returns

PIG table as a DataFrame

Return type

pd.DataFrame

cobra.evaluation.plot_incidence(pig_tables: pandas.core.frame.DataFrame, variable: str, model_type: str, column_order: Optional[list] = None, dim: tuple = (12, 8))[source]

Plots a Predictor Insights Graph (PIG), a graph in which the mean target value is plotted for a number of bins constructed from a predictor variable. When the target is a binary classification target, the plotted mean target value is a true incidence rate.

Bins are ordered in descending order of mean target value unless specified otherwise with the column_order list.

Parameters
  • pig_tables (pd.DataFrame) – Dataframe with cleaned, binned, partitioned and prepared data, as created by generate_pig_tables() from this module.

  • variable (str) – Name of the predictor variable for which the PIG will be plotted.

  • model_type (str) – Type of model (either “classification” or “regression”).

  • column_order (list, default=None) – Explicit order of the value bins of the predictor variable to be used on the PIG.

  • dim (tuple, default=(12, 8)) – Optional tuple to configure the width and length of the plot.

cobra.evaluation.plot_performance_curves(model_performance: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None, colors: dict = {'selection': '#ff9500', 'train': '#0099bf', 'validation': '#8064a2'}, metric_name: Optional[str] = None)[source]

Plot performance curves generated by the forward feature selection for the train-selection-validation sets.

Parameters
  • model_performance (pd.DataFrame) – Contains train-selection-validation performance for each model trained in the forward feature selection.

  • dim (tuple, optional) – Width and length of the plot.

  • path (str, optional) – Path to store the figure.

  • colors (dict, optional) – Map with colors for train-selection-validation curves.

  • metric_name (str, optional) – Name to indicate the metric used in model_performance. Defaults to RMSE in case of regression and AUC in case of classification.

cobra.evaluation.plot_variable_importance(df_variable_importance: pandas.core.frame.DataFrame, title: Optional[str] = None, dim: tuple = (12, 8), path: Optional[str] = None)[source]

Plot variable importance of a given model.

Parameters
  • df_variable_importance (pd.DataFrame) – DataFrame containing columns predictor and importance.

  • title (str, optional) – Title of the plot.

  • dim (tuple, optional) – Width and length of the plot.

  • path (str, optional) – Path to store the figure.

cobra.evaluation.plot_univariate_predictor_quality(df_metric: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None)[source]

Plot univariate quality of the predictors.

Parameters
  • df_metric (pd.DatFrame) – DataFrame containing for each variable the train AUC or RMSE and test AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.

  • dim (tuple, optional) – Width and length of the plot.

  • path (str, optional) – Path to store the figure.

cobra.evaluation.plot_correlation_matrix(df_corr: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None)[source]

Plot correlation matrix amongst the predictors.

Parameters
  • df_corr (pd.DataFrame) – Correlation matrix.

  • dim (tuple, optional) – Width and length of the plot.

  • path (str, optional) – Path to store the figure.

class cobra.evaluation.ClassificationEvaluator(probability_cutoff: Optional[float] = None, lift_at: float = 0.05, n_bins: int = 10)[source]

Bases: object

Evaluator class encapsulating classification model metrics and plotting functionality.

y_true

True binary target data labels.

Type

np.ndarray

y_pred

Target scores of the model.

Type

np.ndarray

confusion_matrix

Confusion matrix computed for a particular cut-off.

Type

np.ndarray

cumulative_gains

Data for plotting cumulative gains curve.

Type

tuple

evaluation_metrics

Map containing various scalar evaluation metrics (precision, recall, accuracy, AUC, F1, etc.).

Type

dict

lift_at

Parameter to determine at which top level percentage the lift of the model should be computed.

Type

float

lift_curve

Data for plotting lift curve(s).

Type

tuple

probability_cutoff

Probability cut off to convert probability scores to a binary score.

Type

float

roc_curve

Map containing true-positive-rate, false-positive-rate at various thresholds (also incl.).

Type

dict

n_bins

Defines the number of bins used to calculate the lift curve for (by default 10, so deciles).

Type

int, optional

fit(y_true: numpy.ndarray, y_pred: numpy.ndarray)[source]

Fit the evaluator by computing the relevant evaluation metrics on the inputs.

Parameters
  • y_true (np.ndarray) – True labels.

  • y_pred (np.ndarray) – Model scores (as probability).

plot_roc_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot ROC curve of the model.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

plot_confusion_matrix(path: Optional[str] = None, dim: tuple = (12, 8), labels: list = ['0', '1'])[source]

Plot the confusion matrix.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

  • labels (list, optional) – Optional list of labels, default “0” and “1”.

plot_cumulative_response_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot cumulative response curve.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

plot_lift_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot lift per decile.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

plot_cumulative_gains(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot cumulative gains per decile.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

class cobra.evaluation.RegressionEvaluator[source]

Bases: object

Evaluator class encapsulating regression model metrics and plotting functionality.

y_true

True binary target data labels.

Type

np.ndarray

y_pred

Target scores of the model.

Type

np.ndarray

scalar_metrics

Map containing various scalar evaluation metrics (R-squared, MAE, MSE, RMSE)

Type

dict

qq

Theoretical quantiles and associated actual residuals.

Type

pd.Series

fit(y_true: numpy.ndarray, y_pred: numpy.ndarray)[source]

Fit the evaluator by computing the relevant evaluation metrics on the inputs.

Parameters
  • y_true (np.ndarray) – True labels.

  • y_pred (np.ndarray) – Model scores.

plot_predictions(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot predictions from the model against actual values.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

plot_qq(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Display a Q-Q plot from the standardized prediction residuals.

Parameters
  • path (str, optional) – Path to store the figure.

  • dim (tuple, optional) – Tuple with width and length of the plot.

cobra.utils module

cobra.utils.clean_predictor_name(predictor_name: str) str[source]

Strip the redundant suffix (e.g. “_enc” or “_bin”) off from the end of the predictor name to return a clean version of the predictor