Module contents

cobra.preprocessing module

class cobra.preprocessing.KBinsDiscretizer(n_bins: int = 10, strategy: str = 'quantile', closed: str = 'right', auto_adapt_bins: bool = False, starting_precision: int = 0, label_format: str = '{} - {}', change_endpoint_format: bool = False)[source]

Bases: sklearn.base.BaseEstimator

Bin continuous data into intervals of predefined size. It provides a way to partition continuous data into discrete values, i.e. transform continuous data into nominal data. This can make a linear model more expressive as it introduces nonlinearity to the model, while maintaining the interpretability of the model afterwards.

This module is a rework of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/_discretization.py, though it is purely written in pandas instead of numpy because it is more intuitive. It also includes some custom modifications to align it with the Python Predictions methodology. See the README of the GitHub repository for more background information.

auto_adapt_bins

Reduces the number of bins (starting from n_bins) as a function of the number of missings.

Type: bool

change_endpoint_format

Whether or not to change the format of the lower and upper bins into <= x and > y resp.

Type: bool

closed

Whether to close the bins (intervals) from the left or right

Type: str

label_format

Format string to display the bin labels e.g. min - max, (min, max], …

Type: str

n_bins

Number of bins to produce. Raises ValueError if n_bins < 2. A warning is issued when a variable can only produce a lower number of bins than asked for.

Type: int

starting_precision

Initial precision for the bin edges to start from, can also be negative. Given a list of bin edges, the class will automatically choose the minimal precision required to have proper bins e.g. [5.5555, 5.5744, ...] will be rounded to [5.56, 5.57, ...]. In case of a negative number, an attempt will be made to round up the numbers of the bin edges e.g. 5.55 -> 10, 146 -> 100, …

Type: int

strategy

Binning strategy. Currently only uniform and quantile e.g. equifrequency is supported.

Type: str

valid_strategies = ('uniform', 'quantile')

valid_keys = ['n_bins', 'strategy', 'closed', 'auto_adapt_bins', 'starting_precision', 'label_format', 'change_endpoint_format']

attributes_to_dict() → dict[source]

Return the attributes of KBinsDiscretizer in a dictionary

Returns: Contains the attributes of KBinsDiscretizer instance with the names as keys
Return type: dict

set_attributes_from_dict(params: dict)[source]

Set instance attributes from a dictionary of values with key the name of the attribute.

Parameters: params (dict) – Contains the attributes of KBinsDiscretizer with their names as key.
Raises: ValueError – In case _bins_by_column is not of type dict

fit(data: pandas.core.frame.DataFrame, column_names: list)[source]

Fits the estimator

Parameters

data (pd.DataFrame) – Data to be discretized
column_names (list) – Names of the columns of the DataFrame to discretize

transform(data: pandas.core.frame.DataFrame, column_names: list) → pandas.core.frame.DataFrame[source]

Discretizes the data in the given list of columns by mapping each number to the appropriate bin computed by the fit method

Parameters

data (pd.DataFrame) – Data to be discretized
column_names (list) – Names of the columns of the DataFrame to discretize

Returns

data with additional discretized variables

Return type

pd.DataFrame

fit_transform(data: pandas.core.frame.DataFrame, column_names: list) → pandas.core.frame.DataFrame[source]

Fits to data, then transform it

Parameters

data (pd.DataFrame) – Data to be discretized
column_names (list) – Names of the columns of the DataFrame to discretize

Returns

data with additional discretized variables

Return type

pd.DataFrame

class cobra.preprocessing.TargetEncoder(weight: float = 0.0, imputation_strategy: str = 'mean')[source]

Bases: sklearn.base.BaseEstimator

Target encoding for categorical features, inspired by http://contrib.scikit-learn.org/category_encoders/targetencoder.html.

Replace each value of the categorical feature with the average of the target values (in case of a binary target, this is the incidence of the group). This encoding scheme is also called Mean encoding.

Note that, when applying this target encoding, values of the categorical feature that have not been seen during fit will be imputed according to the configured imputation strategy (replacement with the mean, minimum or maximum value of the categorical variable).

The main problem with Target encoding is overfitting; the fact that we are encoding the feature based on target classes may lead to data leakage, rendering the feature biased. This can be solved using some type of regularization. A popular way to handle this is to use cross-validation and compute the means in each out-of-fold. However, the approach implemented here makes use of additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing).

In summary:

with a binary classification target, a value of a categorical variable is

replaced with:

[count(variable=value) * P(target=1|variable=value) + weight * P(target=1)] / [count(variable=value) + weight]

with a regression target, a value of a categorical variable is replaced

with:

[count(variable=value) * E(target|variable=value) + weight * E(target)] / [count(variable=value) + weight]

imputation_strategy

In case there is a particular column which contains new categories, the encoding will lead to NULL values which should be imputed. Valid strategies then are to replace the NULL values with the global mean of the train set or the min (resp. max) incidence of the categories of that particular variable.

Type: str

weight

Smoothing parameter (non-negative). The higher the value of the parameter, the bigger the contribution of the overall mean of targets learnt from all training data (prior) and the smaller the contribution of the mean target learnt from data with the current categorical value (posterior), so the bigger the smoothing (regularization) effect. When set to zero, there is no smoothing (e.g. the mean target of the current categorical value is used).

Type: float

valid_imputation_strategies = ('mean', 'min', 'max')

attributes_to_dict() → dict[source]

Return the attributes of TargetEncoder in a dictionary.

Returns: Contains the attributes of TargetEncoder instance with the names as keys.
Return type: dict

set_attributes_from_dict(params: dict)[source]

Set instance attributes from a dictionary of values with key the name of the attribute.

Parameters: params (dict) – Contains the attributes of TargetEncoder with their names as key.

fit(data: pandas.core.frame.DataFrame, column_names: list, target_column: str)[source]

Fit the TargetEncoder to the data.

Parameters

data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be encoded.
target_column (str) – Column name of the target.

transform(data: pandas.core.frame.DataFrame, column_names: list) → pandas.core.frame.DataFrame[source]

Replace (e.g. encode) values of each categorical column with a new value (reflecting the corresponding average target value, optionally smoothed by a regularization weight), which was computed when the fit method was called.

Parameters

data (pd.DataFrame) – Data to encode.
column_names (list) – Name of the categorical columns in the data to be encoded.

Returns

The resulting transformed data.

Return type

pd.DataFrame

Raises

NotFittedError – Exception when TargetEncoder was not fitted before calling this method.

fit_transform(data: pandas.core.frame.DataFrame, column_names: list, target_column: str) → pandas.core.frame.DataFrame[source]

Fit the encoder and transform the data.

Parameters

data (pd.DataFrame) – Data to be encoded.
column_names (list) – Columns of data to be encoded.
target_column (str) – Column name of the target.

Returns

Data with additional columns, holding the target-encoded variables.

Return type

pd.DataFrame

class cobra.preprocessing.CategoricalDataProcessor(model_type: str = 'classification', regroup: bool = True, regroup_name: str = 'Other', keep_missing: bool = True, category_size_threshold: int = 5, p_value_threshold: float = 0.001, scale_contingency_table: bool = True, forced_categories: dict = {})[source]

Bases: sklearn.base.BaseEstimator

Regroups the categories of categorical variables based on significance with target variable.

This class implements the Python Prediction’s way of dealing with categorical data preprocessing. There are three steps involved:

An optional regrouping of the different categories based on category size and significance of the category w.r.t. the target.

For a given categorical variable, all categories below the (weighted) category size threshold are put into a rest category (by default Other)

The remaining categories are subject to a statistical test, if there is sufficient dependence with the target variable compared to all other categories, the category is kept as-is, otherwise it is also put into the rest category

Beware: one can force categories to be kept, and if no single category passes the statistical test, the categorical variable is left unprocessed altogether

Missing value replacement with the additional category Missing.

Change of dtype to category (could potentially lead to memory optimization).

See the README of the GitHub repository for more methodological background information.

category_size_threshold

All categories with a size (corrected for incidence if applicable) in the training set above this threshold are kept as a separate category, if statistical significance w.r.t. target is detected. Remaining categories are converted into Other (or else, cf. regroup_name).

Type: int

forced_categories

Map to prevent certain categories from being grouped into Other for each column - dict of the form {col:[forced vars]}.

Type: dict

keep_missing

Whether or not to keep missing as a separate category.

Type: bool

model_type

Model type (classification or regression).

Type: str

p_value_threshold

Significance threshold for regrouping.

Type: float

regroup

Whether or not to regroup categories.

Type: bool

regroup_name

New name of the non-significant regrouped variables

Type: str

scale_contingency_table

Whether contingency table should be scaled before chi^2.

Type: bool

valid_keys = ['model_type', 'regroup', 'regroup_name', 'keep_missing', 'category_size_threshold', 'p_value_threshold', 'scale_contingency_table', 'forced_categories']

attributes_to_dict() → dict[source]

Return the attributes of CategoricalDataProcessor as a dictionary.

Returns: Contains the attributes of CategoricalDataProcessor instance with the attribute name as key.
Return type: dict

set_attributes_from_dict(params: dict)[source]

Set instance attributes from a dictionary of values with key the name of the attribute.

Parameters: params (dict) – Contains the attributes of CategoricalDataProcessor with their names as key.
Raises: ValueError – In case _cleaned_categories_by_column is not of type dict.

fit(data: pandas.core.frame.DataFrame, column_names: list, target_column: str)[source]

Fit the CategoricalDataProcessor.

Parameters

data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be processed.
target_column (str) – Column name of the target.

transform(data: pandas.core.frame.DataFrame, column_names: list) → pandas.core.frame.DataFrame[source]

Transform the data.

Parameters

data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be processed.

Returns

Data with additional transformed variables.

Return type

pd.DataFrame

fit_transform(data: pandas.core.frame.DataFrame, column_names: list, target_column: str) → pandas.core.frame.DataFrame[source]

Fits the data, then transforms it.

Parameters

data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be processed.
target_column (str) – Column name of the target.

Returns

Data with additional transformed variables.

Return type

pd.DataFrame

class cobra.preprocessing.PreProcessor(categorical_data_processor: cobra.preprocessing.categorical_data_processor.CategoricalDataProcessor, discretizer: cobra.preprocessing.kbins_discretizer.KBinsDiscretizer, target_encoder: cobra.preprocessing.target_encoder.TargetEncoder, is_fitted: bool = False)[source]

Bases: sklearn.base.BaseEstimator

This class implements a so-called facade pattern to define a higher-level interface to work with the CategoricalDataProcessor, KBinsDiscretizer and TargetEncoder classes, so that their fit and transform methods are called in the correct order.

Additionally, it provides methods such as (de)serialization to/from JSON so that preprocessing pipelines can be stored and reloaded, example for scoring.

We refer to the README of the GitHub repository for more background information on the preprocessing methodology.

categorical_data_processor

Instance of CategoricalDataProcessor to do the preprocessing of categorical variables.

Type: CategoricalDataProcessor

discretizer

Instance of KBinsDiscretizer to do the preprocessing of continuous variables by means of discretization.

Type: KBinsDiscretizer

target_encoder

Instance of TargetEncoder to do the incidence replacement.

Type: TargetEncoder

is_fitted

Whether or not object is yet fit.

Type: bool

model_type

The model_type variable as specified in CategoricalDataProcessor (classification or regression).

Type: str

classmethod from_params(model_type: str = 'classification', n_bins: int = 10, strategy: str = 'quantile', closed: str = 'right', auto_adapt_bins: bool = False, starting_precision: int = 0, label_format: str = '{} - {}', change_endpoint_format: bool = False, regroup: bool = True, regroup_name: str = 'Other', keep_missing: bool = True, category_size_threshold: int = 5, p_value_threshold: float = 0.001, scale_contingency_table: bool = True, forced_categories: dict = {}, weight: float = 0.0, imputation_strategy: str = 'mean')[source]

Constructor to instantiate PreProcessor from all the parameters that can be set in all its required (attribute) classes along with good default values.

Parameters

model_type (str) – Model type (classification or regression).
n_bins (int, optional) – Number of bins to produce. Raises ValueError if n_bins < 2.
strategy (str, optional) – Binning strategy. Currently only uniform and quantile e.g. equifrequency is supported.
closed (str, optional) – Whether to close the bins (intervals) from the left or right.
auto_adapt_bins (bool, optional) – Reduces the number of bins (starting from n_bins) as a function of the number of missings.
starting_precision (int, optional) – Initial precision for the bin edges to start from, can also be negative. Given a list of bin edges, the class will automatically choose the minimal precision required to have proper bins e.g. [5.5555, 5.5744, ...] will be rounded to [5.56, 5.57, ...]. In case of a negative number, an attempt will be made to round up the numbers of the bin edges e.g. 5.55 -> 10, 146 -> 100, …
label_format (str, optional) – Format string to display the bin labels e.g. min - max, (min, max], …
change_endpoint_format (bool, optional) – Whether or not to change the format of the lower and upper bins into < x and > y resp.
regroup (bool) – Whether or not to regroup categories.
regroup_name (str) – New name of the non-significant regrouped variables.
keep_missing (bool) – Whether or not to keep missing as a separate category.
category_size_threshold (int) – All categories with a size (corrected for incidence if applicable) in the training set above this threshold are kept as a separate category, if statistical significance w.r.t. target is detected. Remaining categories are converted into Other (or else, cf. regroup_name).
p_value_threshold (float) – Significance threshold for regrouping.
forced_categories (dict) – Map to prevent certain categories from being grouped into Other for each column - dict of the form {col:[forced vars]}.
scale_contingency_table (bool) – Whether contingency table should be scaled before chi^2.
weight (float, optional) – Smoothing parameters (non-negative). The higher the value of the parameter, the bigger the contribution of the overall mean. When set to zero, there is no smoothing (e.g. the pure target incidence is used).
imputation_strategy (str, optional) – In case there is a particular column which contains new categories, the encoding will lead to NULL values which should be imputed. Valid strategies are to replace with the global mean of the train set or the min (resp. max) incidence of the categories of that particular variable.

Returns

Class encapsulating CategoricalDataProcessor, KBinsDiscretizer, and TargetEncoder instances.

Return type

PreProcessor

classmethod from_pipeline(pipeline: dict)[source]

Constructor to instantiate PreProcessor from a (fitted) pipeline which was stored as a JSON file and passed to this function as a dict.

Parameters: pipeline (dict) – The (fitted) pipeline as a dictionary.
Returns: Instance of PreProcessor instantiated from a stored pipeline.
Return type: PreProcessor
Raises: ValueError – If the loaded pipeline does not have all required parameters and no others.

fit(train_data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list, target_column_name: str)[source]

Fit the data to the preprocessing pipeline.

Parameters

train_data (pd.DataFrame) – Data to be preprocessed.
continuous_vars (list) – List of continuous variables.
discrete_vars (list) – List of discrete variables.
target_column_name (str) – Column name of the target.

transform(data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list) → pandas.core.frame.DataFrame[source]

Transform the data by applying the preprocessing pipeline.

Parameters

data (pd.DataFrame) – Data to be preprocessed.
continuous_vars (list) – List of continuous variables.
discrete_vars (list) – List of discrete variables.

Returns

Transformed (preprocessed) data.

Return type

pd.DataFrame

Raises

NotFittedError – In case PreProcessor was not fitted first.

fit_transform(train_data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list, target_column_name: str) → pandas.core.frame.DataFrame[source]

Fit preprocessing pipeline and transform the data.

Parameters

train_data (pd.DataFrame) – Data to be preprocessed
continuous_vars (list) – List of continuous variables.
discrete_vars (list) – List of discrete variables.
target_column_name (str) – Column name of the target.

Returns

Transformed (preprocessed) data.

Return type

pd.DataFrame

static train_selection_validation_split(data: pandas.core.frame.DataFrame, train_prop: float = 0.6, selection_prop: float = 0.2, validation_prop: float = 0.2) → pandas.core.frame.DataFrame[source]

Adds split column with train/selection/validation values to the dataset.

Train set = data on which the model is trained and on which the encoding is based. Selection set = data used for univariate and forward feature selection. Often called the validation set. Validation set = data that generates the final performance metrics. Often called the test set.

Parameters

data (pd.DataFrame) – Input dataset to split into train-selection and validation sets.
train_prop (float, optional) – Percentage data to put in train set.
selection_prop (float, optional) – Percentage data to put in selection set.
validation_prop (float, optional) – Percentage data to put in validation set.

Returns

DataFrame with additional split column.

Return type

pd.DataFrame

serialize_pipeline() → dict[source]

Serialize the preprocessing pipeline by writing all its required parameters to a dictionary to later store it as a JSON file.

Returns: Return the pipeline as a dictionary.
Return type: dict

cobra.model_building module

cobra.model_building.compute_univariate_preselection(target_enc_train_data: pandas.core.frame.DataFrame, target_enc_selection_data: pandas.core.frame.DataFrame, predictors: list, target_column: str, model_type: str = 'classification', preselect_auc_threshold: float = 0.053, preselect_rmse_threshold: float = 5, preselect_overtrain_threshold: float = 0.05) → pandas.core.frame.DataFrame[source]

Perform a preselection of predictors based on an AUC (in case of classification) or a RMSE (in case of regression) threshold of a univariate model on a train and selection dataset and return a DataFrame containing for each variable the train and selection AUC or RMSE along with a boolean “preselection” column.

As the AUC just calculates the quality of a ranking, all monotonous transformations of a given ranking (i.e. transformations that do not alter the ranking itself) will lead to the same AUC. Hence, pushing a categorical variable (incl. a binned continuous variable) through a logistic regression will produce exactly the same ranking as pushing it through incidence replacement (i.e. target encoding), as it will produce the exact same output: a ranking of the categories on the training set. Therefore, no univariate model is trained here as the target encoded train and selection data is/must be used as inputs for this function. These will be used as predicted scores to compute the AUC with against the target.

Parameters

model_type (str) – Model type (“classification” or “regression”).
target_enc_train_data (pd.DataFrame) – Train data.
target_enc_selection_data (pd.DataFrame) – Selection data.
predictors (list) – List of predictors (e.g. column names in the train set and selection data sets).
target_column (str) – Name of the target column.
preselect_auc_threshold (float, optional) – Threshold on min. AUC to select predictor. Ignored if model_type is “regression”.
preselect_rmse_threshold (float, optional) – Threshold on max. RMSE to select predictor. Ignored if model_type is “classification”. It is important to note that the threshold depends heavily on the scale of the target variable, and should be modified accordingly.
preselect_overtrain_threshold (float, optional) – Threshold on the difference between train and selection AUC or RMSE (in case of the latter, as a proportion).

Returns

DataFrame containing for each variable the train AUC or RMSE and selection AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.

Return type

pd.DataFrame

cobra.model_building.get_preselected_predictors(df_metric: pandas.core.frame.DataFrame) → list[source]

Wrapper function to extract a list of predictors from df_metric.

Parameters: df_metric (pd.DataFrame) – DataFrame containing for each variable the train AUC or RMSE and test AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.
Returns: List of preselected predictors.
Return type: list

cobra.model_building.compute_correlations(target_enc_train_data: pandas.core.frame.DataFrame, predictors: list) → pandas.core.frame.DataFrame[source]

Given a DataFrame and a list of predictors, compute the correlations amongst the predictors in the DataFrame.

Parameters

target_enc_train_data (pd.DataFrame) – Data to compute correlation.
predictors (list) – List of column names of the DataFrame between which to compute the correlation matrix.

Returns

The correlation matrix of the training set.

Return type

pd.DataFrame

class cobra.model_building.LogisticRegressionModel[source]

Bases: object

Wrapper around the LogisticRegression class, with additional methods implemented such as evaluation (using AUC), getting a list of coefficients, a dictionary of coefficients per predictor, … for convenience.

logit

scikit-learn logistic regression model.

Type: LogisticRegression

predictors

List of predictors used in the model.

Type: list

serialize() → dict[source]

Serialize model as JSON.

Returns: Dictionary containing the serialized JSON.
Return type: dict

deserialize(model_dict: dict)[source]

Deserialize a model previously stored as JSON.

Parameters: model_dict (dict) – Serialized JSON file as a dict.
Raises: ValueError – In case JSON file is no valid serialized model.

get_coef() → numpy.array[source]

Returns the model coefficients.

Returns: Array of model coefficients.
Return type: np.array

get_intercept() → float[source]

Returns the intercept of the model.

Returns: Intercept of the model.
Return type: float

get_coef_by_predictor() → dict[source]

Returns a dictionary mapping predictor (key) to coefficient (value).

Returns: A map {predictor: coefficient}.
Return type: dict

fit(X_train: pandas.core.frame.DataFrame, y_train: pandas.core.series.Series)[source]

Fit the model.

Parameters

X_train (pd.DataFrame) – Predictors of train data.
y_train (pd.Series) – Target of train data.

score_model(X: pandas.core.frame.DataFrame) → numpy.ndarray[source]

Score a model on a (new) dataset.

Parameters: X (pd.DataFrame) – Dataset of predictors to score the model.
Returns: Score (i.e. predicted probabilities) of the model for each observation.
Return type: np.ndarray

evaluate(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, split: Optional[str] = None, metric: Optional[Callable] = None) → float[source]

Evaluate the model on a given dataset (X, y). The optional split parameter is to indicate that the dataset belongs to (train, selection, validation), so that the computation on these sets can be cached!

Parameters

X (pd.DataFrame) – Dataset containing the predictor values for each observation.
y (pd.Series) – Dataset containing the target of each observation.
split (str, optional) – Split name of the dataset (e.g. “train”, “selection”, or “validation”).
metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (AUC). The function should require y_true and y_pred (binary output) arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Returns

The performance score of the model (AUC by default).

Return type

float

compute_variable_importance(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Compute the importance of each predictor in the model and return it as a DataFrame.

Parameters: data (pd.DataFrame) – Data to score the model.
Returns: DataFrame containing columns predictor and importance.
Return type: pd.DataFrame

class cobra.model_building.LinearRegressionModel[source]

Bases: object

Wrapper around the LinearRegression class, with additional methods implemented such as evaluation (using RMSE), getting a list of coefficients, a dictionary of coefficients per predictor, … for convenience.

linear

scikit-learn linear regression model.

Type: LinearRegression

predictors

List of predictors used in the model.

Type: list

serialize() → dict[source]

Serialize model as JSON.

Returns: Dictionary containing the serialized JSON.
Return type: dict

deserialize(model_dict: dict)[source]

Deserialize a model previously stored as JSON.

Parameters: model_dict (dict) – Serialized JSON file as a dict.
Raises: ValueError – In case JSON file is no valid serialized model.

get_coef() → numpy.array[source]

Returns the model coefficients.

Returns: Array of model coefficients.
Return type: np.array

get_intercept() → float[source]

Returns the intercept of the model.

Returns: Intercept of the model.
Return type: float

get_coef_by_predictor() → dict[source]

Returns a dictionary mapping predictor (key) to coefficient (value).

Returns: A map {predictor: coefficient}.
Return type: dict

fit(X_train: pandas.core.frame.DataFrame, y_train: pandas.core.series.Series)[source]

Fit the model.

Parameters

X_train (pd.DataFrame) – Predictors of train data.
y_train (pd.Series) – Target of train data.

score_model(X: pandas.core.frame.DataFrame) → numpy.ndarray[source]

Score a model on a (new) dataset.

Parameters: X (pd.DataFrame) – Dataset of predictors to score the model.
Returns: Score of the model for each observation.
Return type: np.ndarray

evaluate(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, split: Optional[str] = None, metric: Optional[Callable] = None) → float[source]

Evaluate the model on a given dataset (X, y). The optional split parameter is to indicate that the dataset belongs to (train, selection, validation), so that the computation on these sets can be cached!

Parameters

X (pd.DataFrame) – Dataset containing the predictor values for each observation.
y (pd.Series) – Dataset containing the target of each observation.
split (str, optional) – Split name of the dataset (e.g. “train”, “selection”, or “validation”).
metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (RMSE). The function should require y_true and y_pred arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Returns

The performance score of the model (RMSE by default).

Return type

float

compute_variable_importance(data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]

Compute the importance of each predictor in the model and return it as a DataFrame.

Parameters: data (pd.DataFrame) – Data to score the model.
Returns: DataFrame containing columns predictor and importance.
Return type: pd.DataFrame

class cobra.model_building.ForwardFeatureSelection(model_type: str = 'classification', max_predictors: int = 50, pos_only: bool = True)[source]

Bases: object

Perform forward feature selection for a given dataset using a given algorithm.

Predictors are sequentially added to the model, starting with the one that has the highest univariate predictive power, and then proceeding with those that jointly lead to the best fit, optimizing for selection AUC or RMSE. Interaction effects are not explicitly modeled, yet they are implicitly present given the feature selection and the underlying feature correlation structure.

model_type

Model type (classification or regression).

Type: str

MLModel

LogisticRegressionModel or LinearRegressionModel.

Type: Cobra model

max_predictors

Maximum number of predictors allowed in any model. This corresponds more or less with the maximum number of steps in the forward feature selection.

Type: int

pos_only

Whether or not the model coefficients should all be positive (no sign flips).

Type: bool

self._fitted_models

List of fitted models.

Type: list

get_model_from_step(step: int)[source]

Get fitted model from a particular step.

Parameters: step (int) – Particular step in the forward selection.
Returns: Fitted model from the given step.
Return type: self.MLModel
Raises: ValueError – In case step is larger than the number of available models.

compute_model_performances(data: pandas.core.frame.DataFrame, target_column_name: str, splits: list = ['train', 'selection', 'validation'], metric: Optional[Callable] = None) → pandas.core.frame.DataFrame[source]

Compute for each model the performance for different sets (e.g. train-selection-validation) and return them along with a list of predictors used in the model. Note that the computation of the performance for each split is cached inside the model itself, so it is inexpensive to perform it multiple times!

Parameters

data (pd.DataFrame) – Dataset for which to compute performance of each model.
target_column_name (str) – Name of the target column.
splits (list, optional) – List of splits to compute performance on.
metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (AUC for classification, RMSE for regression). The function should require y_true and y_pred arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Returns

Contains for each model the performance for train, selection and validation sets as well as the set of predictors used in this model.

Return type

DatFrame

fit(train_data: pandas.core.frame.DataFrame, target_column_name: str, predictors: list, forced_predictors: list = [], excluded_predictors: list = [])[source]

Fit the forward feature selection estimator.

Parameters

train_data (pd.DataFrame) – Data on which to fit the model. Should include a “train” and “selection” split for correct model selection! The “train” split is used to train a model, the “selection” split is used to evaluate which model to include in the actual forward feature selection.
target_column_name (str) – Name of the target column.
predictors (list) – List of predictors on which to train the estimator.
forced_predictors (list, optional) – List of predictors to force in the estimator.
excluded_predictors (list, optional) – List of predictors to exclude from the estimator.

Raises

ValueError – In case the number of forced predictors is larger than the maximum number of allowed predictors in the model.

cobra.evaluation module

cobra.evaluation.generate_pig_tables(basetable: pandas.core.frame.DataFrame, id_column_name: str, target_column_name: str, preprocessed_predictors: list) → pandas.core.frame.DataFrame[source]

Compute PIG tables for all predictors in preprocessed_predictors.

The output is a DataFrame with columns variable, label, pop_size, global_avg_target and avg_target.

Parameters

basetable (pd.DataFrame) – Basetable to compute PIG tables from.
id_column_name (str) – Name of the basetable column containing the IDs of the basetable rows (e.g. customernumber).
target_column_name (str) – Name of the basetable column containing the target values to predict.
preprocessed_predictors (list) – List of basetable column names containing preprocessed predictors.

Returns

DataFrame containing a PIG table for all predictors.

Return type

pd.DataFrame

cobra.evaluation.compute_pig_table(basetable: pandas.core.frame.DataFrame, predictor_column_name: str, target_column_name: str, id_column_name: str) → pandas.core.frame.DataFrame[source]

Compute the PIG table of a given predictor for a given target.

Parameters

basetable (pd.DataFrame) – Input data from which to compute the pig table.
predictor_column_name (str) – Predictor name of which to compute the pig table.
target_column_name (str) – Name of the target variable.
id_column_name (str) – Name of the id column (used to count population size).

Returns

PIG table as a DataFrame

Return type

pd.DataFrame

cobra.evaluation.plot_incidence(pig_tables: pandas.core.frame.DataFrame, variable: str, model_type: str, column_order: Optional[list] = None, dim: tuple = (12, 8))[source]

Plots a Predictor Insights Graph (PIG), a graph in which the mean target value is plotted for a number of bins constructed from a predictor variable. When the target is a binary classification target, the plotted mean target value is a true incidence rate.

Bins are ordered in descending order of mean target value unless specified otherwise with the column_order list.

Parameters

pig_tables (pd.DataFrame) – Dataframe with cleaned, binned, partitioned and prepared data, as created by generate_pig_tables() from this module.
variable (str) – Name of the predictor variable for which the PIG will be plotted.
model_type (str) – Type of model (either “classification” or “regression”).
column_order (list, default=None) – Explicit order of the value bins of the predictor variable to be used on the PIG.
dim (tuple, default=(12, 8)) – Optional tuple to configure the width and length of the plot.

cobra.evaluation.plot_performance_curves(model_performance: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None, colors: dict = {'selection': '#ff9500', 'train': '#0099bf', 'validation': '#8064a2'}, metric_name: Optional[str] = None)[source]

Plot performance curves generated by the forward feature selection for the train-selection-validation sets.

Parameters

model_performance (pd.DataFrame) – Contains train-selection-validation performance for each model trained in the forward feature selection.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.
colors (dict, optional) – Map with colors for train-selection-validation curves.
metric_name (str, optional) – Name to indicate the metric used in model_performance. Defaults to RMSE in case of regression and AUC in case of classification.

cobra.evaluation.plot_variable_importance(df_variable_importance: pandas.core.frame.DataFrame, title: Optional[str] = None, dim: tuple = (12, 8), path: Optional[str] = None)[source]

Plot variable importance of a given model.

Parameters

df_variable_importance (pd.DataFrame) – DataFrame containing columns predictor and importance.
title (str, optional) – Title of the plot.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.

cobra.evaluation.plot_univariate_predictor_quality(df_metric: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None)[source]

Plot univariate quality of the predictors.

Parameters

df_metric (pd.DatFrame) – DataFrame containing for each variable the train AUC or RMSE and test AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.

cobra.evaluation.plot_correlation_matrix(df_corr: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None)[source]

Plot correlation matrix amongst the predictors.

Parameters

df_corr (pd.DataFrame) – Correlation matrix.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.

class cobra.evaluation.ClassificationEvaluator(probability_cutoff: Optional[float] = None, lift_at: float = 0.05, n_bins: int = 10)[source]

Bases: object

Evaluator class encapsulating classification model metrics and plotting functionality.

y_true

True binary target data labels.

Type: np.ndarray

y_pred

Target scores of the model.

Type: np.ndarray

confusion_matrix

Confusion matrix computed for a particular cut-off.

Type: np.ndarray

cumulative_gains

Data for plotting cumulative gains curve.

Type: tuple

evaluation_metrics

Map containing various scalar evaluation metrics (precision, recall, accuracy, AUC, F1, etc.).

Type: dict

lift_at

Parameter to determine at which top level percentage the lift of the model should be computed.

Type: float

lift_curve

Data for plotting lift curve(s).

Type: tuple

probability_cutoff

Probability cut off to convert probability scores to a binary score.

Type: float

roc_curve

Map containing true-positive-rate, false-positive-rate at various thresholds (also incl.).

Type: dict

n_bins

Defines the number of bins used to calculate the lift curve for (by default 10, so deciles).

Type: int, optional

fit(y_true: numpy.ndarray, y_pred: numpy.ndarray)[source]

Fit the evaluator by computing the relevant evaluation metrics on the inputs.

Parameters

y_true (np.ndarray) – True labels.
y_pred (np.ndarray) – Model scores (as probability).

plot_roc_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot ROC curve of the model.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.

plot_confusion_matrix(path: Optional[str] = None, dim: tuple = (12, 8), labels: list = ['0', '1'])[source]

Plot the confusion matrix.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.
labels (list, optional) – Optional list of labels, default “0” and “1”.

plot_cumulative_response_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot cumulative response curve.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.

plot_lift_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot lift per decile.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.

plot_cumulative_gains(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot cumulative gains per decile.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.

class cobra.evaluation.RegressionEvaluator[source]

Bases: object

Evaluator class encapsulating regression model metrics and plotting functionality.

y_true

True binary target data labels.

Type: np.ndarray

y_pred

Target scores of the model.

Type: np.ndarray

scalar_metrics

Map containing various scalar evaluation metrics (R-squared, MAE, MSE, RMSE)

Type: dict

qq

Theoretical quantiles and associated actual residuals.

Type: pd.Series

fit(y_true: numpy.ndarray, y_pred: numpy.ndarray)[source]

Fit the evaluator by computing the relevant evaluation metrics on the inputs.

Parameters

y_true (np.ndarray) – True labels.
y_pred (np.ndarray) – Model scores.

plot_predictions(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Plot predictions from the model against actual values.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.

plot_qq(path: Optional[str] = None, dim: tuple = (12, 8))[source]

Display a Q-Q plot from the standardized prediction residuals.

Parameters

path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.

cobra.utils module

cobra.utils.clean_predictor_name(predictor_name: str) → str[source]: Strip the redundant suffix (e.g. “_enc” or “_bin”) off from the end of the predictor name to return a clean version of the predictor