Module contents
cobra.preprocessing module
- class cobra.preprocessing.KBinsDiscretizer(n_bins: int = 10, strategy: str = 'quantile', closed: str = 'right', auto_adapt_bins: bool = False, starting_precision: int = 0, label_format: str = '{} - {}', change_endpoint_format: bool = False)[source]
Bases:
sklearn.base.BaseEstimator
Bin continuous data into intervals of predefined size. It provides a way to partition continuous data into discrete values, i.e. transform continuous data into nominal data. This can make a linear model more expressive as it introduces nonlinearity to the model, while maintaining the interpretability of the model afterwards.
This module is a rework of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/_discretization.py, though it is purely written in pandas instead of numpy because it is more intuitive. It also includes some custom modifications to align it with the Python Predictions methodology. See the README of the GitHub repository for more background information.
- auto_adapt_bins
Reduces the number of bins (starting from n_bins) as a function of the number of missings.
- Type
bool
- change_endpoint_format
Whether or not to change the format of the lower and upper bins into
<= x
and> y
resp.- Type
bool
- closed
Whether to close the bins (intervals) from the left or right
- Type
str
- label_format
Format string to display the bin labels e.g.
min - max
,(min, max]
, …- Type
str
- n_bins
Number of bins to produce. Raises ValueError if
n_bins < 2
. A warning is issued when a variable can only produce a lower number of bins than asked for.- Type
int
- starting_precision
Initial precision for the bin edges to start from, can also be negative. Given a list of bin edges, the class will automatically choose the minimal precision required to have proper bins e.g.
[5.5555, 5.5744, ...]
will be rounded to[5.56, 5.57, ...]
. In case of a negative number, an attempt will be made to round up the numbers of the bin edges e.g.5.55 -> 10
,146 -> 100
, …- Type
int
- strategy
Binning strategy. Currently only uniform and quantile e.g. equifrequency is supported.
- Type
str
- valid_strategies = ('uniform', 'quantile')
- valid_keys = ['n_bins', 'strategy', 'closed', 'auto_adapt_bins', 'starting_precision', 'label_format', 'change_endpoint_format']
- attributes_to_dict() dict [source]
Return the attributes of KBinsDiscretizer in a dictionary
- Returns
Contains the attributes of KBinsDiscretizer instance with the names as keys
- Return type
dict
- set_attributes_from_dict(params: dict)[source]
Set instance attributes from a dictionary of values with key the name of the attribute.
- Parameters
params (dict) – Contains the attributes of KBinsDiscretizer with their names as key.
- Raises
ValueError – In case _bins_by_column is not of type dict
- fit(data: pandas.core.frame.DataFrame, column_names: list)[source]
Fits the estimator
- Parameters
data (pd.DataFrame) – Data to be discretized
column_names (list) – Names of the columns of the DataFrame to discretize
- transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame [source]
Discretizes the data in the given list of columns by mapping each number to the appropriate bin computed by the fit method
- Parameters
data (pd.DataFrame) – Data to be discretized
column_names (list) – Names of the columns of the DataFrame to discretize
- Returns
data with additional discretized variables
- Return type
pd.DataFrame
- fit_transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame [source]
Fits to data, then transform it
- Parameters
data (pd.DataFrame) – Data to be discretized
column_names (list) – Names of the columns of the DataFrame to discretize
- Returns
data with additional discretized variables
- Return type
pd.DataFrame
- class cobra.preprocessing.TargetEncoder(weight: float = 0.0, imputation_strategy: str = 'mean')[source]
Bases:
sklearn.base.BaseEstimator
Target encoding for categorical features, inspired by http://contrib.scikit-learn.org/category_encoders/targetencoder.html.
Replace each value of the categorical feature with the average of the target values (in case of a binary target, this is the incidence of the group). This encoding scheme is also called Mean encoding.
Note that, when applying this target encoding, values of the categorical feature that have not been seen during fit will be imputed according to the configured imputation strategy (replacement with the mean, minimum or maximum value of the categorical variable).
The main problem with Target encoding is overfitting; the fact that we are encoding the feature based on target classes may lead to data leakage, rendering the feature biased. This can be solved using some type of regularization. A popular way to handle this is to use cross-validation and compute the means in each out-of-fold. However, the approach implemented here makes use of additive smoothing (https://en.wikipedia.org/wiki/Additive_smoothing).
In summary:
with a binary classification target, a value of a categorical variable is
replaced with:
[count(variable=value) * P(target=1|variable=value) + weight * P(target=1)] / [count(variable=value) + weight]
with a regression target, a value of a categorical variable is replaced
with:
[count(variable=value) * E(target|variable=value) + weight * E(target)] / [count(variable=value) + weight]
- imputation_strategy
In case there is a particular column which contains new categories, the encoding will lead to NULL values which should be imputed. Valid strategies then are to replace the NULL values with the global mean of the train set or the min (resp. max) incidence of the categories of that particular variable.
- Type
str
- weight
Smoothing parameter (non-negative). The higher the value of the parameter, the bigger the contribution of the overall mean of targets learnt from all training data (prior) and the smaller the contribution of the mean target learnt from data with the current categorical value (posterior), so the bigger the smoothing (regularization) effect. When set to zero, there is no smoothing (e.g. the mean target of the current categorical value is used).
- Type
float
- valid_imputation_strategies = ('mean', 'min', 'max')
- attributes_to_dict() dict [source]
Return the attributes of TargetEncoder in a dictionary.
- Returns
Contains the attributes of TargetEncoder instance with the names as keys.
- Return type
dict
- set_attributes_from_dict(params: dict)[source]
Set instance attributes from a dictionary of values with key the name of the attribute.
- Parameters
params (dict) – Contains the attributes of TargetEncoder with their names as key.
- fit(data: pandas.core.frame.DataFrame, column_names: list, target_column: str)[source]
Fit the TargetEncoder to the data.
- Parameters
data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be encoded.
target_column (str) – Column name of the target.
- transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame [source]
Replace (e.g. encode) values of each categorical column with a new value (reflecting the corresponding average target value, optionally smoothed by a regularization weight), which was computed when the fit method was called.
- Parameters
data (pd.DataFrame) – Data to encode.
column_names (list) – Name of the categorical columns in the data to be encoded.
- Returns
The resulting transformed data.
- Return type
pd.DataFrame
- Raises
NotFittedError – Exception when TargetEncoder was not fitted before calling this method.
- fit_transform(data: pandas.core.frame.DataFrame, column_names: list, target_column: str) pandas.core.frame.DataFrame [source]
Fit the encoder and transform the data.
- Parameters
data (pd.DataFrame) – Data to be encoded.
column_names (list) – Columns of data to be encoded.
target_column (str) – Column name of the target.
- Returns
Data with additional columns, holding the target-encoded variables.
- Return type
pd.DataFrame
- class cobra.preprocessing.CategoricalDataProcessor(model_type: str = 'classification', regroup: bool = True, regroup_name: str = 'Other', keep_missing: bool = True, category_size_threshold: int = 5, p_value_threshold: float = 0.001, scale_contingency_table: bool = True, forced_categories: dict = {})[source]
Bases:
sklearn.base.BaseEstimator
Regroups the categories of categorical variables based on significance with target variable.
This class implements the Python Prediction’s way of dealing with categorical data preprocessing. There are three steps involved:
An optional regrouping of the different categories based on category size and significance of the category w.r.t. the target.
For a given categorical variable, all categories below the (weighted) category size threshold are put into a rest category (by default
Other
)The remaining categories are subject to a statistical test, if there is sufficient dependence with the target variable compared to all other categories, the category is kept as-is, otherwise it is also put into the rest category
Beware: one can force categories to be kept, and if no single category passes the statistical test, the categorical variable is left unprocessed altogether
Missing value replacement with the additional category
Missing
.Change of dtype to
category
(could potentially lead to memory optimization).
See the README of the GitHub repository for more methodological background information.
- category_size_threshold
All categories with a size (corrected for incidence if applicable) in the training set above this threshold are kept as a separate category, if statistical significance w.r.t. target is detected. Remaining categories are converted into
Other
(or else, cf. regroup_name).- Type
int
- forced_categories
Map to prevent certain categories from being grouped into
Other
for each column - dict of the form{col:[forced vars]}
.- Type
dict
- keep_missing
Whether or not to keep missing as a separate category.
- Type
bool
- model_type
Model type (
classification
orregression
).- Type
str
- p_value_threshold
Significance threshold for regrouping.
- Type
float
- regroup
Whether or not to regroup categories.
- Type
bool
- regroup_name
New name of the non-significant regrouped variables
- Type
str
- scale_contingency_table
Whether contingency table should be scaled before chi^2.
- Type
bool
- valid_keys = ['model_type', 'regroup', 'regroup_name', 'keep_missing', 'category_size_threshold', 'p_value_threshold', 'scale_contingency_table', 'forced_categories']
- attributes_to_dict() dict [source]
Return the attributes of CategoricalDataProcessor as a dictionary.
- Returns
Contains the attributes of CategoricalDataProcessor instance with the attribute name as key.
- Return type
dict
- set_attributes_from_dict(params: dict)[source]
Set instance attributes from a dictionary of values with key the name of the attribute.
- Parameters
params (dict) – Contains the attributes of CategoricalDataProcessor with their names as key.
- Raises
ValueError – In case _cleaned_categories_by_column is not of type dict.
- fit(data: pandas.core.frame.DataFrame, column_names: list, target_column: str)[source]
Fit the CategoricalDataProcessor.
- Parameters
data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be processed.
target_column (str) – Column name of the target.
- transform(data: pandas.core.frame.DataFrame, column_names: list) pandas.core.frame.DataFrame [source]
Transform the data.
- Parameters
data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be processed.
- Returns
Data with additional transformed variables.
- Return type
pd.DataFrame
- fit_transform(data: pandas.core.frame.DataFrame, column_names: list, target_column: str) pandas.core.frame.DataFrame [source]
Fits the data, then transforms it.
- Parameters
data (pd.DataFrame) – Data used to compute the mapping to encode the categorical variables with.
column_names (list) – Columns of data to be processed.
target_column (str) – Column name of the target.
- Returns
Data with additional transformed variables.
- Return type
pd.DataFrame
- class cobra.preprocessing.PreProcessor(categorical_data_processor: cobra.preprocessing.categorical_data_processor.CategoricalDataProcessor, discretizer: cobra.preprocessing.kbins_discretizer.KBinsDiscretizer, target_encoder: cobra.preprocessing.target_encoder.TargetEncoder, is_fitted: bool = False)[source]
Bases:
sklearn.base.BaseEstimator
This class implements a so-called facade pattern to define a higher-level interface to work with the CategoricalDataProcessor, KBinsDiscretizer and TargetEncoder classes, so that their fit and transform methods are called in the correct order.
Additionally, it provides methods such as (de)serialization to/from JSON so that preprocessing pipelines can be stored and reloaded, example for scoring.
We refer to the README of the GitHub repository for more background information on the preprocessing methodology.
- categorical_data_processor
Instance of CategoricalDataProcessor to do the preprocessing of categorical variables.
- discretizer
Instance of KBinsDiscretizer to do the preprocessing of continuous variables by means of discretization.
- Type
- target_encoder
Instance of TargetEncoder to do the incidence replacement.
- Type
- is_fitted
Whether or not object is yet fit.
- Type
bool
- model_type
The model_type variable as specified in CategoricalDataProcessor (
classification
orregression
).- Type
str
- classmethod from_params(model_type: str = 'classification', n_bins: int = 10, strategy: str = 'quantile', closed: str = 'right', auto_adapt_bins: bool = False, starting_precision: int = 0, label_format: str = '{} - {}', change_endpoint_format: bool = False, regroup: bool = True, regroup_name: str = 'Other', keep_missing: bool = True, category_size_threshold: int = 5, p_value_threshold: float = 0.001, scale_contingency_table: bool = True, forced_categories: dict = {}, weight: float = 0.0, imputation_strategy: str = 'mean')[source]
Constructor to instantiate PreProcessor from all the parameters that can be set in all its required (attribute) classes along with good default values.
- Parameters
model_type (str) – Model type (
classification
orregression
).n_bins (int, optional) – Number of bins to produce. Raises ValueError if
n_bins < 2
.strategy (str, optional) – Binning strategy. Currently only
uniform
andquantile
e.g. equifrequency is supported.closed (str, optional) – Whether to close the bins (intervals) from the left or right.
auto_adapt_bins (bool, optional) – Reduces the number of bins (starting from n_bins) as a function of the number of missings.
starting_precision (int, optional) – Initial precision for the bin edges to start from, can also be negative. Given a list of bin edges, the class will automatically choose the minimal precision required to have proper bins e.g.
[5.5555, 5.5744, ...]
will be rounded to[5.56, 5.57, ...]
. In case of a negative number, an attempt will be made to round up the numbers of the bin edges e.g.5.55 -> 10
,146 -> 100
, …label_format (str, optional) – Format string to display the bin labels e.g.
min - max
,(min, max]
, …change_endpoint_format (bool, optional) – Whether or not to change the format of the lower and upper bins into
< x
and> y
resp.regroup (bool) – Whether or not to regroup categories.
regroup_name (str) – New name of the non-significant regrouped variables.
keep_missing (bool) – Whether or not to keep missing as a separate category.
category_size_threshold (int) – All categories with a size (corrected for incidence if applicable) in the training set above this threshold are kept as a separate category, if statistical significance w.r.t. target is detected. Remaining categories are converted into
Other
(or else, cf. regroup_name).p_value_threshold (float) – Significance threshold for regrouping.
forced_categories (dict) – Map to prevent certain categories from being grouped into
Other
for each column - dict of the form{col:[forced vars]}
.scale_contingency_table (bool) – Whether contingency table should be scaled before chi^2.
weight (float, optional) – Smoothing parameters (non-negative). The higher the value of the parameter, the bigger the contribution of the overall mean. When set to zero, there is no smoothing (e.g. the pure target incidence is used).
imputation_strategy (str, optional) – In case there is a particular column which contains new categories, the encoding will lead to NULL values which should be imputed. Valid strategies are to replace with the global mean of the train set or the min (resp. max) incidence of the categories of that particular variable.
- Returns
Class encapsulating CategoricalDataProcessor, KBinsDiscretizer, and TargetEncoder instances.
- Return type
- classmethod from_pipeline(pipeline: dict)[source]
Constructor to instantiate PreProcessor from a (fitted) pipeline which was stored as a JSON file and passed to this function as a dict.
- Parameters
pipeline (dict) – The (fitted) pipeline as a dictionary.
- Returns
Instance of PreProcessor instantiated from a stored pipeline.
- Return type
- Raises
ValueError – If the loaded pipeline does not have all required parameters and no others.
- fit(train_data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list, target_column_name: str)[source]
Fit the data to the preprocessing pipeline.
- Parameters
train_data (pd.DataFrame) – Data to be preprocessed.
continuous_vars (list) – List of continuous variables.
discrete_vars (list) – List of discrete variables.
target_column_name (str) – Column name of the target.
- transform(data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list) pandas.core.frame.DataFrame [source]
Transform the data by applying the preprocessing pipeline.
- Parameters
data (pd.DataFrame) – Data to be preprocessed.
continuous_vars (list) – List of continuous variables.
discrete_vars (list) – List of discrete variables.
- Returns
Transformed (preprocessed) data.
- Return type
pd.DataFrame
- Raises
NotFittedError – In case PreProcessor was not fitted first.
- fit_transform(train_data: pandas.core.frame.DataFrame, continuous_vars: list, discrete_vars: list, target_column_name: str) pandas.core.frame.DataFrame [source]
Fit preprocessing pipeline and transform the data.
- Parameters
train_data (pd.DataFrame) – Data to be preprocessed
continuous_vars (list) – List of continuous variables.
discrete_vars (list) – List of discrete variables.
target_column_name (str) – Column name of the target.
- Returns
Transformed (preprocessed) data.
- Return type
pd.DataFrame
- static train_selection_validation_split(data: pandas.core.frame.DataFrame, train_prop: float = 0.6, selection_prop: float = 0.2, validation_prop: float = 0.2) pandas.core.frame.DataFrame [source]
Adds split column with train/selection/validation values to the dataset.
Train set = data on which the model is trained and on which the encoding is based. Selection set = data used for univariate and forward feature selection. Often called the validation set. Validation set = data that generates the final performance metrics. Often called the test set.
- Parameters
data (pd.DataFrame) – Input dataset to split into train-selection and validation sets.
train_prop (float, optional) – Percentage data to put in train set.
selection_prop (float, optional) – Percentage data to put in selection set.
validation_prop (float, optional) – Percentage data to put in validation set.
- Returns
DataFrame with additional split column.
- Return type
pd.DataFrame
cobra.model_building module
- cobra.model_building.compute_univariate_preselection(target_enc_train_data: pandas.core.frame.DataFrame, target_enc_selection_data: pandas.core.frame.DataFrame, predictors: list, target_column: str, model_type: str = 'classification', preselect_auc_threshold: float = 0.053, preselect_rmse_threshold: float = 5, preselect_overtrain_threshold: float = 0.05) pandas.core.frame.DataFrame [source]
Perform a preselection of predictors based on an AUC (in case of classification) or a RMSE (in case of regression) threshold of a univariate model on a train and selection dataset and return a DataFrame containing for each variable the train and selection AUC or RMSE along with a boolean “preselection” column.
As the AUC just calculates the quality of a ranking, all monotonous transformations of a given ranking (i.e. transformations that do not alter the ranking itself) will lead to the same AUC. Hence, pushing a categorical variable (incl. a binned continuous variable) through a logistic regression will produce exactly the same ranking as pushing it through incidence replacement (i.e. target encoding), as it will produce the exact same output: a ranking of the categories on the training set. Therefore, no univariate model is trained here as the target encoded train and selection data is/must be used as inputs for this function. These will be used as predicted scores to compute the AUC with against the target.
- Parameters
model_type (str) – Model type (“classification” or “regression”).
target_enc_train_data (pd.DataFrame) – Train data.
target_enc_selection_data (pd.DataFrame) – Selection data.
predictors (list) – List of predictors (e.g. column names in the train set and selection data sets).
target_column (str) – Name of the target column.
preselect_auc_threshold (float, optional) – Threshold on min. AUC to select predictor. Ignored if model_type is “regression”.
preselect_rmse_threshold (float, optional) – Threshold on max. RMSE to select predictor. Ignored if model_type is “classification”. It is important to note that the threshold depends heavily on the scale of the target variable, and should be modified accordingly.
preselect_overtrain_threshold (float, optional) – Threshold on the difference between train and selection AUC or RMSE (in case of the latter, as a proportion).
- Returns
DataFrame containing for each variable the train AUC or RMSE and selection AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.
- Return type
pd.DataFrame
- cobra.model_building.get_preselected_predictors(df_metric: pandas.core.frame.DataFrame) list [source]
Wrapper function to extract a list of predictors from df_metric.
- Parameters
df_metric (pd.DataFrame) – DataFrame containing for each variable the train AUC or RMSE and test AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.
- Returns
List of preselected predictors.
- Return type
list
- cobra.model_building.compute_correlations(target_enc_train_data: pandas.core.frame.DataFrame, predictors: list) pandas.core.frame.DataFrame [source]
Given a DataFrame and a list of predictors, compute the correlations amongst the predictors in the DataFrame.
- Parameters
target_enc_train_data (pd.DataFrame) – Data to compute correlation.
predictors (list) – List of column names of the DataFrame between which to compute the correlation matrix.
- Returns
The correlation matrix of the training set.
- Return type
pd.DataFrame
- class cobra.model_building.LogisticRegressionModel[source]
Bases:
object
Wrapper around the LogisticRegression class, with additional methods implemented such as evaluation (using AUC), getting a list of coefficients, a dictionary of coefficients per predictor, … for convenience.
- logit
scikit-learn logistic regression model.
- Type
LogisticRegression
- predictors
List of predictors used in the model.
- Type
list
- serialize() dict [source]
Serialize model as JSON.
- Returns
Dictionary containing the serialized JSON.
- Return type
dict
- deserialize(model_dict: dict)[source]
Deserialize a model previously stored as JSON.
- Parameters
model_dict (dict) – Serialized JSON file as a dict.
- Raises
ValueError – In case JSON file is no valid serialized model.
- get_coef() numpy.array [source]
Returns the model coefficients.
- Returns
Array of model coefficients.
- Return type
np.array
- get_intercept() float [source]
Returns the intercept of the model.
- Returns
Intercept of the model.
- Return type
float
- get_coef_by_predictor() dict [source]
Returns a dictionary mapping predictor (key) to coefficient (value).
- Returns
A map
{predictor: coefficient}
.- Return type
dict
- fit(X_train: pandas.core.frame.DataFrame, y_train: pandas.core.series.Series)[source]
Fit the model.
- Parameters
X_train (pd.DataFrame) – Predictors of train data.
y_train (pd.Series) – Target of train data.
- score_model(X: pandas.core.frame.DataFrame) numpy.ndarray [source]
Score a model on a (new) dataset.
- Parameters
X (pd.DataFrame) – Dataset of predictors to score the model.
- Returns
Score (i.e. predicted probabilities) of the model for each observation.
- Return type
np.ndarray
- evaluate(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, split: Optional[str] = None, metric: Optional[Callable] = None) float [source]
Evaluate the model on a given dataset (X, y). The optional split parameter is to indicate that the dataset belongs to (train, selection, validation), so that the computation on these sets can be cached!
- Parameters
X (pd.DataFrame) – Dataset containing the predictor values for each observation.
y (pd.Series) – Dataset containing the target of each observation.
split (str, optional) – Split name of the dataset (e.g. “train”, “selection”, or “validation”).
metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (AUC). The function should require y_true and y_pred (binary output) arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
- Returns
The performance score of the model (AUC by default).
- Return type
float
- compute_variable_importance(data: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
Compute the importance of each predictor in the model and return it as a DataFrame.
- Parameters
data (pd.DataFrame) – Data to score the model.
- Returns
DataFrame containing columns predictor and importance.
- Return type
pd.DataFrame
- class cobra.model_building.LinearRegressionModel[source]
Bases:
object
Wrapper around the LinearRegression class, with additional methods implemented such as evaluation (using RMSE), getting a list of coefficients, a dictionary of coefficients per predictor, … for convenience.
- linear
scikit-learn linear regression model.
- Type
LinearRegression
- predictors
List of predictors used in the model.
- Type
list
- serialize() dict [source]
Serialize model as JSON.
- Returns
Dictionary containing the serialized JSON.
- Return type
dict
- deserialize(model_dict: dict)[source]
Deserialize a model previously stored as JSON.
- Parameters
model_dict (dict) – Serialized JSON file as a dict.
- Raises
ValueError – In case JSON file is no valid serialized model.
- get_coef() numpy.array [source]
Returns the model coefficients.
- Returns
Array of model coefficients.
- Return type
np.array
- get_intercept() float [source]
Returns the intercept of the model.
- Returns
Intercept of the model.
- Return type
float
- get_coef_by_predictor() dict [source]
Returns a dictionary mapping predictor (key) to coefficient (value).
- Returns
A map
{predictor: coefficient}
.- Return type
dict
- fit(X_train: pandas.core.frame.DataFrame, y_train: pandas.core.series.Series)[source]
Fit the model.
- Parameters
X_train (pd.DataFrame) – Predictors of train data.
y_train (pd.Series) – Target of train data.
- score_model(X: pandas.core.frame.DataFrame) numpy.ndarray [source]
Score a model on a (new) dataset.
- Parameters
X (pd.DataFrame) – Dataset of predictors to score the model.
- Returns
Score of the model for each observation.
- Return type
np.ndarray
- evaluate(X: pandas.core.frame.DataFrame, y: pandas.core.series.Series, split: Optional[str] = None, metric: Optional[Callable] = None) float [source]
Evaluate the model on a given dataset (X, y). The optional split parameter is to indicate that the dataset belongs to (train, selection, validation), so that the computation on these sets can be cached!
- Parameters
X (pd.DataFrame) – Dataset containing the predictor values for each observation.
y (pd.Series) – Dataset containing the target of each observation.
split (str, optional) – Split name of the dataset (e.g. “train”, “selection”, or “validation”).
metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (RMSE). The function should require y_true and y_pred arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
- Returns
The performance score of the model (RMSE by default).
- Return type
float
- compute_variable_importance(data: pandas.core.frame.DataFrame) pandas.core.frame.DataFrame [source]
Compute the importance of each predictor in the model and return it as a DataFrame.
- Parameters
data (pd.DataFrame) – Data to score the model.
- Returns
DataFrame containing columns predictor and importance.
- Return type
pd.DataFrame
- class cobra.model_building.ForwardFeatureSelection(model_type: str = 'classification', max_predictors: int = 50, pos_only: bool = True)[source]
Bases:
object
Perform forward feature selection for a given dataset using a given algorithm.
Predictors are sequentially added to the model, starting with the one that has the highest univariate predictive power, and then proceeding with those that jointly lead to the best fit, optimizing for selection AUC or RMSE. Interaction effects are not explicitly modeled, yet they are implicitly present given the feature selection and the underlying feature correlation structure.
- model_type
Model type (
classification
orregression
).- Type
str
- MLModel
LogisticRegressionModel or LinearRegressionModel.
- Type
Cobra model
- max_predictors
Maximum number of predictors allowed in any model. This corresponds more or less with the maximum number of steps in the forward feature selection.
- Type
int
- pos_only
Whether or not the model coefficients should all be positive (no sign flips).
- Type
bool
- self._fitted_models
List of fitted models.
- Type
list
- get_model_from_step(step: int)[source]
Get fitted model from a particular step.
- Parameters
step (int) – Particular step in the forward selection.
- Returns
Fitted model from the given step.
- Return type
self.MLModel
- Raises
ValueError – In case step is larger than the number of available models.
- compute_model_performances(data: pandas.core.frame.DataFrame, target_column_name: str, splits: list = ['train', 'selection', 'validation'], metric: Optional[Callable] = None) pandas.core.frame.DataFrame [source]
Compute for each model the performance for different sets (e.g. train-selection-validation) and return them along with a list of predictors used in the model. Note that the computation of the performance for each split is cached inside the model itself, so it is inexpensive to perform it multiple times!
- Parameters
data (pd.DataFrame) – Dataset for which to compute performance of each model.
target_column_name (str) – Name of the target column.
splits (list, optional) – List of splits to compute performance on.
metric (Callable (function), optional) – Function that computes an evaluation metric to evaluate the model’s performances, instead of the default metric (AUC for classification, RMSE for regression). The function should require y_true and y_pred arguments. Metric functions from sklearn can be used, for example, see https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.
- Returns
Contains for each model the performance for train, selection and validation sets as well as the set of predictors used in this model.
- Return type
DatFrame
- fit(train_data: pandas.core.frame.DataFrame, target_column_name: str, predictors: list, forced_predictors: list = [], excluded_predictors: list = [])[source]
Fit the forward feature selection estimator.
- Parameters
train_data (pd.DataFrame) – Data on which to fit the model. Should include a “train” and “selection” split for correct model selection! The “train” split is used to train a model, the “selection” split is used to evaluate which model to include in the actual forward feature selection.
target_column_name (str) – Name of the target column.
predictors (list) – List of predictors on which to train the estimator.
forced_predictors (list, optional) – List of predictors to force in the estimator.
excluded_predictors (list, optional) – List of predictors to exclude from the estimator.
- Raises
ValueError – In case the number of forced predictors is larger than the maximum number of allowed predictors in the model.
cobra.evaluation module
- cobra.evaluation.generate_pig_tables(basetable: pandas.core.frame.DataFrame, id_column_name: str, target_column_name: str, preprocessed_predictors: list) pandas.core.frame.DataFrame [source]
Compute PIG tables for all predictors in preprocessed_predictors.
The output is a DataFrame with columns
variable
,label
,pop_size
,global_avg_target
andavg_target
.- Parameters
basetable (pd.DataFrame) – Basetable to compute PIG tables from.
id_column_name (str) – Name of the basetable column containing the IDs of the basetable rows (e.g. customernumber).
target_column_name (str) – Name of the basetable column containing the target values to predict.
preprocessed_predictors (list) – List of basetable column names containing preprocessed predictors.
- Returns
DataFrame containing a PIG table for all predictors.
- Return type
pd.DataFrame
- cobra.evaluation.compute_pig_table(basetable: pandas.core.frame.DataFrame, predictor_column_name: str, target_column_name: str, id_column_name: str) pandas.core.frame.DataFrame [source]
Compute the PIG table of a given predictor for a given target.
- Parameters
basetable (pd.DataFrame) – Input data from which to compute the pig table.
predictor_column_name (str) – Predictor name of which to compute the pig table.
target_column_name (str) – Name of the target variable.
id_column_name (str) – Name of the id column (used to count population size).
- Returns
PIG table as a DataFrame
- Return type
pd.DataFrame
- cobra.evaluation.plot_incidence(pig_tables: pandas.core.frame.DataFrame, variable: str, model_type: str, column_order: Optional[list] = None, dim: tuple = (12, 8))[source]
Plots a Predictor Insights Graph (PIG), a graph in which the mean target value is plotted for a number of bins constructed from a predictor variable. When the target is a binary classification target, the plotted mean target value is a true incidence rate.
Bins are ordered in descending order of mean target value unless specified otherwise with the column_order list.
- Parameters
pig_tables (pd.DataFrame) – Dataframe with cleaned, binned, partitioned and prepared data, as created by generate_pig_tables() from this module.
variable (str) – Name of the predictor variable for which the PIG will be plotted.
model_type (str) – Type of model (either “classification” or “regression”).
column_order (list, default=None) – Explicit order of the value bins of the predictor variable to be used on the PIG.
dim (tuple, default=(12, 8)) – Optional tuple to configure the width and length of the plot.
- cobra.evaluation.plot_performance_curves(model_performance: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None, colors: dict = {'selection': '#ff9500', 'train': '#0099bf', 'validation': '#8064a2'}, metric_name: Optional[str] = None)[source]
Plot performance curves generated by the forward feature selection for the train-selection-validation sets.
- Parameters
model_performance (pd.DataFrame) – Contains train-selection-validation performance for each model trained in the forward feature selection.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.
colors (dict, optional) – Map with colors for train-selection-validation curves.
metric_name (str, optional) – Name to indicate the metric used in model_performance. Defaults to RMSE in case of regression and AUC in case of classification.
- cobra.evaluation.plot_variable_importance(df_variable_importance: pandas.core.frame.DataFrame, title: Optional[str] = None, dim: tuple = (12, 8), path: Optional[str] = None)[source]
Plot variable importance of a given model.
- Parameters
df_variable_importance (pd.DataFrame) – DataFrame containing columns predictor and importance.
title (str, optional) – Title of the plot.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.
- cobra.evaluation.plot_univariate_predictor_quality(df_metric: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None)[source]
Plot univariate quality of the predictors.
- Parameters
df_metric (pd.DatFrame) – DataFrame containing for each variable the train AUC or RMSE and test AUC or RMSE along with a boolean indicating whether or not it is selected based on the criteria.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.
- cobra.evaluation.plot_correlation_matrix(df_corr: pandas.core.frame.DataFrame, dim: tuple = (12, 8), path: Optional[str] = None)[source]
Plot correlation matrix amongst the predictors.
- Parameters
df_corr (pd.DataFrame) – Correlation matrix.
dim (tuple, optional) – Width and length of the plot.
path (str, optional) – Path to store the figure.
- class cobra.evaluation.ClassificationEvaluator(probability_cutoff: Optional[float] = None, lift_at: float = 0.05, n_bins: int = 10)[source]
Bases:
object
Evaluator class encapsulating classification model metrics and plotting functionality.
- y_true
True binary target data labels.
- Type
np.ndarray
- y_pred
Target scores of the model.
- Type
np.ndarray
- confusion_matrix
Confusion matrix computed for a particular cut-off.
- Type
np.ndarray
- cumulative_gains
Data for plotting cumulative gains curve.
- Type
tuple
- evaluation_metrics
Map containing various scalar evaluation metrics (precision, recall, accuracy, AUC, F1, etc.).
- Type
dict
- lift_at
Parameter to determine at which top level percentage the lift of the model should be computed.
- Type
float
- lift_curve
Data for plotting lift curve(s).
- Type
tuple
- probability_cutoff
Probability cut off to convert probability scores to a binary score.
- Type
float
- roc_curve
Map containing true-positive-rate, false-positive-rate at various thresholds (also incl.).
- Type
dict
- n_bins
Defines the number of bins used to calculate the lift curve for (by default 10, so deciles).
- Type
int, optional
- fit(y_true: numpy.ndarray, y_pred: numpy.ndarray)[source]
Fit the evaluator by computing the relevant evaluation metrics on the inputs.
- Parameters
y_true (np.ndarray) – True labels.
y_pred (np.ndarray) – Model scores (as probability).
- plot_roc_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]
Plot ROC curve of the model.
- Parameters
path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.
- plot_confusion_matrix(path: Optional[str] = None, dim: tuple = (12, 8), labels: list = ['0', '1'])[source]
Plot the confusion matrix.
- Parameters
path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.
labels (list, optional) – Optional list of labels, default “0” and “1”.
- plot_cumulative_response_curve(path: Optional[str] = None, dim: tuple = (12, 8))[source]
Plot cumulative response curve.
- Parameters
path (str, optional) – Path to store the figure.
dim (tuple, optional) – Tuple with width and length of the plot.
- class cobra.evaluation.RegressionEvaluator[source]
Bases:
object
Evaluator class encapsulating regression model metrics and plotting functionality.
- y_true
True binary target data labels.
- Type
np.ndarray
- y_pred
Target scores of the model.
- Type
np.ndarray
- scalar_metrics
Map containing various scalar evaluation metrics (R-squared, MAE, MSE, RMSE)
- Type
dict
- qq
Theoretical quantiles and associated actual residuals.
- Type
pd.Series
- fit(y_true: numpy.ndarray, y_pred: numpy.ndarray)[source]
Fit the evaluator by computing the relevant evaluation metrics on the inputs.
- Parameters
y_true (np.ndarray) – True labels.
y_pred (np.ndarray) – Model scores.