First, we will discuss Ridge Regression. But beofre that let’s first go through Linear Regression. Recall that the cost function for Linear Regression is:
\[min\sum_{i=1}^{n} (\hat y - y)^2\]
The loss function for Ridge Regression is: \[ min \sum_{i=1}^{n} (\hat{y} - y)^2 + \lambda \sum_{i=1}^{m} w_i \] The added regularization term in Ridge Regression is called the \(l2\) -norm which acts as a penalty to Linear Regression. In other words, Ridge Regression introduces a small bias which will make a slightly worse fit. Why? In return for the bias, we can achieve reduced variance that will make the model generalize well for new data. We call this the bias-variance trade-off. If the model has a high variance, it’ll probably not generalize well for unseen future data points.
This is why Ridge Regression is a regularized linear model.
Let’s see what this all means in action:
## Python 3.9.6
## Import packages
import pandas as pd
from sklearn.datasets import make_regression
from sklearn.datasets import make_regression
from matplotlib import pyplot as plt
from sklearn.linear_model import Ridge, Lasso, ElasticNet
## Generate data with one coefficient fitted from a linear regression model
X, y, coef = make_regression(
n_samples= 20 ,
n_features= 1 ,
n_informative= 1 , # number of useful features
n_targets= 1 ,
noise= 20 ,
coef= True ,
random_state= 1
)
coef
lambda_value = 1
rr = Ridge(lambda_value)
rr.fit(X, y)
Ridge(alpha=1) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
alpha alpha: float or array-like of shape (n_targets,), default=1.0
Constant that multiplies the L2 term, controlling regularization
strength. `alpha` must be a non-negative float i.e. in `[0, inf)`.
When `alpha = 0`, the objective is equivalent to ordinary least
squares, solved by the :class:`LinearRegression` object. For numerical
reasons, using `alpha = 0` with the `Ridge` object is not advised.
Instead, you should use the :class:`LinearRegression` object.
If an array is passed, penalties are assumed to be specific to the
targets. Hence they must correspond in number.
See :ref:`sphx_glr_auto_examples_linear_model_plot_ridge_coeffs.py`
for an illustration of the effect of alpha on the model coefficients.
1
fit_intercept fit_intercept: bool, default=True
Whether to fit the intercept for this model. If set
to false, no intercept will be used in calculations
(i.e. ``X`` and ``y`` are expected to be centered).
True
copy_X copy_X: bool, default=True
If True, X will be copied; else, it may be overwritten.
True
max_iter max_iter: int, default=None
Maximum number of iterations for conjugate gradient solver.
For 'sparse_cg' and 'lsqr' solvers, the default value is determined
by scipy.sparse.linalg. For 'sag' solver, the default value is 1000.
For 'lbfgs' solver, the default value is 15000.
None
tol tol: float, default=1e-4
The precision of the solution (`coef_`) is determined by `tol` which
specifies a different convergence criterion for each solver:
- 'svd': `tol` has no impact.
- 'cholesky': `tol` has no impact.
- 'sparse_cg': norm of residuals smaller than `tol`.
- 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr,
which control the norm of the residual vector in terms of the norms of
matrix and coefficients.
- 'sag' and 'saga': relative change of coef smaller than `tol`.
- 'lbfgs': maximum of the absolute (projected) gradient=max|residuals|
smaller than `tol`.
.. versionchanged:: 1.2
Default value changed from 1e-3 to 1e-4 for consistency with other linear
models.
0.0001
solver solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto'
Solver to use in the computational routines:
- 'auto' chooses the solver automatically based on the type of data.
- 'svd' uses a Singular Value Decomposition of X to compute the Ridge
coefficients. It is the most stable solver, in particular more stable
for singular matrices than 'cholesky' at the cost of being slower.
- 'cholesky' uses the standard :func:`scipy.linalg.solve` function to
obtain a closed-form solution.
- 'sparse_cg' uses the conjugate gradient solver as found in
:func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is
more appropriate than 'cholesky' for large-scale data
(possibility to set `tol` and `max_iter`).
- 'lsqr' uses the dedicated regularized least-squares routine
:func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative
procedure.
- 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses
its improved, unbiased version named SAGA. Both methods also use an
iterative procedure, and are often faster than other solvers when
both n_samples and n_features are large. Note that 'sag' and
'saga' fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a
scaler from :mod:`sklearn.preprocessing`.
- 'lbfgs' uses L-BFGS-B algorithm implemented in
:func:`scipy.optimize.minimize`. It can be used only when `positive`
is True.
All solvers except 'svd' support both dense and sparse data. However, only
'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when
`fit_intercept` is True.
.. versionadded:: 0.17
Stochastic Average Gradient descent solver.
.. versionadded:: 0.19
SAGA solver.
'auto'
positive positive: bool, default=False
When set to ``True``, forces the coefficients to be positive.
Only 'lbfgs' solver is supported in this case.
False
random_state random_state: int, RandomState instance, default=None
Used when ``solver`` == 'sag' or 'saga' to shuffle the data.
See :term:`Glossary <random_state>` for details.
.. versionadded:: 0.17
`random_state` to support Stochastic Average Gradient.
None
plt.scatter(X, y)
plt.plot(X, w* X, c= 'red' )
As you can see, Ridge Regression (red line) is very close to Linear Regression when lambda_value = 1
If we increase lambda_value to 10:
lambda_value = 10
rr = Ridge(lambda_value)
rr.fit(X, y)
Ridge(alpha=10) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
alpha alpha: float or array-like of shape (n_targets,), default=1.0
Constant that multiplies the L2 term, controlling regularization
strength. `alpha` must be a non-negative float i.e. in `[0, inf)`.
When `alpha = 0`, the objective is equivalent to ordinary least
squares, solved by the :class:`LinearRegression` object. For numerical
reasons, using `alpha = 0` with the `Ridge` object is not advised.
Instead, you should use the :class:`LinearRegression` object.
If an array is passed, penalties are assumed to be specific to the
targets. Hence they must correspond in number.
See :ref:`sphx_glr_auto_examples_linear_model_plot_ridge_coeffs.py`
for an illustration of the effect of alpha on the model coefficients.
10
fit_intercept fit_intercept: bool, default=True
Whether to fit the intercept for this model. If set
to false, no intercept will be used in calculations
(i.e. ``X`` and ``y`` are expected to be centered).
True
copy_X copy_X: bool, default=True
If True, X will be copied; else, it may be overwritten.
True
max_iter max_iter: int, default=None
Maximum number of iterations for conjugate gradient solver.
For 'sparse_cg' and 'lsqr' solvers, the default value is determined
by scipy.sparse.linalg. For 'sag' solver, the default value is 1000.
For 'lbfgs' solver, the default value is 15000.
None
tol tol: float, default=1e-4
The precision of the solution (`coef_`) is determined by `tol` which
specifies a different convergence criterion for each solver:
- 'svd': `tol` has no impact.
- 'cholesky': `tol` has no impact.
- 'sparse_cg': norm of residuals smaller than `tol`.
- 'lsqr': `tol` is set as atol and btol of scipy.sparse.linalg.lsqr,
which control the norm of the residual vector in terms of the norms of
matrix and coefficients.
- 'sag' and 'saga': relative change of coef smaller than `tol`.
- 'lbfgs': maximum of the absolute (projected) gradient=max|residuals|
smaller than `tol`.
.. versionchanged:: 1.2
Default value changed from 1e-3 to 1e-4 for consistency with other linear
models.
0.0001
solver solver: {'auto', 'svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga', 'lbfgs'}, default='auto'
Solver to use in the computational routines:
- 'auto' chooses the solver automatically based on the type of data.
- 'svd' uses a Singular Value Decomposition of X to compute the Ridge
coefficients. It is the most stable solver, in particular more stable
for singular matrices than 'cholesky' at the cost of being slower.
- 'cholesky' uses the standard :func:`scipy.linalg.solve` function to
obtain a closed-form solution.
- 'sparse_cg' uses the conjugate gradient solver as found in
:func:`scipy.sparse.linalg.cg`. As an iterative algorithm, this solver is
more appropriate than 'cholesky' for large-scale data
(possibility to set `tol` and `max_iter`).
- 'lsqr' uses the dedicated regularized least-squares routine
:func:`scipy.sparse.linalg.lsqr`. It is the fastest and uses an iterative
procedure.
- 'sag' uses a Stochastic Average Gradient descent, and 'saga' uses
its improved, unbiased version named SAGA. Both methods also use an
iterative procedure, and are often faster than other solvers when
both n_samples and n_features are large. Note that 'sag' and
'saga' fast convergence is only guaranteed on features with
approximately the same scale. You can preprocess the data with a
scaler from :mod:`sklearn.preprocessing`.
- 'lbfgs' uses L-BFGS-B algorithm implemented in
:func:`scipy.optimize.minimize`. It can be used only when `positive`
is True.
All solvers except 'svd' support both dense and sparse data. However, only
'lsqr', 'sag', 'sparse_cg', and 'lbfgs' support sparse input when
`fit_intercept` is True.
.. versionadded:: 0.17
Stochastic Average Gradient descent solver.
.. versionadded:: 0.19
SAGA solver.
'auto'
positive positive: bool, default=False
When set to ``True``, forces the coefficients to be positive.
Only 'lbfgs' solver is supported in this case.
False
random_state random_state: int, RandomState instance, default=None
Used when ``solver`` == 'sag' or 'saga' to shuffle the data.
See :term:`Glossary <random_state>` for details.
.. versionadded:: 0.17
`random_state` to support Stochastic Average Gradient.
None
plt.scatter(X, y)
plt.plot(X, w* X, c= 'red' )
Now, we see a slightly worse fit (higher bias) but we expect to have a lower variance for new data points.
Next, we have Lasso Regression. Like Ridge Regression, it is another regularized linear model to prevent the model from overfitting. The only difference is in the cost function:
\[ min \sum_{i=1}^{n} (\hat{y} - y)^2 + \lambda \sum_{i=1}^{m} |w_i| \] The new regularization term is called the \(l1\) -norm. The difference becomes clear by checking visually:
The first point where the elliptical contours touch the region of constraints is how the coefficients from both Ridge and Lasso Regression are determined. Unlike the circular shape of Ridge, Lasso has corners. Hence, if the contour hits the corners, the feature(s) will disappear. Thus, Lasso can perform variable selection, hence the name Least Absolute Shrinkage and Selection Operator .
If we fit using the same X and y data:
las = Lasso(lambda_value)
las.fit(X, y)
Lasso(alpha=10) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
alpha alpha: float, default=1.0
Constant that multiplies the L1 term, controlling regularization
strength. `alpha` must be a non-negative float i.e. in `[0, inf)`.
When `alpha = 0`, the objective is equivalent to ordinary least
squares, solved by the :class:`LinearRegression` object. For numerical
reasons, using `alpha = 0` with the `Lasso` object is not advised.
Instead, you should use the :class:`LinearRegression` object.
10
fit_intercept fit_intercept: bool, default=True
Whether to calculate the intercept for this model. If set
to False, no intercept will be used in calculations
(i.e. data is expected to be centered).
True
precompute precompute: bool or array-like of shape (n_features, n_features), default=False
Whether to use a precomputed Gram matrix to speed up
calculations. The Gram matrix can also be passed as argument.
For sparse input this option is always ``False`` to preserve sparsity.
False
copy_X copy_X: bool, default=True
If ``True``, X will be copied; else, it may be overwritten.
True
max_iter max_iter: int, default=1000
The maximum number of iterations.
1000
tol tol: float, default=1e-4
The tolerance for the optimization: if the updates are smaller or equal to
``tol``, the optimization code checks the dual gap for optimality and continues
until it is smaller or equal to ``tol``, see Notes below.
0.0001
warm_start warm_start: bool, default=False
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
See :term:`the Glossary <warm_start>`.
False
positive positive: bool, default=False
When set to ``True``, forces the coefficients to be positive.
False
random_state random_state: int, RandomState instance, default=None
The seed of the pseudo random number generator that selects a random
feature to update. Used when ``selection`` == 'random'.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
None
selection selection: {'cyclic', 'random'}, default='cyclic'
If set to 'random', a random coefficient is updated every iteration
rather than looping over features sequentially by default. This
(setting to 'random') often leads to significantly faster convergence
especially when tol is higher than 1e-4.
'cyclic'
Fitted attributes
Name
Type
Value
coef_ coef_: ndarray of shape (n_features,) or (n_targets, n_features)
Parameter vector (w in the cost function formula).
ndarray[float64](1,)
[62.51]
dual_gap_ dual_gap_: float or ndarray of shape (n_targets,)
Given parameter ``alpha``, the dual gaps at the end of the optimization,
same shape as each observation of y.
float64
2.728e-13
intercept_ intercept_: float or ndarray of shape (n_targets,)
Independent term in decision function.
float64
-5.953
n_features_in_ n_features_in_: int
Number of features seen during :term:`fit`.
.. versionadded:: 0.24
int
1
n_iter_ n_iter_: int or list of int
Number of iterations run by the coordinate descent solver to reach
the specified tolerance.
int
2
sparse_coef_ sparse_coef_: sparse matrix of shape (n_features, 1) or (n_targets, n_features)
Read-only property derived from ``coef_``.
csr_matrix[float64](1, 1)
<Compressed S... shape (1, 1)>
# Plot
plt.scatter(X, y)
plt.plot(X, w* X, c= 'red' )
Lasso returns a fairly close fit to Ridge Regression
Lastly, Elastic Net is a middle ground between Ridge Regression and Lasso Regression. The regularization term is a mix of both where \(r\) controls the mix ratio.
\[ min \sum_{i=1}^{n} (\hat{y} - y)^2 + r\lambda \sum_{i=1}^{m}|w_i| + \frac{(1-r)}2 \lambda \sum_{i=1}^{m} w_i^2 \]
## Elastic Net
# Fit
enet = ElasticNet(alpha= lambda_value, l1_ratio= .5 )
enet.fit(X, y)
ElasticNet(alpha=10) In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
alpha alpha: float, default=1.0
Constant that multiplies the penalty terms. Defaults to 1.0.
See the notes for the exact mathematical meaning of this
parameter. ``alpha = 0`` is equivalent to an ordinary least square,
solved by the :class:`LinearRegression` object. For numerical
reasons, using ``alpha = 0`` with the ``Lasso`` object is not advised.
Given this, you should use the :class:`LinearRegression` object.
10
l1_ratio l1_ratio: float, default=0.5
The ElasticNet mixing parameter, with ``0 <= l1_ratio <= 1``. For
``l1_ratio = 0`` the penalty is an L2 penalty. ``For l1_ratio = 1`` it
is an L1 penalty. For ``0 < l1_ratio < 1``, the penalty is a
combination of L1 and L2.
0.5
fit_intercept fit_intercept: bool, default=True
Whether the intercept should be estimated or not. If ``False``, the
data is assumed to be already centered.
True
precompute precompute: bool or array-like of shape (n_features, n_features), default=False
Whether to use a precomputed Gram matrix to speed up
calculations. The Gram matrix can also be passed as argument.
For sparse input this option is always ``False`` to preserve sparsity.
Check :ref:`an example on how to use a precomputed Gram Matrix in ElasticNet
<sphx_glr_auto_examples_linear_model_plot_elastic_net_precomputed_gram_matrix_with_weighted_samples.py>`
for details.
False
max_iter max_iter: int, default=1000
The maximum number of iterations.
1000
copy_X copy_X: bool, default=True
If ``True``, X will be copied; else, it may be overwritten.
True
tol tol: float, default=1e-4
The tolerance for the optimization: if the updates are smaller or equal to
``tol``, the optimization code checks the dual gap for optimality and continues
until it is smaller or equal to ``tol``, see Notes below.
0.0001
warm_start warm_start: bool, default=False
When set to ``True``, reuse the solution of the previous call to fit as
initialization, otherwise, just erase the previous solution.
See :term:`the Glossary <warm_start>`.
False
positive positive: bool, default=False
When set to ``True``, forces the coefficients to be positive.
False
random_state random_state: int, RandomState instance, default=None
The seed of the pseudo random number generator that selects a random
feature to update. Used when ``selection`` == 'random'.
Pass an int for reproducible output across multiple function calls.
See :term:`Glossary <random_state>`.
None
selection selection: {'cyclic', 'random'}, default='cyclic'
If set to 'random', a random coefficient is updated every iteration
rather than looping over features sequentially by default. This
(setting to 'random') often leads to significantly faster convergence
especially when tol is higher than 1e-4.
'cyclic'
Fitted attributes
Name
Type
Value
coef_ coef_: ndarray of shape (n_features,) or (n_targets, n_features)
Parameter vector (w in the cost function formula).
ndarray[float64](1,)
[12.98]
dual_gap_ dual_gap_: float or ndarray of shape (n_targets,)
Given param alpha, the dual gaps at the end of the optimization,
same shape as each observation of y.
float64
3.638e-13
intercept_ intercept_: float or ndarray of shape (n_targets,)
Independent term in decision function.
float64
-12.56
n_features_in_ n_features_in_: int
Number of features seen during :term:`fit`.
.. versionadded:: 0.24
int
1
n_iter_ n_iter_: list of int
Number of iterations run by the coordinate descent solver to reach
the specified tolerance.
int
2
sparse_coef_ sparse_coef_: sparse matrix of shape (n_features,) or (n_targets, n_features)
Sparse representation of the `coef_`.
csr_matrix[float64](1, 1)
<Compressed S... shape (1, 1)>
# Plot
plt.scatter(X, y)
plt.plot(X, w* X, c= 'red' )
So you might be wondering which one to use including Linear Regression without any regularization. Rule of thumb is to avoid using plain Linear Regression. One can start with Ridge but if you think there are features that are not important, you should use Lasso or Elastic Net. Normally, Elastic Net is preferred over Lasso since Lasso can behave weird when the number of features is greater than the number of instances or when some features have a strong correlation.
Before we end this chapter, I’ll talk a little more on how to deal with overfitting and underfitting. Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. The solutions are:
Simplifying your model by choosing one with fewer paramaters like a linear model over a high-degree polynomial model or regularizing the model as we learned from this chapter
Obtain more data
Remove noise such as fixing simple data entry errors or/and removing outliers
For underfitting models we could:
Select a more complex model with more parameters
Feed better features
Loosen up the model constrains
From Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow