The 5 Most Popular Regression Techniques

Regression is a widely used method for modeling relationships among variables. Different regression techniques can suit various datasets depending on the problem. This article highlights the 5 most popular regression techniques and the types of datasets wherein these are most effective.

Introduction

Although, we wonā€™t go into the details of the assumptions for each technique in this article, but itā€™s important to understand them before you apply any method. While not every assumption will apply in every situation, they can give you a good sense of how reliable the modelā€™s relationships are and how well it might predict future outcomes.


  1. Ordinary Linear Regression
  2. Polynomial Regression
  3. Stepwise regression
  4. Ridge regression
  5. Lasso Regression

Ordinary Linear Regression (OLS)

Assumptions

Below is a quick summary on assumptions:

  1. Linearity: The relationship between X and the mean of Y is linear.
  2. Homoscedasticity: The variance of residual is the same for any value of X.
  3. Independence: Observations are independent of each other.
  4. Normality: For any fixed value of X, Y is normally distributed.

Details:

Regression analysis is commonly used for modeling the relationship between a single dependent variable Y and one or more predictors. When we have one predictor, we call this ā€œsimpleā€ linear regression:

E[Y] = Ī²0 + Ī²1X

That is, the expected value of Y is a straight-line function of X. The betas are selected by choosing the line that minimizing the squared distance between each Y value and the line of best fit. The betas are chosen such that they minimize this expression:

āˆ‘ (yi ā€” (Ī²0 + Ī²1Xi))^2

Implementation example

Polynomial Regression

Assumptions

For polynomial regression models we assume that:

  1. The relationship between the dependent variable y and any independent variable xi is linear or curvilinear (specifically polynomial),

2. The independent variables xi are independent of each other

3. The errors are independent, normally distributed with mean zero and a constant variance (OLS).

Clearly there is significant overlap with the assumptions of OLS mentioned above.

Details

Understanding Polynomial Regression

Polynomial Representation

A polynomial can be expressed as:

In this formula, the exponents of the variable x are constants.

Example Visualization:

Below, you can see an animation of a parabola approximating the flight path of a ball which depicts the polynomial relationship.

Understanding Polynomial Models

Polynomial models can be described in two ways:

a. They are linear with respect to the coefficients bi ince these parameters have an exponent of 1.
b. They are non-linear with respect to the variables due to terms like x**2, x**3 etc.

Generating Polynomial Terms

Polynomial terms are created by raising the variable values to a specific power. however it does come with multicollinearity implications:

In this equation, the variable \( x \) appears twice:

- Once as b1*x
- Once as b2*(x**2)

Since x and x**2 are related, this can lead to multicollinearity issues, which can affect the modelā€™s reliability.

Implementation example

Stepwise Regression

This section covers stepwise regression, a method where we build our regression model by gradually adding and removing predictor variables until no more changes make sense. The aim is to create a useful model without any variables which do not improve relationship modeling. However, if we donā€™t include all the relevant variables that affect the response, we risk ending up with a model that misses important insights and is misleading. So, itā€™s essential to ensure our list of candidate predictors includes everything that truly influences the outcome.

Assumptions and Limitations

The same assumptions and qualifications apply here as applied to OLS. Note that outliers can have a large impact on these stepping procedures, so you must make some attempt to remove outliers from consideration before applying these methods to your data.

The greatest limitation with these procedures is one of sample size. A good rule of thumb is that you have at least five observations for each variable in the candidate pool. If you have 50 variables, you should have 250 observations. With less data per variable, these search procedures may fit the randomness that is inherent in most datasets and spurious models will be obtained.

Details

Below are the steps followed

  1. Initialize: Start with an empty model and define a significance level (e.g., Ī± = 0.05).

2. Candidate Selection: Identify all candidate predictor variables.

3. Iterate Until Convergence:

Adding Step:

  • For each candidate not in the model, fit the model with that predictor and check the p-value.
  • Add the predictor with the smallest p-value if itā€™s less than the significance level.

Removing Step:

  • For each predictor in the model, fit the model without it and check the p-value.
  • Remove the predictor with the largest p-value if it exceeds the significance level.

4. Final Model:

  • Stop when no predictors are added or removed. The resulting model includes only significant predictors.

Implementation example

Ridge Regression

Ridge regression is a statistical technique used to estimate the coefficients of multiple regression models, particularly when the independent variables exhibit high multicollinearity.

Assumptions

The assumptions of ridge regression are the same as those of linear regression: linearity, constant variance, and independence. However, as ridge regression does not provide confidence limits, the distribution of errors to be normal need not be assumed.

Details

Multicollinearity Issues:

  • Causes high variances in coefficient estimates, leading to poor predictions, even though least squares remain unbiased.

Key Concepts:

L2 Regularization: Adds a penalty based on the square of the coefficients to the cost function.

Cost Function: Minimize ||Yāˆ’XĪø||**2 + Ī»*||Īø||**2 whereY: Actual values; X: Independent variables; Īø: Coefficients; Ī»: Penalty term.

First Term (Residual Sum squared): This part measures the difference between the actual values and the predicted values. It captures how well the model fits the training data.

Penalty Term: This term penalizes large coefficients by adding the sum of the squares of the coefficients to the cost function.

Role of Ī»:

Higher Ī»: Greater penalty and smaller coefficients; Lower Ī»: Lesser penalty, coefficients approach those from ordinary least squares.

Benefits

  • Coefficient Shrinkage: Reduces the impact of multicollinearity.
  • Simplified Model: Easier interpretation and improved predictive performance.

Implementation example

Lasso Regression

Lasso regression, also known as L1 regularization, is a technique used in linear regression models. It shares similarities with ridge regression but has distinct features. By applying Lasso regression, we can improve a modelā€™s generalizability through a penalty term that discourages complexity. This approach is summarized by the following formula:

Cost Function: Minimize ||Yāˆ’XĪø||**2 + Ī»*||Īø||

Assumptions

In addition to the OLS assumptions Lasso assumes below:

Multicollinearity:

Lasso can handle multicollinearity more effectively than OLS because it includes a penalty term that can shrink coefficients of correlated predictors. This makes it suitable for high-dimensional data.

Sparsity:

Lasso inherently assumes that many of the coefficients are zero, promoting sparsity in the model. This is a key difference from OLS, which does not assume sparsity.

Details

Regularization in LASSO Regression

  1. Penalty Term:
    Defined as:

L1=Ī»Ć—(āˆ£Ī²1āˆ£+āˆ£Ī²2āˆ£+ā€¦+āˆ£Ī²pāˆ£)

This term imposes a cost for larger coefficients.

2. Regularization Parameter (Ī»):
Controls the strength of the penalty: Higher Ī»Ī»: Increases penalty, shrinking more coefficients to zero and simplifying the model; Lower Ī»Ī»: Reduces penalty, allowing more flexibility but risking overfitting.

3. Coefficients (Ī²1, Ī²2,ā€¦, Ī²pā€‹):
Represent the influence of each predictor on the response. LASSO can set some coefficients to zero, effectively performing variable selection.

4. Cost Function:
Combines the residual sum of squares with the L1 penalty:

J(Īø) = ||Yāˆ’XĪø||**2 + Ī»Ć—(|Ī²1|+|Ī²2|+ā€¦)

Balances data fit and model complexity.

Future Work & Conclusion

This article is just scraching the surface of some of the regression techniques. While many more exist ā€” it would also be worthwhile to look into nuances of each technique in while implementing in real life problems. We will look delve deeper into each topic in upcoming artciles. Please comment on which one I should start with.

If you liked the explanation , follow me for more! Feel free to leave your comments if you have any queries or suggestions.

You can also check out other articles written around data science, computing on medium. If you like my work and want to contribute to my journey, you cal always buy me a coffee :)

References

[1] https://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/R/R5_Correlation-Regression/R5_Correlation-Regression4.html

[2] https://www.public.asu.edu/~gwaissi/ASM-e-book/module402.html

[3] https://scikit-learn.org/stable/modules/linear_model.html

[4] https://www.ncss.com/wp-content/themes/ncss/pdf/Procedures/NCSS/Stepwise_Regression.pdf

[5] https://dataaspirant.com/stepwise-regression/

[6] https://www.mygreatlearning.com/blog/what-is-ridge-regression/

Comments