Regression is the Supervised learning task of predicting a continuous numerical value from input features. Given a person’s age and weight, predict their blood pressure. Given a year, predict the inflation rate. Given a house’s square footage and number of bedrooms, predict its sale price. The output is a real number, not a category.

The simplest regression model is Linear regression: assume the output is a linear function of the inputs. For a single input feature :

where is the intercept and is the slope. The vector contains everything the model has learned. Training the model means finding good values for and .

Linear models are limited — they can only capture linear relationships. If the data curves, a straight line is a poor fit. The natural extension is Polynomial regression:

With we recover linear regression. Higher fits more complex shapes — quadratics, cubics, beyond — at the cost of more parameters and more data needed to estimate them well.

Training proceeds by:

  1. Picking a Loss function that measures how badly predictions agree with labels. The standard choice for regression is Mean squared error.
  2. Finding parameters that minimize the loss. For linear regression this has a closed-form solution; for more complex models we use Gradient descent.

The complementary supervised task is classification, where the output is a discrete category instead of a continuous value. The two are closely related — Logistic regression, for instance, is a classifier built on top of a linear regression by passing the output through a sigmoid.

In scikit-learn, sklearn.linear_model.LinearRegression() fits a linear regression by closed-form least squares; sklearn.linear_model.SGDRegressor() does the same with stochastic gradient descent for very large datasets.