Regression is the Supervised learning task of predicting a continuous numerical value from input features. Given a person’s age and weight, predict their blood pressure. Given a year, predict the inflation rate. Given a house’s square footage and number of bedrooms, predict its sale price. The output is a real number, not a category.
The simplest regression model is Linear regression: assume the output is a linear function of the inputs. For a single input feature :
where is the intercept and is the slope. The vector contains everything the model has learned. Training the model means finding good values for and .
Linear models are limited — they can only capture linear relationships. If the data curves, a straight line is a poor fit. The natural extension is Polynomial regression:
With we recover linear regression. Higher fits more complex shapes — quadratics, cubics, beyond — at the cost of more parameters and more data needed to estimate them well.
Training proceeds by:
- Picking a Loss function that measures how badly predictions agree with labels. The standard choice for regression is Mean squared error.
- Finding parameters that minimize the loss. For linear regression this has a closed-form solution; for more complex models we use Gradient descent.
The complementary supervised task is classification, where the output is a discrete category instead of a continuous value. The two are closely related — Logistic regression, for instance, is a classifier built on top of a linear regression by passing the output through a sigmoid.
In scikit-learn, sklearn.linear_model.LinearRegression() fits a linear regression by closed-form least squares; sklearn.linear_model.SGDRegressor() does the same with stochastic gradient descent for very large datasets.