Correlation and Regression
Correlation and regression are the two most commonly used techniques for investigating the relationship between quantitative variables. Here regression refers to linear regression. Correlation is used to give the relationship between the variables whereas linear regression uses an equation to express this relationship.
Correlation and regression are used to define some form of association between quantitative variables that are assumed to have a linear relationship. In this article, we will learn more about these topics, the difference between correlation and regression as well as see some associated examples.
What are Correlation and Regression?
Correlation and regression are statistical measurements that are used to give a relationship between two variables. For example, suppose a person is driving an expensive car then it is assumed that she must be financially well. To numerically quantify this relationship, correlation and regression are used.
Correlation Definition
Correlation can be defined as a measurement that is used to quantify the relationship between variables. If an increase (or decrease) in one variable causes a corresponding increase (or decrease) in another then the two variables are said to be directly correlated. Similarly, if an increase in one causes a decrease in another or vice versa, then the variables are said to be indirectly correlated. If a change in an independent variable does not cause a change in the dependent variable then they are uncorrelated. Thus, correlation can be positive (direct correlation), negative (indirect correlation), or zero. This relationship is given by the correlation coefficient.
Regression Definition
Regression can be defined as a measurement that is used to quantify how the change in one variable will affect another variable. Regression is used to find the cause and effect between two variables. Linear regression is the most commonly used type of regression because it is easier to analyze as compared to the rest. Linear regression is used to find the line that is the best fit to establish a relationship between variables.
Correlation and Regression Analysis
Both correlation and regression analysis are done to quantify the strength of the relationship between two variables by using numbers. Graphically, correlation and regression analysis can be visualized using scatter plots.
Correlation analysis is done so as to determine whether there is a relationship between the variables that are being tested. Furthermore, a correlation coefficient such as Pearson's correlation coefficient is used to give a signed numeric value that depicts the strength as well as the direction of the correlation. The scatter plot gives the correlation between two variables x and y for individual data points as shown below.
Regression analysis is used to determine the relationship between two variables such that the value of the unknown variable can be estimated using the knowledge of the known variables. The goal of linear regression is to find the best-fitted line through the data points. For two variables, x, and y, the regression analysis can be visualized as follows:
Correlation and Regression Formula
The best way to conduct correlation and regression analysis is by using Pearson's correlation coefficient and by adopting the method of least squares respectively. The correlation and regression formula is given below:
Pearson's Correlation Coefficient: \(r_{xy}=\frac{\sum_{1}^{n}\left ( x_{i} -\overline{x}\right )\left ( y_{i} -\overline{y}\right )}{\sqrt{\sum_{1}^{n}\left ( x_{i}-\overline{x} \right )^{2}\sum_{1}^{n}\left ( y_{i}-\overline{y} \right )^{2}}}\)
Ordinary Least Squares (OLS) Linear Regression:
The straight line equation is given as y = \(\alpha\) + \(\beta x\)
\(\beta = \frac{\sum_{1}^{n}\left ( x_{i}-\overline{x} \right )\left ( y_{i}-\overline{y} \right )}{\sum_{1}^{n}\left ( x_{i}-\overline{x} \right )^{2}}\)
\(\beta = r_{xy}\frac{\sigma_{y}}{\sigma_{x}}\)
\(\alpha = \overline{y}-\beta \overline{x}\)
Here, \(\overline{x}\) is the mean, and \(\sigma_{x}\) is the standard deviation of the first data set where each data point is represented by \(x_{i}\). Similarly, \(\overline{y}\) is the mean, and \(\sigma_{y}\) is the standard deviation of the second data set. n is the number of data points in the datasets.
Difference between Correlation and Regression
Correlation and regression are both used as statistical measurements to get a good understanding of the relationship between variables. If the correlation coefficient is negative (or positive) then the slope of the regression line will also be negative (or positive). The table given below highlights the key difference between correlation and regression.
Correlation | Regression |
---|---|
Correlation is used to determine whether variables are related or not. | Regression is used to numerically describe how a dependent variable changes with a change in an independent variable |
Correlation tries to establish a linear relationship between variables. | It finds the best-fitted regression line to estimate an unknown variable on the basis of the known variable. |
The variables can be used interchangeably | The variables cannot be interchanged. |
Correlation uses a signed numerical value to estimate the strength of the relationship between the variables. | Regression is used to show the impact of a unit change in the independent variable on the dependent variable. |
The Pearson's coefficient is the best measure of correlation. | The least-squares method is the best technique to determine the regression line. |
Related Articles:
Important Notes on Correlation and Regression
- Correlation and regression are statistical measurements that are used to quantify the strength of the linear relationship between two variables.
- Correlation determines if two variables have a linear relationship while regression describes the cause and effect between the two.
- Pearson's correlation coefficient and ordinary least squares method are used to perform correlation and regression analysis.
Examples on Correlation and Regression
- Example 1: Calculate the correlation coefficient for the given data
Person Hand Height A 17 150 B 15 154 C 19 169 D 17 172 E 21 175 Solution:
Person Hand Height (xi - x̄)
\(y_{i} - \overline{y}\) (xi - x̄)(yi - \(\overline{y}\)) (xi - x̄)2 (yi - \(\overline{y}\))2 A 17 150 -0.8 -14.0 11.2 0.6 196.0 B 15 154 -2.8 -10.0 28.0 7.8 100.0 C 19 169 1.2 5.0 6.0 1.4 25.0 D 17 172 -0.8 8.0 -6.4 0.6 64.0 E 21 175 3.2 11.0 35.2 10.2 121.0 Average 17.8 164 Total 74.0 20.8 506.0 Using the formula,
\(r_{xy}=\frac{\sum_{1}^{n}\left ( x_{i} -\overline{x}\right )\left ( y_{i} -\overline{y}\right )}{\sqrt{\sum_{1}^{n}\left ( x_{i}-\overline{x} \right )^{2}\sum_{1}^{n}\left ( y_{i}-\overline{y} \right )^{2}}}\)
= 0.72
Answer: The data has a high positive correlation
-
Example 2: Find the equation of the regression line for the following data that has a correlation coefficient equal to 0.61
Person Weight Blood Pressure A 150 125 B 169 130 C 175 160 D 180 169 E 200 150 Solution:
Person Weight Blood Pressure (xi - x̄) \(y_{i} - \overline{y}\) (xi - x̄)2 (yi - \(\overline{y}\))2 A 150 125 -24.8 -21.8 615.0 475.2 B 169 130 -5.8 -16.8 33.6 282.2 C 175 160 0.2 13.2 0.0 174.2 D 180 169 5.2 22.2 27.0 492.8 E 200 150 25.2 3.2 635.0 10.2 Average 174.8 146.8 Total 1310.8 1434.8 \(\sigma_{x} = \sqrt{ \frac{\sum\left ( x_{i}-\overline{x} \right )^{2}}{n-1}}\) = 18.3
\(\sigma_{y} = \sqrt{ \frac{\sum\left ( y_{i}-\overline{y} \right )^{2}}{n-1}}\) = 18.94
\(r_{xy}\) = 0.61
\(\beta = r_{xy}\frac{\sigma_{y}}{\sigma_{x}}\) = 0.64
\(\alpha = \overline{y}-\beta \overline{x}\) = 146.8 - (174.8)(0.64) = 35.21
Equation of regression line: y = 35.21 + 0.64x
Answer: y = 35.21 + 0.64x
-
Example 3: Interpret the correlation coefficient of the following data
X Y 12 1002 14 760 21 580 34 400 56 350 78 120 Solution:
X Y (xi - x̄) \(y_{i} - \overline{y}\) (xi - x̄)(yi - \(\overline{y}\)) (xi - x̄)2 (yi - \(\overline{y}\))2 12 1002 -23.833 466.67 -11122.22 568.028 217777.78 14 760 -21.833 224.67 -4905.22 476.694 50475.11 21 580 -14.833 44.67 -662.56 220.028 1995.11 34 400 -1.833 -135.33 248.11 3.361 18315.11 56 350 20.167 -185.33 -3737.56 406.694 34348.44 78 120 42.167 -415.33 -17513.22 1778.028 172501.78 \(\overline{x}\) = 35.83 \(\overline{y}\) = 535.83 Total -37692.667 3452.833 495413.33 Using the formula,
\(r_{xy}=\frac{\sum_{1}^{n}\left ( x_{i} -\overline{x}\right )\left ( y_{i} -\overline{y}\right )}{\sqrt{\sum_{1}^{n}\left ( x_{i}-\overline{x} \right )^{2}\sum_{1}^{n}\left ( y_{i}-\overline{y} \right )^{2}}}\)
= -0.9114
Answer: There is a strong negative correlation between the two variables
FAQs on Correlation and Regression
What are Correlation and Regression in Statistics?
In statistics, correlation and regression are measures that help to describe and quantify the relationship between two variables using a signed number.
What is the Definition of Correlation and Regression?
Correlation in correlation and regression can be defined as a numeric value that determines whether variables are linearly related and give a numeric value to the corresponding strength. Regression is an equation that checks how a change in one variable will result in a change in another variable.
What is the Formula for Correlation and Regression?
The formula for correlation and regression is given as follows
- Correlation: \(r_{xy}=\frac{\sum_{1}^{n}\left ( x_{i} -\overline{x}\right )\left ( y_{i} -\overline{y}\right )}{\sqrt{\sum_{1}^{n}\left ( x_{i}-\overline{x} \right )^{2}\sum_{1}^{n}\left ( y_{i}-\overline{y} \right )^{2}}}\)
- Regression line equation: y = \(\alpha\) + \(\beta x\), where \(\beta = r_{xy}\frac{\sigma_{y}}{\sigma_{x}}\) and \(\alpha = \overline{y}-\beta \overline{x}\)
What is the Similarity Between Correlation and Regression?
The similarity between correlation and regression is that if the correlation coefficient is positive (or negative) then the slope of the regression line will also be positive (or negative).
What is the Difference Between Correlation and Regression?
The main difference between correlation and regression is that correlation is used to find whether the given variables follow a linear relationship or not. Regression is used to find the effect of an independent variable on a dependent variable by determining the equation of the best-fitted line.
How to Graphically Represent Correlation and Regression?
A scatter plot or scatter chart is used to represent correlation and regression graphically. The data points of the variables are plotted on the graph to check the correlation and the best-fitted line represents the regression equation.
What is the Best Way to Find Correlation and Regression Between Two Variables?
The best way to find the correlation and regression between two variables is by using Pearson's correlation coefficient and by employing the ordinary least squares method respectively.
visual curriculum