Task: Calculating the Correlation-coefficient using Python
We know that the correlation coefficient is calculated using the formula
nΣxy- ΣxΣy / (√(nΣx^2-(Σx)^2) * (nΣy^2-(Σy)^2))
In the above formula, n is the total number of values present in each set
of numbers (the sets have to be of equal length). The two sets of numbers
are denoted by x and y (it doesn’t matter which one you denote as which).
The other terms are described as follows:
Σxy :Sum of the products of the individual elements of the two sets
of numbers, x and y
Σx : Sum of the numbers in set x
Σy: Sum of the numbers in set y
Σx^2:Square of the sum of the numbers in set x
Σy^2:Square of the sum of the numbers in set y
(Σx)^2:Sum of the squares of the numbers in set x
(Σy)^2:Sum of the squares of the numbers in set y/
Let us now write a Python Program which calculates the correlation coefficient for us. We will be using the following two functions in the program:
- Sum(x) : Using this function on a list of numbers,x will sum up the numbers in the list.
- Zip(x,y): returns the list of corresponding numbers in lists x,y which you can then use in a loop to perform other operations.
import os
import sys
#A Program to calculate the correlation coefficient
def find_corr_x_y(x,y):
n = len(x)
#Find the sum of the products
prod = []
for xi,yi in zip(x,y):
prod.append(xi*yi)
sum_prod_x_y = sum(prod)
sum_x = sum(x)
sum_y = sum(y)
squared_sum_x = sum_x ** 2
squared_sum_y = sum_y ** 2
x_square = []
for xi in x:
x_square.append(xi**2)
x_square_sum = sum(x_square)
y_square = []
for yi in y:
y_square.append(yi**2)
y_square_sum = sum(y_square)
numerator = n * sum_prod_x_y - sum_x * sum_y
dterm1 = n*x_square_sum - squared_sum_x
dterm2 = n*y_square_sum - squared_sum_y
denm = (dterm1 *dterm2) ** 0.5
corr = numerator / denm
return corr
crr = 0
X1 = [5.1,3.2,3,1.4,3.8,1.0,2.8,-0.3,6.9,2.5,6.2,4.6]
Y = [30,29,30,35,36,36,34,48,24,27,21,30]
if (len(X1) == len(Y)):
crr = find_corr_x_y(X1,Y)
print("Pearson product-moment Correlation Coefficient = {0}".format(crr))
if (crr >= 0.8):
print("Strong Positive Correlation")
elif (crr <= -0.8):
print("Strong Negative Correlation")
else:
print("Sorry,the data set lengths are not equal")
The find_corr_x_y() function accepts two arguments, x and y, which are the two sets of numbers we want to calculate the correlation for. Inside this function all terms used for calculating the Correlation coefficient are obtained. Also, correlation coefficient is only calculated when the list of numbers passed to the function are equal in length.
OUTPUT
>>>
Pearson product-moment Correlation Coefficient = -0.823545657378
Strong Negative Correlation
>>>
Try writing this program,students, in your computer and see how it runs,with equal and unequal lists of numbers.