Linear regression is a widely used statistical method for modeling the relationship between a dependent variable and one or more independent variables. It relies on several key assumptions:
1. **Linearity**: The relationship between the independent and dependent variables should be linear. You can validate this assumption by creating scatter plots.
2. **Independence**: The residuals (errors) should be independent. This can be checked using the Durbin-Watson test.
3. **Homoscedasticity**: The residuals should have constant variance at all levels of the independent variables. A residual plot can help check for this.
4. **Normality**: The residuals should be normally distributed. This can be assessed using a Q-Q plot or the Shapiro-Wilk test.
5. **No Multicollinearity**: Independent variables should not be too highly correlated. Variance Inflation Factor (VIF) can be used to check for multicollinearity.
Here’s an example of validating assumptions using Python and the `statsmodels` library:
import statsmodels.api as sm
import matplotlib.pyplot as plt
import numpy as np
# Sample data
X = np.random.rand(100)
Y = 2 * X + np.random.normal(0, 0.1, 100)
# Fit linear regression model
X = sm.add_constant(X) # Adds a constant term to the predictor
model = sm.OLS(Y, X).fit()
# 1. Check linearity with scatter plot
plt.scatter(X[:, 1], Y)
plt.plot(X[:, 1], model.predict(X), color='red')
plt.title('Linearity Check')
plt.xlabel('Independent Variable')
plt.ylabel('Dependent Variable')
plt.show()
# 2. Residuals vs Fitted plot for homoscedasticity
plt.scatter(model.predict(X), model.resid)
plt.axhline(0, linestyle='--', color='red')
plt.title('Residuals vs Fitted')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
6. **Conclusion**: Validating the assumptions of linear regression is crucial for the model’s reliability. By ensuring these assumptions hold, you can make more accurate predictions and draw meaningful conclusions.