Tag: Linear Regression, Assumptions, Statistics, Data Analysis

  • What Are the Assumptions of Linear Regression, and How Would You Validate Them in a Model?

    Linear regression is a widely used statistical method for modeling the relationship between a dependent variable and one or more independent variables. It relies on several key assumptions:

    1. **Linearity**: The relationship between the independent and dependent variables should be linear. You can validate this assumption by creating scatter plots.

    2. **Independence**: The residuals (errors) should be independent. This can be checked using the Durbin-Watson test.

    3. **Homoscedasticity**: The residuals should have constant variance at all levels of the independent variables. A residual plot can help check for this.

    4. **Normality**: The residuals should be normally distributed. This can be assessed using a Q-Q plot or the Shapiro-Wilk test.

    5. **No Multicollinearity**: Independent variables should not be too highly correlated. Variance Inflation Factor (VIF) can be used to check for multicollinearity.

    Here’s an example of validating assumptions using Python and the `statsmodels` library:


    import statsmodels.api as sm
    import matplotlib.pyplot as plt
    import numpy as np

    # Sample data
    X = np.random.rand(100)
    Y = 2 * X + np.random.normal(0, 0.1, 100)

    # Fit linear regression model
    X = sm.add_constant(X) # Adds a constant term to the predictor
    model = sm.OLS(Y, X).fit()

    # 1. Check linearity with scatter plot
    plt.scatter(X[:, 1], Y)
    plt.plot(X[:, 1], model.predict(X), color='red')
    plt.title('Linearity Check')
    plt.xlabel('Independent Variable')
    plt.ylabel('Dependent Variable')
    plt.show()

    # 2. Residuals vs Fitted plot for homoscedasticity
    plt.scatter(model.predict(X), model.resid)
    plt.axhline(0, linestyle='--', color='red')
    plt.title('Residuals vs Fitted')
    plt.xlabel('Fitted Values')
    plt.ylabel('Residuals')
    plt.show()

    6. **Conclusion**: Validating the assumptions of linear regression is crucial for the model’s reliability. By ensuring these assumptions hold, you can make more accurate predictions and draw meaningful conclusions.