Versions of ANCOVA (Analysis Of Covariance) with python

To perform ANCOVA (Analysis of Covariance) with a dataset that includes multiple types of variables, you’ll need to ensure your dependent variable is continuous, and you can include categorical variables as factors. Below is an example using the statsmodels library in Python: Mock Dataset Let’s create a dataset with a mix of variable types: Performing…

Topics:

To perform ANCOVA (Analysis of Covariance) with a dataset that includes multiple types of variables, you’ll need to ensure your dependent variable is continuous, and you can include categorical variables as factors. Below is an example using the statsmodels library in Python:

Mock Dataset

Let’s create a dataset with a mix of variable types:

Python
Python
Python
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generating a larger sample dataset
np.random.seed(0)
data = pd.DataFrame({
    'continuous_var': np.random.uniform(1, 10, 100),
    'categorical_var': np.random.choice(['A', 'B', 'C'], 100),
    'binary_var': np.random.choice([0, 1], 100),
    'discrete_var': np.random.choice([1, 2, 3], 100),
    'ordinal_var': np.random.choice(['low', 'medium', 'high'], 100),
    'interval_var': np.random.uniform(10, 30, 100),
    'ratio_var': np.random.uniform(100, 300, 100),
    'dependent_var': np.random.uniform(5, 20, 100)
})

# Convert categorical variables to category dtype
data['categorical_var'] = data['categorical_var'].astype('category')
data['ordinal_var'] = pd.Categorical(data['ordinal_var'], categories=['low', 'medium', 'high'], ordered=True)
data.head()
continuous_varcategorical_varbinary_vardiscrete_varordinal_varinterval_varratio_vardependent_var
05.939322B01high12.577211111.67127211.154523
17.436704C11medium17.853514246.14182014.349420
26.424870C12medium29.128114276.34404218.304412
35.903949C03medium13.742618154.48737914.282393
44.812893B12high28.079679175.8113797.001922

Performing ANCOVA with multiple categorical variables

We’ll perform ANCOVA with dependent_var as the dependent variable, continuous_var and ratio_var as covariates, and categorical_var as a factor:

Python
Python
Python
data['categorical_var'] = data['categorical_var'].astype('category')
data['ordinal_var'] = pd.Categorical(data['ordinal_var'], categories=['low', 'medium', 'high'], ordered=True)

# Defining the formula for ANCOVA
formula = 'dependent_var ~ C(categorical_var) + continuous_var + ratio_var'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Output:

sum_sqdfFPR(>F)
C(categorical_var)56.0962752.01.3604510.261497
continuous_var38.3996731.01.8625430.175555
ratio_var0.3511111.00.0170300.896446
Residual1958.59596895.0NaNNaN

The anova_table will show the ANCOVA results, including the sum of squares, degrees of freedom, F-statistic, and p-values for each factor and covariate in the model.

Interpretation: None of the variables (categorical_var, continuous_var, ratio_var) have a statistically significant effect on the dependent variable, as indicated by their p-values being greater than 0.05. Therefore, these variables do not significantly explain the variability in the dependent variable in this dataset.

Explanation:

  • Formula: The formula ‘dependent_var ~ C(categorical_var) + continuous_var + ratio_var
  • ols: specifies that dependent_var is the dependent variable, categorical_var is a categorical factor (notated by C() ), and continuous_var and ratio_var are covariates.
  • anova_lm: This function performs the ANCOVA and returns the results in an ANOVA table format.

You can add more variables to the formula as needed, ensuring that you correctly specify categorical variables with C()

Check out the following examples for numerous variable types

Example 1: Education Dataset

Dataset

  • Dependent Variable: Test Scores
  • Covariates: Hours of Study, Previous Grades
  • Factors: Teaching Method, Gender
Test Scores
Untitled design 3 1
Dependent Variables
Hours of Study
Untitled design 4
Covariates
Previous Grades
Untitled design 5
Covariates
Teaching Methods
Untitled design 6
Categorical
Gender
Untitled design 7
Categorical
Python
Python
Python
import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = pd.DataFrame({
    'test_scores': [80, 85, 88, 90, 85, 80, 95, 90, 88, 85],
    'hours_of_study': [10, 12, 14, 16, 12, 10, 18, 14, 15, 13],
    'previous_grades': [75, 80, 85, 90, 80, 75, 95, 85, 88, 83],
    'teaching_method': ['Traditional', 'Traditional', 'Online', 'Online', 'Traditional', 'Traditional', 'Online', 'Online', 'Traditional', 'Online'],
    'gender': ['M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M']
})

# Convert categorical variables to category dtype
data['teaching_method'] = data['teaching_method'].astype('category')
data['gender'] = data['gender'].astype('category')

# Defining the formula for ANCOVA
formula = 'test_scores ~ C(teaching_method) + C(gender) + hours_of_study + previous_grades'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Data:

test_scoreshours_of_studyprevious_gradesteaching_methodgender
801075TraditionalM
851280TraditionalF
881485OnlineM
901690OnlineF
851280TraditionalF

Output:

sum_sqdfFPR(>F)
C(teaching_method)1.1847331.00.8783980.391668
C(gender)1.6638131.01.2336030.317244
hours_of_study4.1687811.03.0908660.139066
previous_grades2.7575061.02.0445020.212149
Residual6.7437115.0NaNNaN

Interpretation:

None of the variables (teaching_method, gender, hours_of_study, previous_grades) have a statistically significant effect on the dependent variable, as all p-values are greater than 0.05. This implies that these factors do not significantly explain the variability in the dependent variable in this dataset.

Example 2: Health Dataset

Dataset

  • Dependent Variable: Blood Pressure
  • Covariates: Age, Exercise Hours per Week
  • Factors: Diet Type, Smoking Status
Test Scores
Untitled design 8
Dependent Variables
Age
image 2
Covariates
Exercise per Week
Untitled design 9
Covariates
Diet Type
Untitled design 10
Categorical
Smoking Status
Untitled design 11 edited
Categorical
Python
Python
Python
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = pd.DataFrame({
    'blood_pressure': [120, 130, 125, 135, 128, 122, 138, 140, 126, 129],
    'age': [25, 35, 45, 55, 30, 40, 50, 60, 28, 33],
    'exercise_hours': [5, 3, 4, 2, 6, 4, 2, 1, 5, 3],
    'diet_type': ['Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian'],
    'smoking_status': ['Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker']
})
data
# Convert categorical variables to category dtype
data['diet_type'] = data['diet_type'].astype('category')
data['smoking_status'] = data['smoking_status'].astype('category')

# Defining the formula for ANCOVA
formula = 'blood_pressure ~ C(diet_type) + C(smoking_status) + age + exercise_hours'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Output:

Data:

blood_pressureageexercise_hoursdiet_typesmoking_status
0120255VegetarianNon-Smoker
1130353Non-VegetarianSmoker
2125454VegetarianNon-Smoker
3135552Non-VegetarianSmoker
4128306VegetarianNon-Smoker

Output:



sum_sq
dfFPR(>F)
C(diet_type)1056.5602201.058.7129760.000258
C(smoking_status)1502.6758321.083.5035890.000097
age5.0029341.00.2780130.616925
exercise_hours47.2692771.02.6267500.156204
Residual107.9720666.0NaNNaN

Interpretation:

Diet Type (C(diet_type)): Significant effect with F=58.71, PF(>F)=0.000258. Diet type has a strong impact on the dependent variable.

Smoking Status (C(smoking_status)): Significant effect with F=83.50, PF(>F)=0.00097. Smoking status also significantly affects the dependent variable.

Age: Not significant with F=0.28 PF(>F)=616925. Age does not have a significant effect.

Exercise Hours: Not significant with F=2.63 PF(>F)=0.156. Exercise hours do not significantly affect the outcome.

Residuals: The residual variance is 107.97, indicating unexplained variance in the model.

Example 3: Marketing Dataset

Dataset

  • Dependent Variable: Sales
  • Covariates: Advertising Spend
  • Factors: Region, Campaign Type
Sales
Untitled design 12
Dependent Variables
Advertising Spend
Untitled design 13
Covariates
Region
Untitled design 15
Categorical
Campaign Type
Untitled design 14
Categorical
Python
Python
Python
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = pd.DataFrame({
    'sales': [200, 220, 210, 230, 225, 215, 240, 235, 220, 210],
    'advertising_spend': [5000, 6000, 5500, 6500, 6200, 5800, 7000, 6800, 6200, 5900],
    'region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
    'campaign_type': ['Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline']
})

# Convert categorical variables to category dtype
data['region'] = data['region'].astype('category')
data['campaign_type'] = data['campaign_type'].astype('category')

# Defining the formula for ANCOVA
formula = 'sales ~ C(region) + C(campaign_type) + advertising_spend'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Output:

Data:

salesadvertising_spendregioncampaign_type
2005000NorthOnline
2206000SouthOffline
2105500NorthOnline
2306500SouthOffline
2256200NorthOnline

Output:


sum_sq
dfFPR(>F)
C(region)600.6901761.061.9881920.000222
C(campaign_type)813.5220371.083.9513650.000095
advertising_spend528.5883891.054.5476520.000316
Residual58.1423806.0NaNNaN

Interpretation:

Diet Type (C(diet_type)): Significant effect with F=58.71, PF(>F)=0.000258. Diet type has a strong impact on the dependent variable.

Smoking Status (C(smoking_status)): Significant effect with F=83.50, PF(>F)=0.00097. Smoking status also significantly affects the dependent variable.

Age: Not significant with F=0.28 PF(>F)=616925. Age does not have a significant effect.

Exercise Hours: Not significant with F=2.63 PF(>F)=0.156. Exercise hours do not significantly affect the outcome.

Residuals: The residual variance is 107.97, indicating unexplained variance in the model.

Conclusion:

These examples demonstrate the flexibility of ANCOVA in handling datasets with a mix of covariates and factors. By correctly specifying the model formula and using the appropriate statistical methods, researchers can gain valuable insights into how different types of variables influence the dependent variable. Whether in education, health, or marketing, ANCOVA provides a robust framework for adjusting for covariates while examining the effects of categorical factors, leading to more precise and meaningful analyses.

How useful was this post?

Click on a star to rate it!

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

Leave a Reply

Points You Earned

Untitled design 6
0 distinction_points
Untitled design 5
python_points 0
0 Solver points
Instagram
WhatsApp
error: Content is protected !!