Versions of ANCOVA (Analysis Of Covariance) with python

To perform ANCOVA (Analysis of Covariance) with a dataset that includes multiple types of variables, you’ll need to ensure your dependent variable is continuous, and you can include categorical variables as factors. Below is an example using the statsmodels library in Python:

Mock Dataset

Let’s create a dataset with a mix of variable types:

Python

import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Generating a larger sample dataset
np.random.seed(0)
data = pd.DataFrame({
    'continuous_var': np.random.uniform(1, 10, 100),
    'categorical_var': np.random.choice(['A', 'B', 'C'], 100),
    'binary_var': np.random.choice([0, 1], 100),
    'discrete_var': np.random.choice([1, 2, 3], 100),
    'ordinal_var': np.random.choice(['low', 'medium', 'high'], 100),
    'interval_var': np.random.uniform(10, 30, 100),
    'ratio_var': np.random.uniform(100, 300, 100),
    'dependent_var': np.random.uniform(5, 20, 100)
})

# Convert categorical variables to category dtype
data['categorical_var'] = data['categorical_var'].astype('category')
data['ordinal_var'] = pd.Categorical(data['ordinal_var'], categories=['low', 'medium', 'high'], ordered=True)
data.head()

	continuous_var	categorical_var	binary_var	discrete_var	ordinal_var	interval_var	ratio_var	dependent_var
0	5.939322	B	0	1	high	12.577211	111.671272	11.154523
1	7.436704	C	1	1	medium	17.853514	246.141820	14.349420
2	6.424870	C	1	2	medium	29.128114	276.344042	18.304412
3	5.903949	C	0	3	medium	13.742618	154.487379	14.282393
4	4.812893	B	1	2	high	28.079679	175.811379	7.001922

Performing ANCOVA with multiple categorical variables

We’ll perform ANCOVA with dependent_var as the dependent variable, continuous_var and ratio_var as covariates, and categorical_var as a factor:

Python

data['categorical_var'] = data['categorical_var'].astype('category')
data['ordinal_var'] = pd.Categorical(data['ordinal_var'], categories=['low', 'medium', 'high'], ordered=True)

# Defining the formula for ANCOVA
formula = 'dependent_var ~ C(categorical_var) + continuous_var + ratio_var'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Output:

	sum_sq	df	F	PR(>F)
C(categorical_var)	56.096275	2.0	1.360451	0.261497
continuous_var	38.399673	1.0	1.862543	0.175555
ratio_var	0.351111	1.0	0.017030	0.896446
Residual	1958.595968	95.0	NaN	NaN

The anova_table will show the ANCOVA results, including the sum of squares, degrees of freedom, F-statistic, and p-values for each factor and covariate in the model.

Interpretation: None of the variables (categorical_var, continuous_var, ratio_var) have a statistically significant effect on the dependent variable, as indicated by their p-values being greater than 0.05. Therefore, these variables do not significantly explain the variability in the dependent variable in this dataset.

Explanation:

Formula: The formula ‘dependent_var ~ C(categorical_var) + continuous_var + ratio_var‘
ols: specifies that dependent_var is the dependent variable, categorical_var is a categorical factor (notated by C() ), and continuous_var and ratio_var are covariates.
anova_lm: This function performs the ANCOVA and returns the results in an ANOVA table format.

You can add more variables to the formula as needed, ensuring that you correctly specify categorical variables with C()

Check out the following examples for numerous variable types

Example 1: Education Dataset

Dataset

Dependent Variable: Test Scores
Covariates: Hours of Study, Previous Grades
Factors: Teaching Method, Gender

Test Scores

Dependent Variables

Hours of Study

Covariates

Previous Grades

Covariates

Teaching Methods

Categorical

Gender

Categorical

Python

import pandas as pd
import numpy as np
import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = pd.DataFrame({
    'test_scores': [80, 85, 88, 90, 85, 80, 95, 90, 88, 85],
    'hours_of_study': [10, 12, 14, 16, 12, 10, 18, 14, 15, 13],
    'previous_grades': [75, 80, 85, 90, 80, 75, 95, 85, 88, 83],
    'teaching_method': ['Traditional', 'Traditional', 'Online', 'Online', 'Traditional', 'Traditional', 'Online', 'Online', 'Traditional', 'Online'],
    'gender': ['M', 'F', 'M', 'F', 'F', 'M', 'M', 'F', 'F', 'M']
})

# Convert categorical variables to category dtype
data['teaching_method'] = data['teaching_method'].astype('category')
data['gender'] = data['gender'].astype('category')

# Defining the formula for ANCOVA
formula = 'test_scores ~ C(teaching_method) + C(gender) + hours_of_study + previous_grades'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Data:

test_scores	hours_of_study	previous_grades	teaching_method	gender
80	10	75	Traditional	M
85	12	80	Traditional	F
88	14	85	Online	M
90	16	90	Online	F
85	12	80	Traditional	F

Output:

	sum_sq	df	F	PR(>F)
C(teaching_method)	1.184733	1.0	0.878398	0.391668
C(gender)	1.663813	1.0	1.233603	0.317244
hours_of_study	4.168781	1.0	3.090866	0.139066
previous_grades	2.757506	1.0	2.044502	0.212149
Residual	6.743711	5.0	NaN	NaN

Interpretation:

None of the variables (teaching_method, gender, hours_of_study, previous_grades) have a statistically significant effect on the dependent variable, as all p-values are greater than 0.05. This implies that these factors do not significantly explain the variability in the dependent variable in this dataset.

Example 2: Health Dataset

Dataset

Dependent Variable: Blood Pressure
Covariates: Age, Exercise Hours per Week
Factors: Diet Type, Smoking Status

Test Scores

Dependent Variables

Age

Covariates

Exercise per Week

Covariates

Diet Type

Categorical

Smoking Status

Categorical

Python

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = pd.DataFrame({
    'blood_pressure': [120, 130, 125, 135, 128, 122, 138, 140, 126, 129],
    'age': [25, 35, 45, 55, 30, 40, 50, 60, 28, 33],
    'exercise_hours': [5, 3, 4, 2, 6, 4, 2, 1, 5, 3],
    'diet_type': ['Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian', 'Vegetarian', 'Non-Vegetarian'],
    'smoking_status': ['Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker', 'Non-Smoker', 'Smoker']
})
data
# Convert categorical variables to category dtype
data['diet_type'] = data['diet_type'].astype('category')
data['smoking_status'] = data['smoking_status'].astype('category')

# Defining the formula for ANCOVA
formula = 'blood_pressure ~ C(diet_type) + C(smoking_status) + age + exercise_hours'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Output:

Data:

	blood_pressure	age	exercise_hours	diet_type	smoking_status
0	120	25	5	Vegetarian	Non-Smoker
1	130	35	3	Non-Vegetarian	Smoker
2	125	45	4	Vegetarian	Non-Smoker
3	135	55	2	Non-Vegetarian	Smoker
4	128	30	6	Vegetarian	Non-Smoker

Output:

	sum_sq	df	F	PR(>F)
C(diet_type)	1056.560220	1.0	58.712976	0.000258
C(smoking_status)	1502.675832	1.0	83.503589	0.000097
age	5.002934	1.0	0.278013	0.616925
exercise_hours	47.269277	1.0	2.626750	0.156204
Residual	107.972066	6.0	NaN	NaN

Interpretation:

Diet Type (C(diet_type)): Significant effect with F=58.71, PF(>F)=0.000258. Diet type has a strong impact on the dependent variable.

Smoking Status (C(smoking_status)): Significant effect with F=83.50, PF(>F)=0.00097. Smoking status also significantly affects the dependent variable.

Age: Not significant with F=0.28 PF(>F)=616925. Age does not have a significant effect.

Exercise Hours: Not significant with F=2.63 PF(>F)=0.156. Exercise hours do not significantly affect the outcome.

Residuals: The residual variance is 107.97, indicating unexplained variance in the model.

Example 3: Marketing Dataset

Dataset

Dependent Variable: Sales
Covariates: Advertising Spend
Factors: Region, Campaign Type

Sales

Dependent Variables

Advertising Spend

Covariates

Region

Categorical

Campaign Type

Categorical

Python

import statsmodels.api as sm
from statsmodels.formula.api import ols

# Example dataset
data = pd.DataFrame({
    'sales': [200, 220, 210, 230, 225, 215, 240, 235, 220, 210],
    'advertising_spend': [5000, 6000, 5500, 6500, 6200, 5800, 7000, 6800, 6200, 5900],
    'region': ['North', 'South', 'North', 'South', 'North', 'South', 'North', 'South', 'North', 'South'],
    'campaign_type': ['Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline', 'Online', 'Offline']
})

# Convert categorical variables to category dtype
data['region'] = data['region'].astype('category')
data['campaign_type'] = data['campaign_type'].astype('category')

# Defining the formula for ANCOVA
formula = 'sales ~ C(region) + C(campaign_type) + advertising_spend'

# Fitting the model
model = ols(formula, data=data).fit()

# Performing ANCOVA
anova_table = sm.stats.anova_lm(model, typ=2)
print(anova_table)

Output:

Data:

sales	advertising_spend	region	campaign_type
200	5000	North	Online
220	6000	South	Offline
210	5500	North	Online
230	6500	South	Offline
225	6200	North	Online

Output:

sum_sq	df	F	PR(>F)
C(region)	600.690176	1.0	61.988192	0.000222
C(campaign_type)	813.522037	1.0	83.951365	0.000095
advertising_spend	528.588389	1.0	54.547652	0.000316
Residual	58.142380	6.0	NaN	NaN

Interpretation:

Diet Type (C(diet_type)): Significant effect with F=58.71, PF(>F)=0.000258. Diet type has a strong impact on the dependent variable.

Smoking Status (C(smoking_status)): Significant effect with F=83.50, PF(>F)=0.00097. Smoking status also significantly affects the dependent variable.

Age: Not significant with F=0.28 PF(>F)=616925. Age does not have a significant effect.

Exercise Hours: Not significant with F=2.63 PF(>F)=0.156. Exercise hours do not significantly affect the outcome.

Residuals: The residual variance is 107.97, indicating unexplained variance in the model.

Conclusion:

These examples demonstrate the flexibility of ANCOVA in handling datasets with a mix of covariates and factors. By correctly specifying the model formula and using the appropriate statistical methods, researchers can gain valuable insights into how different types of variables influence the dependent variable. Whether in education, health, or marketing, ANCOVA provides a robust framework for adjusting for covariates while examining the effects of categorical factors, leading to more precise and meaningful analyses.