ANCOVA: Analysis of Covariance with python

ANCOVA

ANCOVA is an extension of ANOVA (Analysis of Variance) that combines blocks of regression analysis and ANOVA.
It allows us to assess the impact of an independent variable (like a treatment or group) on a dependent variable (such as an outcome or response), while controlling for the influence of one or more continuous covariates (confounding variables).

e.g. Think of ANCOVA as a way to level the playing field. It adjusts for other factors (covariates) so we can clearly see the impact of our main variable on the outcome.
e.g. Think of ANCOVA as a way to compare groups while keeping other factors constant. It’s like comparing the performance of students from different schools while accounting for their study hours.

Step 1: Why is there a need for a hypothesis?

In scientific research, a hypothesis is crucial. It guides the investigation and helps focus on a specific research question. A hypothesis gives the study a clear direction, making sure that data collection and analysis are purposeful and meaningful. Without a hypothesis, research may lack direction and results may be hard to interpret.

e.g Imagine you’re guessing the number of candies in a jar. Hypothesis testing is like checking if your guess is close enough to the actual number or if it’s way off.

Step 2: Why is there a need for ANOVA, MANOVA, and ANCOVA?

ANOVA (Analysis of Variance), MANOVA (Multivariate Analysis of Variance), and ANCOVA (Analysis of Covariance) are statistical techniques used to analyze the relationship between variables. ANOVA compares means between groups, MANOVA extends this to multiple dependent variables, and ANCOVA accounts for the effect of additional variables (covariates) on the relationship. These techniques help researchers understand the significance of differences between groups and the impact of covariates on the outcomes.

Step 3: Which type of real-life problems need ANCOVA implementation?

ANCOVA is particularly useful in situations where:

There are multiple groups to compare
There are additional variables (covariates) that may influence the outcome
The goal is to control for the effect of covariates on the relationship between variables

Examples of real-life problems that may require ANCOVA implementation include:

Case 1:

Analyzing the effect of a new marketing campaign on sales while controlling for advertising spend.

New Campaign

Dependent Variable

Sales

Group Variable

Budget

Covariates

Case 2:

2. A medical study is investigating the effectiveness of three different treatments for reducing cholesterol levels. The researchers also want to account for the potential influence of patients’ age and baseline cholesterol levels.

Imagine you’re a doctor studying the effect of a new drug. ANCOVA helps you see the drug’s impact while considering patients’ ages. Similarly, in education, it can show how different teaching methods work while accounting for students’ prior knowledge.

`Cholesterol Reduction` ~ `Treatment Type` + `Age` + `Baseline Cholesterol`

Cholesterol Reduction

Dependent Variable

Treatment Type

Group Variable

Age

Covariate

Baseline Cholesterol

Covariate

The previous examples show how to study the connection between a main result and various influencing factors while considering the effects of other related variables.

Step 4: What are real-life implementations of ANCOVA so far?

ANCOVA has been widely used in various fields, including:

Medicine

To compare the effectiveness of different treatments while controlling for patient characteristics

Education

To examine the impact of different teaching methods on student outcomes while accounting for student IQ
Here’s a simple example: If you have data on students’ test scores, study hours, and teaching methods, you can use Python’s statsmodels library to see how teaching methods affect scores, considering study hours.

Marketing

To analyze the relationship between customer behaviour and marketing strategies while controlling for demographic variables

Step 5: Python code for ANCOVA implementation

This code snippet imports necessary libraries reads a CSV file ‘teengamb.csv’ into a pandas DataFrame, and prints the first few rows of the dataset.

Python

import pandas as pd
from statsmodels.formula.api import ols
import statsmodels.api as sm

teengamb = pd.read_csv('teengamb.csv')

# View the first few rows of the dataset
print(teengamb.head())

index	sex	status	income	verbal	gamble
0	1	51	2.0	8	0.0
1	1	28	2.5	8	0.0
2	1	37	2.0	6	0.0
3	1	28	7.0	4	7.3
4	1	65	2.0	8	19.6

Explore Dataset (`Teenage Gambling`)

The teengamb dataset from R (in the faraway package) contains information related to teenage gambling in Britain. Here are the columns typically found in this dataset:

gamble: Represents expenditure on gambling in pounds per year. for example in the fourth row 7.3 euros per year
income: Refers to the income level of the teenagers’ households, in pounds per week. 1£ pound= 108.70 ₹
sex: Indicates the gender of the teenagers, categorized as male or female, where 0 = male, 1 = female.
status: This represents the socioeconomic status (SES) of the teenagers’ families. social status being 0 = low and 100 = highest.
verbal: Typically relates to verbal reasoning scores or similar measures, In short, it is a Verbal IQ measure 1= lowest to 10= highest

Each column provides specific insights into factors that may influence teenage gambling behaviour, facilitating various statistical analyses and research studies.

Step 6: Python code for ANCOVA

This code specifies and fits an ANCOVA (Analysis of Covariance) model using the ols function from statsmodels.formula.api. It examines the relationship between the dependent variable ‘gamble’ (teenage gambling expenditure) and several independent variables ('income', 'sex', 'status', 'verbal'). The .fit() the method fits the model to the data, and .summary() provides a detailed summary of the model statistics, including coefficients, standard errors, t-statistics, p-values, and confidence intervals.

The formula:

`Variable: ~ Group Variable + Covariate`

describes the structure of an Analysis of Covariance (ANCOVA) model. Here’s a detailed explanation:

Dependent Variable: This is the outcome variable you are interested in analyzing. It is continuous and represents the primary measurement you want to understand or predict.
Group Variable: This is the categorical independent variable. It divides the data into different groups or levels. In ANCOVA, it is often a factor representing different treatment groups, experimental conditions, or classifications.
Covariate: This is a continuous variable that you suspect may influence the dependent variable. The covariate is included in the model to control for its effect, allowing a clearer understanding of the relationship between the group variable and the dependent variable.

Building the ANCOVA Model

The purpose of ANCOVA is to adjust the dependent variable for the influence of the covariate, isolating the effect of the group variable. Here’s how it works:

Fit a Linear Regression: First, a linear regression model is built with the dependent variable as the response and the covariate as a predictor. This step adjusts the dependent variable for the covariate.
Include the Group Variable: The group variable is added to the model to see how the adjusted dependent variable differs across groups.
Interpret Results: The model provides estimates of the group means adjusted for the covariate, allowing comparison between groups while controlling for the covariate.

‘Dependent Variable’ ~ ‘Group Variable’ + ‘Covariate’

‘Dependent Variable’ = gamble

‘Group Variable’ = ‘sex’,

‘Covariates‘ (Independent Variables)= income, sex, status, and verbal

Python

# Specify the ANCOVA model
model = ols('gamble ~ income + sex + status + verbal', data=teengamb).fit()
#summarise model
mode.summary()

Dep. Variable:	gamble	R-squared:	0.527
Model:	OLS	Adj. R-squared:	0.482
Method:	Least Squares	F-statistic:	11.69
Date:	Mon, 15 Jul 2024	Prob (F-statistic):	1.81e-06
Time:	14:52:21	Log-Likelihood:	-210.78
No. Observations:	47	AIC:	431.6
Df Residuals:	42	BIC:	440.8
Df Model:	4
Covariance Type:	nonrobust

coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	22.5557	17.197	1.312	0.197	-12.149	57.260
income	4.9620	1.025	4.839	0.000	2.893	7.031
sex	-22.1183	8.211	-2.694	0.010	-38.689	-5.548
status	0.0522	0.281	0.186	0.853	-0.515	0.620
verbal	-2.9595	2.172	-1.362	0.180	-7.343	1.424

Omnibus:	31.143	Durbin-Watson:	2.214
Prob(Omnibus):	0.000	Jarque-Bera (JB):	101.046
Skew:	1.604	Prob(JB):	1.14e-22
Kurtosis:	9.427	Cond. No.	264.

Step 7: Output metrics of ANCOVA

‘Dependent Variable’ ~ ‘Group Variable’ + ‘Covariate’

The output of ANCOVA:

	sum_sq	df	F	PR(>F)
sex	3735.790512	1.0	7.256053	0.010112
income	12056.238564	1.0	23.416920	0.000018
status	17.775781	1.0	0.034526	0.853487
verbal	955.734110	1.0	1.856329	0.180311
Residual	21623.767055	42.0	NaN	NaN

Sum of Squares
Degrees of Freedom
F-statistic (F)
p-value
(PR(>F))

Step 8: How to properly interpret the output

Think of the output metrics like a report card. The ‘sum of squares’ shows the total variation, ‘degrees of freedom’ indicate the number of comparisons made, ‘F-value’ tells you how strong the effect is, and ‘p-value’ shows if the results are significant, like a passing grade.

sum_sq (Sum of Squares): This column shows the total variation in the dependent variable that can be attributed to each factor or variable.
- Higher values indicate more variability explained by the factor.
df (Degrees of Freedom): This column represents the number of independent values or quantities that can vary for each factor.
- Typically, df = number of levels – 1 for each factor.
F (F-value): This is the test statistic calculated by dividing the mean square of each factor by the mean square of the residuals (error). It indicates the ratio of systematic variance to unsystematic variance.
- Higher F-values suggest a more significant effect of the factor on the dependent variable.
PR(>F) (p-value): This column shows the significance level of the F-test. It indicates the probability that the observed data would occur if the null hypothesis were true (i.e., no effect).
- p-values less than 0.05 typically indicate significant effects.

Interpretation of the Specific Values:

Sex:
- sum_sq: 3735.790512
- df: 1.0
- F: 7.256053
- PR(>F): 0.010112
- Interpretation: Gender has a significant effect on gambling expenditure (p < 0.05).
Income:
- sum_sq: 12056.238564
- df: 1.0
- F: 23.416920
- PR(>F): 0.000018
- Interpretation: Income has a very significant effect on gambling expenditure (p < 0.01).
Status:
- sum_sq: 17.775781
- df: 1.0
- F: 0.034526
- PR(>F): 0.853487
- Interpretation: Status does not have a significant effect on gambling expenditure (p > 0.05).

	sum_sq	df	F	PR(>F)
sex	3735.790512	1.0	7.256053	0.010112
income	12056.238564	1.0	23.416920	0.000018
status	17.775781	1.0	0.034526	0.853487
verbal	955.734110	1.0	1.856329	0.180311
Residual	21623.767055	42.0	NaN	NaN

Verbal:
- sum_sq: 955.734110
- df: 1.0
- F: 1.856329
- PR(>F): 0.180311
- Interpretation: Verbal IQ score does not have a significant effect on gambling expenditure (p > 0.05).
Residual:
- sum_sq: 21623.767055
- df: 42.0
- Interpretation: The residual sum of squares represents the variation not explained by the factors in the model.

Conclusion:

The ANCOVA analysis of the teengamb dataset reveals that both gender and income significantly influence gambling expenditure among teenagers in Britain, while socioeconomic status and verbal IQ do not show significant effects.