I think I have some fundamental misunderstandings when it comes to the assumptions behind ANOVA. Lets suppose there is a function that maps y to x with the following equation, y=3x+2. Now suppose we don't know that this relationship is true and set about determining the relation between y and x, as well as some irrelevant categorical variable, say - eye colour, which has two values A and B (We don't know that it is irrelevant). To simulate this data run the following code
from numpy.random import seed from numpy.random import randn from matplotlib import pyplot import pandas as pd import scipy.stats as stats seed(1) err= randn(50)*.5 # random normal distribution error err2= randn(50)*.5 # random normal distribution error x=list(range(0,50)) # independent variable #Generate results based on Category A A=[(i*3+2+e) for i,e in zip(x,err)] # adding a random error, because reality is imperfect. B=[(i*3+2+e) for i,e in zip(x,err2)] data_tuples = list(zip(x,A,B)) df=pd.DataFrame(data_tuples, columns=['X','A','B'])
Okay - so now we have a nice dataframe with this information. Now someone says "I think the value is correlated to the Categorical variable". We say, "Don't be absurd". They say "Convince me" So we crank out an ANOVA test. I know that typically a t-test is used instead of ANOVA for two categorical variables, but I have my reasons, and ANOVA should boil down to t-test so it shouldn't matter. We run the following code
fvalue, pvalue = stats.f_oneway(df['A'], df['B']) pvalue
So we see that pvalue is certainly not
d_melt = pd.melt(df.reset_index(), id_vars=['X'], value_vars=['A', 'B']) -- Get the frame into a WIDE format model = ols('value ~ C(variable)', data=d_melt).fit() -- Run ordinary least squares w, pvalue = stats.shapiro(model.resid) -- Run shapiro test. pvalue
Now we see that pvalue is
pyplot.hist(model.resid) pyplot.show()
So what am I doing wrong? I suspect it is the ols that is wrong, but how else do you get the residuals in this case?