I have a dataset which has 3 attributes -

1. Gender [Male, Female]

2. Retention in Major [Yes, No]

3. Internship Experience [Yes, No]

When I perform the Chi-square test of independence, I found that Internship and Retention in major is independent [i.e. p value>>0.05]. However, when I disaggregate this data by gender by creating cross tabs having 4 rows as male-interned, female-interned, male-not interned and female-not interned versus columns - switch major and not switch major, I get a p-value that is statistically significant after the chi-square calculation. Is this ethical to disaggregate the dataset and if so, what does this mean when reporting results? Is this a form of data dredging?

The sample size is ~100 and what I am seeing after disaggregating the data set is that male and female who interned have observed values different than expected while those who never interned have observed values similar to expected values. In other words, when I write this result, will I say something like "Of all the students who interned, gender is dependent on retention while for those who never interned, we can't say anything."

Further isn't it contradictory to my first finding, "Retention is independent of internship experience" when I analysed this data across 2 categorical variables?

Similar questions and discussions