Statistical Inference with R: Checking relationship between categorical variables using Chi Square Method.

image source: https://penncil.med.upenn.edu/research/statistical-inference/

Statistical inference is very essential in making decisions based on available data. Many organizations these days now rely on data driven decision making to increase their productivity and service provision. In this post, we look at how we can use R to make inferences on data.

The data used for this work is an extract from the General social survey (GSS) 1972–2012 data. The data extract is part of the final project used by Duke University’s Inferential Statistics course on Coursera.

We begin our work by loading in the required libraries.

library(ggplot2)
library(dplyr)
library(statsr)
library(tidyr)

In performing our inference, we will start by asking a question that tries to find correlation between different variables in the data and exploring the data to see if there is any validity in the stated question for those variables.

Exploratory Data Analysis

Views on pre-marital sex are often shaped by religious teachings and beliefs. This is mostly the case because ancient religious texts forbid it. It is believed that people who are actively engaged or who practices religion are less likely to engage in or condone pre-marital sex. We would like to find out from our data if this general assumption or belief is true. We will look at the variables teensex, premarsx,relig and attend We look at the structure of these variables

gss %>% select(teensex, premarsx, relig,attend) %>% str()## 'data.frame':    57061 obs. of  4 variables:
## $ teensex : Factor w/ 5 levels "Always Wrong",..: NA NA NA NA NA NA NA NA NA NA ...
## $ premarsx: Factor w/ 5 levels "Always Wrong",..: 4 1 1 1 3 3 4 3 4 1 ...
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...
## $ attend : Factor w/ 9 levels "Never","Lt Once A Year",..: 3 8 5 NA NA 3 8 NA 4 9 ...

We then check their various factor levels

gss %>% group_by(teensex) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 5 x 2
## teensex count
## <fct> <int>
## 1 Always Wrong 15165
## 2 Almst Always Wrg 3540
## 3 Sometimes Wrong 2106
## 4 Not Wrong At All 891
## 5 <NA> 35359
gss %>% group_by(premarsx) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 5 x 2
## premarsx count
## <fct> <int>
## 1 Always Wrong 9244
## 2 Almst Always Wrg 3200
## 3 Sometimes Wrong 7044
## 4 Not Wrong At All 14060
## 5 <NA> 23513
gss %>% group_by(relig) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 14 x 2
## relig count
## <fct> <int>
## 1 Protestant 33472
## 2 Catholic 13926
## 3 Jewish 1155
## 4 None 6113
## 5 Other 998
## 6 Buddhism 130
## 7 Hinduism 63
## 8 Other Eastern 31
## 9 Moslem/Islam 108
## 10 Orthodox-Christian 96
## 11 Christian 588
## 12 Native American 24
## 13 Inter-Nondenominational 124
## 14 <NA> 233
gss %>% group_by(attend) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 9 x 2
## attend count
## <fct> <int>
## 1 Lt Once A Year 4351
## 2 Once A Year 7476
## 3 Sevrl Times A Yr 7202
## 4 Once A Month 4067
## 5 2-3X A Month 5060
## 6 Nrly Every Week 3183
## 7 Every Week 11383
## 8 More Thn Once Wk 4370
## 9 <NA> 9969

Now that we have seen how the variables look like and their various levels, we will extract those variables from the GSS data and create a mini data frame for our EDA. We will add another column relig_mindset that will show if a respondent is very religious, less religious, or just religious. Here we write a code that assigns the various levels in relig_mindset to a respondent depending on how many times he/she attend religious activities that is the attend variable. For this work if respondents attend Every Week, more than once a week they are Very Religious if they attend Nearly Every Week, 2-3X A Month and Once a Month they are Religious any other rate of attendance is Less Religious

eda_df <- select(gss, relig, attend, teensex, premarsx)
eda_df <- eda_df %>% mutate(relig_mindset=if_else(attend=='Nrly Every Week','Religious',if_else(attend=='Every Week','Very Religious',if_else(attend=='More Thn Once Wk','Very Religious',if_else(attend=='2-3X A Month','Religious',if_else(attend=='Once A Month','Religious','Less Religious'))))))
str(eda_df)
## 'data.frame': 57061 obs. of 5 variables:
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...
## $ attend : Factor w/ 9 levels "Never","Lt Once A Year",..: 3 8 5 NA NA 3 8 NA 4 9 ...
## $ teensex : Factor w/ 5 levels "Always Wrong",..: NA NA NA NA NA NA NA NA NA NA ...
## $ premarsx : Factor w/ 5 levels "Always Wrong",..: 4 1 1 1 3 3 4 3 4 1 ...
## $ relig_mindset: chr "Less Religious" "Very Religious" "Religious" NA ...

The column we just created is shown as a character. Let's change it to a factor so that we can use it in our analysis.

eda_df$relig_mindset <- factor(eda_df$relig_mindset)
str(eda_df)
## 'data.frame': 57061 obs. of 5 variables:
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...
## $ attend : Factor w/ 9 levels "Never","Lt Once A Year",..: 3 8 5 NA NA 3 8 NA 4 9 ...
## $ teensex : Factor w/ 5 levels "Always Wrong",..: NA NA NA NA NA NA NA NA NA NA ...
## $ premarsx : Factor w/ 5 levels "Always Wrong",..: 4 1 1 1 3 3 4 3 4 1 ...
## $ relig_mindset: Factor w/ 3 levels "Less Religious",..: 1 3 2 NA NA 1 3 NA 1 3 ...

Now for the exploratory analysis we extract the column we created relig_mindset and premarsx to plot a graph that shows a visual representation of how the two variables are related

r_quest1 <- eda_df %>% drop_na() %>% group_by(relig_mindset, premarsx) %>% summarise (count=n())## `summarise()` regrouping output by 'relig_mindset' (override with `.groups` argument)r_quest1## # A tibble: 12 x 3
## # Groups: relig_mindset [3]
## relig_mindset premarsx count
## <fct> <fct> <int>
## 1 Less Religious Always Wrong 833
## 2 Less Religious Almst Always Wrg 480
## 3 Less Religious Sometimes Wrong 1577
## 4 Less Religious Not Wrong At All 3867
## 5 Religious Always Wrong 1142
## 6 Religious Almst Always Wrg 503
## 7 Religious Sometimes Wrong 1086
## 8 Religious Not Wrong At All 1797
## 9 Very Religious Always Wrong 3002
## 10 Very Religious Almst Always Wrg 690
## 11 Very Religious Sometimes Wrong 896
## 12 Very Religious Not Wrong At All 1054

plot graph

ggplot(r_quest1, aes(fill=premarsx, x=relig_mindset, y=count)) + geom_bar(position = 'dodge', stat = 'identity') + xlab('Religious Seriousness') + ylab('Respondents') + ggtitle('Religion and Pre-marital sex') + scale_fill_discrete(name='Premarital sex')

From the Graph we see clearly that there is a connection between moral views on premarital sex and religion. About four thousand responds who are Less Religious see no wrong in having premarital sex and less than one thousand see it as always wrong. On the other hand, three thousand Respondents who are Very religious see premarital sex as always wrong

Inference

For the inference, we will do a hypothesis test to find out if there is a relationship between religious mindset (relig_mindset) and moral views on premarital sex (premarsx). Because this is a test between two categorical variables. We perform a chi-square test.

We start by first looking at how the observed proportions look like. but we will remove all Na’s since they are undefined for chi-square test of independence

eda_df <- eda_df %>% filter(!is.na(relig), !is.na(attend), !is.na(teensex))

eda_df %>% group_by(relig,premarsx,teensex,relig_mindset) %>% summarise(count=n())
## `summarise()` regrouping output by 'relig', 'premarsx', 'teensex' (override with `.groups` argument)## # A tibble: 350 x 5
## # Groups: relig, premarsx, teensex [156]
## relig premarsx teensex relig_mindset count
## <fct> <fct> <fct> <fct> <int>
## 1 Protestant Always Wrong Always Wrong Less Religious 591
## 2 Protestant Always Wrong Always Wrong Religious 870
## 3 Protestant Always Wrong Always Wrong Very Religious 2285
## 4 Protestant Always Wrong Almst Always Wrg Less Religious 12
## 5 Protestant Always Wrong Almst Always Wrg Religious 13
## 6 Protestant Always Wrong Almst Always Wrg Very Religious 18
## 7 Protestant Always Wrong Sometimes Wrong Less Religious 3
## 8 Protestant Always Wrong Sometimes Wrong Religious 3
## 9 Protestant Always Wrong Sometimes Wrong Very Religious 4
## 10 Protestant Always Wrong Not Wrong At All Religious 3
## # ... with 340 more rows

We create the frequency table for observed values

observed=table(eda_df %>% select(relig_mindset, premarsx))
observed
## premarsx
## relig_mindset Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 833 480 1577 3867
## Religious 1142 503 1086 1797
## Very Religious 3002 690 896 1054
## premarsx
## relig_mindset Other
## Less Religious 0
## Religious 0
## Very Religious 0

We check conditions first:

  • Independence: Since the sample was collected using simple random sampling the independence condition is met.
  • Sample size: each cell has at least five cases (we remove the column ‘Other’ to meet this condition)
observed <- observed [,-5]

observed
## premarsx
## relig_mindset Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 833 480 1577 3867
## Religious 1142 503 1086 1797
## Very Religious 3002 690 896 1054

We now set our Hypothesis:

Ho: No relationship between premarital sex and religious mindset

Ha: There is a relationship between premarital sex and religious mindset

W test the hypothesis with the Inference function in the statsr package.

inference(y=premarsx, x=relig_mindset,data=eda_df, type='ht',method='theoretical',  statistic = 'proportion', null = 0, alternative = 'greater')## Response variable: categorical (5 levels) 
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 833 480 1577 3867
## Religious 1142 503 1086 1797
## Very Religious 3002 690 896 1054
##
## Expected:
## y
## x Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 1986.742 667.8361 1420.6985 2681.723
## Religious 1331.356 447.5302 952.0383 1797.076
## Very Religious 1658.902 557.6337 1186.2632 2239.201
##
## H0: relig_mindset and premarsx are independent
## HA: relig_mindset and premarsx are dependent
## chi_sq = 3133.7379, df = 6, p_value = 0

The p-value indicates that we should reject the null hypothesis which states that there is no difference between premarital sex and religious mindset.

As seen earlier in the exploratory data analysis with the bar plot, it is safe to say that: the data provides evidence that there is association between premarital sex and religious mindset

References

GSS DATA: Smith, Tom W., Michael Hout, and Peter V. Marsden. General Social Survey, 1972–2012 [Cumulative File]. ICPSR34802-v1. Storrs, CT: Roper Center for Public Opinion Research, University of Connecticut /Ann Arbor, MI: Inter-university Consortium for Political and Social Research [distributors], 2013–09–11. doi:10.3886/ICPSR34802.v1

Persistent URL: http://doi.org/10.3886/ICPSR34802.v1

DUKE UNIVERSITY: Inferential Statistics course on coursera. https://www.coursera.org/learn/inferential-statistics-intro

A graduate student who is interested and research and data analysis in environmental engineering.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store