Statistical Inference with R: Checking relationship between categorical variables using Chi Square Method.

image source: https://penncil.med.upenn.edu/research/statistical-inference/
library(ggplot2)
library(dplyr)
library(statsr)
library(tidyr)

Question: Is there a relationship between religious preference and moral views on premarital sex?

Exploratory Data Analysis

gss %>% select(teensex, premarsx, relig,attend) %>% str()## 'data.frame':    57061 obs. of  4 variables:
## $ teensex : Factor w/ 5 levels "Always Wrong",..: NA NA NA NA NA NA NA NA NA NA ...
## $ premarsx: Factor w/ 5 levels "Always Wrong",..: 4 1 1 1 3 3 4 3 4 1 ...
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...
## $ attend : Factor w/ 9 levels "Never","Lt Once A Year",..: 3 8 5 NA NA 3 8 NA 4 9 ...
gss %>% group_by(teensex) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 5 x 2
## teensex count
## <fct> <int>
## 1 Always Wrong 15165
## 2 Almst Always Wrg 3540
## 3 Sometimes Wrong 2106
## 4 Not Wrong At All 891
## 5 <NA> 35359
gss %>% group_by(premarsx) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 5 x 2
## premarsx count
## <fct> <int>
## 1 Always Wrong 9244
## 2 Almst Always Wrg 3200
## 3 Sometimes Wrong 7044
## 4 Not Wrong At All 14060
## 5 <NA> 23513
gss %>% group_by(relig) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 14 x 2
## relig count
## <fct> <int>
## 1 Protestant 33472
## 2 Catholic 13926
## 3 Jewish 1155
## 4 None 6113
## 5 Other 998
## 6 Buddhism 130
## 7 Hinduism 63
## 8 Other Eastern 31
## 9 Moslem/Islam 108
## 10 Orthodox-Christian 96
## 11 Christian 588
## 12 Native American 24
## 13 Inter-Nondenominational 124
## 14 <NA> 233
gss %>% group_by(attend) %>% summarise(count=n())## `summarise()` ungrouping output (override with `.groups` argument)## # A tibble: 9 x 2
## attend count
## <fct> <int>
## 1 Lt Once A Year 4351
## 2 Once A Year 7476
## 3 Sevrl Times A Yr 7202
## 4 Once A Month 4067
## 5 2-3X A Month 5060
## 6 Nrly Every Week 3183
## 7 Every Week 11383
## 8 More Thn Once Wk 4370
## 9 <NA> 9969
eda_df <- select(gss, relig, attend, teensex, premarsx)
eda_df <- eda_df %>% mutate(relig_mindset=if_else(attend=='Nrly Every Week','Religious',if_else(attend=='Every Week','Very Religious',if_else(attend=='More Thn Once Wk','Very Religious',if_else(attend=='2-3X A Month','Religious',if_else(attend=='Once A Month','Religious','Less Religious'))))))
str(eda_df)
## 'data.frame': 57061 obs. of 5 variables:
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...
## $ attend : Factor w/ 9 levels "Never","Lt Once A Year",..: 3 8 5 NA NA 3 8 NA 4 9 ...
## $ teensex : Factor w/ 5 levels "Always Wrong",..: NA NA NA NA NA NA NA NA NA NA ...
## $ premarsx : Factor w/ 5 levels "Always Wrong",..: 4 1 1 1 3 3 4 3 4 1 ...
## $ relig_mindset: chr "Less Religious" "Very Religious" "Religious" NA ...
eda_df$relig_mindset <- factor(eda_df$relig_mindset)
str(eda_df)
## 'data.frame': 57061 obs. of 5 variables:
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 3 2 1 5 1 1 2 3 1 1 ...
## $ attend : Factor w/ 9 levels "Never","Lt Once A Year",..: 3 8 5 NA NA 3 8 NA 4 9 ...
## $ teensex : Factor w/ 5 levels "Always Wrong",..: NA NA NA NA NA NA NA NA NA NA ...
## $ premarsx : Factor w/ 5 levels "Always Wrong",..: 4 1 1 1 3 3 4 3 4 1 ...
## $ relig_mindset: Factor w/ 3 levels "Less Religious",..: 1 3 2 NA NA 1 3 NA 1 3 ...
r_quest1 <- eda_df %>% drop_na() %>% group_by(relig_mindset, premarsx) %>% summarise (count=n())## `summarise()` regrouping output by 'relig_mindset' (override with `.groups` argument)r_quest1## # A tibble: 12 x 3
## # Groups: relig_mindset [3]
## relig_mindset premarsx count
## <fct> <fct> <int>
## 1 Less Religious Always Wrong 833
## 2 Less Religious Almst Always Wrg 480
## 3 Less Religious Sometimes Wrong 1577
## 4 Less Religious Not Wrong At All 3867
## 5 Religious Always Wrong 1142
## 6 Religious Almst Always Wrg 503
## 7 Religious Sometimes Wrong 1086
## 8 Religious Not Wrong At All 1797
## 9 Very Religious Always Wrong 3002
## 10 Very Religious Almst Always Wrg 690
## 11 Very Religious Sometimes Wrong 896
## 12 Very Religious Not Wrong At All 1054
ggplot(r_quest1, aes(fill=premarsx, x=relig_mindset, y=count)) + geom_bar(position = 'dodge', stat = 'identity') + xlab('Religious Seriousness') + ylab('Respondents') + ggtitle('Religion and Pre-marital sex') + scale_fill_discrete(name='Premarital sex')

Inference

eda_df <- eda_df %>% filter(!is.na(relig), !is.na(attend), !is.na(teensex))

eda_df %>% group_by(relig,premarsx,teensex,relig_mindset) %>% summarise(count=n())
## `summarise()` regrouping output by 'relig', 'premarsx', 'teensex' (override with `.groups` argument)## # A tibble: 350 x 5
## # Groups: relig, premarsx, teensex [156]
## relig premarsx teensex relig_mindset count
## <fct> <fct> <fct> <fct> <int>
## 1 Protestant Always Wrong Always Wrong Less Religious 591
## 2 Protestant Always Wrong Always Wrong Religious 870
## 3 Protestant Always Wrong Always Wrong Very Religious 2285
## 4 Protestant Always Wrong Almst Always Wrg Less Religious 12
## 5 Protestant Always Wrong Almst Always Wrg Religious 13
## 6 Protestant Always Wrong Almst Always Wrg Very Religious 18
## 7 Protestant Always Wrong Sometimes Wrong Less Religious 3
## 8 Protestant Always Wrong Sometimes Wrong Religious 3
## 9 Protestant Always Wrong Sometimes Wrong Very Religious 4
## 10 Protestant Always Wrong Not Wrong At All Religious 3
## # ... with 340 more rows
observed=table(eda_df %>% select(relig_mindset, premarsx))
observed
## premarsx
## relig_mindset Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 833 480 1577 3867
## Religious 1142 503 1086 1797
## Very Religious 3002 690 896 1054
## premarsx
## relig_mindset Other
## Less Religious 0
## Religious 0
## Very Religious 0
  • Independence: Since the sample was collected using simple random sampling the independence condition is met.
  • Sample size: each cell has at least five cases (we remove the column ‘Other’ to meet this condition)
observed <- observed [,-5]

observed
## premarsx
## relig_mindset Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 833 480 1577 3867
## Religious 1142 503 1086 1797
## Very Religious 3002 690 896 1054
inference(y=premarsx, x=relig_mindset,data=eda_df, type='ht',method='theoretical',  statistic = 'proportion', null = 0, alternative = 'greater')## Response variable: categorical (5 levels) 
## Explanatory variable: categorical (3 levels)
## Observed:
## y
## x Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 833 480 1577 3867
## Religious 1142 503 1086 1797
## Very Religious 3002 690 896 1054
##
## Expected:
## y
## x Always Wrong Almst Always Wrg Sometimes Wrong Not Wrong At All
## Less Religious 1986.742 667.8361 1420.6985 2681.723
## Religious 1331.356 447.5302 952.0383 1797.076
## Very Religious 1658.902 557.6337 1186.2632 2239.201
##
## H0: relig_mindset and premarsx are independent
## HA: relig_mindset and premarsx are dependent
## chi_sq = 3133.7379, df = 6, p_value = 0

References

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
George K. Agyen

George K. Agyen

A graduate student who is interested and research and data analysis in environmental engineering.