-
Notifications
You must be signed in to change notification settings - Fork 0
/
SGAM-paper.qmd
563 lines (431 loc) · 29.1 KB
/
SGAM-paper.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
---
title: "Social Genetic Assortative Mating"
author: "David Hugh-Jones, Oana Borcan, Pierre Chiappori, and Abdel Abdellaoui"
format: pdf
editor: visual
knitr:
opts_chunk:
dev: "cairo_pdf"
execute:
echo: false
bibliography: bibliography.bib
---
<!--# TODO How about we take the regressions and plug them into the model -->
<!--# Assume a reasonable value for theta; back out a from the observed correlation between education and each spousal PGI; -->
<!--# calculate the long-run correlation; compare it to the observed sigma -->
<!--# we may need to take account of error in measured PGIs -->
## Introduction
*Possibly extend model to include fertility and evolution.*
Humans mate assortatively on both genetics and phenotype. This has important consequences for social structure and genetic architecture. Assortative mating on income, wealth or education can increase inequality [@fernandez2001sorting; @fernandez2005love; @eika2019educational; @chiappori2018marriage]. Genetic assortative mating (GAM) violates the assumptions of standard models in statistical genetics (XXX which?). Cross-trait assortative mating, where people with one trait match to partners with a different trait, can lead the traits to become associated in the population [@Beauchamp_2010; @Sundet_2005; @border2022cross].
@abch2022 introduce Social Genetic Assortative Mating (SGAM). Under SGAM, partners with a socially defined trait, such as socio-economic status (SES), match partners with certain genetic variants. This could happen if SES is valued in marriage markets, and if other valued features, like intelligence or physical attractiveness, are partly under genetic control. SGAM has important consequences for statistical genetics, social structure, and genetic architecture. When SES is inherited, SGAM induces a correlation between SES and genetics -- the so-called "genes-SES gradient". Under SGAM, environmental shocks to a person's SES are reflected, via partner choice, in the genetics of their descendants; also, the size of the genes-SES gradient depends upon social institutions. SGAM can be a mechanism for the long-run transmission of social inequality.
This paper tests for SGAM for education, an important measure of SES, against 33 genetic traits captured in polygenic indices (PGIs). We show that more educated people match with partners who are systematically different on many PGIs. Using birth order as an environmental shock to SES, we provide evidence that this is not purely due to genetic assortative mating. Lastly we discuss the implications of SGAM in detail.
# Theory
Population members have two characteristics: $x=( x_{1},x_{2})$, drawn from a normal distribution
$$
\mathcal{N}
\left(
\begin{array}{c}
0 \\
0%
\end{array}%
,%
\begin{array}{cc}
s^{2} & \sigma \\
\sigma & S^{2}%
\end{array}%
\right).
$$
where $x_1$ is a genetic measure, and $x_2$ is a measure of SES. They mate assortatively on attractiveness, given by $i(x) =ax_{1} + (1-a) x_{2}$ where $a \in [0, 1]$ reflects the relative importance of genetics to SES. Individuals match with a partner randomly selected from among the same quantile of the distribution of $i(x)$ as themselves. All couples have the same number of children. Children's characteristics are given by:
```{=tex}
\begin{eqnarray}
x_{1}^{\prime } &=&\frac{\tau }{2}\left( x_{1}+y_{1}\right) +\varepsilon
\label{Chil} \\
x_{2}^{\prime } &=&\frac{\theta }{2}\left( x_{2}+y_{2}\right) +\eta
\nonumber
\end{eqnarray}
```
where $x$ and $y$ are the child's parents, and $\varepsilon$ and $\eta$ are independent normal shocks with mean $0$ and variance $1$. Parameter $\tau \approx 1$ reflects genetic inheritance. We expect $\tau < 1$ due to stabilizing selection [@schmalhausen1949factors; @sanjak2018evidence]. Parameter $\theta \in [0, 1]$ reflects social inheritance of SES. Unlike $\tau$, it may vary between societies. $\theta$ is high when there is high intergenerational transmission of SES. Thus, $\theta$ captures social and economic institutions that affect this intergenerational transmission, such as inheritance taxation, public education or hereditary titles. We assume for now that a person's SES does not depend on their genetic endowment.
***Claim**: under SGAM, if* $x_1$ *and* $x_2$ *are uncorrelated in the parents' generation, then they are positively correlated in the children's generation so long as* $0 < a < 1$*. The correlation is increasing in* $\theta$*.*
***Proof**: @abch2022 Claim 4.*
We now consider the asymptotic distribution of $x_1$ and $x_2$ when the matching process is repeated over many generations.
***Proposition**: For* $\theta < 1$ *and* $\tau <1$*, the asymptotic distribution converges to a normal distribution with mean zero and covariance matrix*
$$
\mathbb{C}\left(\begin{array}{cc}
x_1 \\
x_2
\end{array}
\right)
=
\left(
\begin{array}{cc}
\bar{s}^{2} & \bar{\sigma} \\
\bar{\sigma} & \bar{S}^{2}%
\end{array}%
\right)
$$
*The asymptotic correlation between characteristics,* $corr = \bar{\sigma}/\bar{s}\bar{S}$*, is non-negative, positive for* $0 < a < 1$*, increasing in* $\theta$ *and increasing then decreasing in* $a$*.*
***Proof**: @abch2022 Proposition 3.*
The model can be extended to allow different $a$ parameters for men and women, direct dependency of SES on genetics ("meritocracy"), non-normal distributions of $x_1$ and $x_2$, and a nonlinear attractiveness function $i(\cdot)$.
# Data
```{r}
#| label: setup
#| warning: false
library(dplyr)
library(ggplot2)
library(purrr)
library(forcats)
requireNamespace("Formula", quietly = TRUE)
requireNamespace("fixest", quietly = TRUE)
requireNamespace("drake", quietly = TRUE)
requireNamespace("patchwork", quietly = TRUE)
requireNamespace("scales", quietly = TRUE)
set.seed(27101975)
pretty <- function (n, digits = 2, ...) {
formatC(n, digits = digits, big.mark = ",", ...)
}
pct <- scales::label_percent(0.1)
regression_subset <- function (data) {
data %>%
filter(
! is.na(birth_order.x),
! is.na(height.x),
! is.na(fluid_iq.x),
! is.na(university.x),
! is.na(bmi.x),
! is.na(sr_health.x)
)
}
calc_prop_shared_children <- function (mf_pairs) {
drake::loadd(parent_child)
# all pairs where at least 1 parent has a genetic child in the sample
mf_w_parent <- mf_pairs %>%
left_join(parent_child, by = c("ID.m" = "parent_id")) %>%
left_join(parent_child, by = c("ID.f" = "parent_id"),
suffix = c(".m", ".f")) %>%
filter(! is.na(child_id.m) | ! is.na(child_id.f))
n_one_has_kid <- nrow(mf_w_parent)
n_both_same_kid <- mf_w_parent %>%
filter(child_id.m == child_id.f) %>% # NAs excluded
nrow()
return(list(one = n_one_has_kid, both = n_both_same_kid))
}
pretty_names <- function (names) {
pretty <- c(
ADHD_2017 = "ADHD",
age_at_menarche = "Age at menarche",
age_at_menopauze = "Age at menopause",
agreeableness = "Agreeableness",
ai_substance_use = "Age at smoking initiation",
alcohol_schumann = "Alcohol use",
alzheimer = "Alzheimer",
autism_2017 = "Autism",
bipolar = "Bipolar",
bmi_combined = "BMI",
body_fat = "Body Fat",
caffeine = "Caffeine",
cannabis = "Cannabis (ever vs. never)",
cognitive_ability = "Cognitive Ability",
conscientiousness = "Conscientiousness",
coronary_artery_disease = "Coronary Artery Disease",
cpd_substance_use = "Cigarettes per day",
diagram_T2D = "Type 2 Diabetes",
dpw_substance_use = "Drinks per week",
EA2_noUKB = "Educ. attainment 2 (no UKBB)",
EA3_excl_23andMe_UK = "Educ. attainment 3 (no UK)",
eating_disorder = "Eating disorder",
extraversion = "Extraversion",
height_combined = "Height",
hip_combined = "Hip circumference",
MDD_PGC2_noUKB = "Major Depressive Disorder",
neuroticism = "Neuroticism",
openness = "Openness",
sc_substance_use = "Smoking cessation",
SCZ2 = "Schizophrenia",
si_substance_use = "Smoking initiation",
wc_combined = "Waist circumference",
whr_combined = "Waist-hip ratio"
)
pretty[names]
}
plot_coefs <- function (mods, xlab) {
mods |>
mutate(
dep.var = forcats::fct_reorder(dep.var, estimate)
) |>
ggplot(aes(y = dep.var, x = estimate)) +
geom_pointrange(aes(xmin = conf.low, xmax = conf.high),
color = "steelblue4", size = 0.3) +
geom_vline(xintercept = 0, linetype = 2) +
labs(y = "Spouse PGI", x = xlab)
}
knitr::opts_chunk$set(echo = FALSE)
knitr::knit_hooks$set(
inline = function (x) {
if (is.numeric(x)) x <- as.character(round(x, getOption("digits")))
x <- gsub("-", "\u2212", x)
paste(as.character(x), collapse = ", ")
}
)
options(huxtable.long_minus = TRUE)
options(digits = 3)
theme_set(theme_minimal())
theme_update(
text = element_text(family = "Abadi MT Condensed Light")
)
drake::loadd(mf_pairs)
drake::loadd(mf_pairs_twice)
drake::loadd(score_names)
names(score_names) <- pretty_names(score_names)
drake::loadd(van_alten_weights)
mf_pairs_twice <- mf_pairs_twice |> left_join(van_alten_weights,
by = c("eid.x" = "f.eid"))
# this helps when running regressions, to avoid coefficients named
# "university.xTRUE":
mf_pairs_twice$university.x <- as.numeric(mf_pairs_twice$university.x)
mf_pairs_reg <- regression_subset(mf_pairs_twice)
mf_pairs_bo <- mf_pairs_reg |>
dplyr::filter(
n_sibs.x >= 2,
n_sibs.x <= 6
)
# drake::loadd(famhist)
# drake::loadd(resid_scores)
#
# famhist %<>% left_join(resid_scores, by = "f.eid")
# rm(resid_scores)
# famhist$EA3 <- famhist$EA3_excl_23andMe_UK_resid
```
```{r calc-validate-pairs}
prop_shared <- calc_prop_shared_children(mf_pairs)
n_one_has_kid <- prop_shared$one
n_both_same_kid <- prop_shared$both
```
We use respondents from UK Biobank [@bycroft2018uk]. UK Biobank does not include explicit information on partner identity. We classify people as partners if they:
- had the same home postcode on at least one occasion;[^1]
- both reported the same homeownership/renting status, length of time at the address, and number of children;
- attended the same UK Biobank assessment center on the same day;
- both reported living with their spouse ("husband, wife or partner");
- consisted of one male and one female.
[^1]: A typical UK postcode contains about 15 properties.
We also eliminate all pairs where either spouse appeared in more than one pair. This leaves a total of `r pretty(nrow(mf_pairs))` pairs. We validate our pairs by using genetic children who also happen to be UK Biobank respondents. Among our spouse pairs, `r n_one_has_kid` have a genetic child of at least one partner in the sample. `r pct(n_both_same_kid/n_one_has_kid)` of these are the genetic child of *both* partners. This is a lower bound, since some of the remaining pairs may also have a shared genetic child not in the sample. For comparison, 11% of families with dependent children included a stepchild in England and Wales in 2011 [@ons2011stepfamilies].
For our analyses we use pairs twice, once for each focal member. We cluster standard errors by pair. For analyses on birth order, we use focal members with between 1 and 5 siblings. All regressions are weighted using weights from @vanalten2022reweighting, which aim to adjust for non-response bias and match the UK Biobank sampling frame.
# Results
```{r}
#| label: calc-mods-educ
regress_educ <- function (score) {
formula <- Formula::as.Formula(
sprintf("%s_resid.y ~ university.x + %s_resid.x | YOB.x",
score, score)
)
mod <- fixest::feols(formula, data = mf_pairs_reg, notes = FALSE,
weights = ~ weights,
cluster = ~ couple_id)
broom::tidy(mod, conf.int = TRUE) |>
dplyr::filter(term == "university.x")
}
mods_educ <- map_dfr(score_names, regress_educ, .id = "dep.var")
regress_own_educ <- function (score) {
formula <- Formula::as.Formula(
sprintf("%s_resid.x ~ university.x | YOB.x", score)
)
mod <- fixest::feols(formula, data = mf_pairs_reg, notes = FALSE,
weights = ~ weights,
cluster = ~ couple_id)
# conf.int = TRUE leads to empty dataset
broom::tidy(mod) |>
dplyr::filter(term == "university.x")
}
mods_own_educ <- map_dfr(score_names, regress_own_educ, .id = "dep.var")
mods_both <- left_join(mods_educ, mods_own_educ, by = "dep.var",
suffix = c(".spouse", ".own"))
cor_spouse_own <- cor(mods_both$estimate.own, mods_both$estimate.spouse)
beta_spouse_own <- coef(lm(mods_both$estimate.spouse ~ mods_both$estimate.own))[2]
```
@fig-mods-educ plots coefficients of university attendance in regressions of 33 spousal PGIs, controlling for own corresponding PGI and own year born. PGIs are normalized, so that effect sizes are in standard deviations. There is a very large positive correlation between these coefficients, and coefficients of university attendance on the subject's *own* corresponding PGI ($\rho$ = `r cor_spouse_own`). So, assortative mating between education and spousal PGIs very accurately predicts the genes-SES gradient. Effect sizes are also quite similar: in a regression of spouse effect size on own effect size across the PGIs, the slope is `r beta_spouse_own`.
```{r}
#| label: fig-mods-educ
#| fig-cap: "Coefficients of university attendance in regressions of spouse PGIs. Controls: dummy variables for own year born, own PGI. Lines show 95% confidence intervals."
plot_coefs(mods_educ, xlab = "Effect of university attendance")
```
```{r}
#| label: calc-mods-own-bo
regress_own_bo <- function (score) {
formula <- Formula::as.Formula(
sprintf("%s_resid.x ~ birth_order.x | YOB.x + n_sibs.x", score)
)
mod <- fixest::feols(formula, data = mf_pairs_bo, notes = FALSE,
weights = ~ weights,
cluster = ~ couple_id)
broom::tidy(mod) |>
dplyr::filter(term == "birth_order.x")
}
mods_own_bo <- map_dfr(score_names, regress_own_bo, .id = "dep.var")
stopifnot(all(mods_own_bo$p.value > 0.1/33))
stopifnot(sum(mods_own_bo$p.value < 0.05) == 2)
stopifnot(all(abs(mods_own_bo$estimate) < 0.03))
```
These effects could be explained by pure genetic assortative mating (GAM). Partners might match on phenotypes which predict PGIs, e.g. intelligence for educational attainment PGIs, with socio-economic status playing no independent role. To demonstrate SGAM, we use birth order as an environmental shock to SES. Birth order is known to predict life outcomes including education and income [@black2011older; @booth2009birth; @Lindahl_2008]. At the same time, among sibling pairs, birth order is independent of genetic variation, due to the "lottery of meiosis". In the UK Biobank sample, a respondent's birth order does not predict any of their own 33 PGIs at $p$ \< 0.10/33.[^2]
[^2]: Two scores are significant at *p* \< 0.05. No effect sizes are greater than 0.03 standard deviations.
```{r}
#| label: calc-mods-bo
regress_bo <- function (score, data = mf_pairs_bo) {
formula <- Formula::as.Formula(
sprintf("%s_resid.y ~ birth_order.x + par_age_birth.x + %s_resid.x |
YOB.x + n_sibs.x", score, score)
)
mod <- fixest::feols(formula, data = data, notes = FALSE,
weights = ~ weights,
cluster = ~ couple_id)
broom::tidy(mod, conf.int = TRUE) |>
dplyr::filter(term == "birth_order.x")
}
mods_bo <- map_dfr(score_names, regress_bo, .id = "dep.var")
mods_bo_educ <- left_join(mods_bo, mods_educ, by = "dep.var")
cor_bo_educ <- cor(mods_bo_educ$estimate.x, mods_bo_educ$estimate.y)
mods_bo_own_educ <- left_join(mods_bo, mods_own_educ, by = "dep.var")
cor_bo_own_educ <- cor(mods_bo_own_educ$estimate.x,
mods_bo_own_educ$estimate.y)
```
```{r}
#| label: calc-boot-cor
#| cache: true
# Romano-Wolf stepwise testing. Controls FWER
# (prob of falsely rejecting at least one null)
# This takes account of dependence between the hypotheses.
# However, it still fails to reject any of the 33 nulls.
#
# Copyright (C) 2011-2015 Gray Calhoun; MIT license
# From https://github.com/grayclhn/oosanalysis-R-library
#
# stepm <- function(teststatistics, bootmatrix, lefttail, righttail) {
# nstatistics <- length(teststatistics)
# rejected <- rep(FALSE, nstatistics)
# nrejected.endofloop <- 0
# repeat {
# nrejected.topofloop <- nrejected.endofloop
# bootmax <- apply(bootmatrix[!rejected,, drop = FALSE], 2, max)
# bootmin <- apply(bootmatrix[!rejected,, drop = FALSE], 2, min)
# rightcrit <- if (is.na(righttail)) Inf else
# quantile(bootmax, 1 - righttail)
# leftcrit <- if (is.na(lefttail)) -Inf else
# quantile(bootmin[bootmax <= rightcrit],
# lefttail / (1 - righttail))
# rejected[teststatistics < leftcrit | teststatistics > rightcrit] <- TRUE
# nrejected.endofloop <- sum(rejected)
# if (nrejected.endofloop == nrejected.topofloop ||
# nrejected.endofloop >= nstatistics) break
# }
# return(list(leftcrit = leftcrit, rightcrit = rightcrit,
# rejected = rejected))
# }
R <- 100
boot_cor <- replicate(R, {
boot_sample <- mf_pairs_bo |> slice_sample(prop = 1, replace = TRUE)
boot_mods_bo <- map_dfr(score_names, regress_bo, data = boot_sample,
.id = "dep.var")
cor(boot_mods_bo$estimate, mods_educ$estimate)
})
boot_cor_ci <- quantile(boot_cor, c(0.025, 0.975))
```
```{r}
#| label: calc-mods-med
regress_med <- function (score, controls) {
controls_string <- paste(controls, collapse = " + ")
formula <- Formula::as.Formula(
sprintf("%s_resid.y ~ birth_order.x + %s + par_age_birth.x + %s_resid.x | YOB.x + n_sibs.x",
score, controls_string, score)
)
mod <- fixest::feols(formula, data = mf_pairs_bo, notes = FALSE,
weights = ~ weights,
cluster = ~ couple_id)
broom::tidy(mod, conf.int = TRUE) |>
dplyr::filter(term %in% c("birth_order.x", controls))
}
controls <- c("university.x", "fluid_iq.x", "height.x", "bmi.x",
"sr_health.x")
names(controls) <- controls
mods_med <- map_dfr(score_names, regress_med, controls = controls,
.id = "dep.var")
regress_first_stage <- function (dep.var) {
formula <- Formula::as.Formula(
sprintf("%s ~ birth_order.x + par_age_birth.x | YOB.x + n_sibs.x",
dep.var)
)
mod <- fixest::feols(formula, data = mf_pairs_bo, notes = FALSE,
weights = ~ weights,
cluster = ~ couple_id)
broom::tidy(mod, conf.int = TRUE) |>
dplyr::filter(term == "birth_order.x")
}
mods_first_stage <- map_dfr(controls, regress_first_stage, .id = "dep.var")
# for each control: mediated effect is control in first stage
mods_first_med <- inner_join(mods_first_stage, mods_med,
by = c("dep.var" = "term"),
suffix = c(".1st", ".2nd"))
# effect of birth order on mediator * effect of mediator on spouse PGI
mods_first_med$mediated.effect <- mods_first_med$estimate.1st *
mods_first_med$estimate.2nd
```
```{r}
#| label: fig-mods-bo
#| fig-cap: "Effects of birth order on spouse PGIs. (a) Coefficients of birth order in regressions of spouse PGIs. Controls: number of siblings (dummies), birth year dummies, parental age at birth, own PGI. Lines show uncorrected 95% confidence intervals. (b) Percent of birth order effect mediated by university attendance, for PGIs where effect of university is significant at $p$ < 0.05/33."
#| fig-subcap: true
#| layout-nrow: 2
plot_coefs(mods_bo, xlab = "Effect of birth order")
mods_first_med |>
filter(
dep.var == "university.x",
p.value.2nd < 0.05/33
) |>
left_join(mods_bo[c("dep.var", "estimate")], by =
c("dep.var.2nd" = "dep.var")) |>
mutate(
`Percent mediated` = mediated.effect/estimate,
`Spouse PGI` = fct_reorder(dep.var.2nd, `Percent mediated`)
) |>
ggplot(aes(y = `Spouse PGI`, x = `Percent mediated`)) +
geom_col(fill = "steelblue4") +
coord_cartesian(xlim = c(0, 1)) +
scale_x_continuous(labels = scales::label_percent())
```
@fig-mods-bo (a) shows coefficients of birth order in regressions of spouse PGIs. We control for number of siblings, since couples who have more children may genetically differ. We also control for parents' age at birth, which correlates with birth order but typically has an opposite-signed effect. Birth order is significant at the 5% level for PGIs for waist circumference, waist-hip ratio and educational attainment (EA3). However, no coefficients are significant at Bonferroni-corrected $p$ = 0.05/33, probably because the effect of birth order is relatively small. The Bonferroni correction is likely to be conservative here since p values are positively correlated. Nevertheless, across PGIs, coefficients on birth order are strongly negatively correlated with coefficients on university attendance from the previous regression ($\rho$ =`r cor_bo_educ`, 95% confidence interval from 100 bootstraps `r boot_cor_ci[1]` to `r boot_cor_ci[2]` ). This fits the hypothesis that siblings later in the birth order have characteristics which are less attractive in marriage markets, while university attendance is an attractive characteristic. If the true birth order coefficients were zero, we would expect their estimated values to be uncorrelated with the education coefficients.
Next, we add mediators to the birth order regression. These include university attendance, which is a measure of socio-economic status, and four non-SES mediators (IQ, height, self-reported health and body mass index). All these mediators are significantly predicted by birth order.The proportion of the birth order effect that is mediated by university attendance can be calculated as
$$
\frac{\textrm{Coef. of birth order on university}\times\textrm{coef. of university on spouse PGI}}{\textrm{Coef. of birth order on spouse PGI}}
$$
where the coefficient of university on PGI is estimated controlling for birth order and the other mediators. @fig-mods-bo (b) shows this proportion for PGIs where university attendance predicts spouse PGI at $p$ \< 0.05/33. In particular, university attendance is a significant mediator of the birth order effect on spousal PGIs for education.
# Discussion
SGAM has important implications. For *statistical genetics*, SGAM leads to gene-environment correlation (rGE). Children of couples who mate according to SGAM will inherit both genetic variants biologically, and SES socially. This is like how cross-trait genetic assortative mating leads to genetic variants for different traits being correlated in the population. rGE from SGAM can arise in a single generation, meaning that it probably won't be captured by principal components of genetic data (which are commonly used to control for population stratification). As is well known, rGE is a potential confound for genetic analyses which fail to control for the environment, and equally for analyses of environmental effects which don't control for genetics.
```{r}
#| label: calc-mods-own-income
drake::loadd(famhist) # already includes weights
drake::loadd(resid_scores)
famhist <- left_join(famhist, resid_scores, by = "f.eid")
regress_income <- function (score) {
formula <- Formula::as.Formula(
sprintf("%s_resid ~ factor(income_cat)| YOB", score)
)
mod <- fixest::feols(formula, data = famhist,
weights = ~ weights, notes = FALSE)
stats <- fixest::fitstat(mod, c("wf.p", "wr2"))
data.frame(
p.value = stats$wf.p,
r2 = stats$wr2,
estimate = coef(mod)[["factor(income_cat)5"]]
)
}
mods_own_income <- map_dfr(score_names, regress_income, .id = "dep.var")
n_sig_income <- sum(mods_own_income$p.value < 0.05/33)
big <- abs(mods_own_income$estimate) > 0.1
scores_large_income <- mods_own_income$dep.var[big]
```
Specifically, SGAM leads to a genes-SES gradient, in which different genetic variants are associated with SES. Among the entire UK Biobank sample, `r n_sig_income` own PGIs are significantly associated with income at $p$ \< 0.05/33. This includes PGIs for educational attainment, but also PGIs for health outcomes, body phenotypes and behaviours. `r length(scores_large_income)` scores show effect sizes of more than 0.1 standard deviations for the difference between the highest and lowest income category.[^3]
[^3]: Controlling for year of birth. Scores: `r paste(scores_large_income, collapse = ", ")` .
The leading explanation for the genes-SES gradient relies on labour markets. People with variants which predict high earnings will achieve higher SES, and will pass both genes and their SES to children. In this mechanism, causality goes from genetics to SES. It also relies on some degree of "meritocracy" or openness in labour markets. Indeed the heritability of SES increased after the fall of communism in Estonia, probably because SES became more determined by factors under genetic control [@Rimfeld_2018].
By contrast, SGAM works via assortative mating in marriage markets. Widespread meritocracy is historically recent: in most historical societies, status was determined at birth and was hard to alter. Assortative mating, on the other hand, is a historically widespread and may be a human universal. So, unlike the "labour market" explanation, SGAM predicts that genes-SES gradients should exist in pre-modern societies. This remains to be tested using ancient DNA data. Note that the phenotypes considered advantageous in marriage markets are likely to vary across societies, time periods and cultures. As a result, we may see different genetic variants being associated with SES in different times and places.
Under SGAM, there is two-way causality between genes and SES. Environmental shocks to a person's SES may be reflected, via the identity of their partner, in the genetics of their descendants. This provides a new mechanism underlying *persistent social inequality*. Over the long run, measures of SES like income, wealth and education are highly persistent -- more persistent than single-generation correlations between parents and children would lead us to expect [@clark2015intergenerational; @clark2021bell; @barone2021intergenerational; @de2020lineages]. One explanation, put forward by Clark and others, is that SES is a noisy measure of an underlying genetic factor preserved by inheritance and strong genetic assortative mating. SGAM could also increase persistence of SES, but with a difference: under SGAM, genetic variation is not exogenous to SES. Instead, it can be a mediator linking initial SES to genetic differences in subsequent generations (and *vice versa*, from genetic differences to subsequent SES).
Also, under SGAM, the strength of the genes-SES association, as well as the population variance of SES, varies according to social institutions. In our model, the correlation between genes and SES is stronger in societies where SES is more (socially) heritable, i.e. where $\theta$ is higher. Genetic explanations of social inequality are often thought of fatalistically. But according to SGAM, policies like redistributive taxation or mass public education may affect the distribution not only of SES variables, but even of genetic variants themselves. The catch is that this is a long-run process, playing out over multiple generations. Clean comparisons of genes-SES gradients across societies (i.e. tests of "rGE x E") are difficult, but are an important direction for future research.
SGAM could also drive *gene-culture coevolution*. If high-status individuals have greater fitness, as is presumed true in most historical societies [@hopcroft2006sex], then genes associated with high SES via SGAM could spread within a society. If standards of attractiveness vary across societies, this might lead to genetic differences. In this context, it is interesting that regions of the genome associated with visible physical differences vary more between ethnic groups than other regions (XXX trying to find this cite). We speculate that SGAM could play a role in explaining this finding.
The broadest message of SGAM is that genetic variation is a social outcome. DNA affects humans by biological mechanisms: from the manufacture of RNA and then of proteins, to the effect of proteins on the human body, to resulting changes in physiology, psychology and behaviour. But if we step back and ask where DNA comes from, the answer is always a particular set of social institutions. One can even think of genetic variation as a resource, traded and competed for in marriage markets by individuals and lineages. That gives a good reason to be sceptical of mapping genes on to a dichotomy between nature and nurture, or nature and society.[^4] It also provides more impetus to pursue the integration of genetics and the social sciences, as an equal partnership between both sides.
[^4]: This is in addition to other reasons, like the existence of gene-environment interactions.
## References