Last week I traveled back to my scientific roots (Wageningen University) to participate in a course on Structural Equation Modeling (SEM) given by Bill Shipley (he is particularly well known from his book on ‘Cause and Correlation in Biology‘). Structural Equation Models can be used to evaluate the sequence of variables affecting each other, and whether the underlying data supports such a sequence of events (also called path-models). For example – ecosystem functions (e.g. productivity and decomposition) can be affected by the biomass of the vegetation, and this can be affected again by the age of the plot (e.g. during succession) (Lohbeck et al. 2015).
As an evolutionary ecologist I was a bit of a misfit in the group. The group was dominated by Dutch PhD students and professors working in ecology (e.g. functional ecology, community assembly, soil science). They often collect data from plots; data which fit perfectly well in a structural equation model. My data did not – for a couple of reasons. My ‘plots’ are fossil assemblages (species richness = count data, problem 1), collected during the Cenozoic (different time scales, problem 2) and the variables we have are often not assemblage-specific but biased by time, and not normally distributed (e.g. CO2 concentration, temperature, latitude). On the positive side – I have a large sample size (N=666), which is necessary to have enough power to run these SEMs. So how can I test what factors directly and indirectly affect biodiversity (species richness)?
The solution. There is a solution. If your data is spatially, or phylogenetically biased, if your variables are not normally distributed, if you deal with binary/categorical/count data, if you have a nested design… The solution is the d-separation test. (d-sep cannot deal with ‘latent’ variables, e.g. unmeasured variables which may be important for the model).
d-separation in 6 steps:
- your hypothetical model (DAG: “Directed Acyclic Graph” avoid feedback loops in the model!) (for simplicity: A<- B <- C)
- write down each pair not connected by an arrow (in our example only AC)
- causal parents of these? (i.e. causal parent of A = B and of B = C. In our example of AC there is just one causal parent: B)
- run a suitable linear model/generalized linear model/PGLS/mixed model in which you test the effect of your pair variables, conditioned on the parent variables, in our example, of C on A conditioned on B (A ~ B + C)
- sum the probabilities (p values) of the slope coefficients of the regressions (in this case only one regression model was run, and we asses the coefficient of C and it’s p-value)
- calculate the C-statistic: -2 * ln (the sum calculated in step 5) and compare this to a Chi-square distribution. The degrees of freedom are calculated by 2* the number of regressions run (in our case 2 degrees of freedom). If p>0.05, you cannot reject you hypothesized model. If p<0.05 your data do not support the model.
Thanks to the d-separation test we, evolutionary biologist, can still test for causal relationships in our data, even if these data are far from ‘perfect’ or complete. It provides great potential for the field of phylogenetic comparative methods. But how exactly I’m not sure yet….
Madelon Lohbeck, Lourens Poorter, Miguel Martínez-Ramos, and Frans Bongers 2015. Biomass is the main driver of changes in ecosystem process rates during tropical forest succession. Ecology 96:1242–1252. http://dx.doi.org/10.1890/14-0472.1