# Using simphony to evaluate rhythm detection

#### 2022-12-01

Source:`vignettes/introduction.Rmd`

`introduction.Rmd`

`simphony`

is a framework for simulating rhythmic data,
especially gene expression data. Here we show an example of using it to
benchmark a method for detecting rhythmicity.

## Load the packages we’ll use

Internally, `simphony`

uses the `data.table`

package, which provides an enhanced version of the standard R
`data.frame`

. We’ll use `data.table`

for this
example as well.

## Simulate the data

Here we create a `data.table`

called
`featureGroups`

that specifies the desired properties of the
simulated genes. We want 75% of simulated genes to be non-rhythmic and
25% to have a rhythm amplitude of 1.1. Properties not specified in
`featureGroups`

will be given their default values.

Our simulated experiment will have 200 genes. Expression values will be sampled from the negative binomial family, which models read counts from next-generation sequencing data. The interval between time points will be 2 (default period of 24), with one replicate per time point. We also use the default time range of our simulated data points of between 0 and 48 hours.

```
set.seed(44)
featureGroups = data.table(fracFeatures = c(0.75, 0.25), amp = c(0, 0.3))
simData = simphony(featureGroups, nFeatures = 200, interval = 2, nReps = 1, family = 'negbinom')
```

The output of `simphony`

has three components:
`abundData`

, `sampleMetadata`

, and
`featureMetadata`

. `abundData`

is a matrix that
contains the simulated expression values. Each row of corresponds to a
gene, each column corresponds to a sample. Since we sampled from the
negative binomial family, all expression values are integers.

`kable(simData$abundData[1:3, 1:3])`

sample_01 | sample_02 | sample_03 | |
---|---|---|---|

feature_001 | 282 | 154 | 181 |

feature_002 | 213 | 310 | 200 |

feature_003 | 304 | 294 | 273 |

`sampleMetadata`

is a `data.table`

that
contains the condition (`cond`

) and time for each sample.
Here we simulated one condition, so `cond`

is 1 for all
samples.

`kable(simData$sampleMetadata[1:3])`

sample | cond | time |
---|---|---|

sample_01 | cond_1 | 0 |

sample_02 | cond_1 | 2 |

sample_03 | cond_1 | 4 |

`featureMetadata`

is a `data.table`

that
contains the properties of each simulated gene in each condition. The
`group`

column corresponds to the row in
`featureGroups`

to which the gene belongs.

```
kable(simData$featureMetadata[149:151, !'dispFunc']) %>%
kable_styling(font_size = 12)
```

cond | group | feature | fracFeatures | amp | amp0 | phase | period | rhyFunc | base | base0 |
---|---|---|---|---|---|---|---|---|---|---|

cond_1 | 1 | feature_149 | 0.75 | function (m) , x | 0.0 | 0 | 24 | .Primitive(“sin”) | function (x) , defaultValue | 8 |

cond_1 | 1 | feature_150 | 0.75 | function (m) , x | 0.0 | 0 | 24 | .Primitive(“sin”) | function (x) , defaultValue | 8 |

cond_1 | 2 | feature_151 | 0.25 | function (m) , x | 0.3 | 0 | 24 | .Primitive(“sin”) | function (x) , defaultValue | 8 |

## Plot the simulated time-course for selected genes

Here we plot the simulated time-course for a non-rhythmic gene and a
rhythmic gene. We use the `mergeSimData`

function to merge
the expression values, the sample metadata, and the gene metadata.

```
fmExample = simData$featureMetadata[feature %in% c('feature_150', 'feature_151')]
dExample = mergeSimData(simData, fmExample$feature)
```

We also want to compare the simulated expression values with their
underlying distributions over time, for which we can use the
`getExpectedAbund`

function. Since we sampled from the
negative binomial family, the resulting `mu`

column
corresponds to the expected log_{2} counts.

`dExpect = getExpectedAbund(fmExample, 24, times = seq(0, 48, 0.25))`

Then it all comes together with `ggplot`

.

```
dExample[, featureLabel := paste(feature, ifelse(amp0 == 0, '(non-rhythmic)', '(rhythmic)'))]
dExpect[, featureLabel := paste(feature, ifelse(amp0 == 0, '(non-rhythmic)', '(rhythmic)'))]
ggplot(dExample) +
facet_wrap(~ featureLabel, nrow = 1) +
geom_line(aes(x = time, y = log2(2^mu + 1)), size = 0.25, data = dExpect) +
geom_point(aes(x = time, y = log2(abund + 1)), shape = 21, size = 2.5) +
labs(x = 'Time (h)', y = expression(log[2](counts + 1))) +
scale_x_continuous(limits = c(0, 48), breaks = seq(0, 48, 8))
#> Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
#> ℹ Please use `linewidth` instead.
```

## Detect rhythmic genes

We can use the `limma`

package to detect rhythmic genes
based on a linear model that corresponds to cosinor regression.

```
sampleMetadata = copy(simData$sampleMetadata)
sampleMetadata[, timeCos := cos(time * 2 * pi / 24)]
sampleMetadata[, timeSin := sin(time * 2 * pi / 24)]
design = model.matrix(~ timeCos + timeSin, data = sampleMetadata)
```

Here we follow the typical `limma`

workflow: fit the
linear model for each gene, run empirical Bayes, and extract the
relevant summary statistics. We pass `lmFit`

the
log_{2} transformed counts.

## Evaluate accuracy of rhythmic gene detection

First we merge the results from `limma`

with the known
amplitudes from `featureMetadata`

.

```
rhyLimma$feature = rownames(rhyLimma)
rhyLimma = merge(data.table(rhyLimma), simData$featureMetadata[, .(feature, amp0)], by = 'feature')
```

We can plot the distributions of p-values of rhythmicity for non-rhythmic and rhythmic genes. P-values for non-rhythmic genes are uniformly distributed between 0 and 1, as they should be under the null hypothesis. P-values for rhythmic genes, on the other hand, tend to be closer to 0.

```
ggplot(rhyLimma) +
geom_jitter(aes(x = factor(amp0), y = P.Value), shape = 21, width = 0.2) +
labs(x = expression('Rhythm amplitude ' * (log[2] ~ counts)), y = 'P-value of rhythmicity')
```

Finally, we can summarize the ability to distinguish non-rhythmic and
rhythmic genes using a receiver operating characteristic (ROC) curve
(here we use the `precrec`

package).