Here we present summary statistics that can be used, e.g. in an Approximate Bayesian Computation method (Csillery, Francois, and Blum 2012), to estimate parameters of the methylation dynamics model of MethEvolSIM. The statistics include mean frequencies, standard deviations, neighbor correlations of methylation states in island and non-island regions. Additionally, for analyses comparing different samples (comparison of tree tips), some of the summary statistics are based on the mean per-site frequency of methylation changes and the minimum number of changes per islands leading to different global methylation states, calculated with the Fitch algorithm (Fitch 1971).
The provided set of functions allows the computation of summary statistics for methylation data in a genomic region with structural categorization into two types of structures. In this vignette, we will refer to these structures as islands and non-island structures.
The data
argument must be in one of the following
formats, depending on the analysis:
data[[structure]]
):# Example: a single sample with 3 genomic structures
# (1) island with 10 partially-methylated sites
# (2) non-island with 5 methylated sites
# (3) island with 15 unmethylated sites
data <- list(rep(0.5, 10), # Partially methylated
rep(1,5), # Methylated
rep(0,15)) # Unmethylated
data
#> [[1]]
#> [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#>
#> [[2]]
#> [1] 1 1 1 1 1
#>
#> [[3]]
#> [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
data[[tip]][[structure]]
).# Example: data from 3 tips of a tree,
# each with 3 genomic structures
data <- list(
# Tip 1
list(c(rep(0.5,5), rep(0,5)), # 5 partially methylated, 5 unmethylated
c(rep(1,4), 0.5), # 4 methylated, 1 unmethylated
c(rep(0,7), rep(0.5,8))), # 7 unmethylated, 8 partially methylated
# Tip 2
list(c(rep(0.5,9), 1), # 9 partially methylated, 1 methylated
c(rep(0.5,4), 1), # 4 partially methylated, 1 methylated
c(rep(0,8), rep(0.5,7))), # 8 unmethylated, 7 partially-methylated
# Tip 3
list(c(1, rep(0,8), 1), # first and last methylated, rest unmethylated
c(rep(0.5,3), rep(1,2)), # 3 methylated, 1 unmethylated
c(rep(0.5,15)))) # all partially methylated
In this case, data
must be pre-filtered to include only
sites present in all tips, ensuring a valid comparison between species
or samples.
Regardless of the format, methylation values should be represented as follows:
0
for unmethylated sites,
0.5
for partially methylated sites,
1
for methylated sites.
Intermediate values (e.g., obtained from pooled empirical data)
should be categorized before analysis with
categorize_siteMethSt
.
non_categorized_data <- list(
# Tip 1
list(c(0.1, 0.7, 0.9), # First structure
c(0.3, 0.5, 0.9)), # Second structure
# Tip 2
list(c(0.2, 0.8, 0.6), # First structure
c(0.9, 0.4, 0.7)) # Second structure
)
# Transform the data with custom thresholds
categorized_data <- categorize_siteMethSt(data, u_threshold = 0.15, m_threshold = 0.85)
categorized_data
#> [[1]]
#> [[1]][[1]]
#> [1] 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0
#>
#> [[1]][[2]]
#> [1] 1.0 1.0 1.0 1.0 0.5
#>
#> [[1]][[3]]
#> [1] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#>
#>
#> [[2]]
#> [[2]][[1]]
#> [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 1.0
#>
#> [[2]][[2]]
#> [1] 0.5 0.5 0.5 0.5 1.0
#>
#> [[2]][[3]]
#> [1] 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 0.5 0.5 0.5
#>
#>
#> [[3]]
#> [[3]][[1]]
#> [1] 1 0 0 0 0 0 0 0 0 1
#>
#> [[3]][[2]]
#> [1] 0.5 0.5 0.5 1.0 1.0
#>
#> [[3]][[3]]
#> [1] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
Finally, when phylogenetic data is provided, the order of tips in
data
must match the tip order in the Newick tree format
(left to right) or the order of tree$tip.label()
when the
tree is provided as a phylo
object from the
ape
package.
\[ \overline{x}=\frac1I\sum_{i=1}^{I}\frac1T\sum_{t=1}^{T}x_{i,t} \\ \overline{y}=\frac1J\sum_{j=1}^{J}\frac1T\sum_{t=1}^{T}y_{j,t} \]
where \(x_{i,t}\) and \(y_{j,t}\) represent the respective frequencies at each tip.
\[ \overline{w}=\frac1I\sum_{i=1}^{I}\frac1T\sum_{t=1}^{T}w_{i,t} \\ \overline{z}=\frac1J\sum_{j=1}^{J}\frac1T\sum_{t=1}^{T}z_{j,t} \]
These values can be computed using the following functions:
# 1 tip / sample / replicate
sample_n <- 1
index_islands <- c(1, 3)
index_nonislands <- c(2, 4)
data <- list(c(.5, .5, 0, 0, 0, .5), c(.5, 0, 0, .5), c(.5, .5, 0), c(0, 1, .5)) # tip 1
get_islandMeanFreqP(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.5833333
get_nonislandMeanFreqP(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.4166667
get_islandMeanFreqM(index_islands, data, categorized_data = T, sample_n)
#> [1] 0
get_nonislandMeanFreqM(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.1666667
# 2 tip / sample / replicate
sample_n <- 2
index_islands <- c(1, 3)
index_nonislands <- c(2, 4)
data <- list(
list(c(.5, .5, 0, 0, 0, .5), c(.5, 0, 0, .5), c(.5, .5, 0), c(0, 0, .5)), # tip 1
list(c(0, .5, .5, 1, 1, .5), c(1, .5, 1, .5), c(0, .5, .5), c(1, .5, 1)) # tip 2
)
get_islandMeanFreqP(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.5833333
get_nonislandMeanFreqP(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.4166667
get_islandMeanFreqM(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.08333333
get_nonislandMeanFreqM(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.2916667
Note that in the calculation of the means the relative frequencies \(x_{i, t}\), \(y_{j, t}\), \(w_{i, t}\) and \(z_{j, t}\) are not weighted by the lengths of island \(i\) or non-island \(j\).
\[ \sigma_{t}(x)=\sqrt{\frac{1}{I-1}\sum_{i=1}^{I}(x_{i,t}-\overline{x_{.,t}})^2} \\ \sigma_{t}(y)=\sqrt{\frac{1}{J-1}\sum_{j=1}^{J}(y_{j,t}-\overline{y_{.,t}})^2} \]
where \(\overline{x_{.,t}}\) and \(\overline{y_{.,t}}\) represent mean frequencies at tip \(t\). The mean standard deviation across tips is:
\[ \hat{\sigma}(x)=\frac{1}{T}\sum_{t=1}^{T}\sigma_t(x) \\ \hat{\sigma}(y)=\frac{1}{T}\sum_{t=1}^{T}\sigma_t(y) \]
\[ \hat{\sigma}(w)=\frac{1}{T}\sum_{t=1}^{T}\sigma_t(w) \\ \hat{\sigma}(z)=\frac{1}{T}\sum_{t=1}^{T}\sigma_t(z) \]
These values can be computed using:
# 1 tip / sample / replicate
sample_n <- 1
index_islands <- c(1, 3)
index_nonislands <- c(2, 4)
data <- list(c(.5, .5, 0, 1, 1, .5), c(.5, 0, 1, .5), c(.5, .5, 0), c(0, 0, .5))
get_islandSDFreqP(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.1178511
get_nonislandSDFreqP(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.1178511
get_islandSDFreqM(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.2357023
get_nonislandSDFreqM(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.1767767
# 2 tip / sample / replicate
sample_n <- 2
index_islands <- c(1, 3)
index_nonislands <- c(2, 4)
data <- list(
list(c(.5, .5, 0, 0, 0, 1), c(.5, 0, 0, .5), c(1, .5, 0), c(0, 0, .5)), # tip 1
list(c(0, .5, .5, 1, 1, .5), c(1, .5, 1, .5), c(0, .5, .5), c(1, .5, 1)) # tip 2
)
get_islandSDFreqP(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.05892557
get_nonislandSDFreqP(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.1178511
get_islandSDFreqM(index_islands, data, categorized_data = T, sample_n)
#> [1] 0.1767767
get_nonislandSDFreqM(index_nonislands, data, categorized_data = T, sample_n)
#> [1] 0.05892557
Note that we average across tips as there can be correlations between the mean standard deviation of tips (e.g. under the case of an event affecting the states at two tips).
Let \(s\) be the length of two adjacent segments of central positions to consider. Let \(l\) be the number of positions excluded at the start and end of each structure. Define:
\[ \overline{Cor(x)_{s,l}}=\frac1I\frac1T\sum_{i=1}^{I}\sum_{t=1}^{T}Cor(x_{i,t,[p:p'-1]}, x_{i,t,[p+1:p']}) \\ \overline{Cor(y)_{s,l}}=\frac1J\frac1T\sum_{j=1}^{J}\sum_{t=1}^{T}Cor(y_{j,t,[p:p'-1]}, y_{j,t,[p+1:p']}) \]
Correlations are computed only for segment pairs where standard deviation is non-zero:
\[ \sigma(x_{i,t,[p:p'-1]}) \neq 0 \text{ and } \sigma(x_{i,t,[p+1:p']})\neq 0 \\ \sigma(y_{j,t,[p:p'-1]}) \neq 0 \text{ and } \sigma(y_{j,t,[p+1:p']})\neq 0 \]
In this formulation:
Only structures with a minimum length of \(n = s + 2 \cdot l\) are considered.
If \(l = 0\), all the positions are considered.
These values can be computed with the functions
compute_meanCor_i
for island structures and
compute_meanCor_ni
for non-island structures as in the
following examples:
# 1 tip / sample / replicate
sample_n <- 1
index_islands <- c(1, 3)
index_nonislands <- c(2, 4)
data <- list(c(.5, 0, 0, 0, .5, .5, .5, .5, .5, 1, .5, 0, 0, 0, .5, .5, .5, .5,
.5, 1, .5, 0, 0, 0, .5, .5, .5, .5, .5, 1), # 30 sites
c(.5, 1, 1, 1, .5, .5, 1, 1, 1, .5, .5, 1, 1, 1, .5, .5, 1, 1, 1,
.5, .5, 1, 1, 1, .5), # 25 sites
c(.5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0,
.5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5,
0, 0, .5, 1), # 40 sites
c(1, 1, 1, .5, .5, .5, 0, 0, 0, .5, 1, 1, 1, .5, .5, .5, 0, 0, 0,
.5, 1, 1, 1, .5, .5, .5, 0, 0, 0, .5,
.5, 0, 0, 0, .5)) # 35 sites
compute_meanCor_i(index_islands, minN_CpG = 10,
shore_length = 5, data, sample_n = 1, categorized_data = T)
#> [1] 0.3626687
compute_meanCor_ni(index_nonislands, minN_CpG = 10,
shore_length = 5, data, sample_n = 1, categorized_data = T)
#> [1] 0.3623932
# 2 tip / sample / replicate
sample_n <- 2
index_islands <- c(1, 3)
index_nonislands <- c(2, 4)
data <- list(
# tip 1
list(c(.5, 0, 0, 0, .5, .5, .5, .5, .5, 1, .5, 0, 0, 0, .5, .5, .5, .5, .5,
1, .5, 0, 0, 0, .5, .5, .5, .5, .5, 1), # 30 sites
c(.5, 1, 1, 1, .5, .5, 1, 1, 1, .5, .5, 1, 1, 1, .5, .5, 1, 1, 1, .5,
.5, 1, 1, 1, .5), # 25 sites
c(.5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1,
.5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0,
.5, 1), # 40 sites
c(1, 1, 1, .5, .5, .5, 0, 0, 0, .5, 1, 1, 1, .5, .5, .5, 0, 0, 0, .5,
1, 1, 1, .5, .5, .5, 0, 0, 0, .5, .5, 0, 0, 0, .5)), # 35 sites
# tip 2
list(c(.5, 0, 0, .5, .5, .5, 0, 0, .5, 1, .5, 0, 0, 0, 0, .5, .5, 1, 1, 1,
.5, 0, 0, 0, .5, .5, 1, 1, 1, 1), # 30 sites
c(.5, .5, 1, 1, .5, .5, 1, 1, 1, .5, .5, 0, 0, 0, .5, .5, 1, 1, 1, .5,
.5, 1, 1, 1, .5), # 25 sites
c(.5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, .5, .5, 0, 0, .5, 1,
1, 1, 1, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0, .5, 1, .5, 0, 0,
.5, 1), # 40 sites
c(1, 1, 1, .5, .5, .5, 0, 0, 0, .5, 1, 1, 1, 1, .5, .5, 0, 0, 0, .5, 1,
1, 1, .5, .5, .5, .5, .5, 0, .5, .5, .5, .5, 0, .5)) # 35 sites
)
compute_meanCor_i(index_islands, minN_CpG = 10,
shore_length = 5, data, sample_n = 2, categorized_data = T)
#> [1] 0.4371918
compute_meanCor_ni(index_nonislands, minN_CpG = 10,
shore_length = 5, data, sample_n = 2, categorized_data = T)
#> [1] 0.5021673
Note that correlation measures how much the state of each site says about the next. So, when there is no variation (e.g. in middle segment state u of a site is followed by state u of the next), there is no correlation.
Let \(S_k\) be the set of sites in the genomic structure indexed by \(k\). For each site \(s \in S_k\), let the methylation state at a given tip \(t\) be denoted as \(m_{k,s,t}\), where \(m_{k,s,t} \in \{0, 0.5, 1\}\) represents the unmethylated, partially-methylated, or methylated state, respectively.
For a given cherry \((t_1, t_2)\), that is two tips \(t_1\) and \(t_2\) that are direct offspring of the same internal node, we compute the proportion of sites with differing methylation states between the two tips.
Define the indicator variable: \[ \delta_{k,s,t_1,t_2} = \begin{cases} 1, & \text{if } m_{k,s,t_1} \neq m_{k,s,t_2}, \\ 0, & \text{if } m_{k,s,t_1} = m_{k,s,t_2}. \end{cases} \]
Then, the proportion of sites with different methylation states in structure \(k\) for the cherry \((t_1, t_2)\) is given by:
\[ F_k(t_1, t_2) = \frac{1}{|S_k|} \sum_{s \in S_k} \delta_{k,s,t_1,t_2} \]
where \(|S_k|\) is the total number of sites in structure \(k\), and the sum counts the number of sites where the methylation state differs between tips \(t_1\) and \(t_2\).
This statistic quantifies the proportion of differing methylation states for a given genomic structure in a cherry.
To compute the weighted mean frequency of methylation changes across all structures in a cherry, we separate island and non-island structures.
For a given cherry \((t_1, t_2)\), let \(\mathcal{I}\) be the set of indices for island structures and \(\mathcal{N}\) be the set of indices for non-island structures. Each structure \(k\) has a total number of sites \(|S_k|\), which we use as weights.
The weighted mean frequency of methylation changes for island structures is:
\[ \bar{F}_{\text{island}}(t_1, t_2) = \frac{\sum_{k \in \mathcal{I}} |S_k| F_k(t_1, t_2)}{\sum_{k \in \mathcal{I}} |S_k|} \]
Similarly, the weighted mean frequency for non-island structures is:
\[ \bar{F}_{\text{non-island}}(t_1, t_2) = \frac{\sum_{k \in \mathcal{N}} |S_k| F_k(t_1, t_2)}{\sum_{k \in \mathcal{N}} |S_k|} \]
These expressions compute the mean frequency of methylation state differences across all island and non-island structures, weighted by the number of sites in each structure.
Functions:
get_cherryDist(tree)
to get the distances between
the tips of each cherry. It identifies each cherry by the tip names and
the tip indices. The tip indices correspond to (a) the index from left
to right on the newick string, (b) the order of the tip label in the
phylo_object$tip.label
, and (c) the index in the
methylation data list data[[tip]][[structure]]
as obtained
with the function simulate_evolData()
when the given tree
has several tips.
countSites_cherryMethDiff(cherryDist, data)
to get
for each cherry the counts of sites with half-methylation and
full-methylation changes per genomic structure (island or
non-island).
freqSites_cherryMethDiff(tree,data)
. The function
first validates the tree structure and extracts pairwise distances
between cherry tips with get_cherryDist
. It then validates
the input data and counts half and full methylation differences for each
cherry at each structure using countSites_cherryMethDiff
.
Finally, it normalizes these (half and full) counts by the number of
sites per structure to compute frequencies.
get_siteFChange_cherry(tree,data)
uses
freqSites_cherryMethDiff(tree,data)
and then computes the
frequency of sites with any change of methylation state (full or half)
for each cherry at each genomic structure.
MeanSiteFChange_cherry(tree, data, index_islands, index_nonislands)
.
The function computes the per-cherry frequency of sites with different
methylation states at each structure (islands and non-islands) using
get_siteFChange_cherry
. Then it calculates the weighted
mean site frequency of methylation changes for each cherry separately
for islands and non-islands.
# Set example tree and methylation data
tree <- "((a:1.5,b:1.5):2,(c:2,d:2):1.5);"
data <- list(
list(rep(1,10), rep(0,5), rep(1,8)), # tip a
list(rep(1,10), rep(0.5,5), rep(0,8)), # tip b
list(rep(1,10), rep(0.5,5), rep(0,8)), # tip c
list(c(rep(0,5), rep(0.5, 5)), c(0, 0, 1, 1, 1), c(0.5, 1, rep(0, 6)))) # d
# Set the index for islands and non-island structures
index_islands <- c(2)
index_nonislands <- c(1, 3)
MeanSiteFChange_cherry(data = data,
categorized_data = T,
tree = tree,
index_islands = index_islands,
index_nonislands = index_nonislands)
#> tip_names tip_indices dist nonisland_meanFChange island_meanFChange
#> 1 a-b 1-2 3 0.4444444 1
#> 2 c-d 3-4 4 0.6666667 1
Let \(I\) be the set of CpG islands
in a given genomic region For each island \(i
\in I\), the mean methylation level across sites is computed with
get_meanMeth_islands(index_islands, data)
and categorized
into one of three global methylation states—unmethylated (\(u\)), partially methylated (\(p\)), or methylated (\(m\))—based on user-defined thresholds \(u_{\text{thresh}}\) and \(m_{\text{thresh}}\). This categorization is
given by the function:
\[ \operatorname{categorize\_islandGlbSt}(\mu_i, u_{\text{thresh}}, m_{\text{thresh}}) = \begin{cases} u, & \text{if } \mu_i \leq u_{\text{thresh}}, \\ m, & \text{if } \mu_i \geq m_{\text{thresh}}, \\ p, & \text{otherwise}. \end{cases} \]
Given a phylogenetic tree \(T\) with
tips corresponding to sampled species or individuals, we aim to estimate
the minimum number of state changes required to explain the observed
distribution of island methylation states across the tips. This is
achieved using the Fitch algorithm (Fitch
1971) as implemented in
compute_fitch(meth, tree)
.
The function computeFitch_islandGlbSt(T, M)
, where \(M\) is the matrix of categorized
methylation states across tips, returns a vector \(C\) where each entry \(C_i\) represents the minimum number of
changes required for island \(i\) under
the Fitch parsimony criterion.
# Example with data from a single island structure
# and three tips
tree <- "((bla:1,bah:1):2,booh:2);"
data <- list(
#Tip 1
list(c(rep(1,9), rep(0,1))), # m
#Tip 2
list(c(rep(0.5,10))), # p
#Tip 3
list(c(rep(0.5,9), rep(0.5,1)))) # p
index_islands <- c(1)
computeFitch_islandGlbSt(index_islands, data, tree,
u_threshold = 0.1, m_threshold = 0.9)
#> [1] 1
# Example: data from a genomic region consisting on 3 structures with 10 sites each
# one island, one non-island, one island
# and a tree with 8 tips
tree <- "(((a:1,b:1):1,(c:1,d:1):1):1,((e:1,f:1):1,(g:1,h:1):1):1);"
data <- list(
#Tip 1
list(c(rep(1,5), rep(0,5)), # p
c(rep(0,9), 1),
c(rep(1,8), rep(0.5,2))), # m
#Tip 2
list(c(rep(0.5,9), rep(0.5,1)), # p
c(rep(0.5,9), 1),
c(rep(0,8), rep(0.5,2))), # u
#Tip 3
list(c(rep(1,9), rep(0.5,1)), # m
c(rep(0.5,9), 1),
c(rep(0.5,8), rep(0.5,2))), # p
#Tip 4
list(c(rep(1,9), rep(0.5,1)), # m
c(rep(1,9), 0),
c(rep(0.5,8), rep(0.5,2))), # p
#Tip 5
list(c(rep(0,5), rep(0,5)), # u
c(rep(0,9), 1),
c(rep(0.5,8), rep(0.5,2))), # p
#Tip 6
list(c(rep(0,9), rep(0.5,1)), # u
c(rep(0.5,9), 1),
c(rep(1,8), rep(0.5,2))), # m
#Tip 7
list(c(rep(0,9), rep(0.5,1)), # u
c(rep(0.5,9), 1),
c(rep(0,8), rep(0.5,2))), # u
#Tip 8
list(c(rep(0,9), rep(0.5,1)), # u
c(rep(1,9), 0),
c(rep(0,9), rep(0.5,1)))) # u
index_islands <- c(1,3)
computeFitch_islandGlbSt(index_islands, data, tree,
u_threshold = 0.1, m_threshold = 0.9)
#> [1] 2 4
For genomic regions containing multiple CpG islands, we can summarize the per-island estimates using the mean to obtain an overall measure of the minimum state changes.
A cherry \(c\) is defined as a pair of direct offspring (tips \(t_1\) and \(t_2\)) of the same internal node in the phylogenetic tree. For a given cherry \(c\), let \(\mathcal{I}\) be the set of indices for island structures. For each island structure \(i \in \mathcal{I}\) at each cherry tip \(t_c \in \{t_{c,1}, t_{c,2}\}\) in each phylogenetic tree, the number of sites in each methylation state (unmethylated \(0\), partially-methylated \(0.5\), methylated \(1\)) is counted. This count is represented as:
\[ \text{count}_{\text{UPM}}(c,t_c, i) = \left( n_0, n_{0.5}, n_1 \right) \]
where \(n_0\), \(n_{0.5}\), and \(n_1\) are the counts of unmethylated, partially-methylated, and methylated sites at tip \(t_c\), respectively.
The distribution of methylation states at each island between the two tips \(t_1\) and \(t_2\) of a given cherry is compared using a chi-squared test. The null hypothesis of the test is that the frequency distributions of states in the two islands follow the same multinomial distribution. Note that not only IWEs can lead to deviations from this null hypothesis but also neighbor-dependent SSEs can lead to slight deviations from the assumption of multinomial distributions. The test statistic is calculated using the contingency table \(T_{c,i}\) for tips \(t_{c,1}\) and \(t_{c,2}\):
\[ T_{c,i} = \begin{pmatrix} n_0^{t_{c,1}} & n_{0.5}^{t_{c,1}} & n_1^{t_{c,1}} \\ n_0^{t_{c,2}} & n_{0.5}^{t_{c,2}} & n_1^{t_{c,2}} \end{pmatrix} \] For each cherry \(c\) and each island \(i\), the methylation frequencies at each cherry tip are compared using the chi-squared test. The p-value for the chi-squared test is calculated via Monte Carlo simulation to improve reliability, even when the expected frequencies do not meet the assumptions of the chi-squared approximation (i.e., expected counts of at least 5 in each category). The obtained p-values are stored in a dataframe for each cherry and for each island.
The significance of the methylation frequency changes is determined based on a user-defined threshold \(p_{\text{threshold}}\). If the p-value is smaller than the threshold, the change is considered significant. This is done for each cherry and each island \(i \in \mathcal{I}\) by comparing each p-value to the threshold:
\[ \text{Significant Change}_{c,i} = \begin{cases} 1, & \text{if } p_{\text{value}}(c,i) \lt p_{\text{threshold}} \\ 0, & \text{if } p_{\text{value}}(c,i) \ge p_{\text{threshold}} \end{cases} \]
For each cherry , the mean number of significant changes across all islands \(i \in \mathcal{I}\) is computed as the average of the significant changes for each island:
\[ \text{Mean Number of Significant Changes per Island}_{c} = \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \text{Significant Change}_{c,i} \]
where \(|\mathcal{I}|\) is the total number of islands.
Functions:
compare_CherryFreqs(tip1, tip2)
to perform a
chi-squared test to compare the distribution of methylation states
between two cherry tips.
pValue_CherryFreqsChange_i(data, index_islands, tree)
uses compare_CherryFreqs(tip1, tip2)
to get for each cherry
and each island the chi-square p-value of the comparison of the
distribution of methylation states between two cherry tips.
mean_CherryFreqsChange_i(data, categorized_data = FALSE, index_islands, tree, pValue_threshold)
.
The function uses
pValue_CherryFreqsChange_i(data, index_islands, tree)
and
counts the mean number of significant changes per island at each
cherry.
# Set example tree and methylation data
tree <- "((a:1,b:1):2,c:2);"
data <- list(
#Tip a
list(c(rep(1,9), rep(0,1)), # Structure 1: island
c(rep(0,9), 1), # Structure 2: non-island
c(rep(0,9), rep(0.5,1))), # Structure 3: island
#Tip b
list(c(rep(0,9), rep(0.5,1)), # Structure 1: island
c(rep(0.5,9), 1), # Structure 2: non-island
c(rep(0,9), rep(0,1))), # Structure 3: island
#Tip c
list(c(rep(1,9), rep(0.5,1)), # Structure 1: island
c(rep(0.5,9), 1), # Structure 2: non-island
c(rep(0,9), rep(0.5,1)))) # Structure 3: island
index_islands <- c(1,3)
mean_CherryFreqsChange_i(data, categorized_data = T,
index_islands, tree,
pValue_threshold = 0.05)
#> first_tip_name second_tip_name first_tip_index second_tip_index dist island_1
#> 1 a b 1 2 2 TRUE
#> island_3 FreqsChange
#> 1 FALSE 0.5
For a given phylogenetic tree with a number of tips \(N\), let \(t_n \in \{t_1, t_2, \cdots, t_N\) represent the tree tips. For a given genomic region, let \(\mathcal{I}\) be the set of indices for island structures. For each island structure \(i \in \mathcal{I}\) at each tip \(t_n\) in the phylogenetic tree, the number of sites in each methylation state (unmethylated \(0\), partially-methylated \(0.5\), methylated \(1\)) is counted. This count is represented as:
\[ \text{count}_{\text{UPM}}(t_n, i) = \left( n_0, n_{0.5}, n_1 \right) \]
where \(n_0\), \(n_{0.5}\), and \(n_1\) are the counts of unmethylated, partially-methylated, and methylated sites at tip \(t_n\), respectively.
The distribution of methylation states at each island across all tips is compared using a chi-squared test. The null hypothesis of the test is that the frequency distributions of states in the two islands follow the same multinomial distribution. Note that not only IWEs can lead to deviations from this null hypothesis but also neighbor-dependent SSEs can lead to slight deviations from the assumption of multinomial distributions. The test statistic is calculated using the contingency table \(T_{i}\) for all tree tips:
\[ T_{i} = \begin{pmatrix} n_0^{t_{1}} & n_{0.5}^{t_{1}} & n_1^{t_{1}} \\ n_0^{t_{2}} & n_{0.5}^{t_{2}} & n_1^{t_{2}} \\ \cdots & \cdots & \cdots \\ n_0^{t_{N}} & n_{0.5}^{t_{N}} & n_1^{t_{N}} \end{pmatrix} \]
For each island \(i\), the methylation frequencies are compared across tips using the chi-squared test. The p-value for the chi-squared test is calculated via Monte Carlo simulation to improve reliability, even when the expected frequencies do not meet the assumptions of the chi-squared approximation (i.e., expected counts of at least 5 in each category).
The significance of the methylation frequency changes for each island \(i \in \mathcal{I}\) across tips is determined based on a user-defined threshold \(p_{\text{threshold}}\). If the p-value is smaller than the threshold, the change is considered significant.
\[ \text{Significant Change}_{i} = \begin{cases} 1, & \text{if } p_{\text{value}}(i) < p_{\text{threshold}} \\ 0, & \text{if } p_{\text{value}}(i) \ge p_{\text{threshold}} \end{cases} \]
For the given tree, the mean number of significant changes per island across all tips is computed as:
\[ \text{Mean Number of Significant Changes per Island} = \frac{1}{|\mathcal{I}|} \sum_{i \in \mathcal{I}} \text{Significant Change}_{i} \]
where \(|\mathcal{I}|\) is the total number of islands.
Function:
mean_TreeFreqsChange_i(tree, data, categorized_data = FALSE, index_islands, pValue_threshold)
.# Set example tree and methylation data
tree <- "((a:1,b:1):2,(c:2,d:2):1);"
data <- list(
#Tip a
list(c(rep(1,9), rep(0,1)), # Structure 1: island
c(rep(0,9), 1), # Structure 2: non-island
c(rep(0,9), rep(0,1))), # Structure 3: island
#Tip b
list(c(rep(0,9), rep(0.5,1)), # Structure 1: island
c(rep(0.5,9), 1), # Structure 2: non-island
c(rep(0,9), rep(0,1))),# Structure 3: island
#Tip c
list(c(rep(0,9), rep(0.5,1)), # Structure 1: island
c(rep(0.5,9), 1), # Structure 2: non-island
c(rep(1,9), rep(0,1))),# Structure 3: island
#Tip d
list(c(rep(0,9), rep(0.5,1)), # Structure 1: island
c(rep(0.5,9), 1), # Structure 2: non-island
c(rep(1,8), rep(0.5,2)))) # Structure 3: island
index_islands <- c(1,3)
mean_TreeFreqsChange_i(tree, data, categorized_data = T,
index_islands,
pValue_threshold = 0.05)
#> [1] 1