Title: | Tensor Composition Analysis |
---|---|
Description: | Tensor Composition Analysis (TCA) allows the deconvolution of two-dimensional data (features by observations) coming from a mixture of heterogeneous sources into a three-dimensional matrix of signals (features by observations by sources). The TCA framework further allows to test the features in the data for different statistical relations with an outcome of interest while modeling source-specific effects; particularly, it allows to look for statistical relations between source-specific signals and an outcome. For example, TCA can deconvolve bulk tissue-level DNA methylation data (methylation sites by individuals) into a three-dimensional tensor of cell-type-specific methylation levels for each individual (i.e. methylation sites by individuals by cell types) and it allows to detect cell-type-specific statistical relations (associations) with phenotypes. For more details see Rahmani et al. (2019) <DOI:10.1038/s41467-019-11052-9>. |
Authors: | Elior Rahmani [aut, cre], Brandon Jew [ctb] |
Maintainer: | Elior Rahmani <[email protected]> |
License: | GPL-3 |
Version: | 1.2.1 |
Built: | 2024-10-26 03:25:08 UTC |
Source: | https://github.com/cozygene/tca |
Performs unsupervised feature selection followed by principal component analysis (PCA) under a row-sparse model using the ReFACTor algorithm. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), refactor
allows to capture the variation in cell-type composition, which was shown to be a dominant sparse signal in methylation data.
refactor( X, k, sparsity = 500, C = NULL, C.remove = FALSE, sd_threshold = 0.02, num_comp = NULL, rand_svd = FALSE, log_file = "TCA.log", debug = FALSE, verbose = TRUE )
refactor( X, k, sparsity = 500, C = NULL, C.remove = FALSE, sd_threshold = 0.02, num_comp = NULL, rand_svd = FALSE, log_file = "TCA.log", debug = FALSE, verbose = TRUE )
X |
An |
k |
A numeric value indicating the dimension of the signal in |
sparsity |
A numeric value indicating the sparsity of the signal in |
C |
An |
C.remove |
A logical value indicating whether the covariates in X should be accounted for not only in the feature selection step, but also in the final calculation of the principal components (i.e. if |
sd_threshold |
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in |
num_comp |
A numeric value indicating the number of ReFACTor components to return. |
rand_svd |
A logical value indicating whether to use random svd for estimating the low-rank structure of the data in the first step of the algorithm; random svd can result in a substantial speedup for large data. |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
ReFACTor is a two-step algorithm for sparse principal component analysis (PCA) under a row-sparse model. The algorithm performs an unsupervised feature selection by ranking the features based on their correlation with their values under a low-rank representation of the data, followed by a calculation of principal components using the top ranking features (ReFACTor components).
Note that ReFACTor is tuned towards capturing sparse signals of the dominant sources of variation in the data. Therefore, in the presence of other potentially dominant factors in the data (i.e. beyond the variation of interest), these factors should be accounted for by including them as covariates (see argument C
). In cases where the ReFACTor components are designated to be used as covariates in a downstream analysis alongside the covariates in C
(e.g., in a standard regression analysis or in a TCA regression), it is advised to set the argument C.remove
to be TRUE
. This will adjust the selected features for the information in C
prior to the calculation of the ReFACTor components, which will therefore capture only signals that is not present in C
(and as a result may benefit the downstream analysis by potentially capturing more signals beyond the information in C
).
A list with the estimated components of the ReFACTor model.
scores |
An |
coeffs |
A |
ranked_list |
A vector with the features in |
For very large input matrices it is advised to use random svd for speeding up the feature selection step (see argument rand_svd
).
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nature Methods 2017.
data <- test_data(100, 200, 3, 0, 0, 0.01) ref <- refactor(data$X, k = 3, sparsity = 50)
data <- test_data(100, 200, 3, 0, 0, 0.01) ref <- refactor(data$X, k = 3, sparsity = 50)
Fits the TCA model for an input matrix of features by observations that are coming from a mixture of k
sources, under the assumption that each observation is a mixture of unique (unobserved) source-specific values (in each feature in the data). This function further allows to statistically test the effect of covariates on source-specific values. For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tca
allows to model the methylation of each individual as a mixture of cell-type-specific methylation levels that are unique to the individual. In addition, it allows to statistically test the effects of covariates and phenotypes on methylation at the cell-type level.
tca( X, W, C1 = NULL, C1.map = NULL, C2 = NULL, refit_W = FALSE, refit_W.features = NULL, refit_W.sparsity = 500, refit_W.sd_threshold = 0.02, tau = NULL, vars.mle = FALSE, constrain_mu = FALSE, parallel = FALSE, num_cores = NULL, max_iters = 10, log_file = "TCA.log", debug = FALSE, verbose = TRUE )
tca( X, W, C1 = NULL, C1.map = NULL, C2 = NULL, refit_W = FALSE, refit_W.features = NULL, refit_W.sparsity = 500, refit_W.sd_threshold = 0.02, tau = NULL, vars.mle = FALSE, constrain_mu = FALSE, parallel = FALSE, num_cores = NULL, max_iters = 10, log_file = "TCA.log", debug = FALSE, verbose = TRUE )
X |
An |
W |
An |
C1 |
An |
C1.map |
An |
C2 |
An |
refit_W |
A logical value indicating whether to re-estimate the input |
refit_W.features |
A vector with the names of the features in |
refit_W.sparsity |
A numeric value indicating the number of features to select using the ReFACTor algorithm when re-estimating |
refit_W.sd_threshold |
A numeric value indicating a standard deviation threshold to be used for excluding low-variance features in |
tau |
A non-negative numeric value of the standard deviation of the measurement noise (i.e. the i.i.d. component of variation in the model). If |
vars.mle |
A logical value indicating whether to use maximum likelihood estimation when learning the variances in the model. If |
constrain_mu |
A logical value indicating whether to constrain the estimates of the mean parameters (i.e. |
parallel |
A logical value indicating whether to use parallel computing (possible when using a multi-core machine). |
num_cores |
A numeric value indicating the number of cores to use (activated only if |
max_iters |
A numeric value indicating the maximal number of iterations to use in the optimization of the TCA model ( |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
The TCA model assumes that the hidden source-specific values are random variables. Formally, denote by the source-specific value of observation
in feature
source
, the TCA model assumes:
where represent the mean and standard deviation that are specific to feature
, source
. The model further assumes that the observed value of observation
in feature
is a mixture of
different sources:
where is the non-negative proportion of source
in the mixture of observation
such that
, and
is an i.i.d. component of variation that models measurement noise. Note that the mixture proportions in
are, in general, unique for each individual, therefore each entry in
is coming from a unique distribution (i.e. a different mean and a different variance).
In cases where the true W
is unknown, tca
can be provided with noisy estimates of W
and then re-estimate W
as part of the optimization procedure (see argument refit_W
). These initial estimates should not be random but rather capture the information in W
to some extent. When the argument refit_W
is used, it is typically the case that only a subset of the features should be used for re-estimating W
. Therefore, when re-estimating W
, tca
performs feature selection using the ReFACTor algorithm; alternatively, it can also be provided with a user-specified list of features to be used in the re-estimation, assuming that such list of features that are most informative for estimating exist (see argument
refit_W.features
).
Covariates that systematically affect the source-specific values can be further considered (see argument
C1
). In that case, we assume:
where and
correspond to the
covariate values of observation
(i.e. a row vector from
C1
) and their effect sizes, respectively.
Covariates that systematically affect the mixture values , such as variables that capture technical biases in the collection of the measurements, can also be considered (see argument
C2
). In that case, we assume:
where and
correspond to the
covariate values of observation
(i.e. a row vector from
C2
) and their effect sizes, respectively.
Since the standard deviation of is specific to observation
and feature
, we can obtain p-values for the estimates of
and
by dividing each observed data point
by its estimated standard deviation and calculating T-statistics under a standard linear regression framework.
A list with the estimated parameters of the model. This list can be then used as the input to other functions such as tcareg.
W |
An |
mus_hat |
An |
sigmas_hat |
An |
tau_hat |
An estimate of the standard deviation of the i.i.d. component of variation in |
gammas_hat |
An |
deltas_hat |
An |
gammas_hat_pvals |
An |
gammas_hat_pvals.joint |
An |
deltas_hat_pvals |
An |
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.
Rahmani E, Zaitlen N, Baran Y, Eng C, Hu D, Galanter J, Oh S, Burchard EG, Eskin E, Zou J, Halperin E. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nature Methods 2016.
data <- test_data(100, 20, 3, 1, 1, 0.01) tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)
data <- test_data(100, 20, 3, 1, 1, 0.01) tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2)
TCA regression allows to test, under several types of statistical tests, the effects of source-specific values on an outcome of interest (or on mediating components thereof). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tcareg
allows to test for cell-type-specific effects of methylation on outcomes of interest (or on mediating components thereof).
tcareg( X, tca.mdl, y, C3 = NULL, test = "marginal_conditional", null_model = NULL, alternative_model = NULL, save_results = FALSE, fast_mode = TRUE, output = "TCA", sort_results = FALSE, parallel = FALSE, num_cores = NULL, log_file = "TCA.log", features_metadata = NULL, debug = FALSE, verbose = TRUE )
tcareg( X, tca.mdl, y, C3 = NULL, test = "marginal_conditional", null_model = NULL, alternative_model = NULL, save_results = FALSE, fast_mode = TRUE, output = "TCA", sort_results = FALSE, parallel = FALSE, num_cores = NULL, log_file = "TCA.log", features_metadata = NULL, debug = FALSE, verbose = TRUE )
X |
An |
tca.mdl |
The value returned by applying tca to |
y |
An |
C3 |
An |
test |
A character vector with the type of test to perform on each of the features in |
null_model |
A vector with a subset of the names of the sources in |
alternative_model |
A vector with a subset (or all) of the names of the sources in |
save_results |
A logical value indicating whether to save the returned results in a file. If |
fast_mode |
A logical value indicating whether to use a fast version of TCA regression, in which source-specific-values are first estimated using the |
output |
Prefix for output files (activated only if |
sort_results |
A logical value indicating whether to sort the results by their p-value (i.e. features with lower p-value will appear first in the results). This option is not available if |
parallel |
A logical value indicating whether to use parallel computing (possible when using a multi-core machine). |
num_cores |
A numeric value indicating the number of cores to use (activated only if |
log_file |
A path to an output log file. Note that if the file |
features_metadata |
A path to a csv file containing metadata about the features in |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
TCA models as the source-specific value of observation
in feature
coming from source
(see tca for more details). A TCA regression model tests an outcome
for a linear statistical relation with the source-specific values of a feature
by assuming:
where is an intercept term,
is the effect of source
,
and
correspond to the
covariate values of observation
(i.e. a row vector from
C3
) and their effect sizes, respectively, and . In practice, if
fast_mode == FALSE
then tcareg
fits this model using the conditional distribution , which, effectively, integrates over the random
. Statistical significance is then calculated using a likelihood ratio test (LRT).
Alternatively, in case
fast_mode == TRUE
the above model is fitted by first learning point estimates for using the tensor function and then assessing statistical significance using T-tests and partial F-tests under a standard regression framework. This alternative provides a substantial boost in speed.
Note that the null and alternative models will be set automatically, except when test == 'custom'
, in which case they will be set according to the user-specified null and alternative hypotheses.
Under the TCA regression model, several statistical tests can be performed by setting the argument test
according to one of the following options:
1. If test == 'marginal'
, tcareg
will perform the following for each source . For each feature
,
will be estimated and tested for a non-zero effect, while assuming
for all other sources
.
2. If test == 'marginal_conditional'
, tcareg
will perform the following for each source . For each feature
,
will be estimated and tested for a non-zero effect, while also estimating the effect sizes
for all other sources
(thus accounting for covariances between the estimated effects of different sources).
3. If test == 'joint'
, tcareg
will estimate for each feature the effect sizes of all
sources
and then test the set of
estimates of each feature
j
for a joint effect.
4. If test == 'single_effect'
, tcareg
will estimate for each feature the effect sizes of all
sources
, under the assumption that
, and then test the set of
estimates of each feature
j
for a joint effect.
5. If test == 'custom'
, tcareg
will estimate for each feature the effect sizes of a predefined set of sources (defined by a user-specified alternative model) and then test their estimates for a joint effect, while accounting for a nested predefined set of sources (defined by a user-specified null model).
A list with the results of applying the TCA regression model to each of the features in X
. If test == 'marginal'
or (test == 'marginal_conditional'
and fast_mode == FALSE
) then a list of k
such lists of results are returned, one for the results of each source.
phi |
An estimate of the standard deviation of the i.i.d. component of variation in the TCA regression model. |
beta |
A matrix of effect size estimates for the source-specific effects, such that each row corresponds to the estimated effect sizes of one feature in |
intercept |
An |
alpha |
An |
null_ll |
An |
alternative_ll |
An |
stats |
An |
df |
The degrees of freedom for deriving p-values. |
pvals |
An |
qvals |
An |
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.
n <- 50 m <- 10 k <- 3 p1 <- 1 p2 <- 1 data <- test_data(n, m, k, p1, p2, 0.01) tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2) y <- matrix(rexp(n, rate=.1), ncol=1) rownames(y) <- rownames(data$W) # marginal conditional test: res0 <- tcareg(data$X, tca.mdl, y) # joint test: res1 <- tcareg(data$X, tca.mdl, y, test = "joint") # custom test, testing for a joint effect of sources 1,2 while accounting for source 3 res2 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = c("3"), alternative_model = c("1","2","3")) # custom test, testing for a joint effect of sources 1,2 assuming no effects under the null res3 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = NULL, alternative_model = c("1","2"))
n <- 50 m <- 10 k <- 3 p1 <- 1 p2 <- 1 data <- test_data(n, m, k, p1, p2, 0.01) tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2) y <- matrix(rexp(n, rate=.1), ncol=1) rownames(y) <- rownames(data$W) # marginal conditional test: res0 <- tcareg(data$X, tca.mdl, y) # joint test: res1 <- tcareg(data$X, tca.mdl, y, test = "joint") # custom test, testing for a joint effect of sources 1,2 while accounting for source 3 res2 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = c("3"), alternative_model = c("1","2","3")) # custom test, testing for a joint effect of sources 1,2 assuming no effects under the null res3 <- tcareg(data$X, tca.mdl, y, test = "custom", null_model = NULL, alternative_model = c("1","2"))
Extracts from a fitted TCA model (i.e. a value returned by the function tca
) a subset of the features.
tcasub(tca.mdl, features, log_file = "TCA.log", debug = FALSE, verbose = TRUE)
tcasub(tca.mdl, features, log_file = "TCA.log", debug = FALSE, verbose = TRUE)
tca.mdl |
The value returned by applying the function |
features |
A vector with the identifiers of the features to extract (as they appear in the rows of |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
This function allows to extract a subset of the features from a fitted TCA model (i.e. from a value returned by the function tca
). This allows, for example, to extract and then perform post-hoc tests on only a small set of candidate features (e.g., using the function tcareg
), without the need to run tca
again for fitting the model to the candidate features.
A list with the estimated parameters of the model for the given set of features.
W |
Equals to |
mus_hat |
A |
sigmas_hat |
A |
tau_hat |
Equals to |
gammas_hat |
A |
deltas_hat |
A |
gammas_hat_pvals |
A |
gammas_hat_pvals.joint |
A |
deltas_hat_pvals |
A |
data <- test_data(50, 20, 3, 0, 0, 0.01) tca.mdl <- tca(X = data$X, W = data$W) tca.mdl.subset <- tcasub(tca.mdl, rownames(data$X)[1:10]) y <- matrix(rexp(50, rate=.1), ncol=1) rownames(y) <- rownames(data$W) # run tcareg test with an outcome y: res <- tcareg(data$X[1:10,], tca.mdl.subset, y, test = "joint")
data <- test_data(50, 20, 3, 0, 0, 0.01) tca.mdl <- tca(X = data$X, W = data$W) tca.mdl.subset <- tcasub(tca.mdl, rownames(data$X)[1:10]) y <- matrix(rexp(50, rate=.1), ncol=1) rownames(y) <- rownames(data$W) # run tcareg test with an outcome y: res <- tcareg(data$X[1:10,], tca.mdl.subset, y, test = "joint")
Estimates 3-dimensional signals (features by observations by sources) from input of mixtures (features by observations), under the assumption of the TCA model that each observation is a mixture of unique source-specific values (in each feature in the data). For example, in the context of tissue-level bulk DNA methylation data coming from a mixture of cell types (i.e. the input is methylation sites by individuals), tensor
allows to estimate a tensor of cell-type-specific levels for each individual in each methylation site (i.e. a tensor of methylation sites by individuals by cell types).
tensor( X, tca.mdl, scale = FALSE, parallel = FALSE, num_cores = NULL, log_file = "TCA.log", debug = FALSE, verbose = TRUE )
tensor( X, tca.mdl, scale = FALSE, parallel = FALSE, num_cores = NULL, log_file = "TCA.log", debug = FALSE, verbose = TRUE )
X |
An |
tca.mdl |
The value returned by applying the function |
scale |
A logical value indicating whether to divide the estimate of each entry in the tensor by its estimated standard deviation. |
parallel |
A logical value indicating whether to use parallel computing (possible when using a multi-core machine). |
num_cores |
A numeric value indicating the number of cores to use (activated only if |
log_file |
A path to an output log file. Note that if the file |
debug |
A logical value indicating whether to set the logger to a more detailed debug level; set |
verbose |
A logical value indicating whether to print logs. |
See tca for notations and details about the TCA model. Given estimates of the parameters of the model (given by tca), tensor
uses the conditional distribution for estimating the
source-specific levels of each observation
in each feature
.
A list with the estimated source-specific values. The first element in the list is an m
by n
matrix (features by observations) corresponding to the estimated values coming from the first source, the second element in the list is another m
by n
matrix (features by observations) corresponding to the estimated values coming from the second source and so on.
Rahmani E, Schweiger R, Rhead B, Criswell LA, Barcellos LF, Eskin E, Rosset S, Sankararaman S, Halperin E. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nature Communications 2019.
data <- test_data(50, 20, 3, 2, 2, 0.01) tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2) Z_hat <- tensor(data$X, tca.mdl)
data <- test_data(50, 20, 3, 2, 2, 0.01) tca.mdl <- tca(X = data$X, W = data$W, C1 = data$C1, C2 = data$C2) Z_hat <- tensor(data$X, tca.mdl)
Generates simple test data following the TCA model.
test_data(n, m, k, p1, p2, tau, log_file = "TCA.log", verbose = TRUE)
test_data(n, m, k, p1, p2, tau, log_file = "TCA.log", verbose = TRUE)
n |
The number of observations to simulate. |
m |
The number of features to simulate. |
k |
The number of sources to simulate. |
p1 |
The number of covariates that affect the source-specific values to simulate. |
p2 |
The number of covariates that affect the mixture values to simulate. |
tau |
The variance of the i.i.d. component of variation to add on top of the simulated mixture values. |
log_file |
A path to an output log file. Note that if the file |
verbose |
A logical value indicating whether to print logs. |
See tca for details about the TCA model.
A list with the simulated data and parameters.
X |
An |
Z |
A list with the simulated source-specific values, where the first element in the list is an |
W |
An |
mus |
An |
sigmas |
An |
C1 |
An |
C2 |
An |
gammas |
An |
deltas |
An |
data <- test_data(100, 50, 3, 2, 2, 0.01)
data <- test_data(100, 50, 3, 2, 2, 0.01)