Title: | Progression Analysis of Disease with Survival using Topological Data Analysis |
---|---|
Description: | Mapper-based survival analysis with transcriptomics data is designed to carry out. Mapper-based survival analysis is a modification of Progression Analysis of Disease (PAD) where survival data is taken into account in the filtering function. More details in: J. Fores-Martos, B. Suay-Garcia, R. Bosch-Romeu, M.C. Sanfeliu-Alonso, A. Falco, J. Climent, "Progression Analysis of Disease with Survival (PAD-S) by SurvMap identifies different prognostic subgroups of breast cancer in a large combined set of transcriptomics and methylation studies" <doi:10.1101/2022.09.08.507080>. |
Authors: | Miriam Esteve [aut, cre] |
Maintainer: | Miriam Esteve <[email protected]> |
License: | GPL-3 |
Version: | 1.0.0 |
Built: | 2025-01-28 04:10:20 UTC |
Source: | https://github.com/miriamesteve/gsstda |
Character vector of length 121 containing the group to which each sample belongs.
data(case_tag, package = "GSSTDA")
data(case_tag, package = "GSSTDA")
Character vector length 121.
"NT": control, sample from healthy tissue; "T": case, sample from neoplastic tissue.
The data are from the study GSE42568. Information extracted from the file GSE42568_family.soft.gz available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42568.
Checking the arguments introduces in the mapper
object.
check_arg_mapper( full_data, filter_values, distance_type, clustering_type, linkage_type, optimal_clustering_mode = NA, silhouette_threshold = 0.25, na.rm = TRUE )
check_arg_mapper( full_data, filter_values, distance_type, clustering_type, linkage_type, optimal_clustering_mode = NA, silhouette_threshold = 0.25, na.rm = TRUE )
full_data |
Matrix with the columns of the input matrix corresponding to the individuals belonging to the level. |
filter_values |
Vector obtained after applying the filtering function to the input matrix, i.e, a vector with the filtering function values for each included sample. |
distance_type |
Type of distance to be used for clustering. Choose between correlation ("correlation") and euclidean ("euclidean"). "correlation" default option. |
clustering_type |
Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. "hierarchical" default option. |
linkage_type |
Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. "single" default option. |
optimal_clustering_mode |
Method for selection optimal number of clusters. It is only necessary if the chosen type of algorithm is hierarchical. In this case, choose between "standard" (the method used in the original mapper article) or "silhouette". In the case of the PAM algorithm, the method will always be "silhouette". |
silhouette_threshold |
Minimum value of |
na.rm |
|
optimal_clustering_mode
Checking the filter_values introduces in the mapper
object.
check_filter_values(full_data, filter_values, na.rm = TRUE)
check_filter_values(full_data, filter_values, na.rm = TRUE)
full_data |
Matrix with the columns of the input matrix corresponding to the individuals belonging to the level. This matrix could be the genes_disease_component. |
filter_values |
Vector obtained after applying the filtering function to the input matrix, i.e, a vector with the filtering function values for each included sample. |
na.rm |
|
filter_value
and full_data
without NAN's
Checking the full_data introduces in the package
check_full_data(full_data, na.rm = TRUE)
check_full_data(full_data, na.rm = TRUE)
full_data |
Matrix with the columns of the input matrix corresponding to the individuals belonging to the level. |
na.rm |
|
Return full_data
without NAN's and as a matrix
Checking the arguments introduces in the gene selection process.
check_gene_selection(num_genes, gen_select_type, percent_gen_select)
check_gene_selection(num_genes, gen_select_type, percent_gen_select)
num_genes |
Number of genes in the full_data |
gen_select_type |
Type of gene selection to be used. Choose between "top_bot" (top-botton) and "abs" (absolute) |
percent_gen_select |
Percentage of genes to be selected |
num_gen_select Number of genes to be selected according to the percent_gen_select value
Checking the survival_time
, survival_event
and case_tag
introduces in the GSSTDA
object.
check_vectors( full_data, survival_time, survival_event, case_tag, control_tag, na.rm = TRUE )
check_vectors( full_data, survival_time, survival_event, case_tag, control_tag, na.rm = TRUE )
full_data |
The genes of the full_data (maybe remove by na.rm = TRUE) |
survival_time |
Time between disease diagnosis and event (if there was no event until the end of follow-up). |
survival_event |
|
case_tag |
The tag of the healthy sample (healthy or not). |
control_tag |
Tag of the healthy sample.E.g. "T" |
na.rm |
|
control_tag Return the tag of the healthy sample.
It performs the clustering of the samples in each of the levels. That is to say, in each interval of values of the filtering function, the samples with a value within that interval are clustered using the proposed clustering algorithm and the proposed method to determine the optimal number of clusters.
clust_all_levels( data, samp_in_lev, distance_type, clustering_type, linkage_type, optimal_clustering_mode, silhouette_threshold, num_bins_when_clustering )
clust_all_levels( data, samp_in_lev, distance_type, clustering_type, linkage_type, optimal_clustering_mode, silhouette_threshold, num_bins_when_clustering )
data |
Input data matrix whose columns are the individuals and rows are the features.BR cambiar nombre. |
samp_in_lev |
A list of character vectors with the individuals
included in each of the levels (i.e. each of the intervals of the values
of the filter functions). It is the output of the |
distance_type |
Type of distance to be used for clustering. Choose between correlation ("correlation") and euclidean ("euclidean"). |
clustering_type |
Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. |
linkage_type |
Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. |
optimal_clustering_mode |
Method for selection optimal number of clusters. It is only necessary if the chosen type of algorithm is hierarchical. In this case, choose between "standard" (the method used in the original mapper article) or "silhouette". In the case of the PAM algorithm, the method will always be "silhouette". "silhouette". |
silhouette_threshold |
Minimum value of |
num_bins_when_clustering |
Number of bins to generate the histogram employed by the standard optimal number of cluster finder method. Parameter not necessary if the "optimal_clust_mode" option is "silhouette" or the "clust_type" is "PAM". |
List of interger vectors. Each of the vectors contains information about the nodes at each level and the individuals contained in them. The names of the vector values are the names of the samples and the vector values are the node number of that level to which the individual belongs.
It performs clustering of the samples belonging to a particular level (to a particular interval of the filter function) with the proposed clustering algorithm and the proposed method to determine the optimal number of clusters.
clust_lev( data_i, distance_type, clustering_type, linkage_type, optimal_clustering_mode, silhouette_threshold = 0.25, num_bins_when_clustering, level_name )
clust_lev( data_i, distance_type, clustering_type, linkage_type, optimal_clustering_mode, silhouette_threshold = 0.25, num_bins_when_clustering, level_name )
data_i |
Matrix with the columns of the input matrix corresponding to the individuals belonging to the level. |
distance_type |
Type of distance to be used for clustering. Choose between correlation ("correlation") and euclidean ("euclidean"). |
clustering_type |
Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. |
linkage_type |
Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. The value provided if the type of clustering chosen is hierarchical will be ignored |
optimal_clustering_mode |
Method for selection optimal number of clusters. It is only necessary if the chosen type of algorithm is hierarchical. In this case, choose between "standard" (the method used in the original mapper article) or "silhouette". In the case of the PAM algorithm, the method will always be "silhouette". |
silhouette_threshold |
Minimum value of |
num_bins_when_clustering |
Number of bins to generate the histogram employed by the standard optimal number of cluster finder method. Parameter not necessary if the "optimal_clust_mode" option is "silhouette" or the "clust_type" is "PAM". |
level_name |
Name of the studied level. # ERROR No usado |
Returns a interger vector with the samples included in each cluster for the specific level analyzed. The names of the vector values are the names of the samples and the vector values are the node number to which the individual belongs.
It computes the adjacency matrix between nodes. Two nodes are considered connected if they share at least one individual.
compute_node_adjacency(nodes_list)
compute_node_adjacency(nodes_list)
nodes_list |
Output of the |
It returns a matrix of magnitude n nodes x n nodes that stores a 1 if there are shared samples in two given nodes and a 0 otherwise.
It carries out univariate cox proportional hazard models for the expression levels of each gene included in the provided dataset (case_full_data) and their link with relapse-free or overall survival.
cox_all_genes(case_full_data, survival_time, survival_event)
cox_all_genes(case_full_data, survival_time, survival_event)
case_full_data |
Input matrix whose columns correspond to the patients and rows to the genes, having selected only the columns belonging to disease samples. The names of the rows must be the names of the genes. |
survival_time |
Numeric vector that includes time to the event information |
survival_event |
Numeric vector that indicates if relapse or death have been produced (0 and 1s). |
A matrix with the results of the application of proportional
hazard models using the expression levels of each gene as covariate.
The coef
column corresponds to the regression coefficient; the
exp_coef
column corresponds to the value of e^coef (which is
interpreted as the odds ratio); the se_coef
column corresponds
to the standard error of each coefficient; the Z
column corresponds
to the value of coef/se_coef (the higher the Z value, the higher the
significance of the variable) and the Pr_z
column corresponds to
the p-value for each Z value.
It takes a rectangular matrix composed by the addition of
a signal matrix and a Gaussian noise matrix and returns a matrix of the same
dimension that is denoised through a Singular Value Decomposition
truncation process. The selection of the number of singular values is
chosen following the proposal by "The optimal hard threshold
for singular values is ". It should be used after
the function
flatten_normal_tiss
.
denoise_rectangular_matrix(matrix_flatten_normal_tiss, gamma)
denoise_rectangular_matrix(matrix_flatten_normal_tiss, gamma)
matrix_flatten_normal_tiss |
A rectangular noisy matrix to denoise. It is return by
|
gamma |
A parameter that indicates the magnitude of the noise assumed in
the flat data matrix for the generation of the Healthy State Model. If it
takes the value |
A the normal space which has the same dimension denoised version of the matrix
returned by flatten_normal_tiss
.
Disease-Specific Genomic Analysis (dsga). This analysis, developed by Nicolau et al., allows the calculation of the "disease component" of a expression matrix which consists of, through linear models, eliminating the part of the data that is considered normal or healthy and keeping only the component that is due to the disease. It is intended to precede other techniques like classification or clustering. For more information see Disease-specific genomic analysis: identifying the signature of pathologic biology (doi: 10.1093/bioinformatics/btm033).
dsga( full_data, survival_time, survival_event, case_tag, control_tag = NA, gamma = NA, na.rm = TRUE )
dsga( full_data, survival_time, survival_event, case_tag, control_tag = NA, gamma = NA, na.rm = TRUE )
full_data |
Input matrix whose columns correspond to the patients and rows to the genes. |
survival_time |
Numerical vector of the same length as the number of
columns of |
survival_event |
Numerical vector of the same length as the number of
columns of |
case_tag |
Character vector of the same length as the number of
columns of |
control_tag |
Tag of the healthy sample.E.g. "T" |
gamma |
A parameter that indicates the magnitude of the noise assumed in
the flat data matrix for the generation of the Healthy State Model. If it
takes the value |
na.rm |
|
A dsga
object. It contains: the full_data
without
NAN's values, the label designated for healthy samples (control_tag
),
the case_tag
vector without NAN's values, the survival_event
,
the the survival_time
the matrix with the normal space (linear space
generated from normal tissue samples) and the matrix of the disease
components (the transformed full_data matrix from which the normal component
has been removed).
dsga_obj <- dsga(full_data, survival_time, survival_event, case_tag)
dsga_obj <- dsga(full_data, survival_time, survival_event, case_tag)
Given a matrix containing the expression values of
n
healthy tissue samples, it produces the flattened vector matrix
as reported in "Disease-specific genomic analysis: identifying
the signature of pathology biology".
flatten_normal_tiss(normal_tiss)
flatten_normal_tiss(normal_tiss)
normal_tiss |
A normal tissue data gene expression matrix. The columns should be the samples and the rows should be the genes. |
A gene expression matrix containing the flattened version of the vectors.
normal_tissue_matrix <- matrix(stats::rnorm(36), nrow=6) flatten_normal_tiss(normal_tissue_matrix)
normal_tissue_matrix <- matrix(stats::rnorm(36), nrow=6) flatten_normal_tiss(normal_tissue_matrix)
Matrix containing gene expression profiling of 104 breast cancer and 17 normal breast biopsies. Expression profiling data by array (platform HG-U133_Plus_2).
data(full_data, package = "GSSTDA")
data(full_data, package = "GSSTDA")
Gene expression matrix with 4165 rows and 121 columns.
The columns correspond to the patients and the rows to the genes. The column names correspond to the patient identifier in GEO. The row names correspond to the gene names.
Normalized gene expression data GSE42568. Background correction,
summarization, and quantile normalization were carried out using the
fRMA method implemented in the fRMA
package. Filtered probes that did
not target genes with valid gene id were filtered and those probes
targeting the same gene were collapse by thanking those presenting the
highest row variance using the WGCNA::collapseRows
function.
Subsequently, due to CRAN publication requirements, the number of genes
is randomly reduced to 4165 genes.
The data are from the study GSE42568 available in
https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42568.
The data were processed as explained in the details
section.
Gene selection and calculation of filter function values. After fitting a Cox proportional hazard model to each gene, this function makes a selection of genes according to both their variability within the database and their relationship with survival. Subsequently, with the genes selected, the values of the filtering functions are calculated for each patient. The filter function allows to summarise each vector of each individual in a single data. This function takes into account the survival associated with each gene. In particular, the implemented filter function performs the vector magnitude in the Lp norm (as well as k powers of this magnitude) of the vector resulting of weighting each element of the column vector by the Z score obtained in the cox proportional hazard model.
gene_selection(data_object, gen_select_type, percent_gen_select, na.rm = TRUE)
gene_selection(data_object, gen_select_type, percent_gen_select, na.rm = TRUE)
data_object |
Object with:
|
gen_select_type |
Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option. |
percent_gen_select |
Percentage (from zero to one hundred) of genes to be selected to be used in mapper. 10 default option. |
na.rm |
|
A gene_selection_object
. It contains:
the full_data
without NAN's values (data
)
the cox_all_matrix
(a matrix with the results of the application of
proportional hazard models: with the regression coefficients, the odds ratios,
the standard errors of each coefficient, the Z values (coef/se_coef) and
the p-values for each Z value)
a vector with the name of the selected genes
the matrix of disease components with only the rows of the selected genes
(genes_disease_component
)
and the vector of the values of the filter function.
data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" gene_selection_obj <- gene_selection(data_object, gen_select_type ="top_bot", percent_gen_select=10)
data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" gene_selection_obj <- gene_selection(data_object, gen_select_type ="top_bot", percent_gen_select=10)
Private function to gene selection
gene_selection_( full_data, survival_time, survival_event, control_tag_cases, gen_select_type, num_gen_select, matrix_disease_component = NULL )
gene_selection_( full_data, survival_time, survival_event, control_tag_cases, gen_select_type, num_gen_select, matrix_disease_component = NULL )
full_data |
Input matrix whose columns correspond to the patients and rows to the genes. |
survival_time |
Numerical vector of the same length as the number of
columns of |
survival_event |
Numerical vector of the same length as the number of
columns of |
control_tag_cases |
Numeric vector with the indices of the columns corresponding to the healthy sample patients. |
gen_select_type |
Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option. |
num_gen_select |
Number of genes to be selected to be used in mapper. |
matrix_disease_component |
Optional, only necessary in case of gene
selection after dsga has been performed. Matrix of the disease components
(the transformed |
A gene_selection_object
. It contains:
the matrix with which the gene selection has been performed without NAN's
values (data
). It is the matrix_disease_component
in case it has been
performed from a dsga_object
or full_data
in the opposite case.
the cox_all_matrix
(a matrix with the results of the application of
proportional hazard models: with the regression coefficients, the odds ratios,
the standard errors of each coefficient, the Z values (coef/se_coef) and
the p-values for each Z value)
a vector with the name of the selected genes
the matrix of disease components with only the rows of the selected genes
(genes_disease_component
)
and the vector of the values of the filter function.
gen_select_type <- "Top_Bot" percent_gen_select <- 10 control_tag_cases <- which(case_tag == "NT") gene_selection_obj <- gene_selection_(full_data, survival_time, survival_event, control_tag_cases, gen_select_type ="top_bot", num_gen_select = 10)
gen_select_type <- "Top_Bot" percent_gen_select <- 10 control_tag_cases <- which(case_tag == "NT") gene_selection_obj <- gene_selection_(full_data, survival_time, survival_event, control_tag_cases, gen_select_type ="top_bot", num_gen_select = 10)
It selects genes for mapper based on the product of standard deviation of the rows (genes) in the disease component matrix plus one times the Z score obtained by fitting a cox proportional hazard model to the level of each gene. For further information see "Topology based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival"
gene_selection_surv( case_disease_component, cox_all_matrix, gen_select_type, num_gen_select )
gene_selection_surv( case_disease_component, cox_all_matrix, gen_select_type, num_gen_select )
case_disease_component |
Disease component matrix (output of the function
|
cox_all_matrix |
Output from the |
gen_select_type |
Option. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). |
num_gen_select |
Number of genes to be selected (those with the highest product value). |
Character vector with the names of the selected genes.
Private function to select Gene without dsga process
## Default S3 method: gene_selection(data_object, gen_select_type, percent_gen_select, na.rm = TRUE)
## Default S3 method: gene_selection(data_object, gen_select_type, percent_gen_select, na.rm = TRUE)
data_object |
Object with:
|
gen_select_type |
Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option. |
percent_gen_select |
Percentage (from zero to one hundred) of genes to be selected to be used in mapper. 10 default option. |
na.rm |
|
A gene_selection
object. It contains: the full_data without NAN's values,
the control tag of the healthy patient, the matrix with the normal space and
the matrix of the disease components.
data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" gene_selection_object <- gene_selection(data_object, gen_select_type ="top_bot", percent_gen_select = 10)
data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" gene_selection_object <- gene_selection(data_object, gen_select_type ="top_bot", percent_gen_select = 10)
Private function to select Gene with dsga object
## S3 method for class 'dsga_object' gene_selection(data_object, gen_select_type, percent_gen_select, na.rm = TRUE)
## S3 method for class 'dsga_object' gene_selection(data_object, gen_select_type, percent_gen_select, na.rm = TRUE)
data_object |
dsga object information |
gen_select_type |
Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option. |
percent_gen_select |
Percentage (from zero to one hundred) of genes to be selected to be used in mapper. 10 default option. |
na.rm |
|
A gene_selection
object. It contains: the full_data without NAN's values,
the control tag of the healthy patient, the matrix with the normal space and
the matrix of the disease components.
dsga_obj <- dsga(full_data, survival_time, survival_event, case_tag, na.rm = "checked") gene_selection_object <- gene_selection(dsga_obj, gen_select_type ="top_bot", percent_gen_select = 10)
dsga_obj <- dsga(full_data, survival_time, survival_event, case_tag, na.rm = "checked") gene_selection_object <- gene_selection(dsga_obj, gen_select_type ="top_bot", percent_gen_select = 10)
This function produces a disease component matrix from an expression matrix and the denoised flattened matrix constructed from "healthy tissue data".
generate_disease_component(full_data, normal_space)
generate_disease_component(full_data, normal_space)
full_data |
Input matrix whose columns correspond to the patients and rows to the gens. Both tumour and healthy samples should be included. |
normal_space |
Denoised flattened matrix constructed from
"healthy tissue data". Output of the function |
Disease component matrix that contains the disease component of the provided normal space
full_data <- matrix(stats::rnorm(120),ncol=20) normal_tissue <- t(full_data) normal_tissue_f <- flatten_normal_tiss(normal_tissue) normal_tissue_f_d <- denoise_rectangular_matrix(normal_tissue_f, gamma=NA) disease_component <- generate_disease_component(t(full_data),normal_tissue_f_d)
full_data <- matrix(stats::rnorm(120),ncol=20) normal_tissue <- t(full_data) normal_tissue_f <- flatten_normal_tiss(normal_tissue) normal_tissue_f_d <- denoise_rectangular_matrix(normal_tissue_f, gamma=NA) disease_component <- generate_disease_component(t(full_data),normal_tissue_f_d)
It calculates the intervals of the given output values of a filter function according to the given number and percentage of overlap.
get_intervals_One_D(filter_values, num_intervals, percent_overlap)
get_intervals_One_D(filter_values, num_intervals, percent_overlap)
filter_values |
Vector obtained after applying the filtering function to the input matrix, i.e, a vector with the filtering function values for each included sample. |
num_intervals |
Number of intervals to divide the filtering function values in. |
percent_overlap |
Percentage of overlap between intervals. |
Returns a list with the set of intervals for the filtering function values.
Gene Structure Survival using Topological Data Analysis. This function implements an analysis for expression array data based on the Progression Analysis of Disease developed by Nicolau et al. (doi: 10.1073/pnas.1102826108) that allows the information contained in an expression matrix to be condensed into a combinatory graph. The novelty is that information on survival is integrated into the analysis.
The analysis consists of 3 parts: a preprocessing of the data, the gene selection and the filter function, and the mapper algorithm. The preprocessing is specifically the Disease Specific Genomic Analysis (proposed by Nicolau et al.) that consists of, through linear models, eliminating the part of the data that is considered "healthy" and keeping only the component that is due to the disease. The genes are then selected according to their variability and whether they are related to survival and the values of the filtering function for each patient are calculated taking into account the survival associated with each gene. Finally, the mapper algorithm is applied from the disease component matrix and the values of the filter function obtaining a combinatory graph.
gsstda( full_data, survival_time, survival_event, case_tag, control_tag = NA, gamma = NA, gen_select_type = "Top_Bot", percent_gen_select = 10, num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", num_bins_when_clustering = 10, linkage_type = "single", optimal_clustering_mode = NA, silhouette_threshold = 0.25, na.rm = TRUE )
gsstda( full_data, survival_time, survival_event, case_tag, control_tag = NA, gamma = NA, gen_select_type = "Top_Bot", percent_gen_select = 10, num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", num_bins_when_clustering = 10, linkage_type = "single", optimal_clustering_mode = NA, silhouette_threshold = 0.25, na.rm = TRUE )
full_data |
Input matrix whose columns correspond to the patients and rows to the genes. |
survival_time |
Numerical vector of the same length as the number of
columns of |
survival_event |
Numerical vector of the same length as the number of
columns of |
case_tag |
Character vector of the same length as the number of
columns of |
control_tag |
Tag of the healthy sample.E.g. "T" |
gamma |
A parameter that indicates the magnitude of the noise assumed in
the flat data matrix for the generation of the Healthy State Model. If it
takes the value |
gen_select_type |
Option. Options on how to select the genes to be used in the mapper. Select the "Abs" option, which means that the genes with the highest absolute value are chosen, or the "Top_Bot" option, which means that half of the selected genes are those with the highest value (positive value, i.e. worst survival prognosis) and the other half are those with the lowest value (negative value, i.e. best prognosis). "Top_Bot" default option. |
percent_gen_select |
Percentage (from zero to one hundred) of genes to be selected to be used in mapper. 10 default option. |
num_intervals |
Parameter for the mapper algorithm. Number of intervals used to create the first sample partition based on filtering values. 5 default option. |
percent_overlap |
Parameter for the mapper algorithm. Percentage of overlap between intervals. Expressed as a percentage. 40 default option. |
distance_type |
Parameter for the mapper algorithm. Type of distance to be used for clustering. Choose between correlation ("correlation") and euclidean ("euclidean"). "correlation" default option. |
clustering_type |
Parameter for the mapper algorithm. Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. "hierarchical" default option. |
num_bins_when_clustering |
Parameter for the mapper algorithm. Number of bins to generate the histogram employed by the standard optimal number of cluster finder method. Parameter not necessary if the "optimal_clust_mode" option is "silhouette" or the "clust_type" is "PAM". 10 default option. |
linkage_type |
Parameter for the mapper algorithm. Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. "single" default option. |
optimal_clustering_mode |
Method for selection optimal number of clusters. It is only necessary if the chosen type of algorithm is hierarchical. In this case, choose between "standard" (the method used in the original mapper article) or "silhouette". In the case of the PAM algorithm, the method will always be "silhouette". |
silhouette_threshold |
Minimum value of |
na.rm |
|
A gsstda
object. It contains:
the matrix with the normal space normal_space
,
the matrix of the disease components normal_space matrix_disease_component
,
a matrix with the results of the application of proportional hazard models
for each gene (cox_all_matrix)
,
the genes selected for mapper genes_disease_componen
,
the matrix of the disease components with information from these genes only
genes_disease_component
and a mapper_obj
object. This mapper_obj
object contains the
values of the intervals (interval_data), the samples included in each
interval (sample_in_level), information about the cluster to which the
individuals in each interval belong (clustering_all_levels), a list including
the individuals contained in each detected node (node_samples), their size
(node_sizes), the average of the filter function values of the individuals
of each node (node_average_filt) and the adjacency matrix linking the nodes
(adj_matrix). Moreover, information is provided on the number of nodes,
the average node size, the standard deviation of the node size, the number
of connections between nodes, the proportion of connections to all possible
connections and the number of ramifications.
gsstda_object <- gsstda(full_data, survival_time, survival_event, case_tag, gamma=NA, gen_select_type="Top_Bot", percent_gen_select=10, num_intervals = 4, percent_overlap = 50, distance_type = "euclidean", num_bins_when_clustering = 8, clustering_type = "hierarchical", linkage_type = "single")
gsstda_object <- gsstda(full_data, survival_time, survival_event, case_tag, gamma=NA, gen_select_type="Top_Bot", percent_gen_select=10, num_intervals = 4, percent_overlap = 50, distance_type = "euclidean", num_bins_when_clustering = 8, clustering_type = "hierarchical", linkage_type = "single")
Calculates the incomplete Marcenko-Pastur integral from a lower limit to the upper bound of the distribution's support.
incMarPas(x0, beta, gamma)
incMarPas(x0, beta, gamma)
x0 |
A numeric value representing the lower limit of the integral. |
beta |
A numeric value representing the aspect ratio |
gamma |
A numeric value representing an exponent parameter. |
A numeric value representing the incomplete Marcenko-Pastur integral.
Extract the nodes information based on information about clustering. The individuals who are part of each node are identified
levels_to_nodes(clust_all_levels_list)
levels_to_nodes(clust_all_levels_list)
clust_all_levels_list |
A list with information on the levels
obtained from the |
A list including the individuals content of each detected node. List of character vectors. Each of the vectors contains the names of the individuals at each node.
A filtering function for mapper that projects $$R$^n$ into $R$. It calculates for each column of the matrix (each patient), its value of the filtering function. Specifically, it computes the vector magnitude in the Lp norm (as well as k powers of this magnitude) of the vector resulting of weighting each element of the column vector by the Z score obtained by fitting a cox proportional hazard model to the level of each gene. For further information see "Progression Analysis of Disease with Survival (PAD-S) by SurvMap identifies different prognostic subgroups of breast cancer in a large combined set of transcriptomics and methylation studies"
lp_norm_k_powers_surv(genes_disease_component, p, k, cox_all_matrix)
lp_norm_k_powers_surv(genes_disease_component, p, k, cox_all_matrix)
genes_disease_component |
Disease component matrix (output of the function of
|
p |
integer. It indicates the p norm to be calculated. If k = 1 and p = 2, the function computes the standard (Euclidean) vector magnitude of each column. For larger values of p the weight of genes with larger levels is greater. |
k |
integer. Powers of the vector magnitude. If k = 1 and p = 2, the function computes the standard (Euclidean) vector magnitude of each column. |
cox_all_matrix |
A matrix with the output of the
|
A numeric vector including the values produced by the function for each sample in the dataset.
Auxiliary function that maps a numeric vector, the average node filtering function values, to a color vector.
map_to_color(x, limits = NULL)
map_to_color(x, limits = NULL)
x |
A vector of numeric values storing the average filtering function values found in the samples placed into a specific node. |
limits |
A two element numeric vector including the range of values. This is optional. |
A vector of the same length of x with colors ranging from blue to red.
TDA are persistent homology and mapper. Persistent homology borrows ideas from abstract algebra to identify particular aspects related to the shape of the data such as the number of connected components and the presence of higher-dimensional holes, whereas mapper condenses the information of high-dimensional datasets into a combinatory graph or simplicial complex that is referred to as the skeleton of the dataset. This implementation is the mapper of one dimension, i.e. using only one filter function value.
mapper( data, filter_values, num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", num_bins_when_clustering = 10, linkage_type = "single", optimal_clustering_mode = NA, silhouette_threshold = 0.25, na.rm = TRUE )
mapper( data, filter_values, num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", num_bins_when_clustering = 10, linkage_type = "single", optimal_clustering_mode = NA, silhouette_threshold = 0.25, na.rm = TRUE )
data |
Input matrix whose columns correspond to the individuals and rows to the features. |
filter_values |
Vector obtained after applying the filtering function to the input matrix, i.e, a vector with the filtering function values for each included sample. |
num_intervals |
Number of intervals used to create the first sample partition based on filtering values. 5 default option. |
percent_overlap |
Percentage of overlap between intervals. Expressed as a percentage. 40 default option. |
distance_type |
Type of distance to be used for clustering. Choose between correlation ("correlation") and euclidean ("euclidean"). "correlation" default option. |
clustering_type |
Type of clustering method. Choose between "hierarchical" and "PAM" (“partition around medoids”) options. "hierarchical" default option. |
num_bins_when_clustering |
Number of bins to generate the histogram employed by the standard optimal number of cluster finder method. Parameter not necessary if the "optimal_clustering_mode" option is "silhouette" or the "clustering_type" is "PAM". 10 default option. |
linkage_type |
Linkage criteria used in hierarchical clustering. Choose between "single" for single-linkage clustering, "complete" for complete-linkage clustering or "average" for average linkage clustering (or UPGMA). Only necessary for hierarchical clustering. "single" default option. |
optimal_clustering_mode |
Method for selection optimal number of clusters. It is only necessary if the chosen type of algorithm is hierarchical. In this case, choose between "standard" (the method used in the original mapper article) or "silhouette". In the case of the PAM algorithm, the method will always be "silhouette". |
silhouette_threshold |
Minimum value of |
na.rm |
|
A mapper_obj
object. It contains the values of the intervals
(interval_data), the samples included in each interval (sample_in_level),
information about the cluster to which the individuals in each interval
belong (clustering_all_levels), a list including the individuals contained
in each detected node (node_samples), their size (node_sizes), the
average of the filter function values of the individuals of each node
(node_average_filt) and the adjacency matrix linking the nodes (adj_matrix).
control_tag_cases <- which(case_tag == "NT") gene_selection_object <- gene_selection_(full_data, survival_time, survival_event, control_tag_cases, gen_select_type ="top_bot", num_gen_select = 10) mapper_object <- mapper(data = gene_selection_object[["genes_disease_component"]], filter_values = gene_selection_object[["filter_values"]], num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", linkage_type = "single")
control_tag_cases <- which(case_tag == "NT") gene_selection_object <- gene_selection_(full_data, survival_time, survival_event, control_tag_cases, gen_select_type ="top_bot", num_gen_select = 10) mapper_object <- mapper(data = gene_selection_object[["genes_disease_component"]], filter_values = gene_selection_object[["filter_values"]], num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", linkage_type = "single")
Calculates the median of the Marcenko-Pastur distribution for a given aspect ratio.
MedianMarcenkoPastur(beta)
MedianMarcenkoPastur(beta)
beta |
A numeric value representing the aspect ratio |
A numeric value representing the median of the Marcenko-Pastur distribution for the specified beta
.
Wrapping function to carry out the complete process.
one_D_Mapper(mapper_object_ini)
one_D_Mapper(mapper_object_ini)
mapper_object_ini |
Mapper TDA initializated object generated
by |
A mapper_obj
object. It contains the values of the intervals
(interval_data), the samples included in each interval (sample_in_level),
information about the cluster to which the individuals in each interval
belong (clustering_all_levels), a list including the individuals contained
in each detected node (node_samples), their size (node_sizes), the
average of the filter function values of the individuals of each node
(node_average_filt) and the adjacency matrix linking the nodes (adj_matrix).
Moreover, information is provided on the number of nodes, the average node
size, the standard deviation of the node size, the number of connections
between nodes, the proportion of connections to all possible connections
and the number of ramifications.
Computes the optimal SVHT coefficient when the noise level is known.
optimal_SVHT_coef_gamma_known(beta)
optimal_SVHT_coef_gamma_known(beta)
beta |
A numeric vector representing the aspect ratio |
A numeric vector of optimal SVHT coefficients corresponding to each aspect ratio in beta
.
beta <- 0.5 optimal_SVHT_coef_gamma_known(beta)
beta <- 0.5 optimal_SVHT_coef_gamma_known(beta)
Computes the optimal SVHT coefficient when the noise level is unknown.
optimal_SVHT_coef_gamma_unknown(beta)
optimal_SVHT_coef_gamma_unknown(beta)
beta |
A numeric vector representing the aspect ratio |
A numeric vector of optimal SVHT coefficients corresponding to each aspect ratio in beta
.
beta <- 0.5 optimal_SVHT_coef_gamma_unknown(beta)
beta <- 0.5 optimal_SVHT_coef_gamma_unknown(beta)
It draws the heatmap of the dsga result by selecting the 100 genes with the highest variability between samples.
plot_dsga(selected_matrix_disease_component, case_tag)
plot_dsga(selected_matrix_disease_component, case_tag)
selected_matrix_disease_component |
Disease component matrix of
the selected genes that contains the disease component of all patients.
Output of the function |
case_tag |
Character vector of the same length as the number of columns of full_data. Patients must be in the same order as in full_data. It must be indicated for each patient whether he/she is healthy or not. One value should be used to indicate whether the patient is healthy and another value should be used to indicate whether the patient's sample is tumourous. The user will then be asked which one indicates whether the patient is healthy. Only two values are valid in the vector in total. |
The heatmap of the dsga result.
This function produces an interactive network plot using
the visNetork
function from the mapper results.
plot_mapper(mapper_object, trans_node_size = TRUE, exp_to_res = 1/2)
plot_mapper(mapper_object, trans_node_size = TRUE, exp_to_res = 1/2)
mapper_object |
A list produced as an output of the |
trans_node_size |
Logical, it indicates whether you want to resize
the size of the nodes. |
exp_to_res |
Only necessary if trans_node_size is |
Plots an interactive network using the visNetwork
function.
# Create data object data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" #Select gene from data object gene_selection_object <- gene_selection(data_object, gen_select_type="top_bot", percent_gen_select=10) mapper_object <- mapper(data = gene_selection_object[["genes_disease_component"]], filter_values = gene_selection_object[["filter_values"]], num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", linkage_type = "single") plot_mapper(mapper_object)
# Create data object data_object <- list("full_data" = full_data, "survival_time" = survival_time, "survival_event" = survival_event, "case_tag" = case_tag) class(data_object) <- "data_object" #Select gene from data object gene_selection_object <- gene_selection(data_object, gen_select_type="top_bot", percent_gen_select=10) mapper_object <- mapper(data = gene_selection_object[["genes_disease_component"]], filter_values = gene_selection_object[["filter_values"]], num_intervals = 5, percent_overlap = 40, distance_type = "correlation", clustering_type = "hierarchical", linkage_type = "single") plot_mapper(mapper_object)
It calculates the 100 genes with the highest variability in the matrix disease component between samples and use them to draw the heat map.
results_dsga(matrix_disease_component, case_tag)
results_dsga(matrix_disease_component, case_tag)
matrix_disease_component |
Disease component matrix that contains
the disease component of all patients. Output of the function
|
case_tag |
Character vector of the same length as the number of columns of full_data. Patients must be in the same order as in full_data. It must be indicated for each patient whether he/she is healthy or not. One value should be used to indicate whether the patient is healthy and another value should be used to indicate whether the patient's sample is tumourous. The user will then be asked which one indicates whether the patient is healthy. Only two values are valid in the vector in total. |
A heatmap of the 100 genes with the highest variability in the matrix disease component.
This function returns a list of vectors containing the individuals included at each level, i.e. the vectors of individuals with a value of the filter function within each of the intervals.
samples_in_levels(interval_data, filter_values)
samples_in_levels(interval_data, filter_values)
interval_data |
Filter function intervals. List with the set of
intervals for the filtering function values produced by the
|
filter_values |
Vector obtained after applying the filtering function to the input matrix, i.e, a vector with the filtering function values for each included sample. |
A list of character vectors with the samples included in each of the levels (i.e. each of the intervals of the values of the filter functions).
Character vector of length 121 containing whether or not the patient is relapsed.
data(survival_event, package = "GSSTDA")
data(survival_event, package = "GSSTDA")
Character vector of length 121.
A value of "0" indicates that the patient did not relapse
during follow-up, a value of "1" indicates that the patient did. Samples
from healthy tissue contain a value of NA
.
The data are from the study GSE42568. Information extracted from the file GSE42568_family.soft.gz available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42568.
Numeric vector of length 121 containing the time in months until the relapse of the patient or until the end of the follow-up in case the patient has not relapsed.
data(survival_time, package = "GSSTDA")
data(survival_time, package = "GSSTDA")
Numeric vector of length 121.
Time in months. Samples from healthy tissue contain a
value of NA
.
The data are from the study GSE42568. Information extracted from the file GSE42568_family.soft.gz available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE42568