Type: | Package |
Title: | Network-Based Interpretation of Changes in Compositional Data |
Version: | 0.1.8 |
Date: | 2024-03-05 |
Maintainer: | Emmanuel Curis <emmanuel.curis@parisdescartes.fr> |
License: | Artistic-2.0 |
Description: | Provides a set of functions to interpret changes in compositional data based on a network representation of all pairwise ratio comparisons: computation of all pairwise ratio, construction of a p-value matrix of all pairwise tests of these ratios between conditions, conversion of this matrix to a network. |
Suggests: | lme4 |
Depends: | igraph |
Imports: | car |
Encoding: | UTF-8 |
NeedsCompilation: | no |
Packaged: | 2024-03-05 09:16:59 UTC; curis |
Author: | Emmanuel Curis |
Repository: | CRAN |
Date/Publication: | 2024-03-05 14:20:02 UTC |
Circadian Genes Expression in Bipolar Disorder Patients
Description
These two datasets give the expression level of main circadian genes in lymphoblastoid cells from bipolar disorder patients, as determined by qRT-PCR. Results are expressed in cycle thresholds (CT) units.
Usage
data(BpLi_J2)
data(BpLi_J4)
Format
Each dataset is a data frame with 78 rows and 26 columns. Each row give the RNA quantification of circadian and control genes for lymphoblastoid cells of a given patient, either with or without lithium in the culture medium.
Phenotype | factor | Patient phenotype, either responding (R) or not (NR) to lithium |
Patient | integer | Patient code |
Li | factor | Is lithium present (Oui) or not (Non) in the culture medium of the cells |
Other columns are the different circadian genes and reference genes expression levels, expressed as quantities in arbitrary units after quantitative reverse transcription PCR assays, using the Syber Green technology. Three sets of assays were done, using three different dilutions according to the explored gene: 1/20 (PER3, BHLHE41, NR1D1, DBP), 1/100 (GSK3b, RORA, PER1, PER2, CLOCK, ARNTL, CRY2, BHLHE40) and 1/200 (ARNTL2, TIMELESS, CRY1, CSNK1E). The numerical suffixe after the reference gene name (SDHA or HPRT) gives this dilution level – for instance, SDHA_20 is for the 1/20 dilution level, HRPT_100 for the 1/100 level...
Patients were classified as presenting a good response (R) or a lack of responce (NR) to lithium treatment based on the ALDA scale, see the original publication for details. Lymphoblastoid cells from each patients, obtained from blood samples, were cultivated for 2 (BpLi_J2) or 4 (BpLi_J4) days either with or without LiCl.
Source
Data courtesy allowed to be included in the package, by Cynthia Marie-Claire.
References
Geoffrey, P. A., Curis E., Courtin, C., Moreira, J., Morvillers, T., Etain, B., Laplanche, J.-L., Bellivier, F. and Marie-Claire, C. (2017). Lithium response in bipolar disorders and core clock genes expression. World J Biol Psychiatry, doi: 10.1080/15622975.2017.1282174.
Gene expression change in bipolar disorder
Description
This data set gives the mRNA quantification of a few genes compared between bipolar disorder patients and healthy volunteers.
Usage
data(GADL1)
Format
A data frame with 13 columns and 56 rows. Each row gives the quantification results, expressed in amounts:
Groupe | factor | The group label, Ctrl for healthy volunteers and BPD for bipolar disorder patients |
Phenotype | factor | the subgroup label, C for healthy volunteers, NR for patients not responding to treatment and R for patients responding to treatment |
Sample | factor | the sample unique identifier |
All other columns are numeric and give the amount, in arbitrary unit, of mRNA of the corresponding gene. The column name gives the name of the gene and the dilution used for the quantification. hprt and sdha are used as reference genes.
Source
Original data communicated by the authors of the experiment.
Examples
data( GADL1 )
# Optimal cut-off for ten genes and alpha = 0.05
# is around 0.22
# First step, is there differences between healthy subjects
# and patients ?
#
M.m <- creer.Mp( d = GADL1, noms = names( GADL1 )[ -c( 1:3 ) ],
f.p = student.fpc, v.X = 'Groupe' )
# 2) L'arbre associé
# [reference gene for plotting purpose]
n.ref <- grep( 'HPRT|SDHA', names( GADL1 ), value = TRUE )
plot( arbre.Mp( M.m, reference = n.ref ),
seuil.p = c( 0.218, 0.207, 0.230 ) )
# 3) Le graphe pour le seuil optimal
# => only IGF1 seems to behave differently
# (but it has missing values, so interpretation is difficult)
plot( grf.Mp( M.m, reference = n.ref, p = 0.22 ) )
# Second step, is there differences between patients
# that respond or not respond to treatment?
d.R <- GADL1[ which( GADL1$Groupe == 'BPD' ), ]
M.R <- creer.Mp( d = d.R, noms = names( GADL1 )[ -c( 1:3 ) ],
f.p = student.fpc, v.X = 'Phenotype' )
# 2) L'arbre associé
plot( arbre.Mp( M.R, reference = n.ref ),
seuil.p = c( 0.218, 0.207, 0.230 ) )
# 3) Le graphe pour le seuil optimal
# => no sign of any difference
plot( grf.Mp( M.R, reference = n.ref, p = 0.22 ) )
Effect of MDMA and SCH23390 on gene expression
Description
This data set gives the mRNA quantification of several genes involved in the dopamin pathway in four different conditions: control, after addition of MDMA, after addition of SCH23390 and after addition of both.
Usage
data(SCH23390)
Format
A data frame with 8 columns and 48 rows. Each row gives the quantification results, expressed in cycle threshold (CT):
Groupe | factor | The group label |
MDMA | factor | MDMA addition indicator (Oui=Yes, Non=No) |
SCH23390 | factor | SCH23390 addition indicator (as above) |
Hprt | numeric | CT for the hprt gene, use as reference gene |
Fos,Fosb,Egr1,Egr2 | numeric | CT for the four genes of interest |
Source
Original data communicated by the authors of the paper.
References
N. Benturquia, C. Courtin, F. Noble, and C. Marie-Claire (2008). Involvement of D1 dopamine receptor in MDMA-induced locomotor activity and striatal gene expression in mice. Brain Research, 1211, 1-5
Examples
data( SCH23390 )
# Optimal cut-off for five genes and alpha = 0.05
# is around 0.13
# First step, experimental check
#
# MDMA should change expression levels of all genes but the reference
# 1) extract the data for the Ctrl vs MDMA groups comparison
d.MDMA <- SCH23390[ which( SCH23390$Groupe %in% c( 'Ctrl', 'MDMA' ) ), ]
M.MDMA <- creer.Mp( d = d.MDMA, noms = names( d.MDMA )[ 4:8 ], log = TRUE,
f.p = student.fpc, v.X = 'MDMA' )
# 2) L'arbre associé
plot( arbre.Mp( M.MDMA, reference = 'Hprt' ),
seuil.p = c( 0.137, 0.128, 0.147 ) )
# 3) Le graphe pour le seuil optimal
# => indeed, all genes are modified by MDMA
# Fos and Fosb seems to have the same behavior
plot( grf.Mp( M.MDMA, reference = 'Hprt', p = 0.13 ) )
# Second step, experiment analysis
# Does SCH23390 modulate the MDMA effect?
# => interaction term in a two-ways analysis of variance
M.I <- creer.Mp( d = SCH23390, noms = names( SCH23390 )[ 4:8 ], log = TRUE,
f.p = anva_SC.fpc,
frm = R ~ MDMA + SCH23390 + MDMA:SCH23390, SC = 3 )
# 2) L'arbre associé
plot( arbre.Mp( M.I, reference = 'Hprt' ),
seuil.p = c( 0.137, 0.128, 0.147 ) )
# 3) Le graphe pour le seuil optimal
# => no clear detection of interaction
plot( grf.Mp( M.I, reference = 'Hprt', p = 0.13 ) )
Grouping composants by changes in cut-off separation
Description
These functions construct a tree based on the successive disjunctions between nodes of the graph when increasing the cut-off value.
Usage
arbre.Mp( Mp, en.log = FALSE, reference = NA, complement = FALSE )
## S3 method for class 'Arbre'
plot(x, seuil.p = 0.05,
xlab = "Composant",
ylab = if ( TRUE == en.log ) "-log seuil" else "Seuil",
col.seuil = "red" , lwd.seuil = 1, lty.seuil = 1,
horiz = FALSE, center = TRUE, edge.root = TRUE,
...)
Arguments
Mp |
A square, symmetric matrix containing p-values. Element
in row |
en.log |
If |
reference |
Either |
complement |
A logical. If TRUE, the tree is built using the complement of the graph, as when using equivalence test to build the graph. |
x |
The tree to be drawn |
seuil.p |
Selected cut-off for analysis. Can also be a |
xlab , ylab |
Legends for the axes |
col.seuil , lwd.seuil , lty.seuil |
Graphical parameters for drawing the analysis cut-off |
horiz , center , edge.root |
Options from
|
... |
Additionnal parameters for |
Details
By increasing the cut-off from 0 to 1, more and more edges between nodes are removed, and disjoint subgraphs appear. This can be used to build a tree of the composants, with nodes of the tree corresponding to the apparition of a new distinct subgraph. Leafs of the tree are the individual components.
Value
The arbre.Mp
function returns a dendrogram.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
creer.Mp
to create a matrix of p-values for all
possible ratios of a compositional vector.
grf.Mp
to convert such a matrix to a graph, once a
cut-off is selected.
coupures.Mp
to obtain the set of p-values
corresponding to the nodes of the tree, that is to the apparition of
new sets of composants.
plot.dendrogram
and
as.dendrogram
for more details on dendrogram
drawing and structure.
Examples
# load the potery data set
data( poteries )
# Compute one-way ANOVA p-values for all ratios in this data set
Mp <- creer.Mp( poteries, c( 'Al', 'Na', 'Fe', 'Ca', 'Mg' ),
f.p = anva1.fpc, v.X = 'Site' )
# Build the tree (in log scale, p-values are all < 0.01)
arbre <- arbre.Mp( Mp, en.log = TRUE )
# It is a dendrogram as defined in the cluster package
str( arbre )
class( arbre )
# Drawing this tree
plot( arbre )
Compute all pairwise ratios of a set of variables
Description
This function computes all pairwise ratios or differences of a set of variables in a given data frame
Usage
calc.rapports( d, noms, log = FALSE, isoler = FALSE )
Arguments
d |
The data frame that contains the variables. Other objects will be
coerced as data frames using |
noms |
A character vector containing the column names of the compositional variables to be used for ratio computations. Names absent from the data frame will be ignored with a warning. Optionnally, an integer vector containing the column numbers can be given instead. They will be converted to column names before further processing. |
log |
If If |
isoler |
If |
Details
Use this function to compute all pairwise ratio of a set of numerical variables. If non-numerical variables are given in the list of variables, they will be ignores with a warning.
Since the ratio of variables i and j is the inverse of the ratio of
variables j and i, only one of them is computed. The order is
determined by the order of the variables in noms
. In matrix
notations, only the upper right matrix is computed, withour the diagonal.
Value
These function returns the original data.frame with additional columns corresponding to all pairwise ratios added as the last columns of the data.frame.
These variables have their name constructed as the concatenation of the names of the two variables used, the first one being at the numerator, separated with a dot and with the additional suffix .r (or .r.log is working on difference of logarithms).
Their order is determined by the order given in noms: the first variable of the list, V1, is used to compute ratios with all others (V1/V2, V1/V3 and so on). Then the second one is used for ratios further ones (V2/V3 and so on), and so on until the last one.
Note
This function is mainly for designing a step-by-step analysis or control purposes. To avoid waste of memory, most of the functions in the package actually compute “on fly” the ratios when constructing the matrix or the data frame of p-values.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
creer.Mp
to create a matrix of p-values for all pairwise
tests of ratio changes.
Examples
# load the potery data set
data( poteries )
# Compute all ratios in the potery data set
d.r <- calc.rapports( d = poteries, noms = c( 'Al', 'Fe', 'Mg', 'Ca',
'Na' ) )
names( d.r )
head( d.r )
identical( d.r$Al.Fe.r, d.r$Al / d.r$Fe )
Cut-off selection by simulations
Description
Obtaining the optimal p-value cut-off for individual tests to achieve a given Type I error level of obtaining disjoint components of the graph
Usage
choisir.seuil( n.genes,
taille.groupes = c( 10, 10 ),
alpha.cible = 0.05,
seuil.p = (5:30)/100,
B = 3000, conf.level = 0.95,
f.p = student.fpc, frm = R ~ Groupe, v.X = 'Groupe',
normaliser = FALSE, en.log = TRUE,
n.quantifies = n.genes, masque,
n.coeurs = 1,
sigma = rep( 1, n.genes ),
... )
Arguments
n.genes |
Number of genes to be quantified simultaneously |
taille.groupes |
An integer vector containing the sample size
for each group. The number of groups is determined by the length of
this vector. Unused if |
alpha.cible |
The target type I error level of obtaining disjoint subnetworks under the null hypothesis that gene expressions are the same in all groups. Should be between 0 and 1. |
seuil.p |
A numeric vector of candidate cutoffs. Values outside the [0,1] interval are automatically removed. The default (from 0.05 to 0.30) is suited for a target type I error of 0.05 and less than 30 genes, roughly. |
B |
How many simulations to do. |
conf.level |
The confidence level of the interval given as a result (see Details). |
f.p |
The function to use for individual tests of each ratio. See
|
frm |
The formula to use. The default is suited for the structure of the simulated data, with R the ratio and Groupe the variable with group membership. |
v.X |
The name of the grouping variable. The default is suited for the structure of the simulated data, with R the ratio and Groupe the variable with group membership. |
normaliser |
Should the simulated data by normalised, that is should their sum be equal to 1? Since ratio are insensitive to the normalisation (by contrast with individual quantities), it is a useless step for usual designs, hence the default. |
en.log |
If The option is also used in the call of |
n.quantifies |
The number of quantified genes amongst the
|
masque |
A data.frame containing the values of needed covariates
for all replicates. If missing, a one-column data.frame generated
using |
n.coeurs |
The number of CPU cores to use in computation, with parallelization using forks (does not work on Windows) with the help of the parallel package. |
sigma |
A vector of length |
... |
additional arguments, to be used by the analysis function f.p |
Details
The choisir.seuil
function simulates B
datasets
of n.genes
“quantities” measured several times, under
the null hypothesis that there is only random variations between
samples. For each of these B
datasets, creer.Mp
is called with the provided test function, then converted to a graph
using in turn all cut-offs given in seuil.p
and the number of
components of the graph is determined. Having more than one is a type
I error.
For each cut-off in seuil.p
, the proportion of false-positive
is then determined, along with its confidence interval (using the
exact, binomial formula). The optimal cut-off to achieve the target
type I error is then found by linear interpolation.
Simulation is done assuming a log-normal distribution, with a
reduced, centered Gaussian on the log scale. Since under the null
hypothesis nothing changes between the groups, the only needed
informations is the total number of values for a given gene, which is
determined from the number of rows of masque
.
All columns of masque
are transfered to the analysis function,
so simulation under virtually any experimental design should be
possible, as far as a complete null hypothesis is wanted (not any
effect of any covariate).
Value
choisir.seuil
returns a data.frame with four columns,
corresponding to the candidate cut-offs, the corresponding estimated
type-I error and its lower and upper confidence bounds, and attributes
giving the estimated optimal cut-off, its confidence interval and
details on simulation condition. This data.frame has the additional
class SARPcompo.H0
, allowing specific print
and
plot
methods to be used.
Warning
The simulated ratios are stored in a column called R, appended to the simulated data.frame. For this reason, do not use any column of this name in the provided masque: it would be overwritten during the simulation process.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
Examples
# What would be the optimal cut-off for 10 genes quantified in two
# groups of 5 replicates?
# For speed reason, only 50 simulations are done here,
# but obviously much more are needed to have a good estimate f the cut-off.
seuil <- choisir.seuil( 10, c( 5, 5 ), B = 50 )
seuil
# Get the cut-off and its confidence interval
attr( seuil, "seuil" )
# Plot the results
plot( seuil )
Cut-off selection by simulations, in the context of equivalence tests
Description
Obtaining the optimal p-value cut-off for individual tests to achieve a given Type I error level of obtaining connected nodes in the graph
Usage
choisir.seuil.equiv( n.genes, taille.groupes,
mu = 10, sigma = 0.5, Delta = 0.5,
alpha.cible = 0.05,
seuil.p = (10:40)/100,
B = 3000, conf.level = 0.95,
f.p = equiv.fpc,
en.log = TRUE,
n.coeurs = 1,
... )
Arguments
n.genes |
Number of genes to be quantified simultaneously |
taille.groupes |
An integer vector containing the sample size
for each group. The number of groups is determined by the length of
this vector. Unused if |
mu |
A numeric vector giving the mean amount for each component
in the first condition, in the log scale ( It can also be a two-lines matrix giving the mean amounts for each component (columns) in the first (firt row) and second (second row) condition. |
sigma |
A numeric vector giving the standard deviation for the
amount of each component in both conditions, in the log scale
( It can also be a two-lines matrix giving the mean amounts for each component (columns) in the first (firt row) and second (second row) condition. |
Delta |
The limit for the equivalence region, |
alpha.cible |
The target type I error level of obtaining disjoint subnetworks under the null hypothesis that gene expressions are the same in all groups. Should be between 0 and 1. |
seuil.p |
A numeric vector of candidate cutoffs. Values outside the [0,1] interval are automatically removed. The default (from 0.05 to 0.30) is suited for a target type I error of 0.05 and less than 30 genes, roughly. |
B |
How many simulations to do. |
conf.level |
The confidence level of the interval given as a result (see Details). |
f.p |
The function to use for individual tests of each
ratio. See |
en.log |
If |
n.coeurs |
The number of CPU cores to use in computation, with parallelization using forks (does not work on Windows) with the help of the parallel package. |
... |
additional arguments, to be used by the analysis function f.p |
Details
The choisir.seuil.equiv
function simulates B
datasets of n.genes
“quantities” measured several times,
under the null hypothesis that variations between samples of two
conditions are given by the difference between the two rows of the
\mu
matrix. If \mu
was given as a single row (or a single
value), the second row is defined as (\mu, \mu + \Delta, \mu +
2\Delta\dots)
– correspondong to the null hypothesis that all
components have a different change between the two conditions, and
that this change is equal to the equivalence region limit
(\Delta
). For each of these B
datasets,
creer.Mp
is called with the provided test function, then
converted to a graph using in turn all cut-offs given in
seuil.p
and the number of edges of the graph is
determined. Having at least one edge is a type I error, since under
the null hypothesis there is no couple of genes having the same
change.
For each cut-off in seuil.p
, the proportion of false-positive
is then determined, along with its confidence interval (using the
exact, binomial formula). The optimal cut-off to achieve the target
type I error is then found by linear interpolation.
Data are generated using a normal (Gaussian) distribution, independantly for each component and each condition.
Value
choisir.seuil.equiv
returns a data.frame with four
columns, corresponding to the candidate cut-offs, the corresponding
estimated type-I error and its lower and upper confidence bounds, and
attributes giving the estimated optimal cut-off, its confidence
interval and details on simulation condition. This data.frame has the
additional class SARPcompo.H0
, allowing specific print
and plot
methods to be used.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
See choisir.seuil
for the case of difference tests and
disjoing subgraphs.
Examples
# What would be the optimal cut-off for 5 genes quantified in two
# groups of 5 replicates?
# Null hypothesis : mean = 0, sd = 1, Delta = 2
# For speed reason, only 50 simulations are done here,
# but obviously much more are needed to have a good estimate f the cut-off.
seuil <- choisir.seuil.equiv( 5, c( 5, 5 ),
mu = 1, sigma = 1, Delta = 1,
B = 50 )
seuil
# Get the cut-off and its confidence interval
attr( seuil, "seuil" )
# Plot the results
plot( seuil )
Convert between matrix and data-frame format
Description
These functions convert all pairwise ratio tests between the matrix format and the data.frame format
Usage
Mp.DFp(DFp, col.noms = c( 1, 2 ), col.p = 'p')
Arguments
DFp |
Results in the data.frame format. |
col.noms |
A length two character vector giving the columns in the data.frame format that contain the names of the two components of which the ratio is tested. If converting from a data.frame, can be given by number. |
col.p |
A length one character vector giving the column containing the p-values of the test in the data.frame format. If converting from a data.frame, can be given by number. |
Details
The matrix format is more convenient for manipulations like finding cut-off probabilities or building the hierarchical tree of the components. However, it can only store one value per couple and is more memory consuming.
The data.frame format is more efficient for computations, since it
allows to store several results at once. It can also be easily saved
and read using text-files and read.table
or
write.table
. However, finding the hierarchical tree of
components or building the graph is not so straightforward.
These utilitary functions allow to convert between the two formats.
Value
The results in the other format: a matrix for Mp.DFp
and a
3-columns data.frame for DFp.Mp
.
In the matrix form, components will be sorted by alphabetical order.
Warning
When converting a data.frame to a matrix, there is no control that all possible combinations are present once, and only once, in the data.frame. Missing combinations will have 0 in the matrix; combinations present several time will have the value of the last replicate.
When converting a matrix to a data.frame, the diagonal is not included in the data.frame. The matrix is expected to be symmetric, and only the upper right part is used.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
creer.Mp
to create a matrix of p-values for all
possible ratios of a compositional vector.
creer.DFp
to create a data.frame of p-values for
all possible ratios of a compositional vector.
Examples
# load the potery data set data( poteries )
Finding cut-offs for graph disjonctions
Description
These functions detect the experimental cut-offs to create distinct subgraphs, and propose adapted graphical representation.
Usage
coupures.Mp( Mp )
## S3 method for class 'Coupures'
plot(x, seuil.p = 0.05, en.log = TRUE,
xlab = "Seuil de p", ylab = "Nombre de composantes",
col.trait = "black", lwd.trait = 1, lty.trait = 1,
col.seuil = "red" , lwd.seuil = 1, lty.seuil = 1,
pch.fin = 19, cex.fin = 1, col.fin ="darkgreen",
pch.deb = ")", cex.deb = 1, col.deb = "darkgreen",
...)
Arguments
Mp |
A square, symmetric matrix containing p-values. Element
in row |
x |
The set of critical values, as obtained by |
seuil.p |
Selected cut-off for analysis. Can also be a |
en.log |
If |
xlab , ylab |
Legends for the axes |
col.trait , lwd.trait , lty.trait , pch.fin , cex.fin , col.fin , pch.deb , cex.deb , col.deb |
Graphical parameters for drawing the number of components in function of the cut-off. ‘trait’ refers to the function itself, ‘deb’ to the first point of a region of constant components number (that does not belong to it: the function is right-discontinuous) and ‘fin’ to the last point of this region (that belongs to it) |
col.seuil , lwd.seuil , lty.seuil |
Graphical parameters for drawing the analysis cut-off |
... |
Additionnal parameters for |
Details
By increasing the cut-off from 0 to 1, more and more edges between nodes are removed, and disjoint subgraphs appear. This function detects in a matrix of p-values which are the “critical” ones, that is the one for which the number of components changes.
Because the edge removal is defined by p < cut-off
, the cut-off
returned for a given number of components is to be understand as the
maximal one that gives this number of components.
The plot
method allows to visualize the evolution of the number
of components with the cut-off, and writes the critical cut-off values.
Value
The coupures.Mp
function returns a data.frame with
additionnal class ‘Coupures’. It contains three columns: one
with the p-value cut-offs, one with the opposite of their
decimal logarithm and one with the number of components when using
exactly this cut-off. The additionnal class allows to provide a
plot
method.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
creer.Mp
to create a matrix of p-values for all
possible ratios of a compositional vector.
grf.Mp
to convert such a matrix to a graph, once a
cut-off is selected.
arbre.Mp
to convert such a matrix to a classification
tree of the components of the compositional vector.
Examples
# load the potery data set
data( poteries )
# Compute one-way ANOVA p-values for all ratios in this data set
Mp <- creer.Mp( poteries, c( 'Al', 'Na', 'Fe', 'Ca', 'Mg' ),
f.p = anva1.fpc, v.X = 'Site' )
# Where would be the cut-offs?
seuils <- coupures.Mp( Mp )
seuils
# Drawing this, in log10 scale
plot( seuils, en.log = TRUE )
Create p-values data-frame from pairwise tests of all possible ratios of a compositional vector
Description
This function performs hypothesis testing on all possible pairwise ratios or differences of a set of variables in a given data frame, and store their results in a data.frame
Usage
creer.DFp( d, noms, f.p = student.fpc,
log = FALSE, en.log = !log,
nom.var = 'R',
noms.colonnes = c( "Cmp.1", "Cmp.2", "p" ),
add.col = "delta", n.coeurs = 1,
... )
Arguments
d |
The data frame that contains the compositional variables. Other
objects will be coerced as data frames using
|
noms |
A character vector containing the column names of the compositional variables to be used for ratio computations. Names absent from the data frame will be ignored with a warning. Optionnally, an integer vector containing the column numbers can be given instead. They will be converted to column names before further processing. |
f.p |
An R function that will perform the hypothesis test on a single
ratio (or log ratio, depending on This function should return a numeric vector, of which the first one
will typically be the p-value from the test — see
Such functions are provided for several common situations, see references at the end of this manual page. |
log |
If If |
en.log |
If |
nom.var |
A length-one character vector giving the name of the
variable containing a single ratio (or log-ratio). No sanity check
is performed on it: if you experience strange behaviour, check you
gave a valid column name, for instance using
|
noms.colonnes |
A length-three character vector giving the names
of, respectively, the two columns of the data frame that will contain
the components identifiers and of the column that will contain the
p-value from the test (the first value returned by |
add.col |
A character vector giving the names of additional
columns of the data.frame, used for storing additional return values
of |
n.coeurs |
The number of CPU cores to use in computation, with parallelization using forks (does not work on Windows) with the help of the parallel package. |
... |
additional arguments to |
Details
This function constructs a data.frame with n\times
(n-1)/2
rows, where n = length( noms )
(after eventually
removing names in noms
that do not correspond to numeric
variables). Each line of the data.frame is the result of the
f.p
function when applied on the ratio of variables whose names
are given in the first two columns (or on its log, if either (log ==
TRUE) && (en.log == FALSE)
or (log == FALSE) && (en.log ==
TRUE)
).
Value
These function returns the data.frame obtained as described above.
Note
Creating a data.frame seems slightly less efficient (in terms of
speed) than creating a dense matrix, so for compositionnal data with
only a few components and simple stastitical analysis were only a
single p-value is needed, consider using creer.Mp
instead.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
Predefined f.p
functions: anva1.fpc
for one-way
analysis of variance; kw.fpc
for the non-parametric
equivalent (Kruskal-Wallis test).
grf.DFp to create a graphe from the obtained matrix.
Examples
# load the Circadian Genes Expression dataset, at day 4
data( "BpLi_J4" )
ng <- names( BpLi_J4 )[ -c( 1:3 ) ] # Name of the genes
# analysis function (complex design)
# 1. the formula to be used
frm <- R ~ (1 | Patient) + Phenotype + Li + Phenotype:Li
# 2. the function itself
# needs the lme4 package
if ( TRUE == require( "lme4" ) ) {
f.p <- function( d, variable, ... ) {
# Fit the model
md <- lmer( frm, data = d )
# Get coefficients and standard errors
cf <- fixef( md )
se <- sqrt( diag( vcov( md ) ) )
# Wald tests on these coefficients
p <- 2 * pnorm( -abs( cf ) / se )
# Sending back the 4 p-values
p
}
# CRAN does not like 'long' computations
# => analyse only the first 6 genes
# (remove for real exemple!)
ng <- ng[ 1:6 ]
# Create the data.frame with all results
DF.p <- creer.DFp( d = BpLi_J4, noms = ng,
f.p = f.p, add.col = c( 'p.NR', 'p.Li', 'p.I' ) )
# Make a graphe from it and plot it
# for the interaction term, at the p = 0.2 threshold
plot( grf.DFp( DF.p, p = 0.20, col.p = 'p.I' ) )
}
Create a graph using a set of p-values from pairwise tests
Description
These functions construct an undirected graph, using the igraph package, to represent and investigate the results of all testing of all ratios of components of a compositional vector.
Usage
grf.Mp( Mp, p = 0.05, reference = NULL, groupes = NULL,
complement = FALSE )
grf.DFp( DFp, col.noms = c( 1, 2 ), p = 0.05, col.p = 'p',
reference = NULL, groupes = NULL )
Arguments
Mp |
A square, symmetric matrix containing p-values. Element
in row |
DFp |
A data frame containing at least three columns: two with component names and one with p-values for the corresponding ratio. |
col.noms |
A length two vector giving the names of the two columns containing the components names. |
p |
The p-value cutoff for adding an edge between two nodes. See details. |
col.p |
The name of the column containing the p-values to use to create the graph. |
reference |
A character vector giving the names of nodes that should be displayed with a different color in the created graph. These names should match column names used in Mp. Typical use would be for reference genes in qRT-PCR experiments. By default, all nodes are displayed in palegreen; reference nodes, if any, will be displayed in orange. |
groupes |
A list of character vectors giving set of logically related nodes, defining groups of nodes that will share common color. Currently unimplemented. |
complement |
A logical. If TRUE, the complement of the graph is returned. It allows to construct graphs using equivalence tests, instead of difference tests (that is, to keep an edge between two nodes if the test is significant, instead of non-significant). |
Details
Consider a compositional vector of n
components. These n
are seen as the nodes of a graph. Nodes i
and j
will be
connected if and only if the p-value for the test of the
\frac{i}{j}
ratio is higher than the cutoff, p
– that
is, if the test is not significant at the level p
.
Strongly connected sets of nodes will represent components that share a similar behaviour between the conditions tested, whereas unrelated sets of nodes will have a different behaviour.
Value
These function returns the created graph. It is an igraph object on which any igraph function can be applied, including plotting, and searching for graph components, cliques or communities.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
creer.Mp
to create a matrix of p-values for all
possible ratios of a compositional vector.
creer.DFp
to create a data.frame of p-values for all
possible ratios of a compositional vector.
Examples
# load the potery data set
data( poteries )
# Compute one-way ANOVA p-values for all ratios in this data set
Mp <- creer.Mp( poteries, c( 'Al', 'Na', 'Fe', 'Ca', 'Mg' ),
f.p = anva1.fpc, v.X = 'Site' )
Mp
# Make a graphe from it and plot it
plot( grf.Mp( Mp ) )
Create p-values matrix from pairwise tests of all possible ratios of a compositional vector
Description
This function performs hypothesis testing on all possible pairwise ratios or differences of a set of variables in a given data frame, and store their results in a (symmetric) matrix
Usage
creer.Mp( d, noms, f.p, log = FALSE, en.log = !log,
nom.var = 'R', n.coeurs = 1, controles = TRUE, ... )
Arguments
d |
The data frame that contains the compositional variables. Other
objects will be coerced as data frames using
|
noms |
A character vector containing the column names of the compositional variables to be used for ratio computations. Names absent from the data frame will be ignored with a warning. Optionnally, an integer vector containing the column numbers can be given instead. They will be converted to column names before further processing. |
f.p |
An R function that will perform the hypothesis test on a single
ratio (or log ratio, depending on This function should return a single numerical value, typically the p-value from the test. This function must accept at least two named arguments: Such functions are provided for several common situations, see references at the end of this manual page. |
log |
If If |
en.log |
If |
nom.var |
A length-one character vector giving the name of the
variable containing a single ratio (or log-ratio). No sanity check
is performed on it: if you experience strange behaviour, check you
gave a valid column name, for instance using
|
n.coeurs |
The number of CPU cores to use in computation, with parallelization using forks (does not work on Windows) with the help of the parallel package. |
controles |
If TRUE, sanity checks are done on the obtained
matrix. Columns and Rows containing only non-finite values are
removed, with warnings; corresponding component names are stored as
attribute |
... |
additional arguments to |
Details
This function constructs a n\times n
matrix, where n =
length( noms )
(after eventually removing names in noms
that
do not correspond to numeric variables). Term (i,j)
in this
matrix is the result of the f.p
function when applied on the
ratio of variables noms[ i ]
and noms[ j ]
(or on its
log, if either (log == TRUE) && (en.log == FALSE)
or (log
== FALSE) && (en.log == TRUE)
).
The f.p
function is always called only once, for i < j
,
and the other term is obtained by symmetry.
The diagonal of the matrix is filled with 1 without calling f.p
,
since corresponding ratios are always identically equal to 1 so
nothing useful can be tested on.
Value
These function returns the matrix obtained as described above,
with row an column names set to the names in noms
(after
conversion into column names and removing all non-numeric variables).
Note
Since the whole matrix is stored and since it is a dense
matrix, memory consumption (and computation time) increases as
n^2
. For compositional data with a large number of components,
like in RNA-Seq data, consider instead creating a file.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
Predefined f.p
functions: anva1.fpc
for one-way
analysis of variance; kw.fpc
for the non-parametric
equivalent (Kruskal-Wallis test).
grf.Mp to create a graphe from the obtained matrix.
Examples
# load the potery data set
data( poteries )
# Compute one-way ANOVA p-values for all ratios in this data set
Mp <- creer.Mp( poteries, c( 'Al', 'Na', 'Fe', 'Ca', 'Mg' ),
f.p = anva1.fpc, v.X = 'Site' )
Mp
# Make a graphe from it and plot it
plot( grf.Mp( Mp ) )
Simulate the distribution of maximal minimal distances in a random graph
Description
This function simulates the distribution of the maximum of the minimal distances between nodes of a random graph, for a given cut-off threshold.
Usage
distrib.distances( n.genes,
taille.groupes = c( 10, 10 ), masque,
me.composition = 0, cv.composition = 1, en.log = TRUE,
seuil.p = 0.05,
B = 3000, conf.level = 0.95,
f.p = student.fpc, frm = R ~ Groupe,
n.coeurs = 1 )
Arguments
conf.level |
The confidence level for the exact confidence intervals of estimated probabilities of maximal minimal distances in the graph. |
n.genes |
Number of components in the system (of nodes in the
total graph). Ignored if |
me.composition |
The expected median quantity of each component,
in the log scale. Can be either a single value, used for two
conditions and |
cv.composition |
The expected coefficient of variation of the
quantified amounts. Should be either a single value, that will be
used for all components and all conditions, or a matrix with the
same structure than |
.
en.log |
If |
taille.groupes |
The sample size for each condition. Unused if
|
masque |
A data.frame that will give the dataset design for a
given experiment. Should contain at least one column containing the
names of the conditions, with values being in the conditions names
in |
f.p , frm |
The function used to analyse the dataset, and its
parameter. See |
seuil.p |
The p-value cut-off to be used when creating the
graph. Should be between 0 and 1. See |
B |
The number of simulations to be done. |
n.coeurs |
The number of CPU cores to use to parallelize the simulation. |
Details
In an undirected graph, minimal distance between two nodes is the minimal number of edges to cross to go from one node to the other. The maximal minimal distance is the largest of all possible minimal distances in a given graph.
The function simulates the distribution of the maximal minimal distance in a graph whose edges were removed according to the specified p-value cut-off. To avoid infinite distances, these distances are computed in the largest connected component of the graph.
In the observed graph, nodes that are at a largest minimal distance than probable maximal minimal distances may signal components belonging to different sets, that could not be disconnected because of some nodes having intermediate changes.
Value
A 4-columns data.frame, with additional attributes giving the number
of simulations (Nombre.simulations
) and their results
(Tirages
). The first column contains the maximal minimal
distances, the second contains their observed frequencies in the
simulated datasets, the third and fourth contain the limits of the
confidence interval of the corresponding probability.
Confidence intervals are exacts, using the Clopper-Pearson method.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
distances
, in package igraph, to compute the matrix of all
minimal distances of a graphe.
creer.Mp
and grf.Mp
, which are used
internally, for details about analysis functions and p-value cutoff.
Utility function to obtain p-value for equivalence tests on individual ratios
Description
These functions can be used in the functions to perform analysis on all pairwise ratios of a compositional dataset, using equivalence tests to ensure edge existence
Usage
equiv.fpc( d, variable, v.X, var.equal = TRUE, Delta = 0.5,
pred = FALSE, ... )
Arguments
d |
The data frame that contains the ratio to test, and all variables of the original data frame that where not used as compositional data. |
variable |
A length-one character vector containing the names of the variable corresponding to the ratio (or log-ratio) to test. |
v.X |
The name of the explanatory (independant, predictor)
variable. This variable should be a factor for |
var.equal |
For |
Delta |
The value giving the positive limit of the equivalence region. For symmetry reasons, the equivalence region will be [-Delta, Delta]. |
pred |
If If |
... |
additional arguments |
Details
These functions are only wrapper to some commonly used equivalence tests.
The basic idea underlying equivalence tests is to try to reject the
null hypothesis that the difference between the two conditions is
higher (in absolute value) than a predefined, fixed, value, given by
Delta
. Consequently, significant tests will mean that the edge
between the two tested nodes should be kept, whereas non-significant
tests will mean that the existence of the edge is uncertain.
As a consequence, conversion of the p-value matrix to graphs should be made using the complement of the graph, and the analysis of the graph should be made in terms of cliques instead of disjoint subgraphs. See example for an illustration.
Value
These function returns the p-value from the corresponding test.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
student.fpc
for a, more usual, approach using difference
tests.
creer.Mp
to use these functions to create a matrix of
p-values.
Examples
# load the reference genes data set
data( generef )
# compute equivalence test of all ratios
# Genes are in column 3 to 35
# Only the first 10 genes are used here, for speed reasons
# Equivalence is defined as a change lower than 0.5 Cq (× 1.41)
# Values are expressed as Cq : already in log
Mp <- creer.Mp( generef, names( generef )[ 3:13 ], log = TRUE,
f.p = equiv.fpc, Delta = 0.5, v.X = 'Group' )
# Make the graph from it, and plot it
# Threshold is set at 0.15, for 10 nodes...
# (using the complement, to keep edges with p < threshold only )
plot( grf.Mp( Mp, p = 0.15, complement = TRUE ) )
# => there is a single clique, of 3 genes : HPRT1, B2M, HSP90AB1
# only these three genes are "proved" to have the same variation
Create and read a file of p-values for all pairwise tests of all possible ratios of a compositional vector
Description
These functions allow to perform hypothesis testing on all possible pairwise ratios or differences of a set of variables in a given data frame, and store or read their results in a file
Usage
creer.Fp( d, nom.fichier,
noms, f.p = student.fpc,
log = FALSE, en.log = !log,
nom.var = 'R',
noms.colonnes = c( "Cmp.1", "Cmp.2", "p" ),
add.col = "delta",
sep = ";", dec = ".", row.names = FALSE, col.names = TRUE,
... )
grf.Fp( nom.fichier, col.noms = c( 1, 2 ), p = 0.05, col.p = 'p',
reference = NULL, groupes = NULL,
sep = ";", dec = ".", header = TRUE,
... )
Arguments
d |
The data frame that contains the compositional variables. Other
objects will be coerced as data frames using
|
nom.fichier |
A length-one character vector giving the name of the file |
noms |
A character vector containing the column names of the compositional variables to be used for ratio computations. Names absent from the data frame will be ignored with a warning. Optionnally, an integer vector containing the column numbers can be given instead. They will be converted to column names before further processing. |
f.p |
An R function that will perform the hypothesis test on a single
ratio (or log ratio, depending on This function should return a numeric vector, of which the first one
will typically be the p-value from the test — see
Such functions are provided for several common situations, see links at the end of this manual page. |
log |
If If |
en.log |
If |
nom.var |
A length-one character vector giving the name of the
variable containing a single ratio (or log-ratio). No sanity check
is performed on it: if you experience strange behaviour, check you
gave a valid column name, for instance using
|
noms.colonnes |
A length-three character vector giving the names
of, respectively, the two columns of the data frame that will contain
the components identifiers and of the column that will contain the
p-value from the test (the first value returned by |
add.col |
A character vector giving the names of additional
columns of the data.frame, used for storing additional return values
of |
sep , dec , row.names , col.names , header |
Options for controling
the file format, used by |
col.noms |
A length-two vector giving the two columns that contain the two components of the ratio. Can be given either as column number or column name. |
col.p |
A length-one vector giving the column that contain the p-value of the ratio. Can be given either as column number or column name. |
p |
The p-value cut-off to be used when creating the
graph, see |
reference |
A character vector giving the names of nodes that should be displayed with a different color in the created graph. These names should match components names present un the file. Typical use would be for reference genes in qRT-PCR experiments. By default, all nodes are displayed in palegreen; reference nodes, if any, will be displayed in orange. |
groupes |
A list of character vectors giving set of logically related nodes, defining groups of nodes that will share common color. Currently unimplemented. |
... |
additional arguments to |
Details
These functions are basically the same as the function that
create data.frames (creer.DFp
) and use data.frames to
create a graph (grf.DFp
), except thatthey work on text
files. This allow to deal with compositionnal data including
thousands of components, like RNA-Seq or microarray data.
Seeing the results as a matrix, computations are done in rows and the file is updated after each row. Only the upper-triangular part, without the diagonal, is stored in the file.
The function that creates the graphe from file is not very efficient
and can take a lot of time for huge matrices. Making a first filter
on the file using shell tools, like gawk
or perl
, or a
dedicated C software and loading the resulting file as a data.frame
before converting it into a graph is a better alternative, but may
lose some isolated nodes.
Value
creer.Fp
does not return anything.
grf.Fp
returns the result graph.
Note
Creating a file and working from a file is quite inefficient (in terms
of speed), so for compositionnal data with only a few components,
consider using creer.DFp
that creates the data.frame
directly in memory and grf.DFp
that creates the graphe
from a data.frame instead.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
Predefined f.p
functions: anva1.fpc
for one-way
analysis of variance; kw.fpc
for the non-parametric
equivalent (Kruskal-Wallis test).
For directly creating and manipulating matrices,
creer.Mp
and grf.Mp
.
Examples
# load the potery data set
data( poteries )
# Create the file name in R temporary directory
nom.fichier <- paste0( tempdir(), "/fichier_test.csv" )
nom.fichier
# Compute one-way ANOVA p-values for all ratios in this data set
# and store them in a text file
creer.Fp( poteries, nom.fichier,
c( 'Al', 'Na', 'Fe', 'Ca', 'Mg' ),
f.p = anva1.fpc, v.X = 'Site',
add.col = c( 'mu0', 'd.C', 'd.CoA', 'd.IT', 'd.L' ) )
# Make a graphe from it and plot it
plot( grf.Fp( nom.fichier ) )
# The file is a simple text-file that can be read as a data.frame
DFp <- read.table( nom.fichier, header = TRUE, sep = ";", dec = "," )
DFp
Expression level of candidate reference genes
Description
This dataset gives the expression level of 30 different candidate reference genes, in control subjects and in subjects with bipolar disorder.
Usage
data(generef)
Format
A data frame with 40 rows and 32 columns. Each row gives the results and characteristics of a given patient. Columns 3 to 32 give the average expression level of the gene whose name is the column name, for the given patient. Columns are sorted in alphabetical order of the gene name. Values are expressed as Cq (Ct) and were obtained through qRT-PCR. RNAs were extracted on lymphoblastoid cell line culture. Cq are the average of three technical replicates, after eventual removal of outliers.
ID | factor | Patient identifier |
Group | factor | Patient group (control or having bipolar disorders) |
ATCB | numeric | Cq for gene ATCB |
... | numeric | Cq for gene ... |
YWHAZ | numeric | Cq for gene YWHAZ |
Source
Experimental data kindly provided by Calypso Nepost & Cynthia Marie-Claire, UMR-S 1144, INSERM-Paris Descartes-Paris Diderot
Create a compositional model for simulations
Description
These functions create and plot a model of compositionnal data for two or more conditions.
Usage
modele_compo( medianes, en.log = FALSE,
noms = colnames( medianes ),
conditions = rownames( medianes ),
reference = NULL, total = 1 )
## S3 method for class 'SARPcompo.modele'
plot( x,
xlab = "Composant",
ylab.absolu = "Quantit\u00e9", ylab.relatif = "Fraction",
taille.noeud = 50, ... )
Arguments
medianes |
A matrix giving the medians of all components quantities in each condition. Each row of this matrix corresponds to a different condition; each column, to one of the components. |
en.log |
If |
noms |
Names of the components. If provided, should be a
character vector whose length is equal to the number of columns of
|
conditions |
Names of the different conditions. If provided,
should be a character vector whose length is equal to the number of
rows of |
reference |
A character vector giving the names of the components used as reference (typically, reference genes in qRT-PCR). |
total |
The total amount. The sum of amounts in each condition will equal this total, when the data are made compositionnal. |
Arguments for the plot method
x |
The modele to be plotted |
xlab |
Legend for the X axis |
ylab.absolu |
Legend for the Y axis, in the amount scale (no constrain) |
ylab.relatif |
Legend for the Y axis, in the compositional scale. |
taille.noeud |
The plot size of nodes of the theoretical graph |
... |
Additionnal parameters for |
Details
The modele_compo
function creates a compositionnal data
model using the quantites provided: it converts amounts in fractions
of the total amount for each condition, then computes the theoretical
graph showing classes of equivalents components, that is components
that have the same evolution between the two conditions. If more than two
conditions are given, graphs correspond to comparison of each
condition with the first one.
The plot
methods represents the original quantities, the
quantities after conversion in compositional data ant the theoretical
graph.
Value
An object of class SARPcompo.modele
, with a plot
method. It is a list with the following elements:
Absolue |
The matrix of quantities in amount scale |
Relative |
The matrix of quantities in compositional data scale |
Graphes |
A list of length |
It also stores a few informations as attributes.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
estimer.puissance
and estimer.alpha
to use
these models in simulations to study power and type I error of the
method in a given situation.
Examples
## Create a toy example: four components, two conditions
## components 1 and 2 do not change between conditions
## component 3 is doubled
## component 4 is halfed
me <- rbind( 'A' = c( 1, 1, 1, 1 ),
'B' = c( 1, 1, 2, 0.5 ) )
colnames( me ) <- paste0( "C-", 1:4 )
md <- modele_compo( me )
## Plot it...
plot( md )
## What is approximately the power to detect that something changes
## between conditions A and B using a Student test
## with a CV of around 50 % ?
## (only a few simulations for speed, should be increased )
puissance <- estimer.puissance( md, cv = 0.50, B = 50, f.p = student.fpc )
plot( puissance )
Composition of Roman poteries
Description
This data set gives the oxide composition of several potteries found in five different archaelogic sites of the United Kingdom. Composition was obtained using atomic absorption spectrometry.
Usage
data(poteries)
Format
A data frame with 14 columns and 48 rows. Each row gives the composition of a pottery (columns 2 to 10), the archaelogical site where it was found (columns 6 and 7):
ID | factor | Pottery sample identifier (see original paper appendix) |
Al | numeric | Percentage of aluminium oxide |
Fe | numeric | Percentage of iron oxide |
Mg | numeric | Percentage of magnesium oxide |
Ca | numeric | Percentage of calcium oxide |
Na | numeric | Percentage of natrium oxide |
K | numeric | Percentage of kalium oxide |
Ti | numeric | Percentage of titanim oxide |
Mn | numeric | Percentage of manganese oxide |
Ba | numeric | Percentage of baryum oxide |
Site | factor | Kiln site |
Pays | factor | Location of the kiln site |
Couleur | factor | External color of the pottery |
Date | factor | Approximate date of the pottery |
Note
The DASL version of the dataset, as presented in the "Pottery stoty", does not include data on the poteries from the Gloucester site, neither the data on K, Ti, Mn and Ba oxides. It neither includes the color and date informations, and codes sites as their first letter only.
The DASL version of the dataset exists in the car
package, as
the Pottery
dataset (with two locations differently spelled).
Source
Downloaded from the DASL (Data and Story Library) website, and completed from the original paper of Tubb et al.
References
A. Tubb, A. J. Parker, and G. Nickless (1980). The analysis of Romano-British pottery by atomic absorption spectrophotometry. Archaeometry, 22, 153-171.
Hand, D. J., Daly, F., Lunn, A. D., McConway, K. J., and E., O. (1994) A Handbook of Small Data Sets. Chapman and Hall – for the short version of the dataset.
Examples
data( poteries )
# Reconstruct the car version of this dataset
dcar <- poteries[ , c( 'Al', 'Fe', 'Mg', 'Ca', 'Na', 'Site' ) ]
dcar <- droplevels( dcar[ -which( dcar$Site == "College of Art" ), c( 6, 1:5 ) ] )
levels( dcar$Site )[ c( 1, 3, 4 ) ] <- c( "AshleyRails", "Islethorns", "Llanedyrn" )
# Reconstruct the DASL version of this dataset
ddasl <- dcar[ , c( 2:6, 1 ) ]
levels( ddasl$Site ) <- c( 'A', 'C', 'I', 'L' )
Estimate the power and the type-I error of the disjoint-subgraphs method
Description
Estimate the power and the type-I error of the disjoint-graph method to detect a change in compositions between different conditions
Usage
estimer.puissance( composition, cv.composition,
taille.groupes = 10, masque,
f.p, v.X = 'Condition',
seuil.candidats = ( 5:30 ) / 100,
f.correct = groupes.identiques,
groupes.attendus = composition$Graphes[[ 1 ]]$Connexe,
avec.classique = length( attr( composition, "reference" ) ) > 0,
f.correct.classique = genes.trouves,
genes.attendus,
B = 3000, n.coeurs = 1,
... )
estimer.alpha( composition, cv.composition,
taille.groupes = 10, masque,
f.p, v.X = 'Condition',
seuil.candidats = ( 5:30 ) / 100,
avec.classique = length( attr( composition, "reference" ) ) > 0,
B = 3000, n.coeurs = 1,
... )
Arguments
composition |
A composition model, as obtained by
|
cv.composition |
The expected coefficient of variation of the
quantified amounts. Should be either a single value, that will be
used for all components and all conditions, or a matrix with the
same structure than |
.
taille.groupes |
The sample size for each condition. Unused if
|
masque |
A data.frame that will give the dataset design for a
given experiment. Should contain at least one column containing the
names of the conditions, with values being in the conditions names
in |
f.p |
The function used to analyse the dataset. See
|
v.X |
The name of the column identifying the different conditions
in |
seuil.candidats |
A vector of p-value cut-offs to be tested. All values should be between 0 and 1. |
f.correct |
A function to determine if the result of the analysis is the expected one. Defaults to a function that compares the disjoint sub-graphs of a reference graph and the obtained one. |
groupes.attendus |
The reference graph for the above function. Defaults to the theoretical graph of the model, for the comparison between the first and the second conditions. |
avec.classique |
If If requested, the analysis is done with and without multiple testing correction (with Holm's method). The “cut-off p-value” is used as the nominal type~I error level for the individual tests. |
f.correct.classique |
A function to determine if the alr-like method finds the correct answer. Defaults to a function that compares the set of significant tests with the set of expected components. |
genes.attendus |
A character vector giving the names of components expected to behave differently than the reference set. |
B |
The number of simulations to be done. |
n.coeurs |
The number of CPU cores to use in computation, with parallelization using forks (does not work on Windows) with the help of the parallel package. |
... |
Additionnal parameters for helper functions, including |
Details
Use this function to simulate experiments and explore the properties of the disjoint graph method in a specified experimental context. Simulations are done using a log-normal model, so analysis is always done on the log scale. Coefficients of variation in the original scale hence directly translate into standard deviations in the log-scale.
For power analysis, care should be taken that any rejection of the null hypothesis “nothing is different between conditions” is counted as a success, even if the result does not respect the original changes. This is the reason for the additional correct-finding probability estimation. However, defining what is a correct, or at least acceptable, result may be not straightforward, especially for comparison with other analysis methods.
Note also that fair power comparisons can be done only for the same type I error level. Hence, for instance, power of the corrected alr-like method at p = 0.05 should be compared to the power of the disjoint-graph method at its “optimal” cut-off.
Value
An object of class SARPcompo.simulation
, with a plot
method. It is a data.frame with the following columns:
Seuil |
The cut-offs used to build the graph |
Disjoint |
The number of simulations that led to disjoint graphs. |
Correct |
The number of simulations that led to the correct graph
(as defined by the |
If avec.classique
is TRUE
, it has additionnal columns:
DDCt |
The number of simulations that led at least one significant test using the alr-like method. |
DDCt.H |
The number of simulations that led at least one significant test using the alr-like method, after multiple testing correction using Holm's method. |
DDCt.correct |
The number of simulations that detected the
correct components (as defined by the |
DDCt.H.correct |
As above, but after multiple testing correction using Holm's method. |
It also stores a few informations as attributes, including the total
number of simulations (attribute n.simulations
).
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
modele_compo
to create a compositional model for two or
more conditions.
creer.Mp
, which is used internally, for details about
analysis functions.
choisir.seuil
for a simpler interface to estimate the
optimal cut-off.
Examples
## Create a toy example: four components, two conditions
## components 1 and 2 do not change between conditions
## components 3 and 4 are doubled
## component 1 is a reference component
me <- rbind( 'A' = c( 1, 1, 1, 1 ),
'B' = c( 1, 1, 2, 2 ) )
colnames( me ) <- paste0( "C-", 1:4 )
md <- modele_compo( me, reference = 'C-1' )
## How many simulations?
## 50 is for speed; increase for useful results...
B <- 50
## What is the optimal cut-off for this situation?
## (only a few simulations for speed, should be increased)
## (B = 3000 suggests a cut-off between 0.104 and 0.122)
seuil <- choisir.seuil( 4, B = B )
## What is approximately the type I error
## between conditions A and B using a Student test
## with a CV of around 50 % ?
## (only a few simulations for speed, should be increased)
alpha <- estimer.alpha( md, cv = 0.50, B = B,
f.p = student.fpc )
# Plot it : darkgreen = the disjoint graph method
# orange = the alr-like method, Holm's corrected
# salmon = the alr-like method, uncorrected
plot( alpha )
## What is approximately the power to detect that something changes
## between conditions A and B using a Student test
## with a CV of around 50 % ?
## (only a few simulations for speed, should be increased)
puissance <- estimer.puissance( md, cv = 0.50, B = B,
f.p = student.fpc,
genes.attendus = c( 'C-3', 'C-4' ) )
# Plot it : darkgreen = the disjoint graph method
# orange = the alr-like method, Holm's corrected
# salmon = the alr-like method, uncorrected
plot( puissance )
## Do we detect the correct situation in general?
## (that is, exactly two sets: one with C-1 and C-2, the second with
## C-3 and C-4 --- for the alr-like method, that only C-3 and C-4
## are significant)
# darkgreen = the disjoint graph method
# orange = the alr-like method, Holm's corrected
# salmon = the alr-like method, uncorrected
plot( puissance, correct = TRUE )
Utility functions to obtain p-values from tests on individual ratios
Description
These functions can be used in the functions to perform analysis on all pairwise ratios of a compositional dataset
Usage
student.fpc( d, variable, v.X, var.equal = TRUE, ... )
anva1.fpc( d, variable, v.X, frm = NULL, ... )
anva1vi.fpc( d, variable, v.X, frm = NULL, ... )
rls.fpc( d, variable, v.X, frm = NULL, ... )
kw.fpc( d, variable, v.X, frm = NULL, ... )
anva_SC.fpc( d, variable, frm, SC = 1, type = 1, ... )
Arguments
d |
The data frame that contains the ratio to test, and all variables of the original data frame that where not used as compositional data. |
variable |
A length-one character vector containing the names of the variable corresponding to the ratio (or log-ratio) to test. |
v.X |
The name of the explanatory (independant, predictor)
variable. This variable should be a factor for |
frm |
The formula to use. Defaults to For |
SC |
For |
type |
For |
var.equal |
For |
... |
additional arguments |
Details
These functions are only wrapper to some commonly used tests. The correspondance is as follow
student.fpc | Student's T-test | t.test()$p.value |
anva1.fpc | One-way analysis of variance | anova(lm())[ 1, 5 ] |
rls.fpc | Simple linear regression | anova(lm())[ 1, 5 ] |
anva1vi.fpc | One-way analysis of variance, without equal variance assumption | oneway.test()$p.value |
kw.fpc | Kruskal-Wallis test | kruskal.test()$p.value |
anva_SC.fpc
is a generic wrapper for lm
using any
formula. It then extracts the p-value of the line given by
SC
in the analysis of variance table. If type = 1
, the
table is built using anova
and corresponds to type 1
(sequential sum of square). If type = 2
or type = 3
,
the table is built using car::Anova
and corresponds either to
type 2 or type 3 sums of squares.
For Student's test (either with equal or unequal variances), instead of calling t.test, the computation is done internally, hopefully speeding up (less controls are done and only useful computations are done)
Value
These function returns the p-value from the corresponding test.
Note
rls.fpc
is an exact synonym for anva1.fpc
, since the
underlying theory is the same. Distinction is made to help users
without a formal statistical background to find the right test.
Author(s)
Emmanuel Curis (emmanuel.curis@parisdescartes.fr)
See Also
kruskal.test
, lm
, anova
,
Anova
, oneway.test
, for corresponding tests.
equiv.fpc
for an approach using equivalence tests.
creer.Mp
to use these functions to create a matrix of
p-values.
Examples
# load the potery data set
data( poteries )
# Compute one-way ANOVA p-values for all ratios in this data set
Mp <- creer.Mp( poteries, c( 'Al', 'Na', 'Fe', 'Ca', 'Mg' ),
f.p = anva1.fpc, v.X = 'Site', frm = R ~ Site )
Mp
# Make a graphe from it and plot it
plot( grf.Mp( Mp ) )