Type: | Package |
Title: | Partitioning of Individual Autozygosity into Multiple Homozygous-by-Descent Classes |
Version: | 0.4.1 |
Maintainer: | Tom Druet <tom.druet@uliege.be> |
Description: | Functions to identify Homozygous-by-Descent (HBD) segments associated with runs of homozygosity (ROH) and to estimate individual autozygosity (or inbreeding coefficient). HBD segments and autozygosity are assigned to multiple HBD classes with a model-based approach relying on a mixture of exponential distributions. The rate of the exponential distribution is distinct for each HBD class and defines the expected length of the HBD segments. These HBD classes are therefore related to the age of the segments (longer segments and smaller rates for recent autozygosity / recent common ancestor). The functions allow to estimate the parameters of the model (rates of the exponential distributions, mixing proportions), to estimate global and local autozygosity probabilities and to identify HBD segments with the Viterbi decoding. The method is fully described in Druet and Gautier (2017) <doi:10.1111/mec.14324> and Druet and Gautier (2022) <doi:10.1016/j.tpb.2022.03.001>. |
Depends: | R (≥ 3.5.0), methods |
License: | GPL-3 |
Encoding: | UTF-8 |
LazyData: | true |
Imports: | foreach, doParallel, parallel, data.table, RColorBrewer, iterators |
RoxygenNote: | 7.3.2 |
Suggests: | knitr, rmarkdown, testthat |
VignetteBuilder: | knitr |
NeedsCompilation: | yes |
Packaged: | 2025-06-08 06:21:36 UTC; tdruet |
Author: | Tom Druet [aut, cre], Naveen Kumar Kadri [aut], Amandine Bertrand [ctb], Mathieu Gautier [aut] |
Repository: | CRAN |
Date/Publication: | 2025-06-08 11:00:01 UTC |
RZooRoH: Partitioning of Individual Autozygosity into Multiple Homozygous-by-Descent Classes
Description
Functions to identify Homozygous-by-Descent (HBD) segments associated with runs of homozygosity (ROH) and to estimate individual autozygosity (or inbreeding coefficient). HBD segments and autozygosity are assigned to multiple HBD classes with a model-based approach relying on a mixture of exponential distributions. The rate of the exponential distribution is distinct for each HBD class and defines the expected length of the HBD segments. These HBD classes are therefore related to the age of the segments (longer segments and smaller rates for recent autozygosity / recent common ancestor). The functions allow to estimate the parameters of the model (rates of the exponential distributions, mixing proportions), to estimate global and local autozygosity probabilities and to identify HBD segments with the Viterbi decoding. The method is fully described in Druet and Gautier (2017) doi:10.1111/mec.14324 and Druet and Gautier (2022) doi:10.1016/j.tpb.2022.03.001.
Functions to identify Homozygous-by-Descent (HBD) segments associated with runs of homozygosity (RoH) and to estimate individual autozygosity (or inbreeding coefficient). HBD segments and autozygosity are assigned to multiple HBD classes with a model-based approach relying on a mixture of exponential distributions. The rate of the exponential distribution is distinct for each HBD class and defines the expected length of the HBD segments. These HBD classes are therefore related to the age of the segments (longer segments and smaller rates for recent autozygosity / recent common ancestor). The functions allow to estimate the parameters of the model (rates of the exponential distributions, mixing proportions), to estimate global and local autozygosity probabilities and to identify HBD segments with the Viterbi decoding. Functions also allow to compute kinship between individuals and to predict inbreeding in the future progeny of a genotyped couple.
Data pre-processing
Note that the model is designed for autosomes. Other chromosomes and additional filtering (e.g. bi-allelic markers, markers and individuals filtering on call rate, coding of missing genotypes, HWE, etc.) should be performed prior to run RZooRoH with tools such as plink or bcftools. The model works on an ordered map and ignores SNPs with a null position.
RZooRoH functions
The main functions included in the package are zoodata(), zoomodel() and zoorun(). The zoorun() function can also be applied to two (phased) haplotypes to obtain IBD probabilities with the same model. This is possible for haploid data (haploid organism or eventually specific cases with sex chromosomes) or for diploid individuals, but this requires then a prior phasing step. The zookin() functions estimates kinship between pairs of individuals with the ZooRoH model. To that end, it computes the IBD relationship between the four possible pairs of haplotypes between the two individuals (this is only possible with phased data). There are also four functions to plot the results: zooplot_partitioning(), zooplot_hbdseg(), zooplot_prophbd() and zooplot_individuals(). Eight accessors functions help to extract the results: realized(), cumhbd(), rohbd(), probhbd(), merge_zres and update_zres after HBD estimation and cumkin() and predhbd() after using zookin().
You can obtain individual help for each of the functions. By typing for instance: help(zoomodel) or ? zoomodel.
To run RZooRoH, you must first load your data with the zoodata() function. It will create a zooin object required for further analysis. Next, you need to define the model you want to run. You can define a default model by typing for instance, my.mod <- zoomodel(). Finally, you can run the model with the zoorun function. You can choose to estimate parameters with different procedures, estimate global and local homozygous-by-descent (HBD) probabilities with the Forward-Backward procedure or identify HBD segments with the Viterbi algorithm. The results are saved in a zres object.
The four plot functions zooplot_partitioning(), zooplot_hbdseg(), zooplot_prophbd() and zooplot_individuals() use zres objects to make different graphics. Similarly, the accessor functions help to extract information from the zres objects (see vignette for more details).
To get the list of data sets (for examples):
data(package="RZooRoH")
And to get the description of one data set, type ? name_data (with name_data being the name of the data set). For instance:
? genosim
Author(s)
Maintainer: Tom Druet tom.druet@uliege.be
Authors:
Naveen Kumar Kadri
Mathieu Gautier
Other contributors:
Amandine Bertrand [contributor]
Examples
# Start with a small data set with six individuals and external frequencies.
freqfile <- (system.file("exdata","typsfrq.txt",package="RZooRoH"))
typfile <- (system.file("exdata","typs.txt",package="RZooRoH"))
frq <- read.table(freqfile,header=FALSE)
typdata <- zoodata(typfile,supcol=4,chrcol=1,poscol=2,allelefreq=frq$V1)
# Define a model with two HBD classes with rates equal to 10 and 100.
Mod2L <- zoomodel(K=2,base_rate=10)
# Run the model on all individuals.
typ.res <- zoorun(Mod2L,typdata)
# Observe some results: likelihood, realized autozygosity in different
# HBD classes and identified HBD segments.
typ.res@modlik
typ.res@realized
typ.res@hbdseg
# Define a model with one HBD and one non-HBD class and run it.
Mod1R <- zoomodel(K=1,predefined=FALSE)
typ2.res <- zoorun(Mod1R,typdata)
# Print the estimated rates and mixing coefficients.
typ2.res@krates
typ2.res@mixc
# Get the name and location of a second example file.
myfile <- (system.file("exdata","genoex.txt",package="RZooRoH"))
# Load your data with default format:
example2 <- zoodata(myfile)
# Define the default model:
my.model <- zoomodel()
# Run RZooRoH on your data with the model (parameter estimation with optim). This can
# take a few minutes because it is a large model for 20 individuals:
my.res <- zoorun(my.model,example2)
# To run the model on a subset of individuals with 1 thread:
my.res3 <- zoorun(my.model, example2, ids=c(7,12,16,18), nT = 1)
# Define a smaller model and run it on two individuals.
my.mod2 <- zoomodel(K=3,base_rate=10)
my.res4 <- zoorun(my.mod2, example2, ids=c(9,18))
Example for "gp" format specification
Description
A dataset containing real genotype probabilities for 1,000 SNPs. The data is available for ten individuals and 1,000 SNPs on chromosome 10. The data corresponds to a low-fold sequencing experiment. There are three columns per individuals (for genotypes 00, 01 and 11).
Usage
BBB_NMP_GP_subset
Format
A data frame with 1,000 rows and 35 variables:
- chr
The chromosome number
- marker_name
The marker id
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1_gp1
The AA genotype probability for the first individual
- id1_gp2
The AB genotype probability for the first individual
- id1_gp3
The BB genotype probability for the first individual
- id2_gp1
The AA genotype probability for the second individual
- id2_gp2
The AB genotype probability for the second individual
- id2_gp3
The BB genotype probability for the second individual
- id3_gp1, id3_gp2, id3_gp3, id4_gp1, id4_gp2, id4_gp3, id5_gp1, id5_gp2, id5_gp3, id6_gp1, id6_gp2, id6_gp3, id7_gp1, id7_gp2, id7_gp3, id8_gp1, id8_gp2, id8_gp3, id9_gp1, id9_gp2, id9_gp3, id10_gp1, id10_gp2, id10_gp3
The genotype probabilities for the other individuals
Example for "ad" format specification
Description
A dataset containing real allele depth information for 1,000 SNPs. The data is available for ten individuals and the first 1,000 SNPs on chromosome 1. The data corresponds to a low-fold sequencing experiment. There are two columns per individuals (read counts for allele 1 and for allele 2).
Usage
BBB_NMP_ad_subset
Format
A data frame with 1,000 rows and 25 variables:
- chr
The chromosome number
- marker_name
The marker id
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1_ad1
The read count for the first individual at the first markrer
- id1_ad2
The read count for the first individual at the second markrer
- id2_ad1
The read count for the second individual at the first markrer
- id2_ad2
The read count for the second individual at the second markrer
- id3_ad1, id3_ad2, id4_ad1, id4_ad2, id5_ad1, id5_ad2, id6_ad1, id6_ad2, id7_ad1, id7_ad2, id8_ad1, id8_ad2, id9_ad1, id9_ad2, id10_ad1, id10_ad2
The read counts for the other individuals
Example for "pl" format specification
Description
A dataset containing real genotype likelihoods in phred scores for 1,000 SNPs. The data is available for ten individuals and the first 1,000 SNPs on chromosome 1. The data corresponds to a low-fold sequencing experiment. There are three columns per individuals (for genotypes 00, 01 and 11).
Usage
BBB_NMP_pl_subset
Format
A data frame with 1,000 rows and 35 variables:
- chr
The chromosome number
- marker_name
The marker id
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1_pl1
The AA phred likelihood for the first individual
- id1_pl2
The AB phred likelihood for the first individual
- id1_pl3
The BB phred likelihood for the first individual
- id2_pl1
The AA phred likelihood for the second individual
- id2_pl2
The AB phred likelihood for the second individual
- id2_pl3
The BB phred likelihood for the second individual
- id3_pl1, id3_pl2, id3_pl3, id4_pl1, id4_pl2, id4_pl3, id5_pl1, id5_pl2, id5_pl3, id6_pl1, id6_pl2, id6_pl3, id7_pl1, id7_pl2, id7_pl3, id8_pl1, id8_pl2, id8_pl3, id9_pl1, id9_pl2, id9_pl3, id10_pl1, id10_pl2, id10_pl3
The phred likelihoods for the other individuals
Example for "gt" format specification
Description
A dataset containing real genotypes for 1,000 SNPs. The data is available for ten individuals and the first 1,000 SNPs on chromosome 1. The data corresponds to a WGS experiment. There is one columns per individual.
Usage
BBB_PE_gt_subset
Format
A data frame with 1,000 rows and 14 variables:
- chr
The chromosome number
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1
The genotypes for the first individuals (id1)
- id2
The genotypes for the second individuals (id2)
- id3, id4, id5, id6, id7, id8, id9, id10
The genotypes of the remaining individuals
A file with names or IDs for ten samples.
Description
The names (or IDs) are provided for ten samples.
Usage
BBB_samples
Format
A data frame with 10 rows and one variable.
- IDs
The ids for ten individuals
Computes the realized inbreeding coefficient
Description
Computes the realized inbreeding coefficient with respect to a base population by summing the autozygosity from all HBD class with a rate lower or equal to the threshold T. This amounts to set the base population approximately 0.5*T generations ago. HBD classes with a higher rate are then no longer considered as autozygous.
Usage
cumhbd(zres, T = NULL)
Arguments
zres |
The name of the zres object created by the zoorun function. |
T |
The value chosen to define the base population. When T is not provided, all HBD classes are considered to estimate the inbreeding coefficient. |
Value
An array with the compute inbreeding coefficients for all the individuals in the analysis.
Computes the realized kinship
Description
Computes the realized kinship with respect to a base population by summing the relatedness from all classes with a rate lower or equal to the threshold T. This amounts to set the base population approximately 0.5*T generations ago. Classes with a higher rate are then no longer considered as IBD.
Usage
cumkin(kres, T = NULL)
Arguments
kres |
The name of the kres object created by the zoorun function. |
T |
The value chosen to define the base population. When T is not provided, all classes are considered to estimate the kinship. |
Value
An array with the computed kinship for all the pairs of individuals in the analysis.
Subset of a dataset with genotypes for 20 sheeps
Description
A dataset containing real genotypes for 20 individuals. Genotypes are available for 14,831 SNPs from the first three chromosomes were selected. The twenty last columns correspond to the genotypes. Missing genotypes are set to 9.
Usage
genoex
Format
A data frame with 14,831 rows and 25 variables:
- chr
The chromosome number
- marker_name
The name of the marker
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1
The genotype for the first individual
- id2
The genotype for the second individual
- id3, id4, id5, id6, id7, id8, id9, id10, id11, id12, id13, id14, id15, id16, id17, id18, id19, id20
The genotypes of the remaining individuals
Example from a small simulated data set
Description
A dataset containing simulated genotypes for 20 individuals. Genotypes are available for 10,000 SNPs on 10 chromosomes. The ten last columns correspond to the genotypes.
Usage
genosim
Format
A data frame with 10,000 rows and 24 variables:
- chr
The chromosome number
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1
The genotype for the first individual
- id2
The genotype for the second individual
- id3, id4, id5, id6, id7, id8, id9, id10, id11, id12, id13, id14, id15, id16, id17, id18, id19, id20
The genotypes of the remaining individuals
The result of local IBD probabilities for a pair of individuals among the 18 cows from the Amsterdam Island.
Description
The results were obtained by running the a model with 4 layers and the predhbd function for all HBD classes. Results were extracted for chromosome 1.
Usage
hbd1
Format
the result is an array.
Details
Results were extracted for chromosome 1 and the map is also provided (map1).
The result of local IBD probabilities for a pair of individuals among the 18 cows from the Amsterdam Island (recent only).
Description
The results were obtained by running the a model with 4 layers and the predhbd function for HBD classes with a rate < 10. Results were extracted for chromosome 1.
Usage
hbd2
Format
the result is an array.
The result of a kinship analysis on 18 cattle from the Amsterdam Island.
Description
The results were obtained by running the a model with 4 layers and the kinship option among all pairs of individuals (genotyped at 23679 SNPs after filtering).
Usage
kintaf_mix4l
Format
the results are a zres object.
The map for local IBD probabilities for a pair of individuals among the 18 cows from the Amsterdam Island (recent only).
Description
The map corresponds to the results obtained by running the a model with 4 layers and the predhbd function (see hbd1 and hbd2). Results were extracted for chromosome 1.
Usage
map1
Format
the result is an array.
Merge several zres objects generated by zoorun
Description
The function is used for example when the analysis has been split in several tasks. All the zres must come from the same original zoodata (zoorun is applied to the same zoodata with the same model). In addition, the data should not overlap (the same individual should not be present in multiple zres).
Usage
merge_zres(list_zres)
Arguments
list_zres |
A list with the name of the zres objects to be merged. |
Value
a single zres object containing the results from the merged zres objects. Note that the HBD segments are not sorted by ID number.
Extracts the IBD probabilities from the kres object
Description
Extracts the locus specific IBD probabilities for a pair of individuals. This is the probability that one pair of chromosomes (one sampled in each individual) are IBD at that position. A specific chromosomal region can be specified. A threshold T can be used to determine which classes are used in the computation of the IBD probability. This function requires that the option "localhbd" was set to TRUE when creating the kres object.
Usage
predhbd(
kres,
zooin,
num,
chrom = NULL,
startPos = NULL,
endPos = NULL,
T = FALSE
)
Arguments
kres |
The name of the kres object created by the zookin function. |
zooin |
The name of the zdata object created by the zoodata function. See "zoodata" for more details. |
num |
The number of the pair of individuals to extract (correspond to the position in the list of pairs). |
chrom |
the number of the chromosome where we are looking for IBD segments. This chromosome number refers to the position of the chromosome in the list of all chromsomes present in the input genotype data. |
startPos |
The starting position (on the chromosome) of the interval from which we will extract IBD segments (1 by default). |
endPos |
The ending position (on the chromosome) of the interval from which we will extract IBD segments (last position by default). |
T |
The value chosen to define the base population (to determine which classes are used in estimated of IBD probability, which classes are considered autozygous). When T is not provided, all IBD classes are considered to estimate the local IBD probability. |
Value
The function returns a vector of IBD probabilities averaged over the four combination of pairs of parental haplotypes for the specified pair of individuals and chromosomal region. The IBD probabilities are computed as the sum of the probabilities for each class with a rate smaller or equal than the threshold (the sum from all the IBD classes when T is not specified).
Extracts the HBD probabilities from the zres object
Description
Extracts the locus specific HBD probabilities for an individual. A specific chromosomal region can be specified. A threshold T can be used to determine which HBD classes are used in the computation of the HBD probability. This function requires that the option "localhbd" was set to TRUE when creating the zres object.
Usage
probhbd(
zres,
zooin,
id,
chrom = NULL,
startPos = NULL,
endPos = NULL,
T = FALSE
)
Arguments
zres |
The name of the zres object created by the zoorun function. |
zooin |
The name of the zdata object created by the zoodata function. See "zoodata" for more details. |
id |
The number of the individual to extract. |
chrom |
the number of the chromosome where we are looking for HBD segments. This chromosome number refers to the position of the chromosome in the list of all chromsomes present in the input genotype data. |
startPos |
The starting position (on the chromosome) of the interval from which we will extract HBD segments (1 by default). |
endPos |
The ending position (on the chromosome) of the interval from which we will extract HBD segments (last position by default). |
T |
The value chosen to define the base population (to determine which classes are used in estimated of HBD probability, which classes are considered autozygous). When T is not provided, all HBD classes are considered to estimate the local HBD probability. |
Value
The function returns a vector of HBD probabilities for the specified individual and chromosomal region. The HBD probabilities are computed as the sum of the probabilities for each HBD class with a rate smaller or equal than the threshold (the sum from all the HBD classes when T is not specified).
The result of an analysis on 22 sheeps from Rasa Aragonesa population.
Description
The results were obtained by running the default model (10 layers with pre-defined rates) on 22 individuals genotyped at 37465 SNPs.
Usage
rara_mix10l
Format
the results are a zres object.
Extracts the realized autozygosity from the zres object
Description
Extracts the realized autozygosity from the zres object. Extraction is performed for the indicated classes (all by default) and names are added to the columns. The function must be used with more than one individual is the zres object.
Usage
realized(zres, classNum = NULL)
Arguments
zres |
The name of the zres object created by the zoorun function. |
classNum |
An array with the number of the classes to extract. All classes are extracted by default. |
Value
The function returns a data frame with one row per individual and one column per extracted classes. In addition, it gives names to the columns. For a pre-defined model, the names of HBD classes are "R_X" where X is the rate of the corresponding class. For a model with rate estimation, the names of the HBD classes are "HBDclassX" where X is the number of the HBD class. For non-HBD classes, we use "NonHBD".
Extracts the HBD segments from the zres object
Description
Extracts the HBD segments (or RoH) from the zres object. Extraction is performed for the indicated individuals and the selected region (all by default).
Usage
rohbd(
zres,
ids = NULL,
chrom = NULL,
startPos = NULL,
endPos = NULL,
inside = TRUE
)
Arguments
zres |
The name of the zres object created by the zoorun function. |
ids |
An array with the ids of the individuals to extract. All individuals are extracted by default. |
chrom |
the number of the chromosome where we are looking for HBD segments. This chromosome number refers to the position of the chromosome in the list of all chromsomes present in the input genotype data. |
startPos |
The starting position (on the chromosome) of the interval from which we will extract HBD segments (1 by default). |
endPos |
The ending position (on the chromosome) of the interval from which we will extract HBD segments (last position by default). |
inside |
A logical indicating whether we extract only segment within the interval (TRUE) or overlapping with the interval (FALSE). By 'within the interval', we mean that both starting and end position of the HBD segment should be in the interval. By 'overlapping', we mean that at least one part of the HBD segment should be located in the interval. |
Value
The function returns a data frame with the HBD segments fitting the filtering rules (id and position). The data frame has one line per identified HBD segment and nine columns: id is the number of the individual in which the HBD segments is located, chrom is the chromosome of the HBD segments, start_snp is the number of the SNP at which the HBD segment starts (the SNP number within the chromosome), start_end is the number of the SNP at which the HBD segment ends (the SNP number within the chromosome), start_pos is the position at which the HBD segment starts (within the chromosome), start_end is the position at which the HBD segment ends (within the chromosome), number_snp is the number of consecutive SNPs in the HBD segment, length is the length of the HBD segment (for instance in bp or in cM/1000000) and HBDclass is the HBD class associated with the HBD segment.
The result of an analysis on 110 sheeps from the Soay population.
Description
The results were obtained by running the default model (10 layers with pre-defined rates) on 110 individuals genotyped at 37465 SNPs.
Usage
soay_mix10l
Format
the results are a zres object.
Subset of a dataset with genotypes for 6 individuals from a cattle population.
Description
A dataset containing real genotypes for 6 individuals. Genotypes are available for a low density array with 6370 SNPs on 29 autosomes. The six last columns correspond to the genotypes. Missing genotypes are set to 9.
Usage
typs
Format
A data frame with 6370 rows and 10 variables:
- chr
The chromosome number
- pos
The position of the marker
- allele1
The name of the first marker allele
- allele2
The name of the second marker allele
- id1
The genotypes for the first individuals (id1)
- id2
The genotypes for the second individuals (id2)
- id3
The genotypes for the third individuals (id3)
- id4
The genotypes for the fourth individuals (id4)
- id5
The genotypes for the fifth individuals (id5)
- id6
The genotypes for the sixth individuals (id6)
A file with marker allele frequencies for the cattle population.
Description
The allele frequencies of the first allele were computed in a larger sample.
Usage
typsfrq
Format
A data frame with 6370 rows and one variable.
- frq
The allele frequencies estimated for the first allele
Update one main zres object with new results
Description
The function is used for example when the main analysis failed for one individual. The analysis is repeated for that individual with other parameters. The function can then be used to insert the new results in the main zres object. To avoid generating to large files, the updated zres object can take the same name as zres1 doing zres1 <- update_zres(zres1,zres2).
Usage
update_zres(zres1, zres2)
Arguments
zres1 |
The main zres object that will be modified. |
zres2 |
The new zres object, with the new results that will be inserted in the main zres object. |
Value
an updated zres object containing the results from zres1 updated by those from zres2.
The result of an analysis on 23 sheeps from Wiltshire population.
Description
The results were obtained by running the default model (10 layers with pre-defined rates) on 23 individuals genotyped at 37465 SNPs.
Usage
wilt_mix10l
Format
the results are a zres object.
Read the genotype data file
Description
Read a data file and convert it to the RZooRoH format in a 'zooin' object required for further analysis.
Usage
zoodata(
genofile,
min_maf = 0,
zformat = "gt",
chrcol = 1,
poscol = 0,
supcol = 0,
haploid = FALSE,
allelefreq = NULL,
freqem = FALSE,
samplefile = NA
)
Arguments
genofile |
The name of the input data file. Note that the model is designed for autosomes. Other chromosomes and additional filtering (e.g. call rate, missing, HWE, etc.) should be performed prior to run RZooRoH with tools such as PLINK or bcftools for instance. The model works on an ordered map and ignores SNPs with a null position. |
min_maf |
The minimum allele frequency to keep variants in the analysis (optional / set to 0.00 by default to keep all markers). Values such as 0.01 allows to exclude monomorphic markers that are not informative and to reduce the size of the data and computational costs. There is no marker exclusion on call rate. However, we expect that data filtering is done prior to RZooRoH with tools such as PLINK or vcftools. |
zformat |
The code corresponding to the format of the data file ("gt" for genotypes, "gp" for genotype probabilities, "gl" for genotype likelihoods in Phred scores, "ad" for allelic depths). For all these formats, markers are ordered per rows and individuals per columns. Variants should be ordered by chromosome and position. By default, the format is inspired from the Oxford/GEN format, and the first five columns are chromosome identification (e.g, "1", "chr1"), the name of the marker, the position of the marker in base pairs or better in cM multiplied by 1,000,000 when genetic distances are known, the first marker allele and the second marker allele. Information per individual varies according to the format. With the "gt" format we have one column per individual with 0, 1 and 2 indicating the number of copies of the first allele (and 9 for missing). With the "gp" format we have three column per individual with the probabilities of genotype 11 (homozygous for the first allele), genotype 12 and genotype 22 (this corresponds to the oxford GEN format). Similarly, with the "gl" format, we have three column per individual with the likelihoods for genotypes 11, 12 and 22 in Phred scores. Finally, with the "ad" format, we expect two columns per individual: the number of reads for allele 1 and the number of reads for allele 2. For these three last formats, missing values must be indicated by setting all elements to 0. If one of the columns is non-null for one individual, the genotype will be considered non-missing. Note that the marker alleles specified in columns 4 and 5 are not used. Conversion of a PLINK ped file or a VCF file to RZooRoH format can easily be performed using PLINK (version 1.9) or using bcftools. For ped files, recode them to oxford gen format with plink –file myinput –recode oxford –autosome –out myoutput. The autosome option keeps only SNPs on autosomes as required by RZooRoH. For vcf files, bcftools can be used to recode a vcf to the oxford gen format with the convert option: bcftools convert -t ^chrX,chrY,chrM -g outfile –chrom –tag GT myfile.vcf. The –chrom option is important to obtain chromosome number in the first column. The tag option allows to select which field from the vcf file (GT, PL, GL or GP) is used to generate the genotype probabilities exported in the oxford gen format. The -t option allows to exclude chromosomes (this is an example and chromosome names must be adapted if necessary). The needed output data is then outfile.gen. If some genotype probabilities are missing, with a value of "-nan", you must replace them with "0" (triple 0 is considered as missing). This can be done with this command: sed -e 's/-nan/0/g' file.gen > newfile.gen Note that some software recode missing as 0.333, 0.333 and 0.333. This was not previously considered as missing. This has no consequence of emission probabilities but impacts slightly estimation of allele frequencies (when the freqem option is false). Now, if the three genotype probabilities are above 0.33, the genotype is considered missing. Ideally, missing should be set to 0 0 0. For applications related to identity-by-descent (IBD) estimation, the ZooRoH model is applied on two phased chromosomes or two haploid chromosomes. Two additional formats are therefore available to provide haplotypes (or haploid data). The first is "vcf" which refers to a phased VCF (for example, the output from Beagle5). In that case, the VCF must contain only the phased haplotype information (e.g 0|1, 0|0, 1|1, etc). The phased VCF can only be used with diploid individuals (after phasing). By default, we assume that the first and second column indicate chromosome and SNP position and that the individual haplotypes information starts in column 10. The second format is "haps" that is similar to the "GEN" format with five columns (chromosome identification (e.g, "1", "chr1"), the name of the marker, the position of the marker in base pairs or better in cM multiplied by 1,000,000 when genetic distances are known, the first marker allele and the second marker allele) followed by the haplotypes (two columns for diploid individuals and one in case of haploid chromosome). The alleles are coded as 0 and 1. The "haps" format is the only format that can be used with haploid data. As for the genotype formats, it is possible to use tools such as bcftools to format file in the correct format. |
chrcol |
An optional argument that indicates the column number where the chromosome information is indicated (first column by default for all formats). |
poscol |
An optional argument that indicates the column number where the marker position is indicated (third column by default for all formats except phased vcf where it is the second column). |
supcol |
An optional argument that indicates the number of additional columns before the individuals genotypes or haplotypes (five columns are expected by default as described in the zformat argument description, except for phased VCF where this value is set to 9). Note that the function requires at least two information columns: the chromosome number and the marker position. |
haploid |
An optional argument that indicates whether we work on an haploid organism or chromosome (default = FALSE). This is only compatible with the 'IBD' option from the zoorun function (we don't estimate HBD in haploid data!). In the case you use haploid data, the only possible format is "haps". |
allelefreq |
A vector with allele frequencies for the first marker allele (optional). By default, the allele frequencies are estimated from the data. The option allows to skip this computation or to provide external allele frequencies estimated by another method or on another data set. |
freqem |
A logical indicating whether allele frequencies should be estimated with an EM algorithm. By default they are estimated with simpler approaches. The approach is ignored with the GT format. For high confidence genotypes (e.g., genotyping arrays, high-coverage sequencing data), it is not necessary to use this EM approach as genotypes are known. The approach is more relevant with low-fold sequencing for example, and more so with the PL or GP format (the approximation with the AD format being closer to the EM). |
samplefile |
A file with names of the samples (optional). It must match with the number of genotypes. If none is provided, the position in the genofile is used as ID. |
Value
The function return a zooin object called containing the following elements: zooin@genos a matrix with the genotypes, genotype probabilities or haplotypes, zooin@bp an array with marker positions, zooin@chrbound a matrix with the first and last marker number for each chromosome, zooin@nind the number of individuals, zooin@nsnps the number of markers conserved after filtering for minor allele frequency, zooin@freqs an array with the marker allele frequencies, zooin@nchr the number of chromosomes, zooin@zformat the format of the data ("gt","gp","gl","ad","vcf","haps") and zooin@sample_ids (the names of the samples).
Examples
# Get the name and location of example files
myfile1 <- system.file("exdata","genoex.txt",package="RZooRoH")
myfile2 <- system.file("exdata","genosim.txt",package="RZooRoH")
# Load your data with default format into a zooin object named "data1":
data1 <- zoodata(myfile1)
# Load the first data file with default format and filtering out markers with MAF < 0.02
# into a zooin object called "data1frq002":
data1frq002 <- zoodata(myfile1, min_maf = 0.02)
# Load the first data file with default format, with external allele frequencies
# (here a random set we create) and filtering out markers with MAF < 0.01:
myrandomfreq <- runif(14831)
data1c <- zoodata(myfile1, allelefreq = myrandomfreq, min_maf = 0.01)
# Load the second data file and indicate your own format (chromosome number in column 1,
# map position in column 2, 4 columns before genotypes) and filtering out markers with
# MAF < 0.01. The created zooin object is called "Sim5":
Sim5 <- zoodata(myfile2, chrcol = 1, poscol =2, supcol = 4, min_maf = 0.01)
Use the ZooRoH model to estimate kinship between pairs of individuals
Description
Apply the defined model to estimate kinship for pairs of individuals by running the ZooRoH model to estimate IBD probabilities between the four possible pairs of chromosome of the two individuals (one chromosome from each individual). As for zoorun this includes parameter estimation, computation of realized IBD, IBD probabilities, and identification of IBD segments. Options are similar to the zoorun function.
Usage
zookin(
zoomodel,
zooin,
parameters = TRUE,
fb = TRUE,
vit = TRUE,
localhbd = FALSE,
nT = 1,
optim_method = "L-BFGS-B",
maxiter = 1000,
minmix = 1,
maxr = 1e+08,
kinpairs = NULL,
RecTable = FALSE,
trim_ad = FALSE,
hemiprob = 0
)
Arguments
zoomodel |
A valid zmodel object as defined by the zoomodel function. The model indicates whether rates of exponential distributions are estimated or predefined, the number of classes, the starting values for mixing coefficients and rates, the error probabilities. See "zoomodel" for more details. zookin can not be run with a KL model because four different rate parameters would be estimated for each pair of haplotypes, making interpretation difficult. |
zooin |
A valid zdata object as obtained by the zoodata function. See "zoodata" for more details. |
parameters |
Specifies whether the parameters are estimated by optimization with the L-BFGS-B method from the optim function (optional argument - true by default). If the user doesn't want to estimate the parameters he must set parameters=FALSE. In that case, the forward-backaward and Viterbi algorithms are run with the provided parameters. |
fb |
A logical indicating whether the forward-backward algorithm is run (optional argument - true by default). The Forward-Backward algorithm estimates the local probabilities to belong to each IBD or non-IBD class. By default, the function returns only the IBD probabilities for each class, averaged genome-wide, and corresponding to the realized autozygosity associated with each class. To obtain HBD probabilities at every marker position, the option localhbd must be set to true (this generates larger outputs). |
vit |
A logical indicating whether the Viterbi algorithm is run (optional argument - false by default). The Viterbi algorithm performs the decoding (determining the underlying class at every marker position). Whereas the Forward-Backward algorithms provide IBD probabilities (and how confident a region can be declared IBD), the Viterbi algorithm assigns every marker position to one of the defined classes (IBD or non-IBD). When informativity is high (many SNPs per IBD segments), results from the Forward-Backward and the Viterbi algorithm are very similar. The Viterbi algorithm is best suited to identify IBD segments. To estimate realized kinship and determine IBD status of a position, we recommend to use the Forward-Backward algorithm that better reflects uncertainty. |
localhbd |
A logical indicating whether the IBD probabilities for each individual at each marker are returned when using the Forward-Backward algorithm (fb option). This is an optional argument that is false by default. |
nT |
Indicates the number of threads used when running RZooRoH in parallel (optional argument - one thread by default). |
optim_method |
Indicates which method the optim R function will use to estimate the parameters of the model ("L-BFGS-B" by default). The possible methods are "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN" and "Brent". Type "? optim" to have more information. In our experience, the "L-BFGS-B" method works well but the method achieving the best likelihood is variable (according to the data sets, the model, the priors, the constraints). The same goes for the efficiency (speed). When the zoorun does not converge, you can test with another method. Note that the only method allowing to put constraints on parameters is "L-BFGS-B" (other methods are unconstrained). |
maxiter |
Indicates the maximum number of iterations when estimating the parameters with the R optim function (optional argument - 100 by default). Iterations are not defined identically across methods. For instance, in one iteration of the "L-BFGS-B" method, the likelihood of the model, estimated with the forward algorithm, is evaluated multiple times. So, a value of 100 iterations is good for the "L-BFGS-B" method but large values are required for some other algorithms. |
minmix |
This indicates the minimal value for the mixing coefficients. By default it is set to 0 with the classical mixkl and kl models (unconstrained). However, when using the step option or the "Interval" HBDclass, the values is set to 1e-16 to avoid numerical problems. Note that constraints are only allowed with the "L-BFGS-B" method from optim. |
maxr |
This indicates the maximum difference between rates of successive classes. It is an optional argument set to an arbitrarily large value (100000000). Adding such constraints might slow down the speed of convergence and we recommend to run first without this constraint (constraints are only allowed with the "L-BFGS-B" method from optim). |
kinpairs |
A matrix with two columns, indicating the pairs of individuals being analyzed. This information is required for a zookin analysis. |
RecTable |
This is an optimal parameter indicating whether a finite number of genetic distances are used (false by default). This function can be used only with the "Interval" HBDclass. The "Interval" option can be slow, in particular if large "intervals" of generations are defined. To speed up computations, some variables are precomputed for a finite set of genetic distances, select to cover a broad range of possible values. The real genetic distance between two genetic markers is then replaced by the closest value in the table (the difference between the true and used genetic distances being also lower than 10"%"). |
trim_ad |
This is an option still under evaluation (for testing only) |
hemiprob |
This is an option still under evaluation (for testing only) |
Value
The function return a kinres object with several slots accesses by the "@" symbol. It is very similar to the zoores object. The three main results are kinres@realized (the matrix with partitioning of the genome in different IBD classes for each pair of individual), These values are scale like coancestry coefficient and represent the probability that two alleles sampled in the two individuals are IBD (and in that length class). The sum of the realized values in all IBD classes correspond to the kinship. kinres@hbdseg (a data frame with identified IBD segments) and zoores@hbdp (a list of matrices with IBD probabilities per SNP and per class). This gives at one marker position, the probability that two haplotypes, one sampled in each individual, are IBD (and in that length class) at that position. It corresponds also to the predicted HBD values in their future progeny.
Some slots present in zoores objects are not reported because for each pair of individuals we would then have four values. If you are interested in these values for each pair of haplotypes, you can run zoorun with the ibd option. These slots are @mixc, @modlik, @modbic, @niter, @optimerr.
Here is a list with all the slots and their description:
kinres@npairs the number of pairs of individuals in the analysis,
kinres@ids a matrix containing the numbers of the analyzed pairs of individuals (their position in the data file),
kinres@krates the rates for the exponential distributions associated with each IBD or non-IBD class for all individuals,
kinres@realized a matrix with estimated realized IBD per layer (columns) for each pair of individuals (rows). These values are obtained with the Forward-Backward algorithm - fb option - and averaging over the four possible haplotype pairs (between the two individuals),
kinres@ibdp a list of matrices with the local probabilities of IBD in different layers (computed for every class and every pair of individuals). The IBD probability is the probability that two alleles sampled at that locus in the two individuals (one in each) are IBD with respect to that layer (e.g., the IBD segment must correspond to the layer). Each matrix has one row per layer / class and one column per snp. To access the matrix for pair (of individuals) i, use the brackets "[[]]", for instance kinres@hbdp[[i]],
zookin@ibdseg a data frame with the list of identified IBD segments with the Viterbi algorithm (the columns are the haplotype pair number, the chromosome number, the first and last SNP of the segment, the positions of the first and last SNP of the segment, the number of SNPs in the segment, the length of the segment, the HBD state of the segment),
kinres@sampleids is a vector with the names of the pair of samples (when provided in the zooin object through the zoodata function). The ids from the two individuals are separated by a "_".
kinres@haplotype_ids is a vector with the information on haplotype pairs in the analysis and used in the ibdseg table.
Define the model for the RZooRoH
Description
Help the user to create a model for RZooRoH, including default parameters. The output is a zmodel object necessary to run RZooRoH.
Usage
zoomodel(
predefined = TRUE,
K = 10,
mix_coef = rep(0, K),
base_rate = 2,
krates = rep(0, K),
err = 0.001,
seqerr = 0.001,
step = FALSE,
XM = rep(0, K),
HBDclass = "SingleRate"
)
Arguments
predefined |
Logical (TRUE or FALSE) to define whether rates of HBD and non-HBD-classes will be estimated by the model ("kl" model) or whether the rates of these classes are fixed and pre-defined by the user ("mixkl" model). The default value is "predefined = TRUE". |
K |
The number of layers with the nested 1R model implemented in 2022, this represents thus the number of HBD classes. By default, K is set to 10 but this is not optimal for all data sets. Hence, we recommend to the users to select their own value. If K is set to 1 and rates are estimated, RZooRoH will use the same rate for the HBD and the non-HBD class (so-called 1R model). |
mix_coef |
The starting value for the mixing coefficients in each layer. The mixing coefficients determine the frequency of the segments from different classes, they determine the probability to start a new HBD segment in a given layer (after reaching that layer after a coancestry change). The mixing coefficients can also be interpreted as the rate of inbreeding in their respective layer and should take values between 0 and 1. The function expects K mixing coefficients (e.g. the number of layers). The default values are 0.01 for HBD classes (smaller values may be better for large values of K). In case the parameters are not estimated (e.g. when running the forward-backward or the Viterbi algorithm alone), these are the mixing coefficients used by the RZooRoH model. Is it possible to force some mixing coefficient to take the same value with the step option. In that case, the expected number of mixing coefficients might be different. |
base_rate |
is a integer used to define the rates of successive layers or HBD classes (see krates below). This parameter is most useful when using a model with predefined rates. The rate of each HBD class will be equal to the base_rate raised to the exponent k (the class number). The non-HBD class will have the same rate as the last HBD class. For instance, with a base_rate of 2 and four layers, we have the following rates: 2, 4, 8, 16 and 16. Similarly, with a base_rate of 10 and three layers, we have 10, 100, 1000 and 1000. With this method, more HBD classes are defined for more recent ancestors (for which we have more information to estimate R) and less for ancient HBD classes (it doesn't make sense to try to distinguish R = 1000 from R = 1010). In addition, since the expected length of HBD segments is expected to be 1/R, the ratio between successive expected HBD lengths remains the same. This ratio also determines the ability of the model to distinguish segments from distinct classes. By keeping the ratio constant, the aptitude to discriminate between HBD classes is also constant. In addition, this method allows to cover a wide range of generations in the past, with more emphasis on recent ancestors. The default value for the base_rate is 2. |
krates |
Is an array with a rate in each layer. The function expects K positive rates. These rates are parameters of the exponential distribution that together with the distance in centimorgans defines the probability to end a HBD segments between two markers. Each layer / HBD class has a distinct rate. Therefore, the expected length of HBD classes is defined by the rates. The expected length is equal to 1/R. These krates are associated with the age of the common ancestor of the HBD segment. The rate is approximately equal to the size of the inbreeding loop (twice the number of generations to the common ancestor) when the map is given in Morgans. This is not an exact dating but an approximation, in simplified conditions. By default, the rates are defined by the base_rate parameter (2, 4, 8, 16, ...). |
err |
Indicates the error term, the probability to observe an heterozygous genotype in a HBD segment. The genotype could be heterozygous due to a mutation occuring on the path to the common ancestor. It can also be associated with a genotype calling error or a technical error. In case GP or GL formats are used (with genotyped probabilities or phred scores) or when an AD format is used (based on read counts), this error term still represents the probability to observe an heterozygous genotype in a HBD segment. When an heterozygous genotype was called with a probability equal to 1.00, this heterozygosity in an HBD track might be associated to a mutation or to errors not accounted for by the model used to estimate the genotype probabilities (e.g., GATK). The emission probability to observe a heterozygous genotype in an HBD class will never go below the error term. The default value is 0.001. |
seqerr |
This parameters is used only with the AD format. In the AD format the user gives the number of reads for both alleles. A simple model is then used to estimate the genotype probabilities based on the read counts. In that model, the seqerr represents the probability to have a sequencing error in one read. The default value is 0.001. |
step |
Logical (TRUE or FALSE). This allows to use a step function to model the mixing coefficients. First a model with many layers is defined. For example, one layer for each even number (50 layers from 2 to 100 for example). Although such a model allows a finer resolution, it requires the estimation of many parameters (probably without enough information to estimate precisely all mixing coefficients). Therefore, we propose a stepfunction that forces mixing coefficients to be constant for several consecutive layers (by blocks of layers). When this option is used, K represents the total number of fitted layers, all their rates must be defined. Regarding the mixing coefficients, one mixing coefficient must be defined per step (block). An incidence matrix relating the full set of mixing coefficients from HBD classes from all layers to the steps or blocks must be provided. |
XM |
Is an incidence matrix relating the full set of mixing coefficients to the reduced set of mixing coefficients (corresponding to steps or blocks). |
HBDclass |
By default, this value is set to "SingleRate". This means that the layer is defined by a single rate (like an average or global rate for that layer). With the option "Interval", one rate is first defined for every past generation. If we assume discrete past generations (1, 2, 3, 4, 5, ...), the corresponding rates would approximately be equal to 2, 4, 6, ... We insist that this is a crude approximation. However, it allows easier interpretation of the results, for instance by understanding better how many generations are approximately captured per layer. One layer corresponds then to multiple "generations" and is defined as an interval going from generations g1 to g2. In this approach, we use only discrete generations and start at generation 1. The values in the krates vector correspond then to the last generation in each layer. For example, setting krates to (10,20,50,100) would define four layers from generations 1 to 10, 11 to 20, 21 to 50 and 51 to 100. To obtain the corresponding rates, these values are multiplied by two (inside the function). The rates used by ZooRoH would then go from 2 to 20, 22 to 40, 42 to 100 and 102 to 200. |
Value
The function return an object that defines a model for RZooRoH and including the following elements: zmodel@typeModel equal to "kl", "mixkl" or "step_mixkl" according to the selected model, zmodel@mix_coef an array with mixing coefficients, zmodel@krates an array with the rates of the HBD classes, zmodel@err the parameter defining the probability to observe an heterozygous genotype in an HBD class, and zmodel@seqerr the parameter defining the probability of sequencing error per read, zmodel@XM the incidence matrix relating individual layers to blocks or steps, zmodel@typeClass indicating the type of HBD classes,
Examples
# To define a the default model, with 10 layers (10 HBD and 1 non-HBD class)
# and with pre-defined rates for HBD classes with a base of 2 (2, 4, 8, ...):
Mix10L <- zoomodel()
# To see the parameters of the defined model, just type:
Mix10L
# To define a model with four layers and using a base of 10 to define
# rates (10, 100, 1000, ...):
Mix4L <- zoomodel(K=4,base=10)
# To define a model with two classes, with estimation of rates for HBD classes
# and starting with a rate 10:
my.mod1R <- zoomodel(predefined=FALSE,K=1,krates=c(10))
Plot HBD segments identified with the ZooROH model
Description
Plot HBD segments identified with the ZooRoH model for one or several populations.
Usage
zooplot_hbdseg(
input,
chr = NULL,
coord = NULL,
minlen = 0,
cols = NULL,
plotids = TRUE,
toplot = NULL,
randomids = FALSE,
nrandom = (rep(10, length(input))),
seed = 100
)
Arguments
input |
a named list with one or several zres objects obtained after running zoorun. The zres objects are the output of the zoorun function. For instance, putting list(name1 = zres1, name2 = zres2). The function will then use the names in the plot (in case several zres objects are used). |
chr |
the number of the chromosome where we are looking for HBD segments. This chromosome number refers to the position of the chromosome in the list of all chromsomes present in the input genotype data. |
coord |
a vector with the start and end position (in bp) of the region to plot. |
minlen |
the minimal length (in cM or Mb) of HBD segments to be plotted (set to 0 by default). |
cols |
a vector with the colors to be used for each population or zres object. |
plotids |
a logical indicating whether the IDs of the individuals are plotted on the graph (TRUE by default). |
toplot |
a list of vectors indicating the zres@ids to be plotted. This option can be used to select the individuals to plot. The list must contain one vector per population or zres object. By default, all individuals are plotted. |
randomids |
a logical indicating whether a randomset of individuals is plotted. This option allows to reduce the number of individuals in the plot. The option can not be used simultaneously with the toplot option. By default, randomids is FALSE. |
nrandom |
a vector indicating the number of individuals to be randomly sampled per population or per zres object when randomids is TRUE. By default, we select 10 individuals per zres object. This vector must have the same length as the input list. |
seed |
a value for the random seed used to sample individuals to plot (when the randomids option is TRUE). |
Value
The function plots the HBD segments identified in the region, using different colors for different zres object. Each line represents a different individual.
Plot individual curves with proportion of the genome in each HBD class or cumulated proportion in HBD classes with rates smaller than a threshold.
Description
For each individual, the function plots the mean percentage of the genome in different HBD classes or the inbreeding coefficient obtained by summing autozygosity associated with HBD classes with a rate lower or equal to a threshold (e.g., including all HBD classes with longer and more recent HBD segments than a selected threshold).
Usage
zooplot_individuals(input, cumulative = TRUE, toplot = NULL, ncols = 2)
Arguments
input |
a named list with one or several zres objects obtained after running zoorun. The zres objects are the output of the zoorun function. For instance, putting list(name1 = zres1, name2 = zres2). The function will then use the names in the plot (in case several zres objects are used). |
cumulative |
a logical indicating whether individual autozygosity is plotted per class (FALSE) or summed over all HBD class with a rate smaller than a value (these cumulated values are obtained for every rate defined in the model). By default, this value is TRUE. When FALSE, the percentages correspond to the individual genome-wide probabilities of belonging to each HBD-class or to the fraction of the genome in an autozygosity class. When TRUE, we obtain the probability of belonging to an HBD class with a rate smaller or equal than a threshold (here we use the pre-defined rates of the model as thresholds), averaged over the whole genome for each individual. This corresponds to report individual genomic inbreeding coefficients estimated with respect to different base populations obtained by selecting different thresholds T that determine which HBD classes are considered in the estimation of the genomic inbreeding coefficient (setting the base population approximately 0.5 * T generations ago). |
toplot |
A list of vectors indicating the zres@ids to be plotted. This option can be used to select the individuals to plot. The list must contain one vector per population or zres object. By default, all individuals are plotted. |
ncols |
when several populations are plotted, ncols determines how many results (graphs) are plotted per row. |
Value
The function plots either the individual proportions of the genome associated with different HBD classes or individual genomic inbreeding coefficients estimated with respect to different base populations (from young to older). With both option, the average values are plotted in red.
Plot the partitioning of the genome in different HBD classes for each individual
Description
Plot the partitioning of the genome in different HBD classes for each individual
Usage
zooplot_partitioning(
input,
cols = NULL,
plotids = TRUE,
toplot = NULL,
randomids = FALSE,
nrandom = NULL,
seed = 100,
ylim = c(0, 1),
border = TRUE,
nonhbd = TRUE,
vertical = FALSE
)
Arguments
input |
a named list with one or several zres objects obtained after running zoorun. The zres objects are the output of the zoorun function. For instance, putting list(name1 = zres1, name2 = zres2). The function will then use the names in the plot (in case several zres objects are used). |
cols |
A vector with the colors to be used for each class in the model. |
plotids |
A logical indicating whether the IDs of the individuals are plotted on the graph (TRUE by default). |
toplot |
A list of vectors indicating the zres@ids to be plotted. This option can be used to select the individuals to plot. The list must contain one vector per population or zres object. By default, all individuals are plotted. |
randomids |
A logical indicating whether a randomset of individuals is plotted. This option allows to reduce the number of individuals in the plot. The option can not be used simultaneously with the toplot option. By default, randomids is FALSE. |
nrandom |
A vector indicating the number of individuals to be randomly sampled per population or per zres object when randomids is TRUE. By default, we select 10 individuals per zres object. This vector must have the same length as the input list. |
seed |
A value for the random seed used to sample individuals to plot (when the randomids option is TRUE). |
ylim |
The limits of the y-axis. |
border |
Whether a border is plotted around each block of the barplot or not. When set to FALSE, it allows to get a less dense plot when many individuals are plotted. |
nonhbd |
Whether the a border is plotted around the non-hbd contribution. When set to FALSE, it allows to get a less dense plot when many individuals are plotted. |
vertical |
Whether the populations or zres labels are printed vertically or not. |
Value
Individuals are presented with stacked barplots. Each vertical stack of bars represents one individual. Each class is represented with a bar of a different color. The height of the bar represents the proportion associated with the corresponding class. The total height of the stack is the total autozygosity.
Plot proportion of the genome associated with different HBD classes
Description
Plot the mean percentage of the genome in different HBD classes or the inbreeding coefficient obtained by summing autozygosity associated with HBD classes with a rate lower or equal to a threshold (e.g., including all HBD classes with longer and more recent HBD segments than a selected threshold).
Usage
zooplot_prophbd(input, cols = NULL, style = "barplot", cumulative = FALSE)
Arguments
input |
a named list with one or several zres objects obtained after running zoorun. The zres objects are the output of the zoorun function. For instance, putting list(name1 = zres1, name2 = zres2). The function will then use the names in the plot (in case several zres objects are used). |
cols |
a vector with the colors to be used for each population or zres object. |
style |
select "barplot", "lines" or "boxplot" for the graphic styles. Boxplot can be used with a single zres file or population. |
cumulative |
a logical indicating whether mean autozygosity is estimated per class (FALSE) or summed over all HBD class with a rate smaller than a value (these cumulated values are obtained for every rate defined in the model). By default, this value is FALSE. When FALSE, the percentages correspond to the mean individual genome-wide probabilities of belonging to each HBD-class or to the fraction of the genome in an autozygosity class. When TRUE, we obtain the mean probability of belonging to an HBD class with a rate smaller or equal than a threshold (here we use the pre-defined rates of the model as thresholds), averaged over the whole genome and all individuals. This corresponds to report mean genomic inbreeding coefficients estimated with respect to different base populations obtained by selecting different thresholds T that determine which HBD classes are considered in the estimation of the genomic inbreeding coefficient (setting the base population approximately 0.5 * T generations ago). |
Value
The function plots either the average proportion of the genome associated with different HBD classes or the average genomic inbreeding coefficient estimated with respect to different base populations (from young to older).
Run the ZooRoH model
Description
Apply the defined model on a group of individuals: parameter estimation, computation of realized autozygosity and homozygous-by-descent probabilities, and identification of HBD segments (decoding). It is also possible to apply the model to a pair of phased haplotypes with the 'ibd' option.
Usage
zoorun(
zoomodel,
zooin,
ids = NULL,
parameters = TRUE,
fb = TRUE,
vit = TRUE,
localhbd = FALSE,
nT = 1,
optim_method = "L-BFGS-B",
maxiter = 100,
minmix = 1,
maxr = 1e+08,
ibd = FALSE,
ibdpairs = NULL,
haploid = FALSE,
RecTable = FALSE,
trim_ad = FALSE,
hemiprob = 0
)
Arguments
zoomodel |
A valid zmodel object as defined by the zoomodel function. The model indicates whether rates of exponential distributions are estimated or predefined, the number of classes, the starting values for mixing coefficients and rates, the error probabilities. See "zoomodel" for more details. |
zooin |
A valid zdata object as obtained by the zoodata function. See "zoodata" for more details. |
ids |
An optional argument indicating the individual (its position in the data file) that must be proceeded. It can also be a vector containing the list of numbers that must be proceeded. By default, the model runs for all individuals. |
parameters |
Specifies whether the parameters are estimated by optimization with the L-BFGS-B method from the optim function (optional argument - true by default). If the user doesn't want to estimate the parameters he must set parameters=FALSE. In that case, the forward-backaward and Viterbi algorithms are run with the provided parameters. |
fb |
A logical indicating whether the forward-backward algorithm is run (optional argument - true by default). The Forward-Backward algorithm estimates the local probabilities to belong to each HBD or non-HBD class. By default, the function returns only the HBD probabilities for each class, averaged genome-wide, and corresponding to the realized autozygosity associated with each class. To obtain HBD probabilities at every marker position, the option localhbd must be set to true (this generates larger outputs). |
vit |
A logical indicating whether the Viterbi algorithm is run (optional argument - false by default). The Viterbi algorithm performs the decoding (determining the underlying class at every marker position). Whereas the Forward-Backward algorithms provide HBD probabilities (and how confident a region can be declared HBD), the Viterbi algorithm assigns every marker position to one of the defined classes (HBD or non-HBD). When informativity is high (many SNPs per HBD segments), results from the Forward-Backward and the Viterbi algorithm are very similar. The Viterbi algorithm is best suited to identify HBD segments. To estimate realized inbreeding and determine HBD status of a position, we recommend to use the Forward-Backward algorithm that better reflects uncertainty. |
localhbd |
A logical indicating whether the HBD probabilities for each individual at each marker are returned when using the Forward-Backward algorithm (fb option). This is an optional argument that is false by default. |
nT |
Indicates the number of threads used when running RZooRoH in parallel (optional argument - one thread by default). |
optim_method |
Indicates which method the optim R function will use to estimate the parameters of the model ("L-BFGS-B" by default). The possible methods are "Nelder-Mead", "BFGS", "CG", "L-BFGS-B", "SANN" and "Brent". Type "? optim" to have more information. In our experience, the "L-BFGS-B" method works well but the method achieving the best likelihood is variable (according to the data sets, the model, the priors, the constraints). The same goes for the efficiency (speed). When the zoorun does not converge, you can test with another method. Note that the only method allowing to put constraints on parameters is "L-BFGS-B" (other methods are unconstrained). |
maxiter |
Indicates the maximum number of iterations when estimating the parameters with the R optim function (optional argument - 100 by default). Iterations are not defined identically across methods. For instance, in one iteration of the "L-BFGS-B" method, the likelihood of the model, estimated with the forward algorithm, is evaluated multiple times. So, a value of 100 iterations is good for the "L-BFGS-B" method but larger values are required for some other algorithms. |
minmix |
This indicates the minimal value for the mixing coefficients. By default it is set to 0 with the classical mixkl and kl models (unconstrained). However, when using the step option or the "Interval" HBDclass, the values is set to 1e-16 to avoid numerical problems. Note that constraints are only allowed with the "L-BFGS-B" method from optim. |
maxr |
This indicates the maximum difference between rates of successive classes. It is an optional argument set to an arbitrarily large value (100000000). Adding such constraints might slow down the speed of convergence and we recommend to run first without this constraint (constraints are only allowed with the "L-BFGS-B" method from optim). |
ibd |
A logical indicating whether the function will be used to compute IBD between pairs of phased haplotypes instead of HBD within individuals. In that case, the user must provide a matrix with the pairs of haplotypes that will be analyzed. This option can only be used if phased data are provided as input with the zformat set to "vcf" or "haps". This is an optional parameter set to false by default. |
ibdpairs |
A matrix with four columns, indicating the pair of haplotypes being analyzed in an IBD analysis. Haplotypes are indicated by two columns, one column for the id of the individuals and a second column for the haplotype number within individual (1 or 2). The first and third columns indicate the id of the individuals carrying the first and second haplotype, respectively. The second and four columns indicates the haplotype numbers within the first and second individuals, respectively. With haploid data, the matrix must have only two columns indicating the number of the first and second haplotypes from the pairs. This is an optional parameter, the matrix must be provided only when the ibd option is true. |
haploid |
This is an optional parameter indicating whether haplotypes belong to an haploid organisms or chromosome (false by default). It can be used only in combination with the 'ibd' option and requires phased data as input with the zformat set to "haps". When haploid is true, then the ibdpairs matrix has only two columns indicating simply the haplotype numbers. When haploid is true the number of haplotypes can be uneven while even numbers are required when haploid is set to false. |
RecTable |
This is an optional parameter indicating whether a finite number of genetic distances are used (false by default). This function can be used only with the "Interval" HBDclass. The "Interval" option can be slow, in particular if large "intervals" of generations are defined. To speed up computations, some variables are precomputed for a finite set of genetic distances, select to cover a broad range of possible values. The real genetic distance between two genetic markers is then replaced by the closest value in the table (the difference between the true and used genetic distances being also lower than 10"%"). |
trim_ad |
This is an option still under evaluation (for testing only) |
hemiprob |
This is an option still under evaluation (for testing only) |
Value
The function return a zoores object with several slots accesses by the "@" symbol. The three main results are zoores@realized (the matrix with partitioning of the genome in different HBD classes for each individual), zoores@hbdseg (a data frame with identified HBD segments) and zoores@hbdp (a list of matrices with HBD probabilities per SNP and per class).
Here is a list with all the slots and their description:
zoores@nind the number of individuals in the analysis,
zoores@ids a vector containing the numbers of the analyzed individuals (their position in the data file),
zoores@mixc the (estimated) mixing coefficients per class for all individuals,
zoores@krates the (estimated) rates for the exponential distributions associated with each HBD or non-HBD class for all individuals,
zoores@niter the number of iterations for estimating the parameters - the number of calls of the forward algorithm (per individual),
zoores@modlik a vector containing the likelihood of the model for each individual,
zoores@modbic a vector containing the value of the BIC for each individual,
zoores@realized a matrix with estimated realized autozygosity per HBD class (columns) for each individual (rows). These values are obtained with the Forward-Backward algorithm - fb option),
zoores@hbdp a list of matrices with the local probabilities to belong to an underlying hidden state (computed for every class and every individual). Each matrix has one row per class and one column per snp. To access the matrix from individual i, use the brackets "[[]]", for instance zoores@hbdp[[i]],
zoores@hbdseg a data frame with the list of identified HBD segments with the Viterbi algorithm (the columns are the individual number, the chromosome number, the first and last SNP of the segment, the positions of the first and last SNP of the segment, the number of SNPs in the segment, the length of the segment, the HBD state of the segment). In case of IBD, the first column indicates the number of the pair of haplotypes,
zoores@optimerr a vector indicating whether optim ran with or without error (0 indicates successful completion / 1 indicates that the iteration limit has been reached / 51 and 52 indicate warnings from the "L-BFGS-B" method / 99 indicates numerical problem). See the optim R function for more details.
zoores@sampleids is a vector with the names of the samples (when provided in the zooin object through the zoodata function). When an IBD analysis is run, it indicates the pairs of haplotypes separated by "_". For diploid individuals, it combines individual ID, haplotype ID (1 or 2) for the two individuals.
Examples
# Start with a small data set with six individuals and external frequencies.
freqfile <- (system.file("exdata","typsfrq.txt",package="RZooRoH"))
typfile <- (system.file("exdata","typs.txt",package="RZooRoH"))
frq <- read.table(freqfile,header=FALSE)
typ <- zoodata(typfile,supcol=4,chrcol=1,poscol=2,allelefreq=frq$V1)
# Define a model with two HBD classes with rates equal to 10 and 100.
Mod2L <- zoomodel(K=2,base_rate=10)
# Run the model on all individuals.
typ.res <- zoorun(Mod2L, typ)
# Observe some results: likelihood, realized autozygosity in different
# HBD classes and identified HBD segments.
typ.res@modlik
typ.res@realized
typ.res@hbdseg
# Define a model with one HBD and one non-HBD class and run it.
Mod1R <- zoomodel(K=1,predefined=FALSE)
typ2.res <- zoorun(Mod1R, typ)
# Print the estimated rates and mixing coefficients.
typ2.res@krates
typ2.res@mixc
# Get the name and location of a second example file and load the data:
myfile <- (system.file("exdata","genoex.txt",package="RZooRoH"))
ex2 <- zoodata(myfile)
# Run RZooRoH to estimate parameters on your data with the 1 HBD and 1 non-HBD
# class (parameter estimation with optim).
my.mod1R <- zoomodel(predefined=FALSE,K=1,krates=c(10))
my.res <- zoorun(my.mod1R, ex2, fb = FALSE, vit = FALSE)
# The estimated rates and mixing coefficients:
my.res@mixc
my.res@krates
# Run the same model and run the Forward-Backward alogrithm to estimate
# realized autozygosity and the Viterbi algorithm to identify HBD segments:
my.res2 <- zoorun(my.mod1R, ex2)
# The table with estimated realized autozygosity:
my.res2@realized
# Run a model with 3 layers (3 HBD classes / 1 non-HBD class) and estimate
# the rates of HBD classes with one thread:
my.mod3L <- zoomodel(predefined=FALSE,K=3,krates=c(16,64,256))
my.res3 <- zoorun(my.mod3L, ex2, fb = FALSE, vit = FALSE, nT =1)
# The estimated rates for the 3 classes and the 20 individuals:
my.res3@krates
# Run a model with 4 layers and predefined rates.
# The model is run only for a subset of four selected individuals.
# The parameters are estimated, the Forward-Backward alogrithm is ued to
# estimate realized autozygosity and the Viterbi algorithm to identify
# HBD segments. One thread is used.
mix4L <- zoomodel(K=4,base=10)
my.res4 <- zoorun(mix4L,ex2,ids=c(7,12,16,18), nT = 1)
# The table with all identified HBD segments:
my.res4@hbdseg