All Downloads are FREE. Search and download functionalities are using the official Maven repository.

net.maizegenetics.dna.snp.bit.package-info Maven / Gradle / Ivy

Go to download

TASSEL is a software package to evaluate traits associations, evolutionary patterns, and linkage disequilibrium.

The newest version!
/**
 * Support for bit level encoding of allele presence and absence.
 * Background on Bit Encoding in TASSEL:

TASSEL has been optimized for several things. One of TASSEL’s optimization is for very fast comparison of the sequence of two taxa for millions of sites (e.g. genetic distance or imputation), or the comparison of two sites for thousands of taxa (e.g. LD).

To do this, the presence of an allele is coded with a single bit (presence= 1, absence=0). With this encoding, then bit level operations can be used calculate genetic distance or LD.

So how is this done? The main assumptions: (1) we generally only care about the two most alleles at any given locus. This may not be true for GWAS at a specific site, but across a genome it is a close enough approximation. (2) phase is generally not tracked in the bit array, however, if everything is phased, then the haplotypes can be used. (3) If allele 2 is unknown, the genotype is assumed to be homozygous for allele 1. e.g. A/N assumed to be A/A. Unknown heterozygotes are not support. Completely missing genotypes are supported.

So the approach is slightly lossy – a bit of information is lots – just like some compression algorithms, in exchange for some very nice features.

While the bit level conserves memory space, it is not the main purpose. The above sequence if store as text would take 16 bytes (8 sites x 2 alleles in a diploid). Using TASSEL half byte approach, it fits into 8 bytes. The bit version would take one byte for the major allele and one for the minor allele for a total of 2 bytes.

In practice, we store these bits of allele presence in sets of 64 (a long). This works very efficiently as most computer architectures are also 64-bit, so when we process the sequence everything is being done at 64 sites at a time. THESE RESULTS IN THE ALGORITHMS THAT USE BIT LEVEL ENCODING TO BE 25 TO 50 TIMES FASTER. */ package net.maizegenetics.dna.snp.bit;




© 2015 - 2024 Weber Informatics LLC | Privacy Policy