At long last, version 0.12.2 of GEMINI supports multi-allelic variants thanks to great work from Brent Pedersen. In order to provide this support, GEMINI now requires that your input VCF file undergo additional preprocessing such that multi-allelic variants are decomposed and normalized using the vt toolset from the Abecasis lab. Note that we have also decomposed and normalized all of the VCF-based annotation files (e.g., ExAC, dbSNP, ClinVar, etc.) so that variants and alleles are properly annotated and we minimize false negative and false positive annotations. For a great discussion of why this is necessary, please read this blog post from Eric Minikel in Daniel MacArthur’s lab.
Essentially, VCF preprocessing for GEMINI now boils down to the following steps.
A workflow for the above steps is given below.
# setup VCF=/path/to/my.vcf NORMVCF=/path/to/my.norm.vcf REF=/path/to/human.b37.fasta SNPEFFJAR=/path/to/snpEff.jar # decompose, normalize and annotate VCF with snpEff. # NOTE: can also swap snpEff with VEP #NOTE: -classic and -formatEff flags needed with snpEff >= v4.1 zless $VCF \ | sed 's/ID=AD,Number=./ID=AD,Number=R/' | vt decompose -s - \ | vt normalize -r $REF - \ | java -Xmx4G -jar $SNPEFFJAR -formatEff -classic GRCh37.75 \ | bgzip -c > $NORMVCF tabix $NORMVCF # load the pre-processed VCF into GEMINI gemini load --cores 3 -t snpEff -v $NORMVCF $db # query away gemini query -q "select chrom, start, end, ref, alt, (gts).(*) from variants" \ --gt-filter "gt_types.mom == HET and \ gt_types.dad == HET and \ gt_types.kid == HOM_ALT" \ $db
GEMINI (GEnome MINIng) is designed to be a flexible framework for exploring genetic variation in the context of the wealth of genome annotations available for the human genome. By placing genetic variants, sample genotypes, and useful genome annotations into an integrated database framework, GEMINI provides a simple, flexible, yet very powerful system for exploring genetic variation for disease and population genetics.
Using the GEMINI framework begins by loading a VCF file (and an optional PED file) into a database. Each variant is automatically annotated by comparing it to several genome annotations from source such as ENCODE tracks, UCSC tracks, OMIM, dbSNP, KEGG, and HPRD. All of this information is stored in portable SQLite database that allows one to explore and interpret both coding and non-coding variation using “off-the-shelf” tools or an enhanced SQL engine.
Please also see the original manuscript.
This video provides more details about GEMINI’s aims and utility.
If you use GEMINI in your research, please cite the following manuscript:
Paila U, Chapman BA, Kirchner R, Quinlan AR (2013) GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations. PLoS Comput Biol 9(7): e1003153. doi:10.1371/journal.pcbi.1003153