Annotation with snpEff or VEP¶
GEMINI depends upon external tools to predict the functional consequence of variants in a VCF file. We currently support annotations produced by either SnpEff or VEP.
Note
Versions tested: VEP versions 73 through 75 and core SnpEff versions 3.0 through 3.6. GEMINI supports ENSEMBL annotations hence users are expected to download genome databases for these tools as represented in the examples below.
Note
Version support would be subsequently updated here, as we test along and add or edit changes available with the latest version of these tools.
Recommended instructions for annotating existing VCF files with these tools are summarized here.
Stepwise installation and usage of VEP¶
Download the Variant Effect Predictor “standalone perl script” from Ensembl. You can choose a specific version of VEP to download here
Example:
Download version 74
Untar the tarball into the current directory
$ tar -zxvf variant_effect_predictor.tar.gz
This will create the variant_effect_predictor directory. Now do the following for install:
$ cd variant_effect_predictor
$ perl INSTALL.pl [options]
By default this would install the API’s, bioperl-1.2.3 and the cache files (in the $HOME/.vep directory).
Homebrew or Anaconda VEP installation¶
If you are a Homebrew, Linuxbrew or Anaconda user, there is an automated recipe to install the main VEP script and plugins in the CloudBioLinux homebrew repository:
$ brew tap chapmanb/cbl
$ brew update
$ brew install vep
For Anaconda/Miniconda, just make sure you are pointing to the bioconda channel:
$ conda install variant-effect-predictor -c bioconda
Manual installation of VEP¶
For those (e.g mac users) who have a problem installing through this install script, try a
manual installation of the API’s, BioPerl-1.2.3 and set all pre-requisites for running VEP
(DBI
and DBD::mysql
modules required). The appropriate pre-build caches should be
downloaded for Human to the $HOME/.vep
directory and then untar.
You may follow instructions at http://www.ensembl.org/info/docs/api/api_installation.html which provides alternate options for the API installation and additional tips for windows/mac users. It also has information for setting up your environment to run VEP.
Example download of the cache files
$ wget ftp://ftp.ensembl.org/pub/release-73/variation/VEP/homo_sapiens_vep_73.tar.gz
You may change the release date in this example to get the appropriate cache files for your version of VEP that you have installed.
Example
$ wget ftp://ftp.ensembl.org/pub/release-74/variation/VEP/homo_sapiens_vep_74.tar.gz
Cache requires the gzip
and zcat
utilities. VEP uses zcat
to decompress cached files. For systems where zcat may not be installed or may
not work, the following option needs to be added along with the --cache
option:
--compress "gunzip -c"
Running VEP¶
You may now run VEP as:
$ perl variant_effect_predictor.pl [OPTIONS]
We recommend running VEP with the following options as currently we support VEP fields specified as below:
$ perl variant_effect_predictor.pl -i example.vcf \
--cache \
--sift b \
--polyphen b \
--symbol \
--numbers \
--biotype \
--total_length \
-o output \
--vcf \
--fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE
A documentation for the above specified options may be found at http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html
As of GEMINI version 0.8.0, you can also run VEP with additional fields, which will be automatically added to the variants table as columns. As an example, run VEP on your VCF with the dbNSFP and LOFTEE plugins to annotate potential high impact variations:
$ variant_effect_predictor.pl --sift b --polyphen b --symbol --numbers --biotype \
--total_length --canonical --ccds \
--fields Consequence,Codons,Amino_acids,Gene,SYMBOL,Feature,EXON,PolyPhen,SIFT,Protein_position,BIOTYPE,CANONICAL,CCDS,RadialSVM_score,RadialSVM_pred,LR_score,LR_pred,CADD_raw,CADD_phred,Reliability_index,LoF,LoF_filter,LoF_flags \
--plugin dbNSFP,/path/to/dbNSFP_v2.5.gz,RadialSVM_score,RadialSVM_pred,LR_score,LR_pred,CADD_raw,CADD_phred,Reliability_index \
--plugin LoF,human_ancestor_fa:/path/to/human_ancestor.fa
Feeding this into GEMINI produces a variants table with columns for each of the
additional VEP metrics. The annotation loader names each column by prefixing
vep_
to the origin VEP name, so select on vep_radialsvm_score
or
vep_lof_filter
in the final database.
Stepwise installation and usage of SnpEff¶
Note
Basic Requirements: Java v1.7 or later; at least 4GB of memory
Download the supported versions of SnpEff from http://snpeff.sourceforge.net/download.html
Example:
$ wget http://sourceforge.net/projects/snpeff/files/snpEff_v3_6_core.zip
Note
SnpEff should be installed preferably in snpEff
directory in your
home directory. Else, you must update the data_dir
parameter in
your snpEff.config file. For e.g. if the installation of snpEff has been done
in ~/src
instead of ~/
then change the data_dir parameter in
snpEff.config to data_dir = ~/src/snpEff/data/
Unzip the downloaded package.
$ unzip snpEff_v3_6_core.zip
Change to the snpEff
directory and download the genome database.
$ cd snpEff_v3_6_core
$ java -jar snpEff.jar download GRCh37.69
Unzip the downloaded genome database. This will create and place the genome in the ‘data’ directory
$ unzip snpEff_v3_6_GRCh37.69.zip
To annotate a vcf using snpEff, use the default
options as below:
Note
Memory options for the run may be specified as -Xmx4G
(4GB)
$ java -Xmx4G -jar snpEff.jar -i vcf -o vcf GRCh37.69 example.vcf > example_snpeff.vcf
If running from a directory different from the installation directory, the complete path needs to be specified as, e.g.:
$ java -Xmx4G -jar path/to/snpEff/snpEff.jar -c path/to/snpEff/snpEff.config GRCh37.69 path/to/example.vcf > example_snpeff.vcf
Note
When using the latest versions of snpEff (e.g. 4.1) annotate your VCF with the additional parameters -classic
and -formatEff
.
This would ensure proper loading of the gene info columns in the variants table.
Columns populated by snpEff/VEP tools¶
The following variant consequence columns in the variant/variant_impacts table, are populated with these annotations, which are otherwise set to null.
- anno_id
- gene
- transcript
- exon
- is_exonic
- is_lof
- is_coding
- codon_change
- aa_change
- aa_length
- biotype
- impact
- impact_so
- impact_severity
- polyphen_pred
- polyphen_score
- sift_pred
- sift_score
Standardizing impact
definitions for GEMINI¶
GEMINI uses slightly modified impact terms (for ease) to describe the functional consequence of a given variant as provided by snpEff/VEP.
The table below shows the alternate GEMINI terms used for snpEff/VEP.
GEMINI terms | snpEff terms | VEP terms (uses SO by default) |
---|---|---|
splice_acceptor | SPLICE_SITE_ACCEPTOR | splice_acceptor_variant |
splice_donor | SPLICE_SITE_DONOR | splice_donor_variant |
stop_gain | STOP_GAINED | stop_gained |
stop_loss | STOP_LOST | stop_lost |
frame_shift | FRAME_SHIFT | frameshift_variant |
start_loss | START_LOST | null |
exon_deleted | EXON_DELETED | null |
non_synonymous_start | NON_SYNONYMOUS_START | null |
transcript_codon_change | null | initiator_codon_variant |
chrom_large_del | CHROMOSOME_LARGE_DELETION | null |
rare_amino_acid | RARE_AMINO_ACID | null |
non_syn_coding | NON_SYNONYMOUS_CODING | missense_variant |
inframe_codon_gain | CODON_INSERTION | inframe_insertion |
inframe_codon_loss | CODON_DELETION | inframe_deletion |
inframe_codon_change | CODON_CHANGE | null |
codon_change_del | CODON_CHANGE_PLUS_CODON_DELETION | null |
codon_change_ins | CODON_CHANGE_PLUS_CODON_INSERTION | null |
UTR_5_del | UTR_5_DELETED | null |
UTR_3_del | UTR_3_DELETED | null |
splice_region | SPLICE_SITE_REGION | splice_region_variant |
mature_miRNA | null | mature_miRNA_variant |
regulatory_region | null | regulatory_region_variant |
TF_binding_site | null | TF_binding_site_variant |
regulatory_region_ablation | null | regulatory_region_ablation |
regulatory_region_amplification | null | regulatory_region_amplification |
TFBS_ablation | null | TFBS_ablation |
TFBS_amplification | null | TFBS_amplification |
synonymous_stop | SYNONYMOUS_STOP | stop_retained_variant |
synonymous_coding | SYNONYMOUS_CODING | synonymous_variant |
UTR_5_prime | UTR_5_PRIME | 5_prime_UTR_variant |
UTR_3_prime | UTR_3_PRIME | 3_prime_UTR_variant |
intron | INTRON | intron_variant |
CDS | CDS | coding_sequence_variant |
upstream | UPSTREAM | upstream_gene_variant |
downstream | DOWNSTREAM | downstream_gene_variant |
intergenic | INTERGENIC | intergenic_variant |
intergenic_conserved | INTERGENIC_CONSERVED | null |
intragenic | INTRAGENIC | null |
gene | GENE | null |
transcript | TRANSCRIPT | null |
exon | EXON | null |
start_gain | START_GAINED | null |
synonymous_start | SYNONYMOUS_START | null |
intron_conserved | INTRON_CONSERVED | null |
nc_transcript | null | nc_transcript_variant (should have been returned by VEP as: non_coding_transcript_variant) |
NMD_transcript | null | NMD_transcript_variant |
incomplete_terminal_codon | null | incomplete_terminal_codon_variant |
nc_exon | null | non_coding_exon_variant (should have been returned by VEP as: non_coding_transcript_exon_variant) |
transcript_ablation | null | transcript_ablation |
transcript_amplification | null | transcript_amplification |
feature elongation | null | feature_elongation |
feature truncation | null | feature_truncation |
Note: “null” refers to the absence of the corresponding term in the alternate database
SO impact
definitions in GEMINI¶
The below table shows the Sequence Ontology (SO) term mappings for the GEMINI impacts, which is otherwise contained in the
impact_so
column of the variants/variant_impacts table of the GEMINI database. The last column shows the severity terms
defined in GEMINI for these impacts.
GEMINI terms (column: impact) | Sequence Ontology terms (column: impact_so) | Impact severity |
---|---|---|
splice_acceptor | splice_acceptor_variant | HIGH |
splice_donor | splice_donor_variant | HIGH |
stop_gain | stop_gained | HIGH |
stop_loss | stop_lost | HIGH |
frame_shift | frameshift_variant | HIGH |
start_loss | start_lost | HIGH |
exon_deleted | exon_loss_variant | HIGH |
non_synonymous_start | initiator_codon_variant | HIGH |
transcript_codon_change | initiator_codon_variant | HIGH |
chrom_large_del | chromosomal_deletion | HIGH |
rare_amino_acid | rare_amino_acid_variant | HIGH |
non_syn_coding | missense_variant | MED |
inframe_codon_gain | inframe_insertion | MED |
inframe_codon_loss | inframe_deletion | MED |
inframe_codon_change | coding_sequence_variant | MED |
codon_change_del | disruptive_inframe_deletion | MED |
codon_change_ins | disruptive_inframe_insertion | MED |
UTR_5_del | 5_prime_UTR_truncation + exon_loss_variant | MED |
UTR_3_del | 3_prime_UTR_truncation + exon_loss_variant | MED |
splice_region | splice_region_variant | MED |
mature_miRNA | mature_miRNA_variant | MED |
regulatory_region | regulatory_region_variant | MED |
TF_binding_site | TF_binding_site_variant | MED |
regulatory_region_ablation | regulatory_region_ablation | MED |
regulatory_region_amplification | regulatory_region_amplification | MED |
TFBS_ablation | TFBS_ablation | MED |
TFBS_amplification | TFBS_amplification | MED |
synonymous_stop | stop_retained_variant | LOW |
synonymous_coding | synonymous_variant | LOW |
UTR_5_prime | 5_prime_UTR_variant | LOW |
UTR_3_prime | 3_prime_UTR_variant | LOW |
intron | intron_variant | LOW |
CDS | coding_sequence_variant | LOW |
upstream | upstream_gene_variant | LOW |
downstream | downstream_gene_variant | LOW |
intergenic | intergenic_variant | LOW |
intergenic_conserved | conserved_intergenic_variant | LOW |
intragenic | intragenic_variant | LOW |
gene | gene_variant | LOW |
transcript | transcript_variant | LOW |
exon | exon_variant | LOW |
start_gain | 5_prime_UTR_premature_start_codon_gain_variant | LOW |
synonymous_start | start_retained_variant | LOW |
intron_conserved | conserved_intron_variant | LOW |
nc_transcript | nc_transcript_variant | LOW |
NMD_transcript | NMD_transcript_variant | LOW |
incomplete_terminal_codon | incomplete_terminal_codon_variant | LOW |
nc_exon | non_coding_exon_variant | LOW |
transcript_ablation | transcript_ablation | LOW |
transcript_amplification | transcript_amplification | LOW |
feature elongation | feature_elongation | LOW |
feature truncation | feature_truncation | LOW |