About HEX

Glossary

3'

3', pronounced 3 prime, refers to the third carbon of the sugar rings that form the DNA backbone. The 3' end is the end of the DNA that terminates at the hydroxyl group attached to the third carbon of the sugar ring. DNA sequences are written with the 5' end to the left and the 3' end to the right, reflecting the direction of nucleic acid synthesis in vivo.

3' UTR (3 prime untranslated region)

The 3' untranslated region extends from the base following the terminal codon to the polyA tail. This region of the gene is transcribed into mRNA, but does not contain protein-coding sequence. The 3' UTR often contains sequences that influence translation efficiency or stability of mRNA.

5'

5', pronounced 5 prime, refers to the fifth carbon of the sugar rings that form the DNA backbone. The 5' end is the end of the DNA that has the fifth carbon of the sugar ring at its terminus. DNA sequences are written with the 5' end to the left and the 3' end to the right, reflecting the direction of nucleic acid synthesis in vivo.

5' UTR (5 prime untranslated region) 

The 5' untranslated region extends from the transcription start site (cap site) to the base just before the initiator codon. This region of the gene is transcribed into mRNA, but does not contain protein-coding sequence. The 5' UTR often contains sequences that influence translation efficiency or stability of mRNA.

AA change (amino acid change)

A change in one or more of a protein's amino acids, caused by a genetic variant.

Allele

An allele is a version of a nucleotide sequence. In general, each individual has two alleles, one inherited from each parent. The term is applicable to a single nucleotide or to larger sequences of nucleotides, such as a gene or a complete genetic locus.

Allele count

The number of instances of a particular variant allele in HEX.

Alternative sequence

An alternative sequence is any genomic sequence that differs from the genomic DNA on the primary assembly, which is considered the reference sequence.

Annotation

Sequence Ontology term describing the nature of the variant. (See below for Definitions of Annotations.)

Assembly

An assembly is a reference genome sequence for a particular species. It is generated by piecing together sequences of DNA fragments, first into contigs, then scaffolds, and sometimes into entire chromosomes. The resulting collection of sequences is called a genome assembly. HEX uses GRCh37 as the reference genome assembly.

Average sample read depth

The average coverage across all samples in HEX for a given locus.

Canonical transcript

A canonical transcript is the reference transcript for a particular gene. HEX uses the canonical transcripts as defined in Ensembl.

CDS (coding sequence)

A coding sequence is a region of a gene or an mRNA transcript that codes for a protein.

Consequence

Synonymous or nonsynonymous changes in coding regions.

Coverage

Coverage refers to the number of sequenced fragments used to analyze each individual at a given locus. The higher the coverage, the more confident one can be in the assignment of specific nucelotides to a given locus. Variants with a high coverage value are more believable than those with low coverage. In general, for next-generation sequencing, coverage >30x is thought to provide high-quality variant identification.

Coverage plot

A coverage plot is a graphical representation of coverage, where coverage (the number of sequence reads) is plotted against location on the chromosome.

dbSNP

The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple (short) genetic polymorphisms. It is a comprehensive repository of identified genetic variability.

Downstream

Downstream refers to direction along a DNA sequence, 3' from a reference point.

Enhancer

An enhancer is a sequence that binds cell- or region- specific transcription factors to increase transcription.

Ensembl

Ensembl is a centralized resource for genomics researchers, offering a popular genome browser for the exploration of sequence data from humans and several other vertebrate species. Ensembl is a collaboration between the European Bioinformatics Institute and the Wellcome Trust Sanger Institute.

EXAC

The Exome Aggregation Consortium collects and synthesizes exome sequencing data from large-scale sequencing projects and makes summary data publically available.

EXAC Minor Allele Frequency

Frequency of a variant in the EXAC database.

Exome sequencing

A technique whereby exonic regions of the genome are sequenced comprehensively, commonly using next-generation sequencing. Note that because of the way exome sequencing works, sequence data from adjacent introns may also be collected.

Exome

The collection of protein-coding sequences (exons) within the genome.

Exon

An exon is a nucleotide sequence in DNA that is included in mRNA transcripts and encodes protein information.

Forward/Reverse strand

For each reference chromosome, one strand of DNA is arbitrarily designated the "forward” (+) strand and the other the "reverse” (-) strand.

Gene symbol

Unique abbreviation for a gene name. HEX uses gene symbols approved by the HUGO Gene Nomenclature Committee, an international body that assigns standardized nomenclature to human genes.

HEX

The Healthy Exomes database is a catalogue of genetic variants in people over age 60 who did not have Alzheimer’s pathology at the time of death.

HEX Minor Allele Frequency

Frequency of a variant in the HEX database.

Initiator codon

The initiator codon, also called the start codon, is the first codon of an mRNA, and always codes for methionine. Most commonly, the initiator codon is AUG.

Intron

An intron is a nucleotide sequence in DNA that is spliced-out of mRNA transcripts and does not encode protein information.

Non-coding

A non-coding RNA does not result in a protein product, but may have important regulatory roles. Non-coding RNAs include, but are not limited to, micro RNAs (miRNA), small interfering RNAs (siRNAs), and large intergenic non-coding RNAs (lincRNAs).

Promoter

The promoter is the sequence to which RNA polymerase and its co-factors bind to initiate transcription of a gene.

Regulatory region

Regulatory regions are sequences that influence the level of expression of a gene. Regulatory regions include promoters and enhancers. The promoter for a gene is located near, and upstream of, the transcription start site. Enhancers are most commonly located upstream of the initiator codon or downstream of the terminal codon, and can occur up to several hundred thousand base pairs away from the genes they regulate. Additionally, regulatory elements have been found in introns and exons.

rsID

Identification number of a variant in the dbSNP database.

SNP (single nucleotide polymorphism)

A single nucleotide polymorphism is a single base pair variation in a genome. Historically, Snps have been defined as common variants that have at least 1% frequency in the general population.

Splice acceptor

The splice acceptor site is the splicing site at the end (3' end) of an intron, and contains the nucleotides AG.

Splice donor

The splice donor site is the splicing site at the beginning (5' end) of an intron, and contains the nucleotides GT.

Terminal codon

The terminal codon, also called the stop codon, marks the end of the protein-coding sequence.  Stop codons may be TAA, TAG or TGA.

Transcript

The mRNA resulting from transcription of the genomic DNA. Multiple transcripts, or splice variants, can be derived from a single gene due to alternative splicing.

Transcript ID

Unique identification number assigned by Ensembl, which allows identification of any transcript.

Transcription factor binding site

A transcription factor binding site is a DNA sequence that binds transcription factors (proteins that control the rate of transcription). Promoter and enhancer regions contain transcription factor binding sites.

Upstream

Upstream refers to direction along a DNA sequence, 5' from a reference point.

UTR (untranslated region)

Untranslated regions are part of mRNA but do not code for protein sequence. The 3- prime UTR commonly contains regulatory regions that influence gene expression.

Variant call

Nucleotide or sequence of nucleotides that differs from the reference sequence at a specific location in the genome.

VCF (variant call format)

Variant Call Format is a plain text format created to hold genomic variability data.

Definitions of Annotations

HEX uses Sequence Ontology (SO) terms to describe the natures of particular variants. These definitions are derived from SO definitions:

3_prime_UTR_variant a variant in the 3' untranslated region of a transcript
5_prime_UTR_variant a variant in the 5' untranslated region of a transcript
coding_sequence_variant a variant within a sequence that codes for a protein
downstream_gene_variant a variant located within 5 KB downstream (3') of a gene
feature_elongation a variant that causes the elongation of a genomic feature
feature_truncation a variant that causes the truncation of a genomic feature
frameshift_variant a variant that causes a disruption of the translational reading frame, because the number of nucleotides inserted or deleted is not a multiple of three
incomplete_terminal_codon_variant a variant where at least one base of the final codon of an incompletely annotated transcript is changed
inframe_deletion an inframe non-synonymous variant that deletes bases from the coding sequence in multiples of three
inframe_insertion an inframe non-synonymous variant that inserts bases from the coding sequence in multiples of three
intergenic_variant a sequnece variant located between genes
intron_variant a variant occurring within an intron
mature_miRNA_variant a variant located within the sequence of a mature miRNA
missense_variant a variant that results in a different amino acid sequence, while preserving the length of the protein product
NMD_transcript_variant a variant in a transcript that is the target of nonsense-mediated decay
non_coding_transcript_exon_variant a variant that changes an exon sequence in a non-coding transcript
non_coding_transcript_variant a transcript variant of a gene for a non-coding RNA
protein_altering_variant a variant that is predicted to change the identity of the protein encoded by the coding sequence
regulatory_region_ablation a variant causing a deletion that includes a regulatory region
regulatory_region_amplification a variant causing amplification of a regulatory region
regulatory_region_variant a variant located within a regulatory region
splice_acceptor_variant a splice variant that changes the 2-base region at the 3' end of an intron
splice_donor_variant a splice variant that changes the 2-base region at the 5' end of an intron
splice_region_variant a variant within the region of the splice site, either within 1-3 bases of the exon or 3-8 bases of the intron
start_lost a variant that changes one or more bases of the canonical start codon
stop_gained a variant that results in a premature stop codon, and therefore a truncated transcript
stop_lost a variant in the stop codon, resulting in an elongated transcript
stop_retained_variant a variant where one or more bases in the stop codon is changed, but the stop function remains
synonymous_variant a variant that does not result in a change in amino acid
TF_binding_site_variant a variant within a transcription factor binding site
TFBS_ablation a variant that causes a deletion that includes a transcription factor binding site
TFBS_amplification a variant that causes amplification of a transcription factor binding site
transcript_ablation a variant causing a deletion that includes a transcript feature
transcript_amplification a variant causing amplification of a region that contains a transcript
upstream_gene_variant a variant located within 5 KB upstream (5') of a gene