Glossary |
3-D
Three-dimensional.
Accession number
An Accession number is a unique identifier given to a
sequence when it is submitted to one of the DNA repositories (GenBank, EMBL,
DDBJ). The initial deposition of a sequence record is referred to as version 1.
If the sequence is updated, the version number is incremented, but the Accession
number will remain constant.
Alu
The Alu repeat family comprises short interspersed
elements (SINES) present in multiple copies in the genomes of humans and other
primates. The Alu sequence is approximately 300 bp in length and is found
commonly in introns, 3' untranslated regions of genes, and intergenic genomic
regions. They are mobile elements and are present in the human genome in
extremely high copy number. Almost 1 million copies of the Alu sequence are
estimated to be present, making it the most abundant mobile element. The Alu
sequence is so named because of the presence of a recognition site for the AluI
endonuclease in the middle of the Alu sequence. Because of the widespread
occurrence of the Alu repeat in the genome, the Alu sequence is used as a
universal primer for PCR in animal cell lines; it binds in both forward and
reverse directions. The Alu universal primer sequence is as follows: 5'-GTG GAT
CAC CTG AGG TCA GGA GTT TC-3' (26-mer).
Allele
One of the variant forms of a gene at a particular locus
on a chromosome. Different alleles often produce variation in inherited
characteristics such as hair color or blood type. In an individual, one form of
the allele (the dominant one) may be expressed more than another form (the
recessive one). When "genes" are considered simply as segments of a nucleotide
sequence, allele refers to each of the possible alternative nucleotides at a
specific position in the sequence. For example, a CT polymorphism such as
CCT[C/T]CCAT would have two alleles: C and T.
Application Programming Interface
An API is a set of routines that an application uses to
request and carry out lower-level services performed by a computer's operating
system. For computers running a graphical user interface, an API manages an
application's windows, icons, menus, and dialog boxes.
ASN1
Abstract Syntax Notation 1 is an international standard
data-representation format used to achieve interoperability between computer
platforms. It allows for the reliable exchange of data in terms of structure and
content by computer and software systems of all types.
BAC
Bacterial Artificial Chromosome. A BAC is a large segment
of DNA (100,000-200,000 bp) from another species cloned into bacteria. Once the
foreign DNA has been cloned into the host bacteria, many copies of it can be
made.
Batch Entrez
Batch Entrez is a tool for retrieval of large amounts of
records in Entrez using a simple text file containing accession numbers or
unique identifiers can be use as an input.
BankIt
BankIt is a tool for the online submission of one or a
few sequences into GenBank and is designed to make the submission process quick
and easy.
bit score
The value S' is derived from the raw alignment score S in
which the statistical properties of the scoring system used have been taken into
account. By normalizing a raw score using the formula:
a bit score of S' is attained, which has a standard set of units, and where K
and lambda are the statistical parameters of the scoring system. Because bit
scores have been normalized with respect to the scoring system, they can be used
to compare alignment scores from different searches.
BLAST
Basic Local Alignment Search Tool (Altschul et al., J Mol
Biol 215:403-410; 1990). A sequence comparison algorithm that is optimized for
speed and used to search sequence databases for optimal local alignments to a
query.
blastn
nucleotide-nucleotide BLAST. blastn takes nucleotide
sequences in FASTA format, GenBank Accession numbers, or GI numbers and compares
them against the NCBI Nucleotide databases.
blastp
protein-protein BLAST. blastp takes protein sequences in
FASTA format, GenBank Accession numbers, or GI numbers and compares them against
the NCBI Protein databases.
BLAT
A DNA/Protein sequence analysis program to quickly find
sequences of 95% and greater similarity of length 40 bases or more. It may miss
more divergent or shorter sequence alignments. BLAT on proteins finds sequences
of 80% and greater similarity of length 20 amino acids or more. BLAT is not
BLAST. (See the BLAT web page.)
BLink
BLAST Link. BLink displays the results of BLAST searches
that have been done for every protein sequence in the Entrez Protein database.
It can be accessed by following the BLink link displayed beside any hit in the
results of an Entrez Protein search. In contrast to Entrez's Related Sequences
feature, which lists the titles of similar sequences, BLink displays the
graphical output of precomputed blastp results against the non-redundant (nr)
protein database. The output includes the positions of up to 200 BLAST hits on
the query sequence, scores, and alignments. BLink offers a variety of display
options, including the distribution of hits by taxonomic grouping, the best hit
to each organism, the protein domains in the query sequence, similar sequences
that have known 3D structures, and more. Additional options allow you to specify
from which taxa you would like to exclude, increase, or decrease the BLAST
cutoff score or filter the BLAST hits to show only those from a specific source
database, such as RefSeq or SWISS-PROT.
BLOB
Binary Large Object (or binary data object). BLOB refers
to a large piece of data, such as a bitmap. A BLOB is characterized by large
field values, an unpredictable table size, and data that are formless from the
perspective of a program. It is also a keyword designating the BLOB structure,
which contains information about a block of data.
BLOSUM 62
Blocks Substitution Matrix. A substitution matrix in
which scores for each position are derived from observations of the frequencies
of substitutions in blocks of local alignments in related proteins. Each matrix
is tailored to a particular evolutionary distance. In the BLOSUM 62 matrix, for
example, the alignment from which scores were derived was created using
sequences sharing no more than 62% identity. Sequences more identical than 62%
are represented by a single sequence in the alignment to avoid overweighting
closely related family members (Henikoff and Henikoff, Proc Natl Acad Sci U S A
89:10915-10919; 1992).
Boolean
This term refers to binary algebra that uses the logical
operators AND, OR, XOR, and NOT; the outcomes consist of logical values (either
TRUE or FALSE). The keyword boolean indicates that the expression or constant
expression associated with the identifier takes the value TRUE or FALSE. The
logical-AND (&&) operator produces the value 1 if both operands have nonzero
values; otherwise, it produces the value 0. The logical-OR (||) operator
produces the value 1 if either of its operands has a nonzero value. The
logical-NOT (!) operator produces the value 0 if its operand is true (nonzero)
and the value 1 if its operand is FALSE (0). The exclusive OR (XOR) operator
yields TRUE only if one of its operands are TRUE and the other is FALSE. If both
operands are the same (either TRUE or FALSE), the operation yields FALSE.
Build
A run of the genome assembly and annotation process of
the set of products generated by that run.
CCAP
Cancer Chromosome Aberration Project. CCAP was designed
to expedite the definition and detailed characterization of the distinct
chromosomal alterations that are associated with malignant transformation. The
project is collaboration among the NCI, the NCBI, and numerous research labs.
CD
Conserved Domain. CD refers to a domain (a distinct
functional and/or structural unit of a protein) that has been conserved during
evolution. CDs are generated from multiple sequence alignments and may be
refined by comparison to solved structures. During evolution, amino acid changes
occur in ways that preserve the physico-chemical properties of critical
residues, and hence the structural and/or functional properties of that domain.
CDART
Conserved Domain Architecture Retrieval Tool. When given
a protein query sequence, CDART displays the functional domains that make up the
protein and lists proteins with similar domain architectures. The functional
domains for a sequence are found by comparing the protein sequence to a database
of conserved domain alignments, CDD using RPS-BLAST.
CDD
Conserved Domain Database. This is a collection of
sequence alignments, also called profiles or position specific scoring matrices
(PSSMs), representing protein domains conserved during evolution.
cDNA
Complementary DNA. A DNA sequence obtained by reverse
transcription of a messenger RNA (mRNA) sequence.
CDS
coding region, coding sequence. CDS refers to the portion
of a genomic DNA sequence that is translated, from the start codon to the stop
codon, inclusively, if complete. A partial CDS lacks part of the complete CDS
(it may lack either or both the start and stop codons). Successful translation
of a CDS results in the synthesis of a protein.
CEPH
Centre d'Etude du Polymorphism Humain
CGAP
Cancer Genome Anatomy Project. CGAP is an
interdisciplinary program to identify the human genes expressed in different
cancerous states, based on cDNA (EST) libraries, and to determine the molecular
profiles of normal, precancerous, and malignant cells. The project is a
collaboration among the NCI, the NCBI, and numerous research labs.
CGH
Comparative Genomic Hybidization. CGH is a fluorescent
molecular cytogenetic technique that identifies chromosomal aberrations and maps
these changes to metaphase chromosomes. CGH can be used to generate a map of DNA
copy number changes in tumor genomes. CGH is based on quantitative two-color
fluorescence in situ hybridization (FISH). DNA extracted from tumor cells is
labeled in one color (e.g., green) and mixed in a 1:1 ratio with DNA from normal
cells, which is labeled in a different color (e.g., red). The mixture is then
applied to normal metaphase chromosomes. Portions of the genome that are equally
represented in normal and tumor cells will appear orange, regions that are
deleted in the tumor sample relative to the normal sample will appear red, and
regions that are present in higher copy number in the tumor sample (because of
amplification) will appear green. Special image analysis tools are necessary to
quantitate the ratio of green-to-red fluorescence to determine whether a given
region is more highly represented in the normal or in the tumor sample.
CGI
Common Gateway Interface. A mechanism that allows a Web
server to run a program or script on the server and send the output to a Web
browser.
Cluster
A group that is created based on certain criteria. For
example, a gene cluster may include a set of genes whose similar expression
profiles are found to be similar according to certain criteria, or a cluster may
refer to a group of clones that are related to each other by homology.
Cn3D
"See in 3-D" is a structure and sequence alignment viewer
for NCBI databases. It allows viewing of 3-D structures and sequence-structure
or structure-structure alignments. Cn3D can work as a helper application to the
browser or as a client-server application that retrieves structure records from
the Molecular Modeling Database (MMDB, see below) directly from the internet.
The Cn3D homepage provides access to information on how to install the program,
a tutorial to get started, and a comprehensive help document.
Codon
Sequence of three nucleotides in DNA or mRNA that
specifies a particular amino acid during protein synthesis; also called a
triplet. Of the 64 possible codons, 3 are stop codons, which do not specify
amino acids.
COGs
Clusters of Orthologous Groups (of proteins) were
delineated by comparing protein sequences from completely sequenced genomes.
Each COG consists of individual proteins or groups of paralogs from at least
three lineages and thus corresponds to an ancient conserved domain.
Consensus sequence
The nucleotides or amino acids found most commonly at
each position in the sequences of homologous DNAs, RNAs, or proteins.
Contig
A contiguous segment of the genome made by joining
overlapping clones or sequences. A clone contig consists of a group of cloned
(copied) pieces of DNA representing overlapping regions of a particular
chromosome. A sequence contig is an extended sequence created by merging primary
sequences that overlap. A contig map shows the regions of a chromosome where
contiguous DNA segments overlap. Contig maps provide the ability to study a
complete and often large segment of the genome by examining a series of
overlapping clones, which then provide an unbroken succession of information
about that region.
Coriell
Coriell Institute of Aging Cell Repository
CPU
Central Processing Unit. The CPU is the computational and
control unit of a computer, the device that interprets and executes
instructions.
dbSNP
The Single Nucleotide Polymorphism database (dbSNP) is a
public-domain archive for a broad collection of simple genetic polymorphisms.
This collection of polymorphisms includes single-base nucleotide substitutions
(also known as single nucleotide polymorphisms or SNPs), small-scale multi-base
deletions or insertions (also called deletion insertion polymorphisms or DIPs),
and retroposable element insertions and microsatellite repeat variations (also
called short tandem repeats or STRs).
DDBJ
DNA Data Bank of Japan
Definition line
A sequence in FASTA format begins with a single-line
description, followed by lines of sequence data. The definition line or
description line is distinguished from the sequence data by a "greater than" (>)
symbol in the first column (see example); also DEFLINE, as in a flatfile.
DNA
Deoxyribonucleic acid is the chemical inside the nucleus
of a cell that carries the genetic instructions for making living organisms. DNA
is composed of two anti-parallel strands, each a linear polymer of nucleotides.
Each nucleotide has a phosphate group linked by a phosphoester bond to a pentose
(a five-carbon sugar molecule, deoxyribose), that in turn is linked to one of
four organic bases, adenine, guanine, cytosine, or thymine, abbreviated A, G, C,
and T, respectively. The bases are of two types: purines, which have two rings
and are slightly larger (A and G); and pyrimidines, which have only one ring (C
and T). Each nucleotide is joined to the next nucleotide in the chain by a
covalent phosphodiester bond between the 5' carbon of one deoxyribose group and
the 3' carbon of the next. DNA is a helical molecule with the sugar-phosphate
backbone on the outside and the nucleotides extending toward the central axis.
There is specific base-pairing between the bases on opposite strands in such a
way that A always pairs with T and G always pairs with C.
Domain
A "domain" refers to a discrete portion of a protein
assumed to fold independently of the rest of the protein and which possesses its
own function.
Draft sequence
Draft sequence refers to DNA sequence that is not yet
finished but is generally of high quality (i.e., an accuracy of greater than
90%). Draft sequence data are mostly in the form of 10,000 base pair-sized
fragments, the approximate chromosomal locations of which are known. The
following keywords are associated with draft sequence: phase 0, light-pass
coverage of a clone, generally only 1× coverage; phase 1, 4-10× coverage of a
BAC clone (order and orientation of the fragments are unknown); and phase 2,
4-10× coverage of a BAC clone (order and orientation of the fragments are
known). Phase 3 refers to the completely finished sequence.
DTD
Document Type Definition. The DTD is an optional part of
the prolog of an XML document that defines the rules of the document. It sets
constraints for an XML document by specifying which elements are present in the
document and the relationships between elements, e.g., which tags can contain
other tags, the number and sequence of the tags, and attributes of the tags. The
DTD helps to validate the data when the receiving application does not have a
built-in description of the incoming data.
DUST
A program for filtering low-complexity regions from
nucleic acid sequences.
E-value
Expect value. The E-value is a parameter that describes
the number of hits one can "expect" to see by chance when searching a database
of a particular size. It decreases exponentially with the score (S) that is
assigned to a match between two sequences. Essentially, the E-value describes
the random background noise that exists for matches between sequences. For
example, an E-value of 1 assigned to a hit can be interpreted as meaning that in
a database of the current size, one might expect to see one match with a similar
score simply by chance. This means that the lower the E-value, or the closer it
is to "0", the higher is the "significance" of the match. However, it is
important to note that searches with short sequences can be virtually identical
and have relatively high E-value. This is because the calculation of the E-value
also takes into account the length of the query sequence. This is because
shorter sequences have a high probability of occurring in the database purely by
chance. For more information, see the following tutorial.
EC number
A number assigned to a type of enzyme according to a
scheme of standardized enzyme nomenclature developed by the Enzyme Commission of
the Nomenclature Committee of the International Union of Biochemistry and
Molecular Biology (IUBMB). EC numbers may be found in ENZYME, the Enzyme
nomenclature database, maintained at the ExPASy molecular biology server.
EMBL
European Molecular Biology Laboratory
Entrez
Entrez is a retrieval system for searching several linked
databases. It provides access to the following NCBI databases: PubMed, GenBank,
Protein, Structure, Genome, PopSet, OMIM, Taxonomy, Books, ProbeSet, 3D Domains,
UniSTS, SNP, and CDD.
Entrez Gene
(formerly known as LocusLink). Entrez Gene provides
tracked, unique identifiers for genes (GeneIDs) and reports information
associated with those identifiers for unrestricted public use.
Entrez Genome
The NCBI Entrez Genome database provides views for a
variety of genomes, complete chromosomes, sequence maps with contigs, and
integrated genetic and physical maps. The database is organized in six major
organism groups: Archaea, Bacteria, Eukaryotae, Viruses, Viroids, and Plasmids
and includes complete chromosomes, organelles and plasmids as well as draft
genome assemblies.
Entrez Genome Project
The NCBI Entrez Genome Project database is a searchable
collection of complete and incomplete large-scale sequencing, assembly,
annotation, and mapping projects for cellular organisms. The database is
organized into organism-specific overviews that function as portals from which
all projects in the database pertaining to that organism can be browsed and
retrieved.
Entrez Nucleotide
The Entrez Nucleotide database is a collection of
sequences from several sources. It includes sequences from the International
Collaboration of Sequence Databases (GenBank, EMBL, and DDBJ), PDB as well as
the NCBI Reference Sequence (RefSeq) records. The database is partitioned into
three components: "EST" (containing EST sequences), "GSS" (containing GSS
sequences) and "CoreNucleotide" (which comprises the remaining nucleotide
sequences).
Entrez Protein
The Entrez Protein database contains sequence data from
the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as
well as protein sequences submitted to Protein Information Resource (PIR),
SWISS-PROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB).
EST
Expressed Sequence Tag. ESTs are short (usually
approximately 300-500 base pairs), single-pass sequence reads from cDNA.
Typically, they are produced in large batches. They represent the genes
expressed in a given tissue and/or at a given developmental stage. They are tags
(some coding, others not) of expression for a given cDNA library. They are
useful in identifying full-length genes and in mapping.
e-PCR
Electronic PCR is used to compare a query sequence to
mapped sequence-tagged sites (STSs) to find a possible map location for the
query sequence. e-PCR finds STSs in DNA sequences by searching for subsequences
that closely match the PCR primers present in mapped markers. The subsequences
must have the correct order, orientation, and spacing that they could plausibly
prime the amplification of a PCR product of the correct molecular weight.
epub
Citation "Ahead-of-print" citation. PubMed now accepts
citations from publishers for articles that have been published electronically
ahead of the printed issue. PubMed displays the category "[epub ahead of print]"
in the part of the citation where the volume and pagination would ordinarily
display. For example: Proc Natl Acad Sci U S A. 2000 May 2 [epub ahead of
print].
ExoFish
Exon Finding by Sequence Homology. Exofish is a tool
based on homology searches for the rapid and reliable identification of human
genes. It relies on the sequence of another vertebrate, the pufferfish Tetraodon
nigroviridis (similar to Fugu), to detect conserved sequences with a very low
background. The genome of T. nigroviridis is eight times more compact than the
human genome and has been used in the comparative identification of human genes
from the rough draft of the human genome (Roest Crollius et al., Nat Genet
25:235-238; 2000).
Exon
Refers to the portion of a gene that encodes for a part
of that gene's mRNA. A gene may comprise many exons, some of which may include
only protein-coding sequence; however, an exon may also include 5' or 3'
untranslated sequence.
Exon-trapped
Exon trapping is a technique for cloning exon sequences
from genomic DNA by selecting for functional splice sites, relying on the
cellular splicing machinery. The genomic DNA containing the putative exon(s) is
cloned into an exon-trap vector, which has a promoter, polyadenylation signals,
and splice sites, and then transfected into a cell line. If there are functional
splice sites in the genomic DNA fragment, the segments of DNA between the splice
sites will be removed. Total RNA is isolated and reverse-transcribed. After cDNA
synthesis and PCR amplification, the exon of interest is cloned.
ExPASy
Expert Protein Analysis System is a proteomics server of
the Swiss Bioinformatics Institute (SIB).
FASTA
A sequence similarity search tool developed by William
Pearson and David Lipman. The term FASTA is also used to identify a text format
for sequences that is widely used. A FASTA-formatted sequence file may contain
multiple sequences. Each sequence in the file is identified by a single line
title preceded by the greater than sign (">").
Fingerprint
The pattern of bands on a gel produced by a clone when
restricted by a particular enzyme, such as HindIII.
Finished sequence
High-quality, low-error DNA sequence that is free of
gaps. To qualify as a finished sequence, only a single error out of every 10,000
bases (i.e., an accuracy of 99.999%) is allowed.
FISH
Fluorescence in situ hybridization. In this technique,
fluorescent molecules are used to label a DNA probe, which can then hybridize to
a specific DNA sequence in a chromosome spread so that the site becomes visible
through a microscope. FISH has been used to highlight the locations of genes,
subchromosome regions, entire chromosomes, or specific DNA sequences. It has
been used for mapping and the detection of genomic rearrangements, as well as
studies on DNA replication.
Flatfile or Flat file
A flat file is a data file that contains records (each
corresponding to a row in a table); however, these records have no structured
relationships. To interpret these files, the format properties of the file
should be known. For example, a database management system may allow the user to
export data to a comma-delimited file. Such a file is called a flat file because
it has no inherent information about the data, and interpretation requires
additional information. Files in a database management system have more complex
storage structures.
Freeze
To copy changing data so as to preserve the dataset as it
existed at a particular point in time. Also used to refer to the resulting set
of frozen data.
FTP
File Transfer Protocol. A method of retrieving files over
a network directly to the user's computer or to his/her home directory using a
set of protocols that govern how the data are to be transported.
Gap
A gap is a space introduced into an alignment to
compensate for insertions and deletions in one sequence relative to another. To
prevent the accumulation of too many gaps in an alignment, introduction of a gap
causes the deduction of a fixed amount (the gap score) from the alignment score.
Extension of the gap to encompass additional nucleotides or amino acid is also
penalized in the scoring of an alignment.
Gene
A DNA segment containing biological information and hence
coding for an RNA and/or polypeptide molecule.
GB
Gigabytes
GBFF
GenBank Flat File. Refers to a format .gbff.
GenBank
GenBank is a database of nucleotide sequences from more
than 100,000 organisms. Records that are annotated with coding region features
also include amino acid translations. GenBank belongs to an international
collaboration of sequence databases that also includes EMBL and DDBJ.
GeneID
GeneID is a unique identifier that is assigned to a gene
record in Entrez Gene. It is an integer and is species specific. In other words,
the integer assigned to dystrophin in human is different from that in any other
species. For genomes that had been represented in LocusLink, the GeneID is the
same as the LocusID. The GeneID is reported in RefSeq records as a 'db_xref'
(e.g. /db_xref="GeneID:856646", in GenBank format).
Genetic code
The instructions in a gene that tell the cell how to make
a specific protein. A, T, G, and C are the "letters" of the DNA code; they stand
for the chemicals adenine, thymine, guanine, and cytosine, respectively, that
make up the nucleotide bases of DNA. Each gene's code combines the four
chemicals in various ways to spell out three-letter "words" that specify which
amino acid is needed at every position for making a protein.
GenomeScan
A gene identification algorithm that is used to identify
exon-intron structures in genomic DNA sequence.
Genotype
The genetic identity of an individual that does not show
as outward characteristics. The genotype refers to the pair of alleles for a
given region of the genome that an individual carries.
GEO
Gene Expression Omnibus. GEO is a gene expression data
repository and online resource for the retrieval of gene expression data from
any organism or artificial source. Many types of gene expression data from
platform types, such as spotted microarray, high-density oligonucleotide array,
hybridization filter, and serial analysis of gene expression (SAGE) data, are
accepted, accessioned, and archived as a public dataset. [See the GEO chapter
(Chpater 6) or the GEO web page.]
GI
The GenInfo Identifier is a sequence identification
number for a nucleotide sequence. If a nucleotide sequence changes in any way, a
new GI number will be assigned. A separate GI number is also assigned to each
protein translation within a nucleotide sequence record, and a new GI is
assigned if the protein translation changes in any way. GI sequence identifiers
run parallel to the new accession.version system of sequence identifiers (see
the description of Version).
GSS
Genome Survey Sequences are analogous to ESTs except that
the sequences are genomic in origin, rather than cDNA (mRNA). The GSS division
of GenBank contains (but is not limited to) the following types of data: random
"single-pass read" genome survey sequences, cosmid/BAC/YAC end sequences,
exon-trapped genomic sequences, and Alu -PCR sequences.
Heterozygosity
The probability that a diploid individual will have two
different alleles at a particular genome locus. These individuals are defined as
heterozygous, whereas individuals who have two identical alleles at the locus
are defined as homozygous. The probability can be estimated by sampling a
representative number of individuals from the population and dividing the number
of heterozygotes by the total number sampled.
HIV
Human Immunodeficiency Virus. HIV-1 is a retrovirus that
is recognized as the causative agent of AIDS (Acquired Immunodeficiency
Syndrome).
Homologene
NCBI's HomoloGene is a system for automated detection of
homologs among the annotated genes of several completely sequenced eukaryotic
genomes.
Homologs
Two biological entities (structures or molecules) are
said to be homologs (or are homologous) if it is thought that theydescend from a
common ancestral structure or molecule. Corresponding body parts and genes in
different or the same species can be homologous.The term has often been extended
to include sequences as well. However it is incorrect to report a relative
homology or percent homology as issometimes said of sequences; genes or
sequences are either homologous or they are not. See also ortholog and paralog.
Homogeneously staining region
A region of the chromosome identified cytologically by
DNA staining or the FISH technique because of the presence of multiple copies of
a subchromosomal region resulting from amplification.
Homologous
The term refers to similarity attributable to descent
from a common ancestor. Homologous chromosomes are members of a pair of
essentially identical chromosomes, each derived from one parent. They have the
same or allelic genes with genetic loci arranged in the same order. Homologous
chromosomes synapse during meiosis.
HTGS
High-Throughput Genomic Sequences. The source of HTGS are
large-scale genome sequencing centers; unfinished sequences are in phases 0, 1,
and 2, and finished sequences are in phase 3.
HTGS_CANCELLED
A keyword added to GenBank entries by sequencing centers
to indicate that work has stopped on a clone and that the existing sequence will
not be finished. Sequencing centers may stop work because the clone is redundant
or for various other reasons.
HTGS_PHASE0, HTGS_PHASE1, HTGS_PHASE2, HTGS_PHASE3
Keywords added to GenBank entries by sequencing centers
to indicate the status (phase) of the sequence (see phase definitions described
under draft sequence).
HTML
Hypertext Markup Language. HTML is derived from SGML. It
is a text-based mark-up language and is used to primarily display information
using a web browser and to link pieces of information via hyperlinks. The tags
used in an HTML document provide information only on how the content is to be
displayed but do not provide information about the content they encompass.
Ideogram
A diagrammatic representation of the karyotype of an
organism.
IMAGE Consortium
Integrated Molecular Analysis of Genomes and their
Expression. A consortium of academic groups that share high-quality, arrayed
cDNA libraries and place sequence, map, and expression data of the clones in
these arrays into the public domain. With the use of this information, unique
clones can be rearrayed to form a "master array", with the aim of ultimately
having a representative cDNA from every gene in the genome under study. To date,
human, mouse, rat, zebrafish, and Xenopus laevis genomes have been studied.
Intron
Noncoding DNA which separates neighboring exons in a
gene. During gene expression introns are transcribed into RNA and the intron
sequences are removed from the pre-mRNA by splicing (See also splice sites.)
Karyotype
The particular chromosome complement of an individual or
a related group of individuals, as defined by both the number and morphology of
the chromosomes, usually in mitotic metaphase, and arranged by pairs according
to the standard classification.
LANL
Los Alamos National Lab
LinkOut
A registry service to create links from specific
articles, journals, or biological data in Entrez to resources on external web
sites. Third parties can provide a URL, resource name, brief description of
their web sites, and specification of the NCBI data from which they would like
to establish links. The specification can be written as a valid Boolean query to
Entrez or as a list of identifiers for specific articles or sequences. Entrez
PubMed users can then select which external links are visible in their searches
through the NCBI Cubby service (see above). (See the LinkOut chapter
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books&cmd=search&doptcmdl=TOCView&term=linkout+AND+helplinkout%5Bbook%5D
or web page http://www.ncbi.nlm.nih.gov/entrez/linkout/ .)
Locus
In a genomic context, locus refers to a fixed position on
a chromosome. It may, therefore, refer to a marker, a gene, or any other
landmark that can be described.
MACAW
Multiple Alignment Construction and Analysis Workbench.
MACAW is a program for locating, analyzing, and editing blocks of localized
sequence similarity among multiple seqences and linking them into a composite
multiple alignment.
Map Viewer
The Map Viewer is a software component of Entrez Genomes
that provides special browsing capabilities for a subset of organisms. It allows
one to view and search an organism's complete genome, display chromosome maps,
and zoom into progressively greater levels of detail, down to the sequence data
for a region of interest. If multiple maps are available for a chromosome, it
displays them aligned to each other based on shared marker and gene names and,
for the sequence maps, based on a common sequence coordinate system. The
organisms currently represented in the Map Viewer are listed in the Entrez Map
Viewer help document, which provides general information on how to use that
tool. The number and types of available maps vary by organism and are described
in the "data and search tips" file provided for each organism.
MB
megabytes
MEDLINE
MEDLINE is NLM's database of indexed journal citations
and abstracts in the fields of biomedicine and healthcare. It encompasses nearly
4,500 journals published in the United States and more than 70 other countries.
(For more information, see the Fact Sheet,
http://www.nlm.nih.gov/pubs/factsheets/medline.html )
MegaBLAST
MegaBLAST is a program for aligning sequences that differ
slightly as a result of sequencing or other similar "errors". When larger word
size is used, it is up to 10 times faster than more common sequence-similarity
search programs. MegaBLAST is also able to efficiently handle much longer DNA
sequences than the blastn program of the traditional BLAST algorithm. It uses
the GREEDY algorithm for a nucleotide sequence alignment search.
MeSH
Medical Subject Headings. MeSH refers to the controlled
vocabulary of NLM used for indexing articles in PubMed. MeSH terminology
provides a consistent way to retrieve information that may use different
terminology for the same concepts. (See the MeSH homepage
http://www.nlm.nih.gov/mesh/MBrowser.html )
mFASTA
Multi-FASTA format.
MGC
Mammalian Gene Collection. MGC is a project of the NIH to
provide a complete set of full-length (open reading frame) sequences and cDNA
clones of expressed genes for human and mouse. This program has been expanded to
include isolation of a set of full-ORF rat clones.
MGD
Mouse Genome Database. MGD contains information on mouse
genetic markers, molecular segments, phenotypes, comparative mapping data,
experimental mapping data, and graphical displays for genetic, physical, and
cytogenetic maps.
MGI
Mouse Genome Informatics. MGI houses a database that
provides integrated access to data on the genetics, genomics, and biology of the
laboratory mouse, http://www.informatics.jax.org/
Microsatellite
Repetitive stretches of short sequences of DNA used as
genetic markers to track inheritance in families (e.g., CC[TATATATA]CCCT). Also
known as short tandem repeats (STRs).
MIM
Mendelian Inheritance in Man. First published in 1966,
Mendelian Inheritance in Man (MIM) is a genetic knowledge base that serves
clinical medicine and biomedical research, including the Human Genome Project.
minimal tiling path
An ordered list or map that defines the minimal set of
overlapping clones needed to provide complete coverage of a chromosome or other
extended segment of DNA (compare with tiling path).
MMDB
Molecular Modeling Database. MMDB is a database of
three-dimensional biomolecular structures derived from X-ray crystallography and
nuclear magnetic resonance (NMR) spectroscopy.
MMDB-ID
Molecular Modeling Database Accession number.
mRNA
messenger RNA. mRNA describes the section of a genomic
DNA sequence that is transcribed, and can include the 5' untranslated region
(5'UTR), coding region (CDS), and 3' untranslated region (3'UTR). Successful
translation of the CDS section of an mRNA results in the synthesis of a protein.
Motif
A motif is a short, well-conserved nucleotide or amino
acid sequence that represents a minimal functional domain. It is often a
consensus for several aligned sequences.
Mutation
A permanent structural alteration in DNA. In most cases,
DNA changes have either no effect or cause harm, but occasionally a mutation can
improve an organism's chance of surviving, and the beneficial change is passed
on to the organism's descendants. Typically, mutations are more rare than
polymorphisms in population samples because natural selection recognizes their
lower fitness and removes them from the population.
NCBI
National Center for Biotechnology Information
NCBI Toolkit
The NCBI Toolkit contains supported software tools from
the Information Engineering Branch (IEB) of the NCBI, describes the three
components of the ToolBox: data model, data encoding, and programming libraries,
and provides access to documentation for the DataModel, C Toolkit, C++ Toolkit,
NCBI C Toolkit Source Browser, XML Demo Program, XML DTDs, and the FTP site.
NCI
National Cancer Institute
NEXUS
NEXUS refers to a file format designed to contain data
for processing by computer programs. NEXUS files should end with .nxs or .nex
for purposes of clarity (Maddison et al., Syst Biol 46:590-621; 1997).
NIH
National Institutes of Health
NLM
National Library of Medicine
NMR
Nuclear Magnetic Resonance. NMR is a spectroscopic
technique used for the determination of protein structure.
Non-synonymous SNP
The terms "synonymous" and "non-synonymous" are used for
SNPs that are in predicted protein coding regions (i.e., exons of genes).
Non-synonymous SNPs are SNPs that have different alleles that encode different
amino acids.
OMIM
Online Mendelian Inheritance in Man. OMIM is a directory
of human genes and genetic disorders, with links to literature references,
sequence records, maps, and related databases.
Ortholog
Orthologs are genes derived from a common ancestor
through vertical descent. This is often stated as the same gene in different
species. In contrast, paralogs are genes within the same genome that have
evolved by duplication.
The hemoglobin genes are a good example. Two separate genes (proteins) make up
the molecule hemoglobin (alpha and beta). The alpha and beta DNA sequences are
very similar and it is believed that they arose from duplication of a single
gene, followed by separate evolution in each of the sequences. Alpha and beta
are considered paralogs. Alpha hemoglobins in different species are considered
orthologs.
Orthology
Synteny
Tax BLAST