An Accession number is a unique identifier given to a sequence when it is submitted to one of the DNA repositories (GenBank, EMBL, DDBJ). The initial deposition of a sequence record is referred to as version 1. If the sequence is updated, the version number is incremented, but the Accession number will remain constant.
The Alu repeat family comprises short interspersed elements (SINES) present in multiple copies in the genomes of humans and other primates. The Alu sequence is approximately 300 bp in length and is found commonly in introns, 3' untranslated regions of genes, and intergenic genomic regions. They are mobile elements and are present in the human genome in extremely high copy number. Almost 1 million copies of the Alu sequence are estimated to be present, making it the most abundant mobile element. The Alu sequence is so named because of the presence of a recognition site for the AluI endonuclease in the middle of the Alu sequence. Because of the widespread occurrence of the Alu repeat in the genome, the Alu sequence is used as a universal primer for PCR in animal cell lines; it binds in both forward and reverse directions. The Alu universal primer sequence is as follows: 5'-GTG GAT CAC CTG AGG TCA GGA GTT TC-3' (26-mer).
One of the variant forms of a gene at a particular locus on a chromosome. Different alleles often produce variation in inherited characteristics such as hair color or blood type. In an individual, one form of the allele (the dominant one) may be expressed more than another form (the recessive one). When "genes" are considered simply as segments of a nucleotide sequence, allele refers to each of the possible alternative nucleotides at a specific position in the sequence. For example, a CT polymorphism such as CCT[C/T]CCAT would have two alleles: C and T.
Application Programming Interface
An API is a set of routines that an application uses to request and carry out lower-level services performed by a computer's operating system. For computers running a graphical user interface, an API manages an application's windows, icons, menus, and dialog boxes.
Abstract Syntax Notation 1 is an international standard data-representation format used to achieve interoperability between computer platforms. It allows for the reliable exchange of data in terms of structure and content by computer and software systems of all types.
Bacterial Artificial Chromosome. A BAC is a large segment of DNA (100,000-200,000 bp) from another species cloned into bacteria. Once the foreign DNA has been cloned into the host bacteria, many copies of it can be made.
Batch Entrez is a tool for retrieval of large amounts of records in Entrez using a simple text file containing accession numbers or unique identifiers can be use as an input.
BankIt is a tool for the online submission of one or a few sequences into GenBank and is designed to make the submission process quick and easy.
The value S' is derived from the raw alignment score S in which the statistical properties of the scoring system used have been taken into account. By normalizing a raw score using the formula:
a bit score of S' is attained, which has a standard set of units, and where K and lambda are the statistical parameters of the scoring system. Because bit scores have been normalized with respect to the scoring system, they can be used to compare alignment scores from different searches.
Basic Local Alignment Search Tool (Altschul et al., J Mol Biol 215:403-410; 1990). A sequence comparison algorithm that is optimized for speed and used to search sequence databases for optimal local alignments to a query.
nucleotide-nucleotide BLAST. blastn takes nucleotide sequences in FASTA format, GenBank Accession numbers, or GI numbers and compares them against the NCBI Nucleotide databases.
protein-protein BLAST. blastp takes protein sequences in FASTA format, GenBank Accession numbers, or GI numbers and compares them against the NCBI Protein databases.
A DNA/Protein sequence analysis program to quickly find sequences of 95% and greater similarity of length 40 bases or more. It may miss more divergent or shorter sequence alignments. BLAT on proteins finds sequences of 80% and greater similarity of length 20 amino acids or more. BLAT is not BLAST. (See the BLAT web page.)
BLAST Link. BLink displays the results of BLAST searches that have been done for every protein sequence in the Entrez Protein database. It can be accessed by following the BLink link displayed beside any hit in the results of an Entrez Protein search. In contrast to Entrez's Related Sequences feature, which lists the titles of similar sequences, BLink displays the graphical output of precomputed blastp results against the non-redundant (nr) protein database. The output includes the positions of up to 200 BLAST hits on the query sequence, scores, and alignments. BLink offers a variety of display options, including the distribution of hits by taxonomic grouping, the best hit to each organism, the protein domains in the query sequence, similar sequences that have known 3D structures, and more. Additional options allow you to specify from which taxa you would like to exclude, increase, or decrease the BLAST cutoff score or filter the BLAST hits to show only those from a specific source database, such as RefSeq or SWISS-PROT.
Binary Large Object (or binary data object). BLOB refers to a large piece of data, such as a bitmap. A BLOB is characterized by large field values, an unpredictable table size, and data that are formless from the perspective of a program. It is also a keyword designating the BLOB structure, which contains information about a block of data.
Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM 62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment to avoid overweighting closely related family members (Henikoff and Henikoff, Proc Natl Acad Sci U S A 89:10915-10919; 1992).
This term refers to binary algebra that uses the logical operators AND, OR, XOR, and NOT; the outcomes consist of logical values (either TRUE or FALSE). The keyword boolean indicates that the expression or constant expression associated with the identifier takes the value TRUE or FALSE. The logical-AND (&&) operator produces the value 1 if both operands have nonzero values; otherwise, it produces the value 0. The logical-OR (||) operator produces the value 1 if either of its operands has a nonzero value. The logical-NOT (!) operator produces the value 0 if its operand is true (nonzero) and the value 1 if its operand is FALSE (0). The exclusive OR (XOR) operator yields TRUE only if one of its operands are TRUE and the other is FALSE. If both operands are the same (either TRUE or FALSE), the operation yields FALSE.
A run of the genome assembly and annotation process of the set of products generated by that run.
Cancer Chromosome Aberration Project. CCAP was designed to expedite the definition and detailed characterization of the distinct chromosomal alterations that are associated with malignant transformation. The project is collaboration among the NCI, the NCBI, and numerous research labs.
Conserved Domain. CD refers to a domain (a distinct functional and/or structural unit of a protein) that has been conserved during evolution. CDs are generated from multiple sequence alignments and may be refined by comparison to solved structures. During evolution, amino acid changes occur in ways that preserve the physico-chemical properties of critical residues, and hence the structural and/or functional properties of that domain.
Conserved Domain Architecture Retrieval Tool. When given a protein query sequence, CDART displays the functional domains that make up the protein and lists proteins with similar domain architectures. The functional domains for a sequence are found by comparing the protein sequence to a database of conserved domain alignments, CDD using RPS-BLAST.
Conserved Domain Database. This is a collection of sequence alignments, also called profiles or position specific scoring matrices (PSSMs), representing protein domains conserved during evolution.
Complementary DNA. A DNA sequence obtained by reverse transcription of a messenger RNA (mRNA) sequence.
coding region, coding sequence. CDS refers to the portion of a genomic DNA sequence that is translated, from the start codon to the stop codon, inclusively, if complete. A partial CDS lacks part of the complete CDS (it may lack either or both the start and stop codons). Successful translation of a CDS results in the synthesis of a protein.
Centre d'Etude du Polymorphism Humain
Cancer Genome Anatomy Project. CGAP is an interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. The project is a collaboration among the NCI, the NCBI, and numerous research labs.
Comparative Genomic Hybidization. CGH is a fluorescent molecular cytogenetic technique that identifies chromosomal aberrations and maps these changes to metaphase chromosomes. CGH can be used to generate a map of DNA copy number changes in tumor genomes. CGH is based on quantitative two-color fluorescence in situ hybridization (FISH). DNA extracted from tumor cells is labeled in one color (e.g., green) and mixed in a 1:1 ratio with DNA from normal cells, which is labeled in a different color (e.g., red). The mixture is then applied to normal metaphase chromosomes. Portions of the genome that are equally represented in normal and tumor cells will appear orange, regions that are deleted in the tumor sample relative to the normal sample will appear red, and regions that are present in higher copy number in the tumor sample (because of amplification) will appear green. Special image analysis tools are necessary to quantitate the ratio of green-to-red fluorescence to determine whether a given region is more highly represented in the normal or in the tumor sample.
Common Gateway Interface. A mechanism that allows a Web server to run a program or script on the server and send the output to a Web browser.
A group that is created based on certain criteria. For example, a gene cluster may include a set of genes whose similar expression profiles are found to be similar according to certain criteria, or a cluster may refer to a group of clones that are related to each other by homology.
"See in 3-D" is a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to the browser or as a client-server application that retrieves structure records from the Molecular Modeling Database (MMDB, see below) directly from the internet. The Cn3D homepage provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document.
Sequence of three nucleotides in DNA or mRNA that specifies a particular amino acid during protein synthesis; also called a triplet. Of the 64 possible codons, 3 are stop codons, which do not specify amino acids.
Clusters of Orthologous Groups (of proteins) were delineated by comparing protein sequences from completely sequenced genomes. Each COG consists of individual proteins or groups of paralogs from at least three lineages and thus corresponds to an ancient conserved domain.
The nucleotides or amino acids found most commonly at each position in the sequences of homologous DNAs, RNAs, or proteins.
A contiguous segment of the genome made by joining overlapping clones or sequences. A clone contig consists of a group of cloned (copied) pieces of DNA representing overlapping regions of a particular chromosome. A sequence contig is an extended sequence created by merging primary sequences that overlap. A contig map shows the regions of a chromosome where contiguous DNA segments overlap. Contig maps provide the ability to study a complete and often large segment of the genome by examining a series of overlapping clones, which then provide an unbroken succession of information about that region.
Coriell Institute of Aging Cell Repository
Central Processing Unit. The CPU is the computational and control unit of a computer, the device that interprets and executes instructions.
The Single Nucleotide Polymorphism database (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms. This collection of polymorphisms includes single-base nucleotide substitutions (also known as single nucleotide polymorphisms or SNPs), small-scale multi-base deletions or insertions (also called deletion insertion polymorphisms or DIPs), and retroposable element insertions and microsatellite repeat variations (also called short tandem repeats or STRs).
DNA Data Bank of Japan
A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The definition line or description line is distinguished from the sequence data by a "greater than" (>) symbol in the first column (see example); also DEFLINE, as in a flatfile.
Deoxyribonucleic acid is the chemical inside the nucleus of a cell that carries the genetic instructions for making living organisms. DNA is composed of two anti-parallel strands, each a linear polymer of nucleotides. Each nucleotide has a phosphate group linked by a phosphoester bond to a pentose (a five-carbon sugar molecule, deoxyribose), that in turn is linked to one of four organic bases, adenine, guanine, cytosine, or thymine, abbreviated A, G, C, and T, respectively. The bases are of two types: purines, which have two rings and are slightly larger (A and G); and pyrimidines, which have only one ring (C and T). Each nucleotide is joined to the next nucleotide in the chain by a covalent phosphodiester bond between the 5' carbon of one deoxyribose group and the 3' carbon of the next. DNA is a helical molecule with the sugar-phosphate backbone on the outside and the nucleotides extending toward the central axis. There is specific base-pairing between the bases on opposite strands in such a way that A always pairs with T and G always pairs with C.
A "domain" refers to a discrete portion of a protein assumed to fold independently of the rest of the protein and which possesses its own function.
Draft sequence refers to DNA sequence that is not yet finished but is generally of high quality (i.e., an accuracy of greater than 90%). Draft sequence data are mostly in the form of 10,000 base pair-sized fragments, the approximate chromosomal locations of which are known. The following keywords are associated with draft sequence: phase 0, light-pass coverage of a clone, generally only 1× coverage; phase 1, 4-10× coverage of a BAC clone (order and orientation of the fragments are unknown); and phase 2, 4-10× coverage of a BAC clone (order and orientation of the fragments are known). Phase 3 refers to the completely finished sequence.
Document Type Definition. The DTD is an optional part of the prolog of an XML document that defines the rules of the document. It sets constraints for an XML document by specifying which elements are present in the document and the relationships between elements, e.g., which tags can contain other tags, the number and sequence of the tags, and attributes of the tags. The DTD helps to validate the data when the receiving application does not have a built-in description of the incoming data.
A program for filtering low-complexity regions from nucleic acid sequences.
Expect value. The E-value is a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size. It decreases exponentially with the score (S) that is assigned to a match between two sequences. Essentially, the E-value describes the random background noise that exists for matches between sequences. For example, an E-value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size, one might expect to see one match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0", the higher is the "significance" of the match. However, it is important to note that searches with short sequences can be virtually identical and have relatively high E-value. This is because the calculation of the E-value also takes into account the length of the query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance. For more information, see the following tutorial.
A number assigned to a type of enzyme according to a scheme of standardized enzyme nomenclature developed by the Enzyme Commission of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (IUBMB). EC numbers may be found in ENZYME, the Enzyme nomenclature database, maintained at the ExPASy molecular biology server.
European Molecular Biology Laboratory
Entrez is a retrieval system for searching several linked databases. It provides access to the following NCBI databases: PubMed, GenBank, Protein, Structure, Genome, PopSet, OMIM, Taxonomy, Books, ProbeSet, 3D Domains, UniSTS, SNP, and CDD.
(formerly known as LocusLink). Entrez Gene provides tracked, unique identifiers for genes (GeneIDs) and reports information associated with those identifiers for unrestricted public use.
The NCBI Entrez Genome database provides views for a variety of genomes, complete chromosomes, sequence maps with contigs, and integrated genetic and physical maps. The database is organized in six major organism groups: Archaea, Bacteria, Eukaryotae, Viruses, Viroids, and Plasmids and includes complete chromosomes, organelles and plasmids as well as draft genome assemblies.
Entrez Genome Project
The NCBI Entrez Genome Project database is a searchable collection of complete and incomplete large-scale sequencing, assembly, annotation, and mapping projects for cellular organisms. The database is organized into organism-specific overviews that function as portals from which all projects in the database pertaining to that organism can be browsed and retrieved.
The Entrez Nucleotide database is a collection of sequences from several sources. It includes sequences from the International Collaboration of Sequence Databases (GenBank, EMBL, and DDBJ), PDB as well as the NCBI Reference Sequence (RefSeq) records. The database is partitioned into three components: "EST" (containing EST sequences), "GSS" (containing GSS sequences) and "CoreNucleotide" (which comprises the remaining nucleotide sequences).
The Entrez Protein database contains sequence data from the translated coding regions from DNA sequences in GenBank, EMBL, and DDBJ as well as protein sequences submitted to Protein Information Resource (PIR), SWISS-PROT, Protein Research Foundation (PRF), and Protein Data Bank (PDB).
Expressed Sequence Tag. ESTs are short (usually approximately 300-500 base pairs), single-pass sequence reads from cDNA. Typically, they are produced in large batches. They represent the genes expressed in a given tissue and/or at a given developmental stage. They are tags (some coding, others not) of expression for a given cDNA library. They are useful in identifying full-length genes and in mapping.
Electronic PCR is used to compare a query sequence to mapped sequence-tagged sites (STSs) to find a possible map location for the query sequence. e-PCR finds STSs in DNA sequences by searching for subsequences that closely match the PCR primers present in mapped markers. The subsequences must have the correct order, orientation, and spacing that they could plausibly prime the amplification of a PCR product of the correct molecular weight.
Citation "Ahead-of-print" citation. PubMed now accepts citations from publishers for articles that have been published electronically ahead of the printed issue. PubMed displays the category "[epub ahead of print]" in the part of the citation where the volume and pagination would ordinarily display. For example: Proc Natl Acad Sci U S A. 2000 May 2 [epub ahead of print].
Exon Finding by Sequence Homology. Exofish is a tool based on homology searches for the rapid and reliable identification of human genes. It relies on the sequence of another vertebrate, the pufferfish Tetraodon nigroviridis (similar to Fugu), to detect conserved sequences with a very low background. The genome of T. nigroviridis is eight times more compact than the human genome and has been used in the comparative identification of human genes from the rough draft of the human genome (Roest Crollius et al., Nat Genet 25:235-238; 2000).
Refers to the portion of a gene that encodes for a part of that gene's mRNA. A gene may comprise many exons, some of which may include only protein-coding sequence; however, an exon may also include 5' or 3' untranslated sequence.
Exon trapping is a technique for cloning exon sequences from genomic DNA by selecting for functional splice sites, relying on the cellular splicing machinery. The genomic DNA containing the putative exon(s) is cloned into an exon-trap vector, which has a promoter, polyadenylation signals, and splice sites, and then transfected into a cell line. If there are functional splice sites in the genomic DNA fragment, the segments of DNA between the splice sites will be removed. Total RNA is isolated and reverse-transcribed. After cDNA synthesis and PCR amplification, the exon of interest is cloned.
Expert Protein Analysis System is a proteomics server of the Swiss Bioinformatics Institute (SIB).
A sequence similarity search tool developed by William Pearson and David Lipman. The term FASTA is also used to identify a text format for sequences that is widely used. A FASTA-formatted sequence file may contain multiple sequences. Each sequence in the file is identified by a single line title preceded by the greater than sign (">").
The pattern of bands on a gel produced by a clone when restricted by a particular enzyme, such as HindIII.
High-quality, low-error DNA sequence that is free of gaps. To qualify as a finished sequence, only a single error out of every 10,000 bases (i.e., an accuracy of 99.999%) is allowed.
Fluorescence in situ hybridization. In this technique, fluorescent molecules are used to label a DNA probe, which can then hybridize to a specific DNA sequence in a chromosome spread so that the site becomes visible through a microscope. FISH has been used to highlight the locations of genes, subchromosome regions, entire chromosomes, or specific DNA sequences. It has been used for mapping and the detection of genomic rearrangements, as well as studies on DNA replication.
Flatfile or Flat file
A flat file is a data file that contains records (each corresponding to a row in a table); however, these records have no structured relationships. To interpret these files, the format properties of the file should be known. For example, a database management system may allow the user to export data to a comma-delimited file. Such a file is called a flat file because it has no inherent information about the data, and interpretation requires additional information. Files in a database management system have more complex storage structures.
To copy changing data so as to preserve the dataset as it existed at a particular point in time. Also used to refer to the resulting set of frozen data.
File Transfer Protocol. A method of retrieving files over a network directly to the user's computer or to his/her home directory using a set of protocols that govern how the data are to be transported.
A gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another. To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes the deduction of a fixed amount (the gap score) from the alignment score. Extension of the gap to encompass additional nucleotides or amino acid is also penalized in the scoring of an alignment.
A DNA segment containing biological information and hence coding for an RNA and/or polypeptide molecule.
GenBank Flat File. Refers to a format .gbff.
GenBank is a database of nucleotide sequences from more than 100,000 organisms. Records that are annotated with coding region features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases that also includes EMBL and DDBJ.
GeneID is a unique identifier that is assigned to a gene record in Entrez Gene. It is an integer and is species specific. In other words, the integer assigned to dystrophin in human is different from that in any other species. For genomes that had been represented in LocusLink, the GeneID is the same as the LocusID. The GeneID is reported in RefSeq records as a 'db_xref' (e.g. /db_xref="GeneID:856646", in GenBank format).
The instructions in a gene that tell the cell how to make a specific protein. A, T, G, and C are the "letters" of the DNA code; they stand for the chemicals adenine, thymine, guanine, and cytosine, respectively, that make up the nucleotide bases of DNA. Each gene's code combines the four chemicals in various ways to spell out three-letter "words" that specify which amino acid is needed at every position for making a protein.
A gene identification algorithm that is used to identify exon-intron structures in genomic DNA sequence.
Genotype The genetic identity of an individual that does not show as outward characteristics. The genotype refers to the pair of alleles for a given region of the genome that an individual carries.
Gene Expression Omnibus. GEO is a gene expression data repository and online resource for the retrieval of gene expression data from any organism or artificial source. Many types of gene expression data from platform types, such as spotted microarray, high-density oligonucleotide array, hybridization filter, and serial analysis of gene expression (SAGE) data, are accepted, accessioned, and archived as a public dataset. [See the GEO chapter (Chpater 6) or the GEO web page.]
The GenInfo Identifier is a sequence identification number for a nucleotide sequence. If a nucleotide sequence changes in any way, a new GI number will be assigned. A separate GI number is also assigned to each protein translation within a nucleotide sequence record, and a new GI is assigned if the protein translation changes in any way. GI sequence identifiers run parallel to the new accession.version system of sequence identifiers (see the description of Version).
Genome Survey Sequences are analogous to ESTs except that the sequences are genomic in origin, rather than cDNA (mRNA). The GSS division of GenBank contains (but is not limited to) the following types of data: random "single-pass read" genome survey sequences, cosmid/BAC/YAC end sequences, exon-trapped genomic sequences, and Alu -PCR sequences.
The probability that a diploid individual will have two different alleles at a particular genome locus. These individuals are defined as heterozygous, whereas individuals who have two identical alleles at the locus are defined as homozygous. The probability can be estimated by sampling a representative number of individuals from the population and dividing the number of heterozygotes by the total number sampled.
Human Immunodeficiency Virus. HIV-1 is a retrovirus that is recognized as the causative agent of AIDS (Acquired Immunodeficiency Syndrome).
NCBI's HomoloGene is a system for automated detection of homologs among the annotated genes of several completely sequenced eukaryotic genomes.
Two biological entities (structures or molecules) are said to be homologs (or are homologous) if it is thought that theydescend from a common ancestral structure or molecule. Corresponding body parts and genes in different or the same species can be homologous.The term has often been extended to include sequences as well. However it is incorrect to report a relative homology or percent homology as issometimes said of sequences; genes or sequences are either homologous or they are not. See also ortholog and paralog.
Homogeneously staining region
A region of the chromosome identified cytologically by DNA staining or the FISH technique because of the presence of multiple copies of a subchromosomal region resulting from amplification.
The term refers to similarity attributable to descent from a common ancestor. Homologous chromosomes are members of a pair of essentially identical chromosomes, each derived from one parent. They have the same or allelic genes with genetic loci arranged in the same order. Homologous chromosomes synapse during meiosis.
High-Throughput Genomic Sequences. The source of HTGS are large-scale genome sequencing centers; unfinished sequences are in phases 0, 1, and 2, and finished sequences are in phase 3.
A keyword added to GenBank entries by sequencing centers to indicate that work has stopped on a clone and that the existing sequence will not be finished. Sequencing centers may stop work because the clone is redundant or for various other reasons.
HTGS_PHASE0, HTGS_PHASE1, HTGS_PHASE2, HTGS_PHASE3
Keywords added to GenBank entries by sequencing centers to indicate the status (phase) of the sequence (see phase definitions described under draft sequence).
Hypertext Markup Language. HTML is derived from SGML. It is a text-based mark-up language and is used to primarily display information using a web browser and to link pieces of information via hyperlinks. The tags used in an HTML document provide information only on how the content is to be displayed but do not provide information about the content they encompass.
A diagrammatic representation of the karyotype of an organism.
Integrated Molecular Analysis of Genomes and their Expression. A consortium of academic groups that share high-quality, arrayed cDNA libraries and place sequence, map, and expression data of the clones in these arrays into the public domain. With the use of this information, unique clones can be rearrayed to form a "master array", with the aim of ultimately having a representative cDNA from every gene in the genome under study. To date, human, mouse, rat, zebrafish, and Xenopus laevis genomes have been studied.
Noncoding DNA which separates neighboring exons in a gene. During gene expression introns are transcribed into RNA and the intron sequences are removed from the pre-mRNA by splicing (See also splice sites.)
The particular chromosome complement of an individual or a related group of individuals, as defined by both the number and morphology of the chromosomes, usually in mitotic metaphase, and arranged by pairs according to the standard classification.
Los Alamos National Lab
A registry service to create links from specific articles, journals, or biological data in Entrez to resources on external web sites. Third parties can provide a URL, resource name, brief description of their web sites, and specification of the NCBI data from which they would like to establish links. The specification can be written as a valid Boolean query to Entrez or as a list of identifiers for specific articles or sequences. Entrez PubMed users can then select which external links are visible in their searches through the NCBI Cubby service (see above). (See the LinkOut chapter http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books&cmd=search&doptcmdl=TOCView&term=linkout+AND+helplinkout%5Bbook%5D or web page http://www.ncbi.nlm.nih.gov/entrez/linkout/ .)
In a genomic context, locus refers to a fixed position on a chromosome. It may, therefore, refer to a marker, a gene, or any other landmark that can be described.
Multiple Alignment Construction and Analysis Workbench. MACAW is a program for locating, analyzing, and editing blocks of localized sequence similarity among multiple seqences and linking them into a composite multiple alignment.
The Map Viewer is a software component of Entrez Genomes that provides special browsing capabilities for a subset of organisms. It allows one to view and search an organism's complete genome, display chromosome maps, and zoom into progressively greater levels of detail, down to the sequence data for a region of interest. If multiple maps are available for a chromosome, it displays them aligned to each other based on shared marker and gene names and, for the sequence maps, based on a common sequence coordinate system. The organisms currently represented in the Map Viewer are listed in the Entrez Map Viewer help document, which provides general information on how to use that tool. The number and types of available maps vary by organism and are described in the "data and search tips" file provided for each organism.
MEDLINE is NLM's database of indexed journal citations and abstracts in the fields of biomedicine and healthcare. It encompasses nearly 4,500 journals published in the United States and more than 70 other countries. (For more information, see the Fact Sheet, http://www.nlm.nih.gov/pubs/factsheets/medline.html )
MegaBLAST is a program for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size is used, it is up to 10 times faster than more common sequence-similarity search programs. MegaBLAST is also able to efficiently handle much longer DNA sequences than the blastn program of the traditional BLAST algorithm. It uses the GREEDY algorithm for a nucleotide sequence alignment search.
Medical Subject Headings. MeSH refers to the controlled vocabulary of NLM used for indexing articles in PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts. (See the MeSH homepage http://www.nlm.nih.gov/mesh/MBrowser.html )
Mammalian Gene Collection. MGC is a project of the NIH to provide a complete set of full-length (open reading frame) sequences and cDNA clones of expressed genes for human and mouse. This program has been expanded to include isolation of a set of full-ORF rat clones.
Mouse Genome Database. MGD contains information on mouse genetic markers, molecular segments, phenotypes, comparative mapping data, experimental mapping data, and graphical displays for genetic, physical, and cytogenetic maps.
Mouse Genome Informatics. MGI houses a database that provides integrated access to data on the genetics, genomics, and biology of the laboratory mouse, http://www.informatics.jax.org/
Repetitive stretches of short sequences of DNA used as genetic markers to track inheritance in families (e.g., CC[TATATATA]CCCT). Also known as short tandem repeats (STRs).
Mendelian Inheritance in Man. First published in 1966, Mendelian Inheritance in Man (MIM) is a genetic knowledge base that serves clinical medicine and biomedical research, including the Human Genome Project.
minimal tiling path
An ordered list or map that defines the minimal set of overlapping clones needed to provide complete coverage of a chromosome or other extended segment of DNA (compare with tiling path).
Molecular Modeling Database. MMDB is a database of three-dimensional biomolecular structures derived from X-ray crystallography and nuclear magnetic resonance (NMR) spectroscopy.
Molecular Modeling Database Accession number.
messenger RNA. mRNA describes the section of a genomic DNA sequence that is transcribed, and can include the 5' untranslated region (5'UTR), coding region (CDS), and 3' untranslated region (3'UTR). Successful translation of the CDS section of an mRNA results in the synthesis of a protein.
A motif is a short, well-conserved nucleotide or amino acid sequence that represents a minimal functional domain. It is often a consensus for several aligned sequences.
A permanent structural alteration in DNA. In most cases, DNA changes have either no effect or cause harm, but occasionally a mutation can improve an organism's chance of surviving, and the beneficial change is passed on to the organism's descendants. Typically, mutations are more rare than polymorphisms in population samples because natural selection recognizes their lower fitness and removes them from the population.
National Center for Biotechnology Information
The NCBI Toolkit contains supported software tools from the Information Engineering Branch (IEB) of the NCBI, describes the three components of the ToolBox: data model, data encoding, and programming libraries, and provides access to documentation for the DataModel, C Toolkit, C++ Toolkit, NCBI C Toolkit Source Browser, XML Demo Program, XML DTDs, and the FTP site.
National Cancer Institute
NEXUS refers to a file format designed to contain data for processing by computer programs. NEXUS files should end with .nxs or .nex for purposes of clarity (Maddison et al., Syst Biol 46:590-621; 1997).
National Institutes of Health
National Library of Medicine
Nuclear Magnetic Resonance. NMR is a spectroscopic technique used for the determination of protein structure.
The terms "synonymous" and "non-synonymous" are used for SNPs that are in predicted protein coding regions (i.e., exons of genes). Non-synonymous SNPs are SNPs that have different alleles that encode different amino acids.
Online Mendelian Inheritance in Man. OMIM is a directory of human genes and genetic disorders, with links to literature references, sequence records, maps, and related databases.
Orthologs are genes derived from a common ancestor through vertical descent. This is often stated as the same gene in different species. In contrast, paralogs are genes within the same genome that have evolved by duplication.
The hemoglobin genes are a good example. Two separate genes (proteins) make up the molecule hemoglobin (alpha and beta). The alpha and beta DNA sequences are very similar and it is believed that they arose from duplication of a single gene, followed by separate evolution in each of the sequences. Alpha and beta are considered paralogs. Alpha hemoglobins in different species are considered orthologs.