Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics - Группа авторов страница 20

Bioinformatics - Группа авторов

Скачать книгу

into three major parts: the header, containing information and descriptors pertaining to the entire record; the feature table, which provides relevant annotations to the sequence; and the sequence itself.

      The Header

      The header is the most database-specific part of the record. Here, we will use the ENA version of the record for discussion (shown in its entirety in Appendix 1.1), with the corresponding DDBJ and GenBank versions of the header appearing in Appendix 1.2. The first line of the record provides basic identifying information about the sequence contained in the record, appropriately named the ID line; this corresponds to the LOCUS line in DDBJ/GenBank.

      ID U54469; SV 1; linear; genomic DNA; STD; INV; 2881 BP.

      Box 1.1 Functional Divisions in Nucleotide Databases

      The organization of nucleotide sequence records into discrete functional types provides a way for users to query specific subsets of the records within these databases. In addition, knowledge that a particular sequence is from a given technique-oriented database allows users to interpret the data from the proper biological point of view. Several of these divisions are described below, and examples of each of these functional divisions (called “data classes” by ENA) can be found by following the example links listed on the ENA Data Formats page listed in the Internet Resources section of this chapter.

CON Constructed (or “contigged”) records of chromosomes, genomes, and other long DNA sequences resulting from whole -genome sequencing efforts. The records in this division do not contain sequence data; rather, they contain instructions for the assembly of sequence data found within multiple database records.
EST Expressed Sequence Tags. These records contain short (300–500 bp) single reads from mRNA (cDNA) that are usually produced in large numbers. ESTs represent a snapshot of what is expressed in a given tissue or at a given developmental stage. They represent tags – some coding, some not – of expression for a given cDNA library.
GSS Genome Survey Sequences. Similar to the EST division, except that the sequences are genomic in origin. The GSS division contains (but is not limited to) single-pass read genome survey sequences, bacterial artificial chromosome (BAC) or yeast artificial chromosome (YAC) ends, exon-trapped genomic sequences, and Alu polymerase chain reaction (PCR) sequences.
HTG High-Throughput Genome sequences. Unfinished DNA sequences generated by high-throughput sequencing centers, made available in an expedited fashion to the scientific community for homology and similarity searches. Entries in this division contain keywords indicating its phase within the sequencing process. Once finished, HTG sequences are moved into the appropriate database taxonomic division.
STD A record containing a standard, annotated, and assembled sequence.
STS Sequence-Tagged Sites. Short (200–500 bp) operationally unique sequences that identify a combination of primer pairs used in a PCR assay, generating a reagent that maps to a single position within the genome. The STS division is intended to facilitate cross-comparison of STSs with sequences in other divisions for the purpose of correlating map positions of anonymous sequences with known genes.
WGS Whole-Genome Shotgun sequences. Sequence data from projects using shotgun approaches that generate large numbers of short sequence reads that can then be assembled by computer algorithms into sequence contigs, higher -order scaffolds, and sometimes into near-chromosome- or chromosome-length sequences.

       DT 19-MAY-1996 (Rel. 47, Created) DT 23-JUN-2017 (Rel. 133, Last updated, Version 5)

      The release number in each line indicates the first quarterly release made after the entry was created or last updated. The version number for the entry appears on the second line and allows the user to determine easily whether they are looking at the most up-to-date record for a particular sequence. Please note that this is different from the accession.version format described above – while some element of the record may have changed, the sequence may have remained the same, so these two different types of version numbers may not always correspond to one another.

      The next part of the header contains the definition lines, providing a succinct description of the kinds of biological information contained within the record. The definition line (DE in ENA, DEFINITION in DDBJ/GenBank) takes the following form.

       DE Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene, DE complete cds, alternatively spliced.

      Much care is taken in the generation of these definition lines and, although many of them can be generated automatically from other parts of the record, they are reviewed to ensure that consistency and richness of information are maintained. Obviously, it is quite impossible to capture all of the biology underlying a sequence in a single line of text, but that wealth of information will follow soon enough in downstream parts of the same record.

      Continuing down the flatfile record, one finds the full taxonomic information on the sequence of interest. The OS line (or SOURCE line in DDBJ/GenBank) provides the preferred scientific name from which the sequence was derived, followed by the common name of the organism in parentheses. The OC lines (or ORGANISM lines in DDBJ/GenBank) contain the complete taxonomic classification of the source organism. The classification is listed top-down, as nodes in a taxonomic tree, with the most general grouping (Eukaryota) given first.

       OS Drosophila melanogaster (fruit fly) OC Eukaryota; Metazoa; Ecdysozoa; Arthropoda; Hexapoda; Insecta; Pterygota; OC Neoptera; Holometabola; Diptera; Brachycera; Muscomorpha; Ephydroidea; OC Drosophilidae; Drosophila; Sophophora.

      Each record must have at least one reference or citation, noted within what are called reference blocks. These reference blocks offer scientific credit and set a context explaining why this particular sequence was determined. The reference blocks take the following form.

       RN [1] RP 1-2881 RX DOI; .1074/jbc.271.27.16393. RX PUBMED; 8663200. RA Lavoie C.A., Lachance P.E., Sonenberg N., Lasko P.; RT "Alternatively spliced transcripts from the Drosophila eIF4E gene produce RT two different Cap-binding proteins"; RL J Biol Chem 271(27):16393-16398(1996).

Скачать книгу