Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics - Группа авторов страница 22

Bioinformatics - Группа авторов

Скачать книгу

FT CDS join(201..224,1550..1920,1986..2085,2317..2404,2466..2629) FT /codon_start=1 FT /gene="eIF4E" FT /product="eukaryotic initiation factor 4E-II" FT /note="Method: conceptual translation with partial peptide FT sequencing" FT /db_xref="GOA:P48598" FT /db_xref="InterPro:IPR001040" FT /db_xref="InterPro:IPR019770" FT /db_xref="InterPro:IPR023398" FT /db_xref="PDB:4AXG" FT /db_xref="PDB:4UE8" FT /db_xref="PDB:4UE9" FT /db_xref="PDB:4UEA" FT /db_xref="PDB:4UEB" FT /db_xref="PDB:4UEC" FT /db_xref="PDB:5ABU" FT /db_xref="PDB:5ABV" FT /db_xref="PDB:5T47" FT /db_xref="PDB:5T48" FT /db_xref="UniProtKB/Swiss-Prot:P48598" FT /protein_id="AAC03524.1" FT /translation="MVVLETEKTSAPSTEQGRPEPPTSAAAPAEAKDVKPKEDPQETGE FT PAGNTATTTAPAGDDAVRTEHLYKHPLMNVWTLWYLENDRSKSWEDMQNEITSFDTVED FT FWSLYNHIKPPSEIKLGSDYSLFKKNIRPMWEDAANKQGGRWVITLNKSSKTDLDNLWL FT DVLLCLIGEAFDHSDQICGAVINIRGKSNKISIWTADGNNEEAALEIGHKLRDALRLGR FT NNSLQYQLHKDTMVKQGSNVKSIYTL"

      Following the mRNA feature is the CDS feature shown above, describing the region that ultimately encodes the protein product. Focusing just on eukaryotic initiation factor 4E-II, the CDS feature also shows a join line with coordinates that are slightly different from those shown in the mRNA feature, specifically at the beginning and end positions. The difference lies in the fact that the 5′ and 3′ untranslated regions (UTRs) are included in the mRNA feature but not in the CDS feature. The CDS feature corresponds to the sequence of amino acids found in the translated protein product whose sequence is shown in the /translation qualifier above. The /codon_start qualifier indicates that the amino acid translation of the first codon begins at the first position of this joined region, with no offset.

      The /protein_id qualifier shows the accession number for the corresponding entry in the protein databases (AAC03524.1) and is hyperlinked, enabling the user to go directly to that entry. These unique identifiers use a “3 + 5” format – three letters, followed by five numbers. Versions are indicated by the decimal that follows; when the protein sequence in the record changes, the version is incremented by one. The assignment of a gene product or protein name (via the /protein qualifier) often is subjective, sometimes being assigned via weak similarities to other (and sometimes poorly annotated) sequences. Given the potential for the transitive propagation of poor annotations (that is, bad data tend to beget more bad data), users are advised to consult curated nucleotide and protein sequence databases for the most up-to-date, accurate information regarding the putative function of a given sequence. Finally, notice the extensive cross-referencing via the /db_xref qualifier to entries in InterPro, the Protein Data Bank (PDB), and UniProtKB/Swiss-Prot, as well as to a Gene Ontology annotation (GOA; Gene Ontology Consortium 2017).

       SQ Sequence 2881 BP; 849 A; 699 C; 585 G; 748 T; 0 other; cggttgcttg ggttttataa catcagtcag tgacaggcat ttccagagtt gccctgttca 60 acaatcgata gctgcctttg gccaccaaaa tcccaaactt aattaaagaa ttaaataatt 120 cgaataataa ttaagcccag taacctacgc agcttgagtg cgtaaccgat atctagtata 180 . .<truncated for brevity> . aaacggaacc ccctttgtta tcaaaaatcg gcataatata aaatctatcc gctttttgta 2820 gtcactgtca ataatggatt agacggaaaa gtatattaat aaaaacctac attaaaaccg 2880 g 2881 //

      Finally, at the end of every nucleotide sequence record, one finds the actual nucleotide sequence, with 60 bases per row. Note that, in the SQ line signaling the beginning of this section of the record, not only is the overall length of the sequence provided, but a count of how many of each individual type of nucleotide base is also provided, making it quite easy to compute the GC content of this sequence.

      Graphical Interfaces

      RefSeq

Snapshot of the landing page for ENA record U54469.1, which provides a graphical view of biological features found within the sequence of the Drosophila melanogaster eukaryotic initiation factor 4E (eIF4E) gene.

      Box 1.2 RefSeq

      To address these issues, NCBI developed the RefSeq project, the major goal of which is to provide a reference sequence for each molecule in the central dogma (DNA, mRNA, and protein). As each biological entity is represented only once, RefSeq is, by definition, non-redundant. Nucleotide and protein sequences in RefSeq are explicitly linked to one another. Most importantly, RefSeq entries undergo ongoing curation, assuring that the RefSeq entry represents the most up-to-date state of knowledge regarding a particular DNA, mRNA, or protein sequence.

      RefSeq entries are distinguished from other entries in GenBank through the use of a distinct accession number series. RefSeq accession numbers follow a “2 + 6” format: a two-letter code indicating the type of

Скачать книгу