Bioinformatics. Группа авторов

Чтение книги онлайн.

Читать онлайн книгу Bioinformatics - Группа авторов страница 44

Bioinformatics - Группа авторов

Скачать книгу

in Swiss-Prot. The asterisks in each row indicate the expected, random distribution of hits. The inset is a magnified version of the histogram in that region.

Snapshot depicts the hit list for the protein–protein FASTA search.

       FASTA begins the search by looking for exact matches of words, while BLAST allows for conservative substitutions in the first step.

       BLAST allows for automatic masking of sequences, while FASTA does not.

       FASTA will return one and only one alignment for a sequence in the hit list, while BLAST can return multiple results for the same sequence, each result representing a distinct HSP.

       Since FASTA uses a version of the more rigorous Smith–Waterman alignment method, it generally produces better final alignments and is more apt to find distantly related sequences than BLAST. For highly similar sequences, their performance is fairly similar.

       When comparing translated DNA sequences with protein sequences or vice versa, FASTA (specifically, FASTX/FASTY for translated DNA → protein and TFASTX/TFASTY for protein → translated DNA) allows for frameshifts.

       BLAST runs faster than FASTA, since FASTA is more computationally intensive.

      Several studies have attempted to answer the “which method is better” question by performing systematic analyses with test datasets (Pearson 1995; Agarawal and States 1998; Chen 2003). In one such study, Brenner et al. (1998) performed tests using a dataset derived from already known homologies documented in the Structural Classification of Proteins database (SCOP; Chapter 12). They found that FASTA performed better than BLAST in finding relationships between proteins having >30% sequence identity, and that the performance of all methods declines below 30%. Importantly, while the statistical values reported by BLAST slightly underestimated the true extent of errors when looking for known relationships, they found that BLAST and FASTA (with ktup = 2) were both able to detect most known relationships, calling them both “appropriate for rapid initial searches.”

      Internet Resources

BLAST
European Bioinformatics Institute (EBI) www.ebi.ac.uk/blastall
National Center for Biotechnology Information (NCBI) blast.ncbi.nlm.nih.gov
BLAST-Like Alignment Tool (BLAT) genome.ucsc.edu/cgi-bin/hgBlat
NCBI Conserved Domain Database (CDD) ncbi.nlm.nih.gov/cdd
Cancer Genome Anatomy Project (CGAP) ocg.cancer.gov/programs/cgap
FASTA
EBI www.ebi.ac.uk/Tools/sss/fasta
University of Virginia fasta.bioch.virginia.edu
RefSeq ncbi.nlm.nih.gov/refseq
Structural Classification of Proteins (SCOP) scop.berkeley.edu
Swiss-Prot www.uniprot.org

      1 Altschul, S.F., Boguski, M.S., Gish, W., and Wootton, J.C. (1994). Issues in searching molecular sequence databases. Nat. Genet. 6: 119–129. A review of the issues that are of importance in using sequence similarity search programs, including potential pitfalls.

      2 Fitch, W. (2000). Homology: a personal view on some of the problems. Trends Genet. 16: 227–231. A classic treatise on the importance of using precise terminology when describing the relationships between biological sequences.

      3 Henikoff, S. and Henikoff, J.G. (2000). Amino acid substitution matrices. Adv. Protein Chem. 54: 73–97. A comprehensive review covering the factors critical to the construction of protein scoring matrices.

      4 Koonin, E. (2005. Orthologs,

Скачать книгу