Bioinformatics. Группа авторов
Чтение книги онлайн.
Читать онлайн книгу Bioinformatics - Группа авторов страница 55
Figure 4.20 The Synteny view at Ensembl. (a) An overview of the syntenic blocks shared between human chromosome 12 and the mouse genome. The human chromosome is drawn in the middle of the display as a thick white box. The syntenic mouse chromosomes are represented by thinner white boxes along the side. The colored rectangles highlight regions of synteny between the human and mouse. A red outline illustrates the position of the PAH gene on the blue region of human chromosome 12 and on the blue region of mouse chromosome 10. (b) The Location tab for the PAH gene showing both the human and mouse syntenic regions. This is similar to the three-panel location tab shown in Figure 4.16, except that both the human and mouse genomes are depicted. The top panel (not shown) displays the full length human chromosome 12 and mouse chromosome 10. The second panel shows an overview of the genes in the region. The third panel focuses in on the PAH gene. Note that the regions in human and mouse appear to be presented in opposite orientations; in human, the PAH and IGF1 genes are both transcribed from right to left, while in mouse they are transcribed from left to right.
Figure 4.21 Ensembl BLAST output, showing an alignment between the human ADAM18 protein and the lizard genome translated in all six reading frames. On the BLAST/BLAT page at Ensembl, paste the FASTA-formatted sequence of human ADAM18, accession NP_001307242.1, into the Sequence data box. This sequence can be found at www.ncbi.nlm.nih.gov/protein/NP_001307242.1/?report=fasta. Select Genomic sequence from the anole lizard as the DNA database. On the results page, select the Alignment link next to the highest scoring hit in order to view the sequence alignment. The human protein sequence is on top, and the translated lizard genomic sequence is below. Lines indicate identical amino acids.
The Ensembl sequence data can also be queried via a BLAT or BLAST search by following the link at the top of any page. Earlier in this chapter, Figure 4.11 outlined how to use BLAT to look for a lizard homolog of the human ADAM18 gene. Ensembl data can be searched by the more sensitive BLAST algorithm, including the TBLASTN program that is used to compare a protein query with a nucleotide database translated in all six reading frames. Copy and paste the FASTA-formatted protein sequence of NCBI RefSeq NP_001307242.1 into the Sequence data box on the BLAST page and carry out a TBLASTN search against the anole lizard genomic sequence. The sequence alignment of the top hit is shown in Figure 4.21. The human protein query is on the top line, and the translated lizard genomic sequence on the second. The sequences share only 32% sequence identity, but the alignment spans 650 amino acids, and some key sequence features are conserved; note the alignment of almost every cysteine residue. Thus, this lizard genomic sequence is indeed a homolog of human ADAM18. The BLAST algorithm, although about two orders of magnitude slower than BLAT for the same query, is able to find a lizard ortholog of the human protein.
Ensembl Biomart
The BioMart tool at Ensembl is akin to the Table Browser at UCSC, in that it provides a web-based interface through which to access the data underlying the Ensembl Genome Browser. Results are returned as text or HTML-formatted tables. Ensembl hosts several mart databases that are described in the online documentation. The Ensembl Genes database contains the Ensembl gene set and integrates Ensembl genes, transcripts, and proteins with a number of resources, including external references, protein domains, sequences, variants, and homology data. After choosing a Database (e.g. Genes) and Dataset (genome assembly, e.g. Homo sapiens), the user specifies the Filters (basically, the input data) and the Attributes (the output data). Users can choose from among seven types of filters, including Region and Gene. A Region could be a chromosomal position, while a Gene could be an accession number, gene name, or even microarray probeset. The list of possible Attributes is long, and includes Ensembl data such as gene and transcript identifiers and positions, links to external data sources including RefSeq, UCSC, Pfam (protein families), and Gene Ontology (GO) terms, as well as mapping to orthologs in the Ensembl genome databases.
In this example, we will identify the mouse orthologs of the human mRNA reference sequences that are associated with common diseases or traits. To do this, we will start with the output of the UCSC Table Browser, the mRNA reference sequences that overlap with a variant from the GWAS Catalog, pull out the corresponding Ensembl gene and transcript identifiers, and then link to the mouse orthologs. The initial step is to retrieve the RefSeq accession numbers that overlap with a variant from the GWAS Catalog by reproducing the search shown in Figure 4.12d, this time changing the output format to sequence
. Copy and paste the output from the Table Browser into your favorite text editor to create a list that contains only the accession numbers. Note that BioMart does not accept the accession.version format used by NCBI, so an accession number like NM_001042682.1
would need to be rewritten as NM_001042682
.
At BioMart, the first step is to enter these accession numbers as Filters into the Human Genes (GRCh38.p10) Dataset. RefSeq mRNA accession numbers are entered in the filter called Gene → Input external references ID list (Figure 4.22a). The Attributes could be the Ensembl Gene and Transcript identifiers, as well as the Gene name, in the Features → Gene → Ensembl section (Figure 4.22b). To correlate the output with the RefSeq accession numbers entered as Filters, it is necessary to also select the RefSeq accession as an attribute, in the Features → Gene → External References