Preface Introduction

The field of bioinformatics has come into full view recently, primarily because of the significant advances made by the Human Genome Project and other systematic sequencing projects, and the necessity for all biologists to be able to apply at some level these techniques to their own research. It may come as a surprise to most readers that the origins of the field of bioinformatics go well back into the 1960s, with the pioneering work performed by Margaret Dayhoff and her colleagues, who looked...

Suggestions for Further Analysis

The authors are investigating a number of extensions to geneid, which are not discussed above 1. Incorporating homology information into the gene predictions. For instance, such information can be obtained after the comparison of the query sequence against a database of known amino acid sequences using BLASTX (Altschul et al., 1990) or FASTA (Pearson, 1990). Processed database search results can already be passed to geneid via -S option. The authors have chosen here not to discuss this option...

Accuracy of geneid Specificity Versus Sensitivity

As discussed above, most gene finders suffer from lack of specificity, predicting a large number of false-positive exons and genes, particularly in large genomic sequences. The authors believe that, comparatively, geneid has superior specificity to other existing gene finders, showing a somewhat more conservative behavior. The price is paid in terms of sensitivity. geneid v1.1 may miss more real exons than other gene finders. This is particularly true for short exons. Compared to other...

Commentary

Entrez, to be clear, is not a database itself but rather the interface through which all of its component databases can be accessed and traversed. The Entrez information space includes PubMed records, nucleotide and protein sequence data, 3-D structure information, and mapping information. The strength of Entrez lies in the fact that all of this information can be accessed by issuing one and only one query. Entrez offers integrated information retrieval through the use of two types of...

CDNA Microarrays

Initial approaches, pioneered by the groups of Jeffrey Trent at the National Human Genome Research Institute and Patrick Brown at Stanford University (Schena et al., 1995 DeRisi et al., 1996 Shalon et al., 1996), involved whole or partial cDNAs (> 200 bp) arrayed into microtiter trays. A custom-made spotting robot spots the cDNAs from the trays onto glass slides. The spotting robot is capable of creating spots from mere nanoliters of fluid, allowing for high spot densities. The first and...

The Gene Model

From a large number of candidate exons, geneid selects a proper combination of exons to assemble the predicted gene structure. This assembly must conform to a number of biological constraints, for example, that selected exons cannot overlap, or that an Open Reading Frame (ORF) should be maintained along the assembled gene. These biological constraints are defined in a set of rules in the so called gene model, included within the parameter file. These rules refer to the order of gene features in...

Ii

m mhmimhib h u m m m i Figure 4.3.12 geneid Web server output with the sequence example1.fa. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 4 FINDING GENES UNIT 4.3 Using geneid to Identify Genes FIGURE(S) Figure 4.3.13 geneid Default Gene Model. Printing images is not supported by this browser. To print images, select update and download the latest version of your browser. Gene.Model...

JLflijt EnlicI

I i tiumarv fiun rtr, itisiiso, r , sns UtiSjttl Baiiti C'.liei fjloaJ a C'TOI ur jn ftowk) rnesleWiWi r nks. 19 E nlr Figure 1.3.11 Entries resulting from the combination of two individual Entrez queries. The command producing the results is shown in the text box near the top of the window. The information on the individual queries that were combined is given in Figure 1.3.9. See text for details. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All...

Literature Cited

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST A new generation of protein database search programs. Nucl. Acids Res. 25 3389-3402. Burset, M. and Guigo, R. 1996. Evaluation of gene structure prediction programs. Genomics 34 353-367. Chothia, C. and Lesk, A.M. 1986. The relation between the divergence of sequence and structure in proteins. E.M.B.O. J. 5 823-826. Claverie, J.M. 1997a. Computational methods for the...

Necessary Resources

An up-to-date Web browser, such as Netscape Communicator or Internet Explorer Please note that there is an alternative implementation to the Web-based version of Entrez, called Network Entrez. This is the fastest of the Entrez programs in that it makes a direct connection to an NCBI dispatcher. The graphical user interface features a series of windows, and each time a new piece of information is requested, a new window appears on the user's screen. Since the client software resides on the...

Translated BLAST searches

While the complete sequence of the human genome (Lander et al., 2001 Venter et al., 2001), along with a number of other organisms, has become known, only fractions of the proteome have been identified. The reason for this discrepancy is that sequencing DNA or complementary DNA (cDNA) is far more productive and economically much more attractive (Venter et al., 2001) than sequencing proteins. Until the proteome is determined, protein and gene discovery, as a rule, are confined to start from...

Printing images is not supported by this browser To print images select update and download the latest version of your

Sequences producing significant alignments SB2 SEQUENCE C. SO SEQUENCE CO. SP SEQUENCE CO J SEQUENCE CON SC SEQUENCE CO. SX SEQUENCE CO 13671 _sp I 2 3 9 6 4 j ALU F_H UM AN 3i> 0799 spl EM9646 I YYY1 E3UMAN 1484 9 I sp I 2 0 9 31 j BOGF_HUMAH LINB-1 REVERSE TRANSCRIPTASE. LIHE-1 REVERSE TRANSCRIPTASE. RETROVIRUS-RELATED POL POLY RETROVIRUS-RELATED POL POLY ALU SUBFAMILY SB SEQUENCE CO. Dystroph i n ALU SUBFAMILY SBi SEQUENCE C Dystrophin ALU SUBFAMILY SUBFAMILY SUBFAMILY SUBFAMILY SUBFAMILY...

Alternate Protocol Combine Entrez Queries

There is another way to perform an Entrez query, involving some built-in features of the system. Consider an example in which the user is attempting to find all genes coding for DNA-binding proteins in menathobacteria. Although this example is for the nucleotide database, the general strategy works equally well for other Entrez databases. An up-to-date Web browser, such as Netscape Communicator or Internet Explorer 1. Open a Web browser and go to the Entrez Web page (http www.ncbi.nlm.nih.gov...

Basic Protocol 1 Framesearch Using A Nucleic Acid Query Sequence

This protocol describes the use of Framesearch in the GCG Wisconsin Package environment to search a protein sequence database for sequences that are similar to a query nucleotide sequence. Any user familiar with the GCG Package will find using Framesearch in that environment straightforward. Framesearch has recently been added to the algorithms supported by the SeqWeb version of the GCG Package (Accelerys, 2001), so users who prefer a Web-based interface may find it simpler to run Framesearch...

Basic Protocol 3 Using External Information To Solidify geneid Predictions

One of the strengths of geneid is that it can easily incorporate external information about gene features on the input query sequence in the final gene prediction. As human genomic sequences are being annotated with increasing reliability, this option may be useful, e.g., to analyze in detail apparently void genomic regions lying between known genes, to explore the possibility of alternative exons in known genes with well established constitutive exonic structure, or to extend gene predictions...

Dl

PSSM Of 1 Pffi 1525A (C5_DNA_mo(h > 156 SKJuOrlWS. T LfitDr -id a iftcur-ajis Q CJ *T lixtir- PSSMol 1PBO01525D (C5 ONA_ma(h 156 sequences. t- curt 4 U5u r mtji c . r-jrri Tj-n.-iijjh-.is T- T-T-T-t-T-T-T-T- CJ i J fj CJ rnj OJ OJ OJ OJ Figure 2.2.4 Sequence logos for the IPB001525 blocks, showing the multiple alignments t- curt 4 U5u r mtji c . r-jrri Tj-n.-iijjh-.is T- T-T-T-t-T-T-T-T- CJ i J fj CJ rnj OJ OJ OJ OJ PSSM or 1PBO01S2SC (CS .ONA malh .) 15fl sequBnoaE. PSSM of 1 PflO015S5F (CS...

The real use of marginal and transitive similarities

Fortunately, the ultimate objective for most researchers is not to establish the statistically significant similarity between two or more sequences. Instead, the final goal is the prediction of the exon structure of the gene, biological function, or the three-dimensional structure of the protein. In this quest, sequence similarity can be a sufficient indicator, but not a necessary one. Profile-like methods (UNIT 2.3), analysis of domains (UNIT 2.2), structural predictions and comparisons, gene...

Internet Resources

Papers and other related information for MZEF From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 4 FINDING GENES UNIT4.2Using MZEF to Find Internal Coding Exons Literature Cited

Pathway Function Prediction

The simplest thing to do is to take the set of genes of interest and see if anything is known about them. If there are two sets of genes (say one set determined from expression clustering, and another of genes known to be ribosomal proteins), a statistical test can be performed to see if they overlap. This will produce a p value, which is basically the probability that the overlap is just a fluke (i.e., a false positive). If there are n genes in the genome, and pX are known to have property X,...

Background Information Foundation and assumptions

The basis of GlimmerM is a dynamic-programming algorithm (UNIT 3.1) that considers all combinations of possible exons for inclusion in a gene model, and chooses the best of all these combinations. The possible exon-intron combinations are formed after an initial screening of the possible translational start sites and splice sites found in the genome. Both these entities are determined with specially designed modules based primarily on Markov chains. Markov models have been in use for decades as...

Geneid runs correctly but stops with a warning before producing any prediction

The following error message will appear Too many predicted sites Change RSITES parameter or a similar message concerning exon types. In order to minimize memory usage, geneid makes a guess on the maximum number of sites and exons that will be predicted in a given sequence fragment. While for most sequences, the guess is correct, in some (particularly anomalous) genomic sequences these numbers are much higher than that guessed. The user will need to change the parameters that control how these...

Pile Ed it Alignment Trees Co I or Quality Help

- -LKPK iLTASfiK HVKPYFTKTILD RVLGVOlvpsOGElS CMlAPE-LLVq- * PVKESRH& IDYDfl L PNQQR 5yvWJO CMlAPE-LLVq- * PVKESRH& IDYDfl L PNQQR 5yvWJO 4 iral2_dt oina ricaE ii nLa larjjroai VPPWFLNH PSN LY A ESMD 3 K P VP I ftVPEDQTGIiSGG t Ppy FtRT PVDQTGVSGG -KP7 1' - V T EK THLDpSG S M -0Eg1FQ I P.VNA TANMD ES --P E IRK BQtfQBVR VS -GKKf SS ORFEV EPl -GK Kfis K QRPEVIEF _ NN K EX ' j - GHRHP VL ----- L'GEPISD--GEEKlSFrrED---GKfc sC TOS R T , Eg KG -

Select format options for the show parameter

Check the Graphical Overview box if desired. This color-coded display of the alignments as horizontal bars is a practical feature for the visualization of polyproteins or multidomain proteins (Fig. 3.4.6). The color of a bar indicates a range of expected frequencies. Moving the cursor over a bar shows the definition and score in the window at the top. Single clicking on a bar takes the user to the corresponding alignment. 3. Check the NCBI-gi box if desired. By selecting this argument, the NCBI...

Background Information Newick tree format

The Newick tree format (sometimes also referred to as the New Hampshire format) is a widely used standard for describing phylogenetic trees. It was developed by James Archie, William H.E. Day, Joe Felsenstein, Wayne Maddison, Christopher Meacham, F. James Rohlf, and David Swofford in 1986 (the name comes from the lobster restaurant at which the format was agreed upon). Typically a phylogenetic program will store one or more Newick tree descriptions in a file. More elaborate file formats, such...

Key References

Jeanmougin, F., Thompson, J.D., Gouy, M., Higgins, D.G., and Gibson, T.J. 1998. Multiple sequence alignment with ClustalX. Trends Biochem Sci. 23 403-405. Higgins, D.G., Thompson, J.D., and Gibson, T.J. 1996. Using CLUSTAL for multiple sequence alignments. Methods Enzymol. 266 383-402. Both of these articles give extensive background and descriptive details as to what exactly happens when you try to use Clustal and what all of the parameters mean. They are intended for a lay, nontechnical...

Equation 437

To estimate this constant, a simple optimization procedure is performed. The value of EW affects the resulting predictions, and it may occasionally be useful to alter its default value (see Critical Parameters and Troubleshooting). Examples of large-scale genomic annotation using geneid geneid is being used in the Dictyostelium discoideum genome project as the main ab initio gene prediction tool. geneid is also being used in the large scale analysis of the genome sequence of pufferfish Fugu...

Gene and Exon Scores

Gene and exon scores have a probabilistic interpretation within geneid (see Background Information). Thus, although the authors have not studied exhaustively the false-positive rate of exon predictions as a function of the score, as a rule of thumb, the higher the score of an exon, the higher its likelihood. Note, however, that in geneid the score of an exon depends directly on its length, and that a very short exon cannot, by definition, have a high score. Thus, very short exons may have very...

Other Organisms

For all versions of GlimmerM except the malaria version, the system prints a list of the putative gene models (Fig. 4.4.9). The output is very similar to that produced by the GENSCAN system (Burge and Karlin, 1997) this format is intentionally designed to make it easier for software that parses the output of GENSCAN to use GlimmerM as well. For each gene model, the output contains a list of the exons that comprise that prediction. Four types of exons may appear in the predictions initial...

Alternate Protocol 3 Submitting Email Blast Search Requests

The NCBI BLAST E-mail server provides another way to run BLAST searches in relatively large numbers without the necessity of using a browser. The query sequence is compared against the specified database using the gapped BLAST algorithm and the results are returned in an E-mail message. The mandatory directives given in the first part of the protocol can be supplemented by the elective argument lines in the second. All directives must precede the sequence information (i.e., they appear before...

Critical Parameters and Troubleshooting

As mentioned above, MZEF requires three input parameters (other than the sequence file itself) Strand 1 or 2. One should try both strands if the coding strand information is unknown. P0 or Prior probability. It reflects the a priori belief on the coding exon density in the genomic region. As one can see from the above examples, when P0 was changed from 0.02 to 0.04, MZEF predicted two more exons that include one true exon (2564 2621) and another false-positive exon (17812 17874 see Figures...

Expect Range

The expected frequency is the number of database hits with a score equal to or higher than the match between the query and the subject (database) sequence. This most important advanced argument may save considerable time when analyzing marginally significant hits. It blocks reporting those matches that can be expected to occur in the database by the given absolute frequency. In contrast to relative frequencies, absolute frequencies (the expected numbers of occurrences of an event) can exceed 1....

Future Installments

At present, this unit merely sketches the topics that the chapter will cover. In a future installment, this unit will be replaced by a more comprehensive overview of phylogenetic methods. The first unit available in this chapter (UNIT 6.2) describes using the program TreeView to display phylogenetic trees. The chapter aims to cover the major tree building methods and software packages, including maximum parsimony, neighbor joining, maximum likelihood, and Bayesian methods. Future installments...

Alignment Score

Figure 3.2.2 Distribution of scores generated by using Framesearch to compare nucleotides 52500 through 55000 of gi-i58 2 92 54_5 5.seq with all peptide sequences from the example bacterial genome. Since the selected region comprises all of one gene and parts of two flanking genes, there are three very strong hits, highlighted by arrows above. There are also many lower-quality hits with scores below 400. Most likely, hits with scores above 200 represent genes related to the three genes...

B

Sequences producing significant alignments 11S6SSIap P11532IDKD HUKAH Dystrophin 17S65660IspI 097592 IEKD OYNFA Dystrophin 1169353 IsplP11531 DMD KQIJSE Dystrophin U8684 sp P11533lOKD CHICK Dystrophin Utrophin (Dystrophin-relate Dystrophin-related protein 2 Spectrin beta chain, erythr Spectrin beta chain, erythro Spectrin beta chair,, brain. . 17368942 sp Q9H254 ISPCC HUM AW Spectrin beta chain, brain. . 173674151splQ9QWN6ISPCF Spectrin beta chain, brain 24934351sp Q62261 SPCO MOUSE Spectrin...

Blast

Ft BLASTN 2,2.2 Dec-14 -2QQ1) ft Database nr ft Query gi 14456711 ref WM_ ft Fields Query idt Subject q. start, q, end, s. start, s, end, e-value 000558,3 Homo sapiens hemoglobin, alpha 1 (HBA1 , mRNA id, i identity, alignment length, mismatches 9aP openings, gi J 49420 263 35 0 71 333 60 121 2,8e-63 244,3 136 330 274 466 3,7e-50 200.7 407 429 527 49 0,310 38.16 giI 1264 3999 66.06 193 23 0 407 429 527 549 0,316 38.16 iji 11284 703 2 BB.Q8 193 23 0 138 330 7B7 979 3.7e-50 200.7 giI 407 429 1040...

Basic Protocol 3 Bl2seq For Comparing Two Sequences

Frequently, only two given sequences are to be aligned e.g., two known homologs, a gene and its transcript, a gene and the protein for which it codes, the transcript and the protein. In such a case, performing a full database search would be a waste of time. Instead, the bl2seq program (Tatusov and Madden, 1999) can generate the local alignment or alignments much faster and with less or no noise. bl2seq can be run either on the command line or on the Web at (Fig. 3.4.18).

From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley Sons Inc All rights reserved

CHAPTER 3 FINDING SIMILARITIES AND INFERRING HOMOLOGIES UNIT 3.4 Finding Homologs in Amino Acid Sequences Using Network BLAST Searches FIGURE(S) Figure 3.4.10 Pairwise alignment view of a hit to the human dystrophin protein when searching against the human subset of the nr database. Printing images is not supported by this browser. To print images, select update and download the latest version of your browser. s-cfi 16005338 ref JIP_00 9 05 5_ 11 H < ETK_007124 ) utrophin dystropliin-related...

Choosing a PAM matrix

It is extremely important to note that PAM matrixes are derived from protein sequence data available in the late 1960s and early 1970s. Most proteins known at that time were small, globular, and hydrophilic. If the researcher believes their protein contains substantial hydrophobic regions, such as membrane-spanning helices or sheets, the PAM matrices are less useful than others described in this unit. Dayhoff et al. (1978) were the first to define the terms protein family and superfamily. A...

Find related material

Select the Related Articles link on the upper right-hand corner of the abstract display (Fig. 1.3.3). In April, 2002, Entrez indicated that there were 162 papers of similar subject matter associated with the original Cayatte reference. Figure 1.3.4 shows the first four related papers. The first paper in the list is the same Cayatte paper because, by definition, it was most related to itself (the parent). The order of the following entries is based on statistical similarity. Thus, the entry...

Buy

CJick new f EatestjenUrs jndtc updite Cubby tari< i Search dam jndiune tick th Cubby Starch MlIck- diip iy nEonnariiwii about thi st-sr d -s-tarch SiCebby SMrtbNami Djiti todTiflrt What'sNtV 13 19 23 ISnev Dt mer-i.-uwriicm bartK jMHl 17-Apr-200fi 5 10 0 0 new r . 5lhc osc.1c osii IMH1 AND arpinn l HP CjiyaTtg 17-Apt- 2002 15 09 7 ft new r 3 aaatdfladt il Hl AND aspirin MM 17-Ap -2Q02 I 5 0E 15 5 an Cuijtjy Resources SIOH-U SOSitflCS fflL raitRnowIgs PncvKJfrCileiOrifli My hihuui PniifHiutii...

JL

Pongo Hyiobates Macaca fuscata M. mul tla M. (ascicuiaris M. sylvanus Saimir sciureus Lemur catta Tarsius syr chta Macaca foscata ML mulalta M. fasciculeris M. sylvanus Saimtri sctureus Figure 6.2.7 The same tree drawn in the four different styles available in TreeView. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 6 INFERRING EVOLUTIONARY RELATIONSHIPS UNIT 6.2 Visualizing Phylogenetic...

Equation 434

Where S -___y is the subsequence of S starting in position i and ending in position j. The score of a potential exon, S, LE(S) defined by sites sa (start acceptor) and sd (donor stop) is computed as the following log-likelihood score geneid predicts gene structures (which can be multiple genes in both strands) as sequences of frame-compatible nonoverlapping exons. If a gene structure g is a sequence of exons, e1, e2 en, the score of the gene is the log-ratio Equation 4.3.6 In geneid, the gene...

Contributors And Introduction

Contributed by Matthew Healy Bristol-Myers Squibb Pharmaceutical Research Institute Wallingford, Connecticut Framesearch (Edelman et al., 1995) is an extension of the classic Smith-Waterman pairwise sequence comparison algorithm. As illustrated in the upper portion of Figure 3.2.1, when the classic Smith-Waterman search algorithm is used to compare a nucleotide query sequence against a database of peptide sequences, it compares the six possible translations of that nucleotide query sequence as...

Background Information

For professionals, the efficiency of interpreting BLAST results can and should be enhanced significantly by providing single-click access to annotations, literature, expression or structural information, other searches, and predictions available over the Internet or by intranet connections. This is true regardless of whether the researcher is using BLAST for just a few single searches or managing a high-performance pipeline generating and analyzing millions of searches. The raw BLAST output can...

Cellular Component

Cellular Component refers to the location of action for a gene product. This location may be a structural component of a cell, such as the nucleus. It can also refer to a location as part of a molecular complex, such as the ribosome. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 7 ANALYSIS OF EXPRESSION DATA UNIT 7.2 The Gene Ontology (GO) Project Structured Vocabularies for Molecular...

A model pipeline

To discover new genes coding for a specified class of proteins (e.g., G-protein coupled receptors) in a database of genomic DNA, the following simple pipeline can be constructed. Compile a representative set of the protein class or import it by anonymous FTP from Swiss-Prot First, probe this set of proteins against the genomic database using the TBLASTN program. For larger scale projects, use a parser e.g., the BLAST-parser in the BioPERL package (http www.bioperl.org). Otherwise, manually...

Equation 632

The process stops when r 2, with the last branch length being equal to the last value in the distance matrix. The successive mergings achieved by NEIGHBOR are available in its outfile. The Q criterion enables numerous interpretations, the most popular being that it corresponds to the least-squares length estimate of the tree under construction. Accordingly, NJ tends to produce a tree with minimal length. More importantly, when applied to any tree distance D that perfectly fits a tree T, Q...

A

73032B3 gb AAFS8349rTl D (AEQ03318) CG6191-PA Droeophila melanogaster Length 717 Score 119 bits (297)r Expect 2e-25 identities 70 180 . Positives 103 1& 0 , -Gaps - 4 1& 0 2 ) Query YDPT FLDtiS ELKTG KH R TVMN LPSYKV IF PFIECKG AI KE E LN EQFRQKHQWIS-QPGI PL. 72 fl Y + LD+ EL GKHRT++ SY S+ +++ +K+ELN++FR+K I L S +++ t+ AYV EKLl+ NLl +SN KL A CLLL+AK H Sbjtt -602 D K +AL E+ E F +++KE+++SEF V L F + L V++ +VLPH RL E* Sb jet 66 0 DVKG--DALKS LIEKTES VFRLNRKELIS SEFAVLVALEPSLH VNRHEVLPHYQR...

Accessing Computer Tools

There are many computational tasks that need to be done to analyze gene expression information. There exist many freely available academic software tools that can do some of the tasks below. These are generally best suited for researchers with some informatics proficiency. The analyses below can all be performed using GeneSpring (see Internet Resources http www.siliconQenetics.com) from Silicon Genetics, and many can be performed by other tools, such as J-Express (UNIT 7.3) and NHGRI's Array...

What Are The Objectives Of The Go Project

The focus of the GO project is three fold. First, the project goal is to compile and provide the GOs structured vocabularies describing domains of molecular biology. The three domains under development were chosen as ones that are shared by all organisms Molecular Function, Biological Process, and Cellular Component. These domains are further described below. Second, the project supports the use of these structured vocabularies in the annotation of gene products. Gene products are associated...

Basic parameters

Controlling the output exclusively by the Number of Descriptions and the Number of Alignments arguments at the Format screen would cause BLAST to report biologically uninteresting matches when the specified number of alignments exceeds the number of interesting hits. The default expected frequency threshold is ten, an overly nonconservative value that instructs the program to report any hit of potential interest. At the other extreme, relying always on the top hit would be a risky proposition....

Introduction

Chapter 2 describes methods for recognizing functional domains in protein and nucleic acid sequences. The term functional domains is defined broadly to refer to sub-sequences of larger sequences that share some common functionality. It could include, for instance, gene-finding programs that divide genomic DNA sequence into regions classified as exons, introns, and non-transcribed. However, that particular topic is important enough to have a separate chapter (Chapter 4) and will not be covered...

Guidelines For Understanding Results

The result output contains the following information File_Name (maybe truncated if too long), Sequence_length (in basepairs), G+C_content (see Feature Variables Used in MZEF in the unit's Appendix) and a table of internal coding exons predicted. The nine columns in the table are Coordinates-the exon coordinates in the input DNA sequence (if Strand 2, one should reverse-complement each output region to get the sense-strand segment) P-the posterior probability ( > 0.5) for each exon how likely...

Alternate Protocol 1 Megablast

NCBI has developed an extremely rapid version of BLAST specialized for searching complete or partial genomes. MEGABLAST is a powerful tool for gene predictions, analyzing single-nucleotide polymorphisms, and some other tasks. It can be accessed from NCBI's top BLAST page (http www.ncbi.nlm.nih.gov BLAST) by clicking on the MEGABLAST link. When word sizes (W) of 16 or larger are used, this tool can be up to 10 times faster than BLASTN. MEGABLAST will find and extend any matches of word size W +...

Jii m

Figure 2.2.11 Corrected version of protein sequence AAF53163.1. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL DOMAINS UNIT 2.2 Using the Blocks Database to Recognize Functional Domains FIGURE(S) Figure 2.2.12 The first page of a phylogenetic tree made from the block regions of the 158 sequences included in IPB001525 as displayed by the ProWeb TreeViewer. Printing...

Masking the Sequence

Some types of interspersed repeats and low-complexity regions exhibit a highly nonrandom sequence composition, often similar to that characterizing protein coding regions (Stormo, 2000). geneid may include these in the gene predictions. It may be advisable, thus, to mask the query sequence for such repeats and regions using, for instance, the program RepeatMasker before running geneid. This strategy may increase the specificity of the predictions. Let us note, however, that real genes often...

Biological Process

Biological Process refers to the broad biological objective or goal that a gene product could be involved in. Biological Process includes the areas of development, cell communication, physiological processes, and behavior. An example of a broad process term is mitosis (the division of the eukaryotic cell nucleus to produce two daughter nuclei that, usually, contain the identical chromosome complement to their mother). An example of a more detailed process term is calcium-dependent cellmatrix...

Basic Protocol 2 Discovering Dna Motifs In A Set Of Dna Sequences With Meme

This protocol describes the use of MEME via the MEME Web interface or from the command line to discover motifs in a family of DNA sequences. It also discusses how to interpret the motifs and to use them to search sequence databases for sequences containing the motifs. Command-line MEME works on many uniprocessor computers, some multiprocessor computers, and clusters that have the MPICH message-passing software installed. A list of supported operating systems and their manufacturers is available...

Background Information History

The program geneid (Guigo et al, 1992) was one of the first programs to predict full exonic structures of vertebrate genes in anonymous DNA sequences. geneid was designed following a simple hierarchical structure first, gene-defining signals were predicted and scored using weight matrices. Next, potential exons were constructed from these sites, and their coding potential was scored as a function of several coding statistics, such as hexamer composition, whose coefficients were estimated by a...

Geneid runs correctly and produces a valid gene prediction but the user strongly suspect that the prediction is

For sequences other than short ones encoding single genes, only in a few percent of the cases will geneid prediction be completely correct. In most cases, the geneid prediction will nearly reproduce (at least one of) the exonic structures of the genes encoded in the input DNA sequence. A number of actual exons may be missed (maybe more than when using other gene-prediction programs), and some false exons or genes may additionally be predicted (in comparison to other gene-prediction programs,...

Run GlimmerM to analyze DNA sequences for their coding potential

The program GlimmerM takes two inputs a DNA sequence file in FASTA format (APPENDIX 1B) and a directory containing the training files for the program. If not specified, the training directory is assumed to be the current working directory. For instance, if the user is running a pre-compiled version of GlimmerM located in the bin directory, the following command should be used glimmerm_< system> < FASTA file with the DNA sequence to be analyzed> glimmerm_< system> < FASTA file>...

Review the results

Use a Web browser to view the MEME results. For example, using Netscape Navigator, click on Open Page in the File menu and use Choose File to select the file saved in the previous step lex.zoops.html. Then click Open In Navigator. 8. Click on the First Motif button. This will take the user to the first motif discovered by MEME (Fig. 2.4.30). MEME finds the experimentally verified LexA binding site motif. It automatically determines the correct width for the motif. The extremely low E-value...

Specify how the alignment is to be shown

Select an Alignment View from the pull down menu Pairwise (default). Standard BLAST alignment in pairs of query sequence and database match (Fig. 3.4.10). This is the best display to evaluate the similarity between two sequences and is therefore recommended for most searches however, when the conservation of individual positions is being analyzed across a family of sequences, select one of the query-anchored multiple alignment displays. For automated analysis, the results in tab-delimited...

Support Protocol 3 Displaying Bootstrap Values In Clustalx Trees

Bootstrap values are a measure of support for a node in a tree. These are usually given as the percentage of bootstrap trees in which that node appeared. Bootstrap trees are obtained by generating a large number (typically 1000 or more) of new data sets, each obtained by randomly resampling with replacement from the original alignment, and generating a tree from each data set. Whereas earlier versions of Clustal stored bootstrap values as internal node labels, in ClustalX (Thompson et al.,...

Basic Protocol Running GlimmerM Locally To Identify Genes

The most powerful and flexible way of using GlimmerM is to install and run the Unix-based software on a local system. This gives the user more organism-specific versions of GlimmerM (these are included with the software), and the power to train the system for any organism of choice, provided that one can collect a representative training set (see Support Protocol). Another advantage of having a locally installed GlimmerM is that the parameters of the system can be customized to reflect the...

Principal Component Analysis

One popular analysis technique for dealing with a multi-dimensional data set such as gene expression data from several experiments is called principal component analysis. It tries to convert the n-dimensional scatter plot where each point is a gene and each axis is an experiment into a two-dimensional scatter plot. The method of doing this, is to find the most significant patterns over the experiments, and then use each axis on the scatter plot to indicate how much like that pattern a given...

How Do GO Vocabularies Relate To Other Resources Such As the TIGR Cellular Role Classifications

Various other classification schemes have been indexed to GO including the SWISS-PROT keyword set and the TIGR cellular role classification set. These mappings are provided to the public at the GO Web site (http www.gene0nt0l0gy.0rg indices). They are reviewed and updated as needed. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 7 ANALYSIS OF EXPRESSION DATA UNIT 7.2 The Gene Ontology...

Geneid predictions on sequence submitted from iIQimimes are

gff-version 2 date Mon Apr 22 17 20 50 source-vers ioni geneid v f S quence examplel - Leng example1 example1 examplel examplel examplel examplel examplel examplel geneid_vl geneid_vl geneid_vl geneid_vl geneld_vl geneid_vl geneid v 1 geneid vl Ivl geneid imiiri > es h 320 1 bps genes. Score - 16 700655 . 470 Graphical representation of the predictions Use ihc option roue oj overeacJi individual pictiut)

Submit a single motif to the Blocks database and construct a tree

Scroll to the top of the MEME results document. 12. Click on the First Motif button. 13. Scroll down to the Motif 1 in Blocks format section and click on the Submit Block 1 button. 14. On the resulting input form, click on Tree GIF, to see a neighbor-joining tree (UNITS 22 & 6.3) of the sites composing motif 1 (Fig. 2.4.27). The numbers following the sequence names in the leaves of the tree are the positions in the sequence of the sites. In this example, the sites nearer the N- (smaller...

Support Protocol Obtaining The Clustalw And Clustalx Programs

The Clustal series of programs are available by anonymous FTP from ftp-igbmc.u-strasbg. fror 1tp.ebi.ac.uk. ClustalW is written in ANSI standard C and has been tested on a number of Unix platforms, including DEC, SGI, and Sun, as well as Macintosh and PC systems. However, it can be compiled on any platform which supports a C compiler. Executable programs are supplied for Power Macintosh computers and for PCs running either the Windows or DOS operating systems. ClustalX uses the Vibrant...

Mast

Use this film to submit motifs to MAST to 1) e us ed in starching a s equenct datsb as e. YflUi iata will b e processed at t Cafttfcj aftd ths- rfcsvltf vill b 15& mt to ym WA 4-ifiAii. -1 y stuu-txiee length reconunende f> n& p -vsdue calculation -I Teal Q ipv-i fernst Display sc ui-flccs vith E-valuc frelow 10 Rgnk of tiit first match retsfimed 1 Ignore Hiotife if Z vafaic aJbavc J001 Figure 2.4.32 MAST input form for LexA. From Current Protocols in Bioinformatics Online Copyright...

Equation 425

Which maximizes the Fisher criterion (h+- h-)2 (s+ + s_) (Fisher, 1936). One notices that the Fisher coefficient (Equation 4.2.5) will reduce to that of (Equation 4.2.4) when S+ S_, although minimization of the Fisher criterion cannot provide an optimal value for the constant threshold v, which may be chosen by minimizing the classification errors in the linear subspace. Using a linear discriminant function (often the Fisher discriminant function) for classification is called LDA (linear...

The rationale of distancebased approaches

Let S be the set of sequences being studied and T the true evolutionary tree of these sequences. Assume that the sequences have been correctly aligned, so that the sites correspond to homologous positions (see UNITS 2J_ & 24). Now consider the true number of substitutions that is attached to every branch of T, i.e., the number of substitutions that occurred in the past from the sequence situated at one branch extremity to the sequence at the other extremity. These substitution numbers are...

Using BIONJ

BIONJ is available free from This Web page contains documentation and articles, test sets, and executables for Windows PC and PowerMac, as well as the C source code. Once downloaded (and compiled on Unix and related systems), BIONJ must be placed in the PHYLIP. BIONJ asks for the distance matrix input file and the name of the tree output file. The distance matrix must be square and written in PHYLIP format. The file can contain one or several matrices, as obtained when using SEQBOOT plus...

Taxonomy reports

It can be challenging to review the complex taxonomic distribution of hits for families of proteins conserved over aeons of evolution, like copper- zinc superoxide dismutase. The taxonomic hierarchies are presented in a user-friendlier format on the Taxonomy links at the top of the BLAST result pages. Clicking there will display the scientific binomens, the common names, and part of the classification of the organisms (Fig. 3.4.20). Note the taxonomy codes used by NCBI that can easily be used...

Other Ontologies Under Development Complement GO

GO vocabularies do not describe attributes of sequence such as intron exon parameters, protein domains, or structural features. They do not model protein-protein interactions. They do not describe mutant or disease phenotypes. There are efforts underway to develop ontologies for each of these domains. The GO consortium is supporting the development of other bio-ontologies by providing a Web home for developers to post other emerging ontologies The requirements for inclusion on this site,...

Why is it necessary to finetune the search arguments

BLAST is one of the most user-friendly bioinformatics tools, particularly when running over the network however, the ease of this application may be somewhat deceptive, as sometimes the default values will not achieve a specific objective. For example, primer design may require that a decanucleotide would not appear in the transcriptome of the organism in other genes. As the default word size for BLASTN is 11 nucleotides, a shorter word size must be specified (i.e., -w 10 see below). Relying on...

GO Does Not Define Evolutionary Relationships

Shared annotation of gene products to GO terms reflect shared association with a defined molecular phenomena. Multiple biological objects (proteins) can share function or cellular location or involvement in a larger biological process, and not be evolutionarily related in the sense of shared ancestry. That said, many proteins that share molecular function attributes, in particular, do share ancestry. However, the property of shared ancestry is separate from the property of function assignment...

Alternate Protocol 3 Improving Speed Of Framesearch By Using Specialized Hardware

While Alternate Protocols 1 and 2 make practical the routine use of Framesearch on an ordinary computer, it is not a perfect solution to the Framesearch performance problem. Restricting Framesearch to regions surrounding BLAST hits will permit alignments that are interrupted by frameshift errors to span reading frames, thus converting a short BLAST hit into a longer Framesearch hit. However, there may be hits that BLAST will not find at all, because no single reading frame has enough sequence...

Sbjct 301 tccaagtaccgttaaactggag 322

Figure 3.3.8 Pairwise alignment view of the best hit to the mRNA of the human hemoglobin a1 chain when searching against the rodent subset of the nr database (Table 3.3.1). From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 3 FINDING SIMILARITIES AND INFERRING HOMOLOGIES UNIT 3.3 Finding Homologs to Nucleotide Sequences Using Network BLAST Searches FIGURE(S) Figure 3.3.9 The Query-Anchored...

GliiwrneiM Web Server

In order to nit the- firmer, please- iclecl the- orjanisTi nr which yoit arc deir tftfc pred ctioa ihei zip t in QCJKt by cut-jjii-postjvg. iiits the i-tquctict w.do r or enttf a filename, to unload. Ir int scquciuces may be in FASTA Emm t w simple DMA itqiitncts. vi.foliiiopsis fciUana Organisms O.-yiliitj GiCt) -M il TViudly ' WoJ'liwaU YfcS, to S (U ra (Hj) to 2 J AO bp) pre ditoeJ jLf ci tea Ma Lari.a ttnp. or p7.HC lt FASTJ DN A sc ier.oc beto1* ivjj to SljQCC bp VT H T-nn i ' I H Mfc K...

Select format

Select arguments for Formatting options on page with results. By default, no formatting screen (Fig. 3.3.5) is included in the results however, to have the option of changing the format after reviewing the results, simply display the formatting section on the top of the results. Turn this option off before saving the results into a file. 14. Specify Autoformat as fully automatic (Full-Auto), semi-automatic (Semi-Auto), or Off. When Full-Auto is selected (this is recommended), the status...

Meme

UffrF'hii.-'Orm CO i'ltm ' D)JA W JirrAi. ifcflCOT.Hi n0 ir il'HC' iTi Y flMt be. ai. cn 'Stt I EjJ-1 itpH-mmprai -> t (hi IJic-gD Sum r-iroftiipyH'L- Ccmcir flic tw-its Jj b& s sr.i yos yy t im.jJ. fi iw os ym thL iit (mkk tf c. Jinoo tre fisftili lii JiisiimiTi Fiimrfi pr of motiFf zc fcui Plisp- tf.Ttf-i-.tffLTj * 'wlyj. yjr'brbinK si t or.t ft * rr.Dfr'i HI* sfcjanrjncfjTn y iisT iii-nr cr.orf th cCS UQJI amitBHl (OrJ*lLll W.y . I r L J d UMlIitlil U Ti i i ' .I v (jpthuulj...

Gene Predictions

Mapping EST sequences to a genomic sequence by BLAST is one of the best methods to predict genes on the condition that ESTs cover all exons in the gene. Fortunately, by the completion of the human genome, and by unraveling most of the human genes, chances are significantly higher to find reliable matches between ESTs, proteins, and the genome than before. Depending on the rate of sequencing errors for the ESTs, we require a high level (i.e., 95 to 97 or higher) identity between the genomic...

Submit a search to MEME

Run MEME on the training set of sequences (tf4.fasta in this example) by doing one of the following a. For command-line MEME Type meme tf4.fasta -nmotifs 10 > tf.zoops.html on the command line. Skip to step 7 in this protocol when MEME has finished running. The name of the file containing the training set sequences is always the first parameter after the program name (i.e., meme). The -nmotifs n switch tells MEME how many motifs to find. How to choose the best value for n is detailed...

MAST Motif Alignment and Search Tool

Figure 2.2.6 Part of the MAST output generated by selecting the MAST Search link in Figure 2.2.2. The query in this search was constructed from six position-specific scoring matrices computed from the six IPB001525 blocks, and the database was Drosophila protein sequences. GenBank entry AAF53163.1 is the top hit. From Current Protocols in Bioinformatics Online Copyright 2002 John Wiley & Sons, Inc. All rights reserved. CURRENT PROTOCOLS IN BIOINFORMATICS CHAPTER 2 RECOGNIZING FUNCTIONAL...

Discriminant Analysis and Bayes Error

MZEF is based on a classical discrimination method QDA (Quadratic Discriminant Analysis), which is a direct descendant of LDA (Linear Discriminant Analysis). Discriminant analysis belongs to general statistical pattern recognition methods and has been widely used in many fields for optimal classification (e.g., Fukunaga, 1990). Discriminant analysis is used to answer the following question given N objects, how can one assign each object into K known classes with minimum error For simplicity,...

Predicting and scoring sites

PWMs are used to score each potential donor site (GT), acceptor site (AG), and start codon (ATG) along a given sequence. The score of a potential donor site (if assumed to be of length l) S s1s2 si within the sequence is computed as This is the log-likelihood ratio of the sequence S in an actual site versus S in any false GT site. Dj is the logarithm of the ratio of the probability of nucleotide i in position j in an actual donor site over the probability of i in position j in a false site. Dj...

Basic Protocol 2 Framesearch Using A Protein Query Sequence

This protocol describes the use of Framesearch in the GCG Wisconsin Package environment to search a nucleotide sequence database for sequences that are similar to a query protein sequence. Any user familiar with the GCG Package will find using Framesearch in that environment straightforward. Framesearch has recently been added to the algorithms supported by the SeqWeb version of the GCG Package (Accelerys, 2001), so users who prefer a Web-based interface may find it simpler to run Framesearch...

Table borders are not supported by this browser Select update to download the latest version of your browser

Table 3.1.1 Rules for Sequence Similarity Searching3 The alignment procedure must provide weights for a gap (g) and for the length of the gap (r) The weights for g and r must never be zero The sum of the weights for a gap (g) and its length (r) must be greater than the weight for a single match or mismatch if insertions and deletions are expected to occur less frequently than substitutions (i.e., mismatches) If there are two strongly matched regions separated by a region of low similarity, the...

Foreword

During the last 25 years, computers have moved from being an esoteric tool of the mathematicians and physicists into the mainstream of our daily existence. Increasingly, they are an essential component of modern living. Nowhere is this more apparent than in biology, where the combination of vast databases of information and clever computer programs to manipulate and mine that data now permeate the practice of our science. The new discipline of bioinformatics has not only gained credibility, but...

Weighbor

WEIGHBOR follows the same agglomerative scheme as NJ. It modifies the reduction step, in a way analogous to BIONJ, but also modifies the selection step to take into account the high variance of long-distance estimates. Instead of using NJ's selection criterion (see above), WEIGHBOR combines two criteria. When i and j are neighbors in T and when D perfectly fits T, then one has the two following properties Additivity dik - djk is independent of k ( i, j) Positivity dik + dji - dj - dk i > 0...

Ontology Based Enhancement of Bioinformatics Resources

Bioinformatics systems have long employed keyword sets to group and query information. Journals typically provide keywords, which subsequently permit indexing of the published articles. Hierarchical classifications (e.g., taxonomies, Enzyme Commission Classification) have been used extensively in biology, and molecular function classifications started to appear with the work of Monica Riley in the early 1990s (Riley, 1993, 1998 Karp et al., 1999). The Unified Medical Language System (UMLS http...

Mmdb

Qtstiimiei* hign Mobility ireup PiOia-n frsflfflenhS Fimst> ) (Drti'Biiwins Hm Eta Domain b OiRai HmatHNmj, DepKttlwK H.M.Vteir, PJ.Krauli C.S.Hill, A R.C.Ra ne, E.D.Laua JO .Thomas 7-Hi r-Sfl TSW> mOitty Riflu ngAmicus Kelcrcnc-r PubMml MMM l2li PC 1HMF VtewiOStrtKiuie * ( T fP-sp v O-'CiJO > Figure 1.3.14 The structure summary for 1HMF, resulting from a direct query of the structures accessible through the Entrez system. The entry shows header information from the corresponding MMDB...

[Middle portion of alignment omitted

38291 T GCAATATAAAGT G CATTATGAC AC C AC C C TGAAC GGT GGTTTCGC TAG C 3834 0 11111111111lil111111111111111111111 11111111111111 1035 105 0 11111 111 111 H 111 1111111 111 111 11111E11 1111111 111111 1051 10 67 38391 AATGCACGCACAGATCAAA 3 8409 Figure 3.2.4 Alignment of a nucleotide query sequence against a peptide database sequence, generated by Framesearch. Note that the middle portion has been omitted here. The names of the query and database sequences, just above this alignment, have been...

Equation 633

Where is any number in 0,1 that varies depending on the merged pair , j but not on x. So once the pair , j has been selected, BIONJ computes the value that minimizes the sum of the variances of the dux estimates. In this way, more reliable estimates will be available to select the pairs of taxa to be agglomerated during the next steps. Moreover, since the process is repeated at each step, these estimates will become better and better in comparison with NJ estimates as the algorithm proceeds. To...

Locating all potential similarities PAM 250

The most widely used PAM matrix is PAM 250 (Fig. 3.5.1). It has been chosen because it is capable of accurately detecting similarities in the 30 range (i.e., superfamilies), that is, when the two proteins are up to 70 different from each other (George et al., 1990). Another way to think about this is that the PAM 250 matrix provides the best look-back in evolutionary time in a protein sequence comparison. If the goal is to know the widest possible range of proteins similar to the protein of...

Overview And Philosophy

Current Protocols in Bioinformatics is designed to provide the experimentalist with insight into the types of data and protocols required to perform basic tasks in the area of bioinformatics. More importantly, it provides insight into understanding and properly interpreting the data produced by these methods. The Current Protocols series is known for its fast and timely publication of valuable and cutting-edge methods this book takes that mandate one step further. Initial online installments...

View an individual database record

Select the author hyperlink to display the Abstract view of the selected paper. The Abstract view presents the name of the paper, the list of authors, their institutional affiliation, and the abstract itself, in standard format. See Figure 1.3.3 for the Abstract view of Cayatte et al. 6. To change the display, select the drop-down list next to the Display button. Select Citation and click Display. Switching to this format produces a similar looking entry however, the cataloging information,...

Enter mandatory information

Address the message to blast ncbi.nlm.nih.gov. One can receive documentation from the server by sending a message with the word help in the body. Whether submitting searches or requesting documentation, the subject of the message is optional however, when it is nonempty, the reply from the server will arrive with the same subject, a useful feature for the identification of multiple requests. 2. The keyword program should be in the word on the first line of the message body. It is followed by a...