FN ISI Export Format VR 1.0 PT Journal AU Kel, AE Gossling, E Reuter, I Cheremushkin, E Kel-Margoulis, OV Wingender, E TI MATCH (TM): a tool for searching transcription factor binding sites in DNA sequences SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 BIOBASE GmbH, Halchtersche Str 33, D-38304 Wolfenbuttel, Germany BIOBASE GmbH, D-38304 Wolfenbuttel, Germany Inst Cytol & Genet, Novosibirsk 360090, Russia Univ Gottingen, UKG, Dept Bioinformat, D-37077 Gottingen, Germany ID ELEMENTS; DATABASE AB Match(TM) is a weight matrix-based tool for searching putative transcription factor binding sites in DNA sequences. Match(TM) is closely interconnected and distributed together with the TRANSFAC(R) database. In particular, Match(TM) uses the matrix library collected in TRANSFAC(R) and therefore provides the possibility to search for a great variety of different transcription factor binding sites. Several sets of optimised matrix cut-off values are built in the system to provide a variety of search modes of different stringency. The user may construct and save his/her specific user profiles which are selected subsets of matrices including default or user-defined cut-off values. Furthermore a number of tissue-specific profiles are provided that were compiled by the TRANSFAC(R) team. A public version of the Match(TM) tool is available at: http://www.gene-regulation.com/pub/programs.html#match. The same program with a different web interface can be found at http://compel.bionet.nsc.ru/Match/Match.html. An advanced version of the tool called Match(TM) Professional is available at http://www.biobase.de. TC 0 BP 3576 EP 3579 PG 4 JI Nucleic Acids Res. PY 2003 PD JUL 1 VL 31 IS 13 GA 695LT PI OXFORD RP Kel AE BIOBASE GmbH, Halchtersche Str 33, D-38304 Wolfenbuttel, Germany J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000183832900061 ER PT Journal AU Wasserman, WW Krivan, W TI In silico identification of metazoan transcriptional regulatory regions SO NATURWISSENSCHAFTEN LA English DT Review SN 0028-1042 PU SPRINGER-VERLAG C1 Univ British Columbia, Ctr Mol Med & Therapeut, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada Univ British Columbia, Ctr Mol Med & Therapeut, Vancouver, BC V5Z 4H4, Canada Zymogenet Inc, Seattle, WA 98102 USA ID FACTOR-BINDING-SITES; PROTEIN-DNA INTERACTIONS; FACTOR-I GENE; HUMAN GENOME; EXPRESSED GENES; TARGET SITES; EXPECTATION MAXIMIZATION; GLUCOCORTICOID RECEPTOR; CHROMOSOME TERRITORIES; COMPUTATIONAL ANALYSIS AB Transcriptional regulation remains one of the most intriguing and challenging subjects in biomedical research. The catalysis of transcription is a clear example of multiple proteins interacting to orchestrate a biological process, offering a starting point for the study of biological systems. Transcriptional regulation is viewed as one of the principal mechanisms governing the spatial and temporal distribution of gene expression, thus the field of transcriptional regulation provides a natural stage for quantitative studies of multiple gene systems. Building on the body of focused experimental studies and new genomics-driven data, computational biologists are making significant strides in accelerating our understanding of the transcriptional regulatory process in metazoan cells. Recent advances in the computational analysis of the interplay between factors have been fueled by well- defined computational methods for the modeling of the binding of individual transcription factors. We present here an overview of advances in the analysis of regulatory systems and the fundamental methods that underlie the recent developments. TC 0 BP 156 EP 166 PG 11 JI Naturwissenschaften PY 2003 PD APR VL 90 IS 4 GA 682ZX PI NEW YORK RP Wasserman WW Univ British Columbia, Ctr Mol Med & Therapeut, 950 W 28th Ave, Vancouver, BC V5Z 4H4, Canada J9 NATURWISSENSCHAFTEN PA 175 FIFTH AVE, NEW YORK, NY 10010 USA UT ISI:000183122900002 ER PT Journal AU Conkright, MD Guzman, E Flechner, L Su, AI Hogenesch, JB Montminy, M TI Genome-wide analysis of CREB target genes reveals a core promoter requirement for cAMP responsiveness SO MOLECULAR CELL LA English DT Article SN 1097-2765 PU CELL PRESS C1 Novartis Res Fdn, Genom Inst, San Diego, CA 92121 USA Novartis Res Fdn, Genom Inst, San Diego, CA 92121 USA Salk Inst Biol Studies, La Jolla, CA 92037 USA ID CONSTITUTIVE ACTIVATION DOMAIN; RESPONSE ELEMENT; CYCLIC-AMP; REPORTER GENE; TRANSCRIPTION; BINDING; COMPLEX; PHOSPHORYLATION; INTERACTS; IDENTIFICATION AB We have employed a hidden Markov model (HMM) based on known cAMP responsive elements to search for putative CREB target genes. The best scoring sites were positionally conserved between mouse and human orthologs, suggesting that this parameter can be used to enrich for true CREB targets. Target validation experiments revealed a core promoter requirement for transcriptional induction via CREB; TATA-less promoters were unresponsive to cAMP compared to TATA-containing genes, despite comparable binding of CREB to both sets of genes in vivo. Indeed, insertion of a TATA box motif rescued cAMP responsiveness on a TATA-less promoter. These results illustrate a mechanism by which subsets of target genes for a transcription factor are differentially regulated depending on core promoter configuration. TC 2 BP 1101 EP 1108 PG 8 JI Mol. Cell PY 2003 PD APR VL 11 IS 4 GA 672UF PI CAMBRIDGE RP Montminy M Novartis Res Fdn, Genom Inst, San Diego, CA 92121 USA J9 MOL CELL PA 1100 MASSACHUSETTS AVE, CAMBRIDGE, MA 02138 USA UT ISI:000182540700027 ER PT Journal AU Gupta, M Liu, JS TI Discovery of conserved sequence patterns using a stochastic dictionary model SO JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION LA English DT Article SN 0162-1459 PU AMER STATISTICAL ASSOC C1 Harvard Univ, Dept Stat, Cambridge, MA 02138 USA Harvard Univ, Dept Stat, Cambridge, MA 02138 USA DE data augmentation; gene regulation; missing data; transcription factor binding site ID BINDING-SITES; EM ALGORITHM; PROTEIN; IDENTIFICATION; AUGMENTATION; ALIGNMENT AB Detection of unknown patterns from a randomly generated sequence of observations is a problem arising in fields ranging from signal processing to computational biology. Here we focus on the discovery of short recurring patterns (called motifs) in DNA sequences that represent binding sites for certain proteins in the process of gene regulation. What makes this a difficult problem is that these patterns can vary stochastically. We describe a novel data augmentation strategy for detecting such patterns in biological sequences based on an extension of a "dictionary" model. In this approach, we treat conserved patterns and individual nucleotides as stochastic words generated according to probability weight matrices and the observed sequences generated by concatenations of these words. By using a missing-data approach to find these patterns, we also address other related problems, including determining widths of patterns, finding multiple motifs, handling low- complexity regions, and finding patterns with insertions and deletions. The issue of selecting appropriate models is also discussed. However, the flexibility of this model is also accompanied by a high degree of computational complexity. We demonstrate how dynamic programming-like recursions can be used to improve computational efficiency. TC 0 BP 55 EP 66 PG 12 JI J. Am. Stat. Assoc. PY 2003 PD MAR VL 98 IS 461 GA 673MJ PI ALEXANDRIA RP Gupta M Harvard Univ, Dept Stat, Cambridge, MA 02138 USA J9 J AMER STATIST ASSN PA 1429 DUKE ST, ALEXANDRIA, VA 22314 USA UT ISI:000182584500007 ER PT Journal AU Zheng, JS Wu, JJ Sun, ZR TI An approach to identify over-represented cis-elements in related sequences SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Tsing Hua Univ, Dept Biol Sci & Biotechnol, MOE Key Lab Bioinformat, Inst Bioinformat,State Key Lab Biomembrane & Memb, Beijing 100084, Peoples R China Tsing Hua Univ, Dept Biol Sci & Biotechnol, MOE Key Lab Bioinformat, Inst Bioinformat,State Key Lab Biomembrane & Memb, Beijing 100084, Peoples R China ID GENE-EXPRESSION DATA; REGULATORY REGIONS; PROMOTER; SITES; IDENTIFICATION; SPECIFICITY; DISCOVERY; MOTIFS; CELLS AB Computational identification of transcription factor binding sites is an important research area of computational biology. Positional weight matrix (PWM) is a model to describe the sequence pattern of binding sites. Usually, transcription factor binding sites prediction methods based on PWMs require user-defined thresholds. The arbitrary threshold and also the relatively low specificity of the algorithm prevent the result of such an analysis from being properly interpreted. In this study, a method was developed to identify over-represented cis- elements with PWM-based similarity scores. Three sets of closely related promoters were analyzed, and only over- represented motifs with high PWM similarity scores were reported. The thresholds to evaluate the similarity scores to the PWMs of putative transcription factors binding sites can also be automatically determined during the analysis, which can also be used in further research with the same PWMs. The online program is available on the website: http://www.bioinfo.tsinghua.edu.cn/similar tozhengjsh/OT FBS/. TC 0 BP 1995 EP 2005 PG 11 JI Nucleic Acids Res. PY 2003 PD APR 1 VL 31 IS 7 GA 666DE PI OXFORD RP Sun ZR Tsing Hua Univ, Dept Biol Sci & Biotechnol, MOE Key Lab Bioinformat, Inst Bioinformat,State Key Lab Biomembrane & Memb, Beijing 100084, Peoples R China J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000182160900023 ER PT Journal AU Qin, ZHS McCue, LA Thompson, W Mayerhofer, L Lawrence, CE Liu, JS TI Identification of co-regulated genes through Bayesian clustering of predicted regulatory binding sites SO NATURE BIOTECHNOLOGY LA English DT Article SN 1087-0156 PU NATURE PUBLISHING GROUP C1 Harvard Univ, Dept Stat, Cambridge, MA 02138 USA Harvard Univ, Dept Stat, Cambridge, MA 02138 USA New York State Dept Hlth, Wadsworth Ctr Labs & Res, Albany, NY 12201 USA Rensselaer Polytech Inst, Dept Comp Sci, Troy, NY 12180 USA ID ESCHERICHIA-COLI; PROTEIN; SEQUENCES; EXPRESSION; IRON AB The identification of co-regulated genes and their transcription-factor binding sites (TFBS) are key steps toward understanding transcription regulation. In addition to effective laboratory assays, various computational approaches for the detection of TFBS in promoter regions of coexpressed genes have been developed. The availability of complete genome sequences combined with the likelihood that transcription factors and their cognate sites are often conserved during evolution has led to the development of phylogenetic footprinting(1,2). The modus operandi of this technique is to search for conserved motifs upstream of orthologous genes from closely related species(1,2). The method can identify hundreds of TFBS without prior knowledge of co-regulation or coexpression. Because many of these predicted sites are likely to be bound by the same transcription factor, motifs with similar patterns can be put into clusters so as to infer the sets of co-regulated genes, that is, the regulons. This strategy utilizes only genome sequence information and is complementary to and confirmative of gene expression data generated by microarray experiments. However, the limited data available to characterize individual binding patterns, the variation in motif alignment, motif width, and base conservation, and the lack of knowledge of the number and sizes of regulons make this inference problem difficult. We have developed a Gibbs sampling-based(3) Bayesian motif clustering (BMC) algorithm to address these challenges. Tests on simulated data sets show that BMC produces many fewer errors than hierarchical and K-means clustering methods(4). The application of BMC to hundreds of predicted gamma-proteobacterial motifs(2) correctly identified many experimentally reported regulons, inferred the existence of previously unreported members of these regulons, and suggested novel regulons. TC 0 BP 435 EP 439 PG 5 JI Nat. Biotechnol. PY 2003 PD APR VL 21 IS 4 GA 664VH PI NEW YORK RP Liu JS Harvard Univ, Dept Stat, Cambridge, MA 02138 USA J9 NAT BIOTECHNOL PA 345 PARK AVE SOUTH, NEW YORK, NY 10010-1707 USA UT ISI:000182082400026 ER PT Journal AU Conlon, EM Liu, XS Lieb, JD Liu, JS TI Integrating regulatory motif discovery and genome-wide expression analysis SO PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA LA English DT Article SN 0027-8424 PU NATL ACAD SCIENCES C1 Harvard Univ, Dept Stat, 1 Oxford St, Cambridge, MA 02138 USA Harvard Univ, Dept Stat, Cambridge, MA 02138 USA Harvard Univ, Sch Publ Hlth, Dana Farber Canc Inst, Dept Biostat, Boston, MA 02115 USA Univ N Carolina, Carolina Ctr Genome Sci, Chapel Hill, NC 27599 USA Univ N Carolina, Dept Biol, Chapel Hill, NC 27599 USA DE sequence motif discovery; microarray data; correlation; transcription regulation ID YEAST SACCHAROMYCES-CEREVISIAE; BINDING-SITES; CELL-CYCLE; DNA; IDENTIFICATION; GENES; SEQUENCES AB We propose MOTIF REGRESSOR for discovering sequence motifs upstream of genes that undergo expression changes in a given condition. The method combines the advantages of matrix-based motif finding and oligomer motif-expression regression analysis, resulting in high sensitivity and specificity. MOTIF REGRESSOR is particularly effective in discovering expression- mediating motifs of medium to long width with multiple degenerate positions. When applied to Saccharomyces cerevisiae, MOTIF REGRESSOR identified the ROX1 and YAP1 motifs from Rox1p and Yap1p overexpression experiments, respectively; predicted that Gcn4p may have increased activity in YAP1 deletion mutants; reported a group of motifs (including GCN4, PHO4, MET4, STRE, USR1, RAN, M3A, and M3B) that may mediate the transcriptional response to amino acid starvation; and found all of the known cell-cycle regulation motifs from 18 expression microarrays over two cell cycles. TC 0 BP 3339 EP 3344 PG 6 JI Proc. Natl. Acad. Sci. U. S. A. PY 2003 PD MAR 18 VL 100 IS 6 GA 657PH PI WASHINGTON RP Liu JS Harvard Univ, Dept Stat, 1 Oxford St, Cambridge, MA 02138 USA J9 PROC NAT ACAD SCI USA PA 2101 CONSTITUTION AVE NW, WASHINGTON, DC 20418 USA UT ISI:000181675200066 ER PT Journal AU Kim, JT Martinetz, T Polani, D TI Bioinformatic principles underlying the information content of transcription factor binding sites SO JOURNAL OF THEORETICAL BIOLOGY LA English DT Article SN 0022-5193 PU ACADEMIC PRESS LTD ELSEVIER SCIENCE LTD C1 Inst Neuro & Bioinformat, Seelandstr 1A, D-23569 Lubeck, Germany Inst Neuro & Bioinformat, D-23569 Lubeck, Germany ID FREE-ENERGY; DNA; EVOLUTION; SEQUENCES; FAMILY; GENES AB Empirically, it has been observed in several cases that the information content of transcription factor binding site sequences (R-sequence) approximately equals the information content of binding site positions (R-frequency). A general framework for formal models of transcription factors and binding sites is developed to address this issue. Measures for information content in transcription factor binding sites are revisited and theoretic analyses are compared on this basis. These analyses do not lead to consistent results. A comparative review reveals that these inconsistent approaches do not include a transcription factor-state space. Therefore, a state space for mathematically representing transcription factors with respect to their binding site recognition properties is introduced into the modelling framework. Analysis of the resulting comprehensive model shows that the structure of genome state space favours equality of R-sequence and R- frequency indeed, but the relation between the two information quantities also depends on the structure of the transcription factor state space. This might lead to significant deviations between R-sequence and R-frequency. However, further investigation and biological arguments show that the effects of the structure of the transcription factor state space on the relation of R-sequence and R-frequency are strongly limited for systems which are autonomous in the sense that all DNA-binding proteins operating on the genome are encoded in the genome itself. This provides a theoretical explanation for the empirically observed equality. (C) 2003 Elsevier Science Ltd. All rights reserved. TC 0 BP 529 EP 544 PG 16 JI J. Theor. Biol. PY 2003 PD FEB 21 VL 220 IS 4 GA 657ZT PI LONDON RP Kim JT Inst Neuro & Bioinformat, Seelandstr 1A, D-23569 Lubeck, Germany J9 J THEOR BIOL PA 24-28 OVAL RD, LONDON NW1 7DX, ENGLAND UT ISI:000181697700006 ER PT Journal AU Mangalam, HJ TI tacg - a grep for DNA SO BMC BIOINFORMATICS LA English DT Article SN 1471-2105 PU BIOMED CENTRAL LTD C1 tacg Informat, 1 Whistler Ct, Irvine, CA 92612 USA tacg Informat, Irvine, CA 92612 USA ID SEARCH AB Background: Pattern matching is the core of bioinformatics; it is used in database searching, restriction enzyme mapping, and finding open reading frames. It is done repeatedly over increasingly long sequences, thus codes must be efficient and insensitive to sequence length. Such patterns of interest include simple motifs with IUPAC degeneracies, regular expressions, patterns allowing mismatches, and probability matrices. Results: I describe a small application which allows searching for all the above pattern types individually, which further allows these atomic motifs to be assembled into logical rules for more sophisticated analysis. Conclusion: tacg is small, portable, faster and more capable than most alternatives, relatively easy to modify, and freely available in source code. TC 0 BP art. no. EP 8 PG 4 JI BMC Bioinformatics PY 2002 VL 3 GA 654CU PI LONDON RP Mangalam HJ tacg Informat, 1 Whistler Ct, Irvine, CA 92612 USA J9 BMC BIOINFORMATICS PA MIDDLESEX HOUSE, 34-42 CLEVELAND ST, LONDON W1T 4LB, ENGLAND UT ISI:000181476800008 ER PT Journal AU Lio, P TI Statistical bioinformatic methods in microbial genome analysis SO BIOESSAYS LA English DT Article SN 0265-9247 PU COMPANY OF BIOLOGISTS LTD C1 Univ Cambridge, Dept Zool, Cambridge CB2 1TN, England Univ Cambridge, Dept Zool, Cambridge CB2 1TN, England European Bioinformat Inst, Hinxton, Cambs, England ID BACTERIAL GENOMES; ESCHERICHIA-COLI; PATHOGENICITY ISLANDS; REGULATORY REGIONS; MAXIMUM-LIKELIHOOD; DNA-SEQUENCES; GENE; EVOLUTION; IDENTIFICATION; PREDICTION AB It is probable that, increasingly, genome investigations are going to be based on statistical formalization. This review summarizes the state of art and potentiality of using statistics in microbial genome analysis. First, I focus on recent advances in functional genomics, such as finding genes and operons, identifying gene conversion events, detecting DNA replication origins and analysing regulatory sites. Then I describe how to use phylogenetic methods in genome analysis and methods for genome-wide scanning for positively selected amino acids. I conclude with speculations on the future course of genome statistical modeling. TC 0 BP 266 EP 273 PG 8 JI Bioessays PY 2003 PD MAR VL 25 IS 3 GA 650RH PI CAMBRIDGE RP Lio P Univ Cambridge, Dept Zool, Cambridge CB2 1TN, England J9 BIOESSAYS PA BIDDER BUILDING CAMBRIDGE COMMERCIAL PARK COWLEY RD, CAMBRIDGE CB4 4DL, CAMBS, ENGLAND UT ISI:000181276700010 ER PT Journal AU Bultrini, E Pizzi, E Del Giudice, P Frontali, C TI Pentamer vocabularies characterizing introns and intron-like intergenic tracts from Caenorhabditis elegans and Drosophila melanogaster SO GENE LA English DT Article SN 0378-1119 PU ELSEVIER SCIENCE BV C1 Ist Super Sanita, Biol Cellulaire Lab, Viale Regina Elena 299, I-00161 Rome, Italy Ist Super Sanita, Biol Cellulaire Lab, I-00161 Rome, Italy Ist Super Sanita, Fis Lab, I-00161 Rome, Italy DE introns; Caenorhabditis elegans; Drosophila melanogaster; linguistic properties ID NUCLEOTIDE-SEQUENCES; GENOME; FEATURES AB Overall compositional properties at the level of bases, dinucleotides and longer oligos characterize genomes of different species. In Caenorhabditis elegans, using recurrence analysis, we recognized the existence of a long-range con- elation in the oligonucleotide usage of introns and intergenic regions. Through correlation analysis, this is confirmed here to be a genome-wide property of C. elegans non-coding portions. We then investigate the possibility of extracting a typical vocabulary through statistical analysis of experimentally confirmed introns of sufficient length I(> I kb), deprived of known splice signals, the focus being on distributed lexical features rather than on localized motifs. Lexical preferences typical of introns could be exposed using principal component analysis of pentanucleotide frequency distributions, both in C. elegans and in Drosophila melanogaster. In either species, the introns' pentamer preferences are largely shared by intergenic tracts. The pentamer vocabularies extracted for the two species exhibit interesting symmetry properties and overlap in part. A more extensive investigation of the interspecies relationship at the level of oligonucleotide preferences in non-coding regions, not related by sequence similarity, might form the basis of new approaches for the study of the evolutionary behaviour of these regions. (C) 2002 Elsevier Science B.V. All rights reserved. TC 0 BP 183 EP 192 PG 10 JI Gene PY 2003 PD JAN 30 VL 304 GA 647ZT PI AMSTERDAM RP Pizzi E Ist Super Sanita, Biol Cellulaire Lab, Viale Regina Elena 299, I-00161 Rome, Italy J9 GENE PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS UT ISI:000181124500018 ER PT Journal AU Krull, M Voss, N Choi, C Pistor, S Potapov, A Wingender, E TI TRANSPATH (R): an integrated database on signal transduction and a tool for array analysis SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 BIOBASE Biol Databases GmbH, Halchtersche Str 33, D-38304 Wolfenbuttel, Germany BIOBASE Biol Databases GmbH, D-38304 Wolfenbuttel, Germany German Res Ctr Biotechnol, GBF, AG Bioinformat, D-38124 Braunschweig, Germany AB TRANSPATH(R) is a database system about gene regulatory networks that combines encyclopedic information on signal transduction with tools for visualization and analysis. The integration with TRANSFAC(R), a database about transcription factors and their DNA binding sites, provides the possibility to obtain complete signaling pathways from ligand to target genes and their products, which may themselves be involved in regulatory action. As of July 2002, the TRANSPATH Professional release 3.2 contains about 9800 molecules, >1800 genes and >11400 reactions collected from similar to5000 references. With the ArrayAnalyzer(TM), an integrated tool has been developed for evaluation of microarray data. It uses the TRANSPATH data set to identify key regulators in pathways connected with up- or down-regulated genes of the respective array. The key molecules and their surrounding networks can be viewed with the PathwayBuilder(TM), a tool that offers four different modes of visualization. More information on TRANSPATH is available at http://www.biobase.de/pages/products/databases.html. TC 1 BP 97 EP 100 PG 4 JI Nucleic Acids Res. PY 2003 PD JAN 1 VL 31 IS 1 GA 647EP PI OXFORD RP Krull M BIOBASE Biol Databases GmbH, Halchtersche Str 33, D-38304 Wolfenbuttel, Germany J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000181079700021 ER PT Journal AU Shahmuradov, IA Gammerman, AJ Hancock, JM Bramley, PM Solovyev, VV TI PlantProm: a database of plant promoter sequences SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Softberry Inc, 116 Radio Circle,Suite 400, Mt Kisco, NY 10549 USA Softberry Inc, Mt Kisco, NY 10549 USA Univ London Royal Holloway & Bedford New Coll, Dept Comp Sci, Egham TW20 0EX, Surrey, England Univ London Royal Holloway & Bedford New Coll, Sch Biol Sci, Egham TW20 0EX, Surrey, England AB PlantProm DB, a plant promoter database, is an annotated, non- redundant collection of proximal promoter sequences for RNA polymerase II with experimentally determined transcription start site(s), TSS, from various plant species. The first release (2002.01) of PlantProm DB contains 305 entries including 71, 220 and 14 promoters from monocot, dicot and other plants, respectively. It provides DNA sequence of the promoter regions (-200:+51) with TSS on the fixed position +201, taxonomic/promoter type classification of promoters and Nucleotide Frequency Matrices (NFM) for promoter elements: TATA-box, CCAAT-box and TSS-motif (Inr). Analysis of TSS-motifs revealed that their composition is different in dicots and monocots, as well as for TATA and TATA-less promoters. The database serves as learning set in developing plant promoter prediction programs. One such program (TSSP) based on discriminant analysis has been created by Softberry Inc. and the application of a support vector machine approach for promoter identification is under development. PlantProm DB is available at http://mendel.cs.rhul.ac.uk/ and http://www.softberry.com/. TC 0 BP 114 EP 117 PG 4 JI Nucleic Acids Res. PY 2003 PD JAN 1 VL 31 IS 1 GA 647EP PI OXFORD RP Solovyev VV Softberry Inc, 116 Radio Circle,Suite 400, Mt Kisco, NY 10549 USA J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000181079700025 ER PT Journal AU Munch, R Hiller, K Barg, H Heldt, D Linz, S Wingender, E Jahn, D TI PRODORIC: prokaryotic database of gene regulation SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Tech Univ Carolo Wilhelmina Braunschweig, Inst Mikrobiol, Spielmannstr 7, D-38106 Braunschweig, Germany Tech Univ Carolo Wilhelmina Braunschweig, Inst Mikrobiol, D-38106 Braunschweig, Germany Gesell Biotechnol Forsch mbH, D-38124 Braunschweig, Germany BIOBASE GmbH, D-38304 Wolfenbuttel, Germany ID ESCHERICHIA-COLI K-12; PROTEINS AB The database PRODORIC aims to systematically organize information on prokaryotic gene expression, and to integrate this information into regulatory networks. The present version focuses on pathogenic bacteria such as Pseudomonas aeruginosa. PRODORIC links data on environmental stimuli with trans-acting transcription factors, cis-acting promoter elements and regulon definition. Interactive graphical representations of operon, gene and promoter structures including regulator-binding sites, transcriptional and translational start sites, supplemented with information on regulatory proteins are available at varying levels of detail. The data collection provided is based on exhaustive analyses of scientific literature and computational sequence prediction. Included within PRODORIC are tools to de ne and predict regulator binding sites. It is accessible at http://prodoric.tu-bs.de. TC 0 BP 266 EP 269 PG 4 JI Nucleic Acids Res. PY 2003 PD JAN 1 VL 31 IS 1 GA 647EP PI OXFORD RP Jahn D Tech Univ Carolo Wilhelmina Braunschweig, Inst Mikrobiol, Spielmannstr 7, D-38106 Braunschweig, Germany J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000181079700062 ER PT Journal AU Matys, V Fricke, E Geffers, R Gossling, E Haubrock, M Hehl, R Hornischer, K Karas, D Kel, AE Kel-Margoulis, OV Kloos, DU Land, S Lewicki-Potapov, B Michael, H Munch, R Reuter, I Rotert, S Saxel, H Scheer, M Thiele, S Wingender, E TI TRANSFAC (R): transcriptional regulation, from patterns to profiles SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 BIOBASE GmbH, Halchtersche Str 33, D-38304 Wolfenbuttel, Germany BIOBASE GmbH, D-38304 Wolfenbuttel, Germany Tech Univ Carolo Wilhelmina Braunschweig, Biozentrum, Inst Genet, D-38106 Braunschweig, Germany Gesell Biotechnol Forsch mbH, D-38124 Braunschweig, Germany ID GENE-EXPRESSION REGULATION; DATABASE; SYSTEM; PROTEINS; COMPEL; TRRD AB The TRANSFAC(R) database on eukaryotic transcriptional regulation, comprising data on transcription factors, their target genes and regulatory binding sites, has been extended and further developed, both in number of entries and in the scope and structure of the collected data. Structured fields for expression patterns have been introduced for transcription factor from human and mouse, using the CYTOMER(R) database on anatomical structures and developmental stages. The functionality of Match(TM), a tool for matrix-based search of transcription factor binding sites, has been enhanced. For instance, the program now comes along with a number of tissue( or state-) specific profiles and new profiles can be created and modified with Match(TM) Profiler. The GENE table was extended and gained in importance, containing amongst others links to LocusLink, RefSeq and OMIM now. Further, ( direct) links between factor and target gene on one hand and between gene and encoded factor on the other hand were introduced. The TRANSFAC 1 public release is available at http: / / www. gene- regulation. com. For yeast an additional release including the latest data was made available separately as TRANSFAC 1 Saccharomyces Module (TSM) at http: / / transfac. gbf. de. For CYTOMER(R) free download versions are available at http: / / www. biobase. de: 8080/ index. html. TC 10 BP 374 EP 378 PG 5 JI Nucleic Acids Res. PY 2003 PD JAN 1 VL 31 IS 1 GA 647EP PI OXFORD RP Matys V BIOBASE GmbH, Halchtersche Str 33, D-38304 Wolfenbuttel, Germany J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000181079700090 ER PT Book in series AU Potapov, AP Wingender, E TI Representing the architecture of signal transduction networks in an algebraic form - Protein target finding SO CELL SIGNALING, TRANSCRIPTION, AND TRANSLATION AS THERAPEUTIC TARGETS LA English DT Article SN 0077-8923 PU NEW YORK ACAD SCIENCES C1 German Res Ctr Biotechnol, Mascheroder Weg 1, D-38124 Braunschweig, Germany German Res Ctr Biotechnol, D-38124 Braunschweig, Germany DE regulatory networks; signal transduction networks; in silico modeling TC 0 BP 1 EP 2 PG 2 JI Ann.NY Acad.Sci. SE ANNALS OF THE NEW YORK ACADEMY OF SCIENCES PY 2002 VL 973 GA BV70N PI NEW YORK RP Potapov AP German Res Ctr Biotechnol, Mascheroder Weg 1, D-38124 Braunschweig, Germany J9 ANN N Y ACAD SCI PA 2 EAST 63RD ST, NEW YORK, NY 10021 USA UT ISI:000179853000001 ER PT Journal AU Park, PJ Butte, AJ Kohane, IS TI Comparing expression profiles of genes with similar promoter regions SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Childrens Hosp, Informat Program, 300 Longwood Ave, Boston, MA 02115 USA Childrens Hosp, Informat Program, Boston, MA 02115 USA Childrens Hosp, Div Endocrinol, Boston, MA 02115 USA ID TRANSCRIPTIONAL ACTIVATORS BAS1; YEAST SACCHAROMYCES- CEREVISIAE; REGULATORY ELEMENTS; BINDING SITES; CELL-CYCLE; GENOME; IDENTIFICATION; SEQUENCES; DATABASE AB Motivation: Gene regulatory elements are often predicted by seeking common sequences in the promoter regions of genes that are clustered together based on their expression profiles. We consider the problem in the opposite direction: we seek to find the genes that have similar promoter regions and determine the extent to which these genes have similar expression profiles. Results: We use the data sets from experiments on Saccharomyces cerevisiae. Our similarity measure for the promoter regions is based on the set of common mapped or putative transcription factor binding sites and other regulatory elements in the upstream region of the genes, as contained in the Saccharomyces cerevisiae Promoter Database. We pair up the genes with high similarity scores and compare their expression levels in time- course experiment data. We find that genes with similar promoter regions on the average have significantly higher correlation, but it can vary widely depending on the genes. This confirms that the presence of similar regulatory elements often does not correspond to similarity in expression profiles and indicates that finding transcription factor binding sites or other regulatory elements starting with the expression patterns may be limited in many cases. Regardless of the correlation, the degree to which the profiles agree under different experimental conditions can be examined to derive hypotheses concerning the role of common regulatory elements. Overall, we find that considering the relationship between the promoter regions and the expression profiles starting with the regulatory elements is a difficult but useful process that can provide valuable insights. TC 0 BP 1576 EP 1584 PG 9 JI Bioinformatics PY 2002 PD DEC VL 18 IS 12 GA 627TG PI OXFORD RP Park PJ Childrens Hosp, Informat Program, 300 Longwood Ave, Boston, MA 02115 USA J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000179951800005 ER PT Journal AU Moreau, Y De Smet, F Thijs, G Marchal, K De Moor, B TI Functional bioinformatics of microarray data: From expression to regulation SO PROCEEDINGS OF THE IEEE LA English DT Article SN 0018-9219 PU IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC C1 Katholieke Univ Leuven, Dept Elect Engn, Louvain, Belgium Katholieke Univ Leuven, Dept Elect Engn, Louvain, Belgium DE adaptive quality-based clustering; clustering; Gibbs sampling; microarray; motif finding; regulation ID GENE-EXPRESSION; EXPECTATION MAXIMIZATION; CLUSTER-ANALYSIS; COMPUTATIONAL ANALYSIS; NONCODING SEQUENCES; BINDING SITES; WHOLE-GENOME; PATTERNS; IDENTIFICATION; DNA AB Using microarrays is a powerful technique to monitor the expression of thousands of genes in a single experiment. From series of such experiments, it is possible to identify the mechanisms that govern the activation of genes in an organism. Short deoxyribonucleic acid patterns (called binding sites) near the genes serve as switches that control gene expression. As a result similar patterns of expression can correspond to similar binding site patterns. Here we integrate clustering of coexpressed genes with the discovery of binding motifs. We overview several important clustering techniques and present a clustering algorithm (called adaptive quality-based clustering), which we have developed to address several shortcomings of existing methods. We overview the different techniques for motif finding, in particular the technique of Gibbs sampling, and we present several extensions of this technique in our Motif Sampler Finally, we present an integrated web tool called INCLUSive (available online at http://www.esat.kuleuven.ac.belsimilar todna/BioI/Software.html) that allows the easy analysis of microarray data for motif finding. TC 1 BP 1722 EP 1743 PG 22 JI Proc. IEEE PY 2002 PD NOV VL 90 IS 11 GA 614RQ PI NEW YORK RP Moreau Y Katholieke Univ Leuven, Dept Elect Engn, Louvain, Belgium J9 PROC IEEE PA 345 E 47TH ST, NEW YORK, NY 10017-2394 USA UT ISI:000179204700004 ER PT Journal AU Sabatti, C Lange, K TI Genomewide motif identification using a dictionary model SO PROCEEDINGS OF THE IEEE LA English DT Article SN 0018-9219 PU IEEE-INST ELECTRICAL ELECTRONICS ENGINEERS INC C1 Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA 90095 USA Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA 90095 USA Univ Calif Los Angeles, Biomath Dept, Los Angeles, CA 90095 USA Univ Calif Los Angeles, Dept Stat, Los Angeles, CA 90095 USA DE expectation-maximization algorithm; genomic sequence; maximum a posteriori; text segmentation ID SITES; ALGORITHM AB This paper surveys and extends models and algorithms for identifying binding sites in noncoding regions of DNA. Binding sites control the transcription of genes into messenger RNA in preparation for translation into proteins. The base sequence of most binding sites is not entirely fixed, with the different permitted spellings collectively constituting a "motif." After summarizing the underlying biological issues, we review three different models for binding site identification. Each model was developed with a different type of dataset as reference. We then present a unified model that borrows from the previous ones and integrates their main features. In our unified model, one can identify motifs and their unknown positions along a sequence. One can also fit the model to data using maximum likelihood and maximum a posteriori algorithms. These algorithms rely on recursive formulas and the maximization/minorization principle. Finally, we conclude with a prospectus of future data analyses and theoretical research. TC 0 BP 1803 EP 1810 PG 8 JI Proc. IEEE PY 2002 PD NOV VL 90 IS 11 GA 614RQ PI NEW YORK RP Sabatti C Univ Calif Los Angeles, Dept Human Genet, Los Angeles, CA 90095 USA J9 PROC IEEE PA 345 E 47TH ST, NEW YORK, NY 10017-2394 USA UT ISI:000179204700010 ER PT Journal AU Laurio, K Linaker, F Narayanan, A TI Regular biosequence pattern matching with cellular automata SO INFORMATION SCIENCES LA English DT Article SN 0020-0255 PU ELSEVIER SCIENCE INC C1 Univ Skovde, Dept Comp Sci, Box 408, S-54128 Skovde, Sweden Univ Skovde, Dept Comp Sci, S-54128 Skovde, Sweden Univ Exeter, Dept Comp Sci, Exeter EX4 4QF, Devon, England DE cellular automata; biosequence; pattern matching; regular expression ID DATABASE AB Could algorithms designed within the computational model of cellular automata fit well into a future biocomputing environment where data and computation are seamlessly and dynamically distributed? We suggest it is possible by presenting a systematic approach for creating 1D linear cellular automata that in parallel can locate all starting positions of complete matches to a given PROSITE pattern in a string. The cellular automaton requires time proportional to the maximal length of a pattern match, and is inherently suited for distribution out on multiple processors. (C) 2002 Elsevier Science Inc. All rights reserved. TC 0 BP 89 EP 101 PG 13 JI Inf. Sci. PY 2002 PD OCT VL 146 IS 1-4 GA 612BW PI NEW YORK RP Laurio K Univ Skovde, Dept Comp Sci, Box 408, S-54128 Skovde, Sweden J9 INFORM SCIENCES PA 360 PARK AVE SOUTH, NEW YORK, NY 10010-1710 USA UT ISI:000179054800007 ER PT Journal AU Sudarsanam, P Pilpel, Y Church, GM TI Genome-wide co-occurrence of promoter elements reveals a cis- regulatory cassette of rRNA transcription motifs in Saccharomyces cerevisiae SO GENOME RESEARCH LA English DT Article SN 1088-9051 PU COLD SPRING HARBOR LAB PRESS C1 Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA Harvard Univ, Sch Med, Lipper Ctr Computat Genet, Boston, MA 02115 USA ID GENE-EXPRESSION; CELL-CYCLE; IDENTIFICATION; YEAST; NETWORKS; SEQUENCES; BINDING; REGIONS; SCALE AB Combinatorial regulation is an important feature of eukaryotic transcription. However, only a limited number of studies have characterized this aspect on a whole-genome level. We have conducted a genome-wide computational survey to identify cis- regulatory motif pairs that co-occur in a significantly high number of promoters in the S. cerevisiae genome. A pair of novel motifs, mRRPE and PAC, co-occur most highly in the genome, primarily in the promoters of genes involved in rRNA transcription and processing. The two motifs show significant positional and orientational bias with mRRPE being closer to the ATG than PAC in most promoters. Two additional rRNA-related motifs, mRRSE3 and mRRSE10, also co-occur with mRRPE and PAC. mRRPE and PAC are the primary determinants of expression profiles while mRRSE3 and mRRSE10 modulate these patterns. We describe a new computational approach for Studying the functional significance of the physical locations of promoter elements that combine analyses of genome sequence and microarray data. Applying this methodology to the regulatory cassette containing the four rRNA motifs demonstrates that the relative promoter locations of these elements have a profound effect on the expression patterns of the downstream genes. These findings provide a function for these novel motifs and insight into the mechanism by which they regulate gene expression. The methodology introduced here should prove particularly useful for analyzing transcriptional regulation in more complex genomes. TC 2 BP 1723 EP 1731 PG 9 JI Genome Res. PY 2002 PD NOV VL 12 IS 11 GA 612DB PI PLAINVIEW RP Church GM Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA J9 GENOME RES PA 1 BUNGTOWN RD, PLAINVIEW, NY 11724 USA UT ISI:000179058300011 ER PT Journal AU Benos, PV Bulyk, ML Stormo, GD TI Additivity in protein-DNA interactions: how good an approximation is it? SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Washington Univ, Sch Med, Dept Genet, Campus Box 8232, St Louis, MO 63110 USA Washington Univ, Sch Med, Dept Genet, St Louis, MO 63110 USA Univ Pittsburgh, Dept Human Genet, Pittsburgh, PA 15261 USA Univ Pittsburgh, Ctr Computat Biol & Bioinformat, Pittsburgh, PA 15261 USA Univ Pittsburgh, Inst Canc, Pittsburgh, PA 15261 USA Brigham & Womens Hosp, Dept Med, Div Genet, Boston, MA 02115 USA Brigham & Womens Hosp, Dept Pathol, Boston, MA 02115 USA Harvard Univ, Sch Med, Boston, MA 02115 USA Harvard Mit Div Hlth Sci & Technol, Boston, MA 02115 USA ID BINDING-SITES; RECOGNITION CODE; ZINC FINGERS; TRANSCRIPTION FACTORS; INFORMATION-CONTENT; TARGET SITES; SEQUENCES; PREDICTION; IDENTIFICATION; STRATEGY AB Man and Stormo and Bulyk et al. recently presented their results on the study of the DNA binding affinity of proteins. In both of these studies the main conclusion is that the additivity assumption, usually applied in methods to search for binding sites, is not true. In the first study, the analysis of binding affinity data from the Mnt repressor protein bound to all possible DNA (sub)targets at positions 16 and 17 of the binding site, showed that those positions are not independent. In the second study, the authors analysed DNA binding affinity data of the wild-type mouse EGR1 protein and four variants differing on the middle finger. The binding affinity of these proteins was measured to all 64 possible trinucleotide (sub)targets of the middle finger using microarray technology. The analysis of the measurements also showed interdependence among the positions in the DNA target. In the present report, we review the data of both studies and we re- analyse them using various statistical methods, including a comparison with a multiple regression approach. We conclude that despite the fact that the additivity assumption does not fit the data perfectly, in most cases it provides a very good approximation of the true nature of the specific protein-DNA interactions. Therefore, additive models can be very useful for the discovery and prediction of binding sites in genomic DNA. TC 3 BP 4442 EP 4451 PG 10 JI Nucleic Acids Res. PY 2002 PD OCT 15 VL 30 IS 20 GA 608CC PI OXFORD RP Stormo GD Washington Univ, Sch Med, Dept Genet, Campus Box 8232, St Louis, MO 63110 USA J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000178826700022 ER PT Journal AU Lin, J Qian, J Greenbaum, D Bertone, P Das, R Echols, N Senes, A Stenger, B Gerstein, M TI GeneCensus: genome comparisons in terms of metabolic pathway activity and protein family sharing SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Yale Univ, Dept Mol Biophys & Biochem, POB 208114, New Haven, CT 06520 USA Yale Univ, Dept Mol Biophys & Biochem, New Haven, CT 06520 USA ID GENE-EXPRESSION; SACCHAROMYCES-CEREVISIAE; STATISTICAL- ANALYSIS; YEAST GENOME; WILD-TYPE; DATABASE; SYSTEM; IDENTIFICATION; PHYSIOLOGY; PROTEOMICS AB We present a prototype of a new database tool, GeneCensus, which focuses on comparing genomes globally, in terms of the collective properties of many genes, rather than in terms of the attributes of a single gene (e.g. sequence similarity for a particular ortholog). The comparisons are presented in a visual fashion over the web at GeneCensus.org. The system concentrates on two types of comparisons: (i) trees based on the sharing of generalized protein families between genomes, and (ii) whole pathway analysis in terms of activity levels. For the trees, we have developed a module (TreeViewer) that clusters genomes in terms of the folds, superfamilies or orthologs-all can be considered as generalized 'families' or 'protein parts'-they share, and compares the resulting trees side-by-side with those built from sequence similarity of individual genes (e.g. a traditional tree built on ribosomal similarity). We also include comparisons to trees built on whole-genome dinucleotide or codon composition. For pathway comparisons, we have implemented a module (PathwayPainter) that graphically depicts, in selected metabolic pathways, the fluxes or expression levels of the associated enzymes (i.e. generalized 'activities'). One can, consequently, compare organisms (and organism states) in terms of representations of these systemic quantities. Develop ment of this module involved compiling, calculating and standardizing flux and expression information from many different sources. We illustrate pathway analysis for enzymes involved in central metabolism. We are able to show that, to some degree, flux and expression fluctuations have characteristic values in different sections of the central metabolism and that control points in this system (e.g. hexokinase, pyruvate kinase, phosphofructokinase, isocitrate dehydrogenase and citric synthase) tend to be especially variable in flux and expression. Both the TreeViewer and PathwayPainter modules connect to other information sources related to individual-gene or organism properties (e.g. a single-gene structural annotation viewer). TC 0 BP 4574 EP 4582 PG 9 JI Nucleic Acids Res. PY 2002 PD OCT 15 VL 30 IS 20 GA 608CC PI OXFORD RP Gerstein M Yale Univ, Dept Mol Biophys & Biochem, POB 208114, New Haven, CT 06520 USA J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000178826700036 ER PT Journal AU Sirava, M Schafer, T Eiglsperger, M Kaufmann, M Kohlbacher, O Bornberg-Bauer, E Lenhof, HP TI BioMiner - modeling, analyzing, and visualizing biochemical pathways and networks SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Univ Saarland, Ctr Bioinformat, POB 151150, D-66041 Saarbrucken, Germany Univ Saarland, Ctr Bioinformat, D-66041 Saarbrucken, Germany Univ Tubingen, Wilhelm Schickard Inst Informat, D-72076 Tubingen, Germany Univ Manchester, Sch Biol Sci, Manchester M13 9PT, Lancs, England DE biochemical data model; metabolic and regulatory pathways; visualization; Java; XML ID NEW-GENERATION; DATABASE; INFORMATION; GENOMES; ENZYMES; SYSTEM; GENES AB Motivation: Understanding the biochemistry of a newly sequenced organism is an essential task for post-genomic analysis. Since, however, genome and array data grow much faster than biochemical information, it is necessary to infer reactions by comparative analysis. No integrated and easy to use software tool for this purpose exists as yet. Results: We present a new software system-BioMiner-for analyzing and visualizing biochemical pathways and networks. BioMiner is based on a new comprehensive, extensible and reusable data model-BioCore-which can be used to model biochemical pathways and networks. As a first application we present PathFinder, a new tool predicting biochemical pathways by comparing groups of related organisms based on sequence similarity. We successfully tested PathFinder with a number of experiments, e.g. the well studied glycolysis in bacteria. Additionally, an application called PathViewer for the visualization of metabolic networks is presented. PathViewer is the first application we are aware of which supports the graphical comparison of metabolic networks of different organisms. . TC 1 BP S219 EP S230 PG 12 JI Bioinformatics PY 2002 PD OCT VL 18 SU S GA 608GC PI OXFORD RP Sirava M Univ Saarland, Ctr Bioinformat, POB 151150, D-66041 Saarbrucken, Germany J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000178836800031 ER PT Journal AU Hannenhalli, S Levy, S TI Predicting transcription factor synergism SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Celera Genom, Informat Res, 45 W Gude Dr, Rockville, MD 20850 USA Celera Genom, Informat Res, Rockville, MD 20850 USA ID NUCLEAR TRANSLOCATOR ARNT; CREB-BINDING-PROTEIN; NF-KAPPA-B; ESTROGEN-RECEPTOR; GENE-EXPRESSION; RESPONSE ELEMENT; ALPHA- SUBUNIT; BETA-CATENIN; PROMOTER; IDENTIFICATION AB Transcriptional regulation is mediated by a battery of transcription factor (TF) proteins, that form complexes involving protein-protein and protein-DNA interactions. Individual TFs bind to their cognate cis-elements or transcription factor-binding sites (TFBS). TFBS are organized on the DNA proximal to the gene in groups confined to a few hundred base pair regions. These groups are referred to as modules. Various modules work together to provide the combinatorial regulation of gene transcription in response to various developmental and environmental conditions. The sets of modules constitute a promoter model. Determining the TFs that preferentially work in concert as part of a module is an essential component of understanding transcriptional regulation. The TFs that act synergistically in such a fashion are likely to have their cis-elements co-localized on the genome at specific distances apart. We exploit this notion to predict TF pairs that are likely to be part of a transcriptional module on the human genome sequence. The computational method is validated statistically, using known interacting pairs extracted from the literature. There are 251 TFBS pairs up to 50 bp apart and 70 TFBS pairs up to 200 bp apart that score higher than any of the known synergistic pairs. Further investigation of 50 pairs randomly selected from each of these two sets using PubMed queries provided additional supporting evidence from the existing biological literature suggesting TF synergism for these novel pairs. TC 1 BP 4278 EP 4284 PG 7 JI Nucleic Acids Res. PY 2002 PD OCT 1 VL 30 IS 19 GA 603KM PI OXFORD RP Hannenhalli S Celera Genom, Informat Res, 45 W Gude Dr, Rockville, MD 20850 USA J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000178558500027 ER PT Journal AU Li, H Rhodius, V Gross, C Siggia, ED TI Identification of the binding sites of regulatory proteins in bacterial genomes SO PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA LA English DT Article SN 0027-8424 PU NATL ACAD SCIENCES C1 Univ Calif San Francisco, Dept Biochem & Biophys, 513 Parnassus Ave, San Francisco, CA 94143 USA Univ Calif San Francisco, Dept Biochem & Biophys, San Francisco, CA 94143 USA Univ Calif San Francisco, Dept Stomatol, San Francisco, CA 94143 USA Univ Calif San Francisco, Dept Immunol & Microbiol, San Francisco, CA 94143 USA Rockefeller Univ, Ctr Studies Phys & Biol, New York, NY 10021 USA DE algorithm; position weight matrix; DNA-binding site; transcription factor; E. coli ID ESCHERICHIA-COLI K-12; COMPUTATIONAL ANALYSIS; EXPRESSION PATTERNS; NONCODING SEQUENCES; WHOLE-GENOME; DNA; TRANSCRIPTION; ELEMENTS; SIGNALS; MOTIFS AB We present an algorithm that extracts the binding sites (represented by position-specific weight matrices) for many different transcription factors from the regulatory regions of a genome, without the need for delineating groups of coregulated genes. The algorithm uses the fact that many DNA- binding proteins in bacteria bind to a bipartite motif with two short segments more conserved than the intervening region. It identifies all statistically significant patterns of the form W1NxW2, where W-1 and W-2 are two short oligonuclecitides separated by x arbitrary bases, and groups them into clusters of similar patterns. These clusters are then used to derive quantitative recognition profiles of putative regulatory proteins. For a given cluster, the algorithm finds the matching sequences plus the flanking regions in the genome and performs a multiple sequence alignment to derive position-specific weight matrices. We have analyzed the Escherichia coli genome with this algorithm and found approximate to1,500 significant patterns, which give rise to approximate to160 distinct position-specific weight matrices. A fraction of these matrices match the binding sites of one-third of the approximate to60 characterized transcription factors with high statistical significance. Many of the remaining matrices are likely to describe binding sites and regulons of uncharacterized transcription factors. The significance of these matrices was evaluated by their specificity, the location of the predicted sites, and the biological functions of the corresponding regulons, allowing us to suggest putative regulatory functions. The algorithm is efficient for analyzing newly sequenced bacterial genomes for which little is known about transcriptional regulation. TC 2 BP 11772 EP 11777 PG 6 JI Proc. Natl. Acad. Sci. U. S. A. PY 2002 PD SEP 3 VL 99 IS 18 GA 590UW PI WASHINGTON RP Li H Univ Calif San Francisco, Dept Biochem & Biophys, 513 Parnassus Ave, San Francisco, CA 94143 USA J9 PROC NAT ACAD SCI USA PA 2101 CONSTITUTION AVE NW, WASHINGTON, DC 20418 USA UT ISI:000177843100044 ER PT Journal AU GuhaThakurta, D Palomar, L Stormo, GD Tedesco, P Johnson, TE Walker, DW Lithgow, G Kim, S Link, CD TI Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods (vol 12, pg 701, 2002) SO GENOME RESEARCH LA English DT Correction SN 1088-9051 PU COLD SPRING HARBOR LAB PRESS TC 0 BP 1301 EP 1301 PG 1 JI Genome Res. PY 2002 PD AUG VL 12 IS 8 GA 583WQ PI PLAINVIEW J9 GENOME RES PA 1 BUNGTOWN RD, PLAINVIEW, NY 11724 USA UT ISI:000177434300018 ER PT Journal AU Liu, XS Brutlag, DL Liu, JS TI An algorithm for finding protein-DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments SO NATURE BIOTECHNOLOGY LA English DT Article SN 1087-0156 PU NATURE AMERICA INC C1 Harvard Univ, Dept Stat, 1 Oxford St, Cambridge, MA 02138 USA Harvard Univ, Dept Stat, Cambridge, MA 02138 USA Stanford Univ, Dept Biochem, Stanford, CA 94305 USA ID YEAST-CELL-CYCLE; SACCHAROMYCES-CEREVISIAE; REGULATORY SITES; SEQUENCE; IDENTIFICATION; ALIGNMENT; GENES AB Chromatin immunoprecipitation followed by cDNA microarray hybridization (ChIP-array) has become a popular procedure for studying genome-wide protein-DNA interactions and transcription regulation. However, it can only map the probable protein-DNA interaction loci within 1-2 kilobases resolution. To pinpoint interaction sites down to the base-pair level, we introduce a computational method, Motif Discovery scan (MDscan), that examines the ChIP-array-selected sequences and searches for DNA sequence motifs representing the protein-DNA interaction sites. MDscan combines the advantages of two widely adopted motif search strategies, word enumeration(1-4) and position-specific weight matrix updating(5-9), and incorporates the ChIP-array ranking information to accelerate searches and enhance their success rates. MDscan correctly identified all the experimentally verified motifs from published ChIP-array experiments in yeast(10-13) (STE12, GAL4, RAP1, SCB, MCB, MCM1, SFF, and SWI5), and predicted two motif patterns for the differential binding of Rap1 protein in telomere regions. In our studies, the method was faster and more accurate than several established motif-finding algorithms(5,8,9). MDscan can be used to find DNA motifs not only in ChIP-array experiments but also in other experiments in which a subgroup of the sequences can be inferred to contain relatively abundant motif sites. The MDscan web server can be accessed at http:// BioProspector.stanford.edu/MDscan/. TC 4 BP 835 EP 839 PG 6 JI Nat. Biotechnol. PY 2002 PD AUG VL 20 IS 8 GA 579MU PI NEW YORK RP Liu JS Harvard Univ, Dept Stat, 1 Oxford St, Cambridge, MA 02138 USA J9 NAT BIOTECHNOL PA 345 PARK AVE SOUTH, NEW YORK, NY 10010-1707 USA UT ISI:000177182500037 ER PT Book in series AU Li, H TI Computational approaches to identifying transcription factor binding sites in yeast genome SO GUIDE TO YEAST GENETICS AND MOLECULAR AND CELL BIOLOGY, PT B LA English DT Review SN 0076-6879 PU ACADEMIC PRESS INC C1 Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA ID REGULATORY SITES; EXPRESSION; SEQUENCES; DNA TC 1 BP 484 EP 495 PG 12 JI Methods Enzymol. SE METHODS IN ENZYMOLOGY PY 2002 VL 350 PN B GA BU60A PI SAN DIEGO RP Li H Univ Calif Irvine, Dept Biol Chem, Irvine, CA 92697 USA J9 METH ENZYMOLOGY PA 525 B STREET, SUITE 1900, SAN DIEGO, CA 92101-4495 USA UT ISI:000176466300027 ER PT Journal AU Halfon, MS Grad, Y Church, GM Michelson, AM TI Computation-based discovery of related transcriptional regulatory modules and motifs using an experimentally validated combinatorial model SO GENOME RESEARCH LA English DT Article SN 1088-9051 PU COLD SPRING HARBOR LAB PRESS C1 Brigham & Womens Hosp, Howard Hughes Med Inst, 75 Francis St, Boston, MA 02115 USA Brigham & Womens Hosp, Howard Hughes Med Inst, Boston, MA 02115 USA Brigham & Womens Hosp, Dept Med, Boston, MA 02115 USA Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA Harvard Univ, Sch Med, Lipper Ctr Computat Genet, Boston, MA 02115 USA ID SEQUENCE COMPARISONS; NUCLEOTIDE-SEQUENCE; GENE-EXPRESSION; DROSOPHILA; IDENTIFICATION; PROMOTER; MESODERM; ELEMENTS; REGIONS; DNA AB Gene expression is regulated by transcription factors that interact with cis-regulatory elements. Predicting these elements from sequence data has proven difficult. We describe here a successful computational search for elements that direct expression in a particular temporal-spatial pattern in the Drosophila embryo, based on a single well characterized enhancer model. The fly genome was searched to identify sequence elements containing the same combination of transcription factors as those found in the model. Experimental evaluation of the search results demonstrates that our method can correctly predict regulatory elements and highlights the importance of functional testing as a means of identifying false-positive results. We also show that the search results enable the identification of additional relevant sequence motifs whose functions can be empirically validated. This approach, combined with gene expression and phylogenetic sequence data, allows for genome-wide identification of related regulatory elements, an important step toward understanding the genetic regulatory networks involved in development. TC 11 BP 1019 EP 1028 PG 10 JI Genome Res. PY 2002 PD JUL VL 12 IS 7 GA 569LR PI PLAINVIEW RP Michelson AM Brigham & Womens Hosp, Howard Hughes Med Inst, 75 Francis St, Boston, MA 02115 USA J9 GENOME RES PA 1 BUNGTOWN RD, PLAINVIEW, NY 11724 USA UT ISI:000176604300003 ER PT Journal AU Qiu, P Ding, W Jiang, Y Greene, JR Wang, LQ TI Computational analysis of composite regulatory elements SO MAMMALIAN GENOME LA English DT Article SN 0938-8990 PU SPRINGER-VERLAG C1 Schering Plough Corp, Res Inst, Bioinformat Grp, 2015 Galloping Hill Rd, Kenilworth, NJ 07033 USA Schering Plough Corp, Res Inst, Bioinformat Grp, Kenilworth, NJ 07033 USA Schering Plough Corp, Res Inst, Human Genomic Res Dept, Kenilworth, NJ 07033 USA ID ACTIVATED T-CELLS; TRANSCRIPTION FACTORS; GENE-EXPRESSION; NUCLEAR FACTOR; GENOMIC SEQUENCES; RESPONSE ELEMENT; NF-AT; PROMOTER; DATABASE; RECOGNITION AB Combinatorial regulation is a powerful mechanism for generating specificity in gene expression, and it is thought to play a pivotal role in the formation of the complex gene regulatory networks found in higher eukaryotes. The term "Composite Element" (CE) refers to a minimal functional unit where protein-DNA and protein-protein interactions contribute to a highly specific pattern of gene transcriptional regulation. Identification of composite elements will help to better understand gene regulation networks. Experimentally identified CEs are limited in number. and the currently available CE database COMPEL is based on such published information, Here. based on the statistical analysis of over-represented adjacent transcription factor binding sites, we describe a computational method to predict composite regulatory elements in genomic sequences. The algorithm proved to be efficient for extracting composite elements that had been experimentally confined and documented in the COMPEL database. Furthermore, putative new composite elements are predicted based on this method, and we have been able to confirm some of our predictions which are not included in the COMPEL database by searching published information. TC 2 BP 327 EP 332 PG 6 JI Mamm. Genome PY 2002 PD JUN VL 13 IS 6 GA 561BK PI NEW YORK RP Qiu P Schering Plough Corp, Res Inst, Bioinformat Grp, 2015 Galloping Hill Rd, Kenilworth, NJ 07033 USA J9 MAMM GENOME PA 175 FIFTH AVE, NEW YORK, NY 10010 USA UT ISI:000176116200009 ER PT Journal AU van Nimwegen, E Zavolan, M Rajewsky, N Siggia, ED TI Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics SO PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA LA English DT Article SN 0027-8424 PU NATL ACAD SCIENCES C1 Rockefeller Univ, Ctr Studies Phys & Biol, 1230 York Ave,Box 75, New York, NY 10021 USA Rockefeller Univ, Ctr Studies Phys & Biol, New York, NY 10021 USA Rockefeller Univ, Lab Computat Genom, New York, NY 10021 USA ID ESCHERICHIA-COLI K-12; REGULATORY SITES; BINDING SITES; PROTEINS; EXPRESSION AB Genome-wide comparisons between enteric bacteria yield large sets of conserved putative regulatory sites on a gene-by-gene basis that need to be clustered into regulons. Using the assumption that regulatory sites can be represented as samples from weight matrices (WMs), we derive a unique probability distribution for assignments of sites into clusters. Our algorithm, "PROCSE" (probabilistic clustering of sequences), uses Monte Carlo sampling of this distribution to partition and align thousands of short DNA sequences into clusters. The algorithm internally determines the number of clusters from the data and assigns significance to the resulting clusters. We place theoretical limits on the ability of any algorithm to correctly cluster sequences drawn from WMs when these WMs are unknown. Our analysis suggests that the set of all putative sites for a single genome (e.g., Escherichia coli) is largely inadequate for clustering. When sites from different genomes are combined and all the homologous sites from the various species are used as a block, clustering becomes feasible. We predict 50-100 new regulons as well as many new members of existing regulons, potentially doubling the number of known regulatory sites in E. coli. TC 6 BP 7323 EP 7328 PG 6 JI Proc. Natl. Acad. Sci. U. S. A. PY 2002 PD MAY 28 VL 99 IS 11 GA 557KW PI WASHINGTON RP van Nimwegen E Rockefeller Univ, Ctr Studies Phys & Biol, 1230 York Ave,Box 75, New York, NY 10021 USA J9 PROC NAT ACAD SCI USA PA 2101 CONSTITUTION AVE NW, WASHINGTON, DC 20418 USA UT ISI:000175908600003 ER PT Journal AU GuhaThakurta, D Palomar, L Stormo, GD Tedesco, P Johnson, TE Walker, DW Lithgow, G Kim, S Link, CD TI Identification of a novel cis-regulatory element involved in the heat shock response in Caenorhabditis elegans using microarray gene expression and computational methods SO GENOME RESEARCH LA English DT Article SN 1088-9051 PU COLD SPRING HARBOR LAB PRESS C1 Univ Colorado, Inst Behav Genet, Boulder, CO 80309 USA Univ Colorado, Inst Behav Genet, Boulder, CO 80309 USA Washington Univ, Sch Med, Dept Genet, St Louis, MO 63114 USA Univ Manchester, Sch Biol Sci, Manchester M13 9PT, Lancs, England Stanford Univ, Dept Dev Biol, Sch Med, Stanford, CA 94305 USA ID DNA-BINDING SITES; TRANSCRIPTION FACTORS; MOLECULAR CHAPERONES; PROTEIN INTERACTIONS; FUNCTIONAL ELEMENTS; SEQUENCE ALIGNMENTS; INFORMATION-CONTENT; MOUSE GENOME; FREE-ENERGY; FAMILY AB We report here the identification of a previously unknown transcription regulatory element for heat shock (HS) genes in Caenorhabditis elegans. We monitored the expression pattern of 11,917 genes from C elegans to determine the genes that were up-regulated on HS. Twenty eight genes were observed to be consistently up-regulated in several different repetitions of the experiments. We analyzed the upstream regions of these genes using computational DNA pattern recognition methods. Two potential cis-regulatory motifs were identified in this way. One of these motifs (TTCTAGAA) was the DNA binding motif for the heat shock factor (HSF), whereas the other (GGGTGTC) was previously unreported in the literature. We determined the significance of these motifs for the HS genes using different statistical tests and parameters. Comparative sequence analysis of orthologous HS genes from C elegans and Caenorhabditis briggsae indicated that the identified DNA regulatory motifs are conserved across related species. The role of the identified DNA sites in regulation of HS genes was tested by in vitro mutagenesis of a green fluorescent protein (GFP) reporter transgene driven by the C elegans hsp-16-2 promoter. DNA sites corresponding to both motifs are shown to play a significant role in up-regulation of the hsp-16-2 gene oil HS. This is one of the rare instances in which a novel regulatory element, identified using computational methods, is shown to be biologically active. The contributions of individual sites toward induction of transcription on HS are nonadditive, which indicates interaction and cross-talk between the sites, possibly through the transcription factors (TFs) binding to these sites. TC 4 BP 701 EP 712 PG 12 JI Genome Res. PY 2002 PD MAY VL 12 IS 5 GA 551JE PI PLAINVIEW RP Link CD Univ Colorado, Inst Behav Genet, Boulder, CO 80309 USA J9 GENOME RES PA 1 BUNGTOWN RD, PLAINVIEW, NY 11724 USA UT ISI:000175556500005 ER PT Journal AU Hampson, S Kibler, D Baldi, P TI Distribution patterns of over-represented kappa-mers in non- coding yeast DNA SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Univ Calif Irvine, Dept Comp & Informat Sci, Inst Genomics & Bioinformat, Irvine, CA 92697 USA Univ Calif Irvine, Dept Comp & Informat Sci, Inst Genomics & Bioinformat, Irvine, CA 92697 USA Univ Calif Irvine, Coll Med, Dept Biol Chem, Irvine, CA 92697 USA ID MICROARRAY GENE-EXPRESSION; SACCHAROMYCES-CEREVISIAE; GENOMIC SCALE; COMPUTATIONAL ANALYSIS; STATISTICAL-ANALYSIS; REGULATORY SITES; SEQUENCE; ELEMENTS; PROMOTER; IDENTIFICATION AB Motivation: Over-represented k-mers in genomic DNA regions are often of particular biological interest. For example, over- represented k-mers in co-regulated families of genes are associated with the DNA binding sites of transcription factors. To measure over-representation, we introduce a statistical background model based on single-mismatches, and apply it to the pooled 500 bp ORF Upstream Regions (USRs) of yeast. More importantly, we investigate the context and spatial distribution of over-represented k-mers in yeast USRs. Results: Single and double-stranded spatial distributions of most over- rep resented k-mers are highly non-random, and predominantly cluster into a small number of classes that are robust with respect to over-representation measures. Specifically, we show that the three most common distribution patterns can be related to DNA structure, function, and evolution and correspond to: (a) homologous ORF clusters associated with sharply localized distributions; (b) regulatory elements associated with a symmetric broad hill-shaped distribution in the 50-200 bp USR; and (c) runs of As, Ts, and ATs associated with a broad hill- shaped distribution also in the 50-200 bp USR, with extreme structural properties. Analysis of over-representation, homology, localization, and DNA structure are essential components of a general data-mining approach to finding biologically important k-mers in raw genomic DNA and understanding the 'lexicon' of regulatory regions. Contact: hampson@ics.uci.edu; kibler@ics.uci.edu; pfbaldi@ics.uci.edu. TC 3 BP 513 EP 528 PG 16 JI Bioinformatics PY 2002 PD APR VL 18 IS 4 GA 551CP PI OXFORD RP Baldi P Univ Calif Irvine, Dept Comp & Informat Sci, Inst Genomics & Bioinformat, Irvine, CA 92697 USA J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000175542200003 ER PT Journal AU Papatsenko, DA Makeev, VJ Lifanov, AP Regnier, M Nazina, AG Desplan, C TI Extraction of functional binding sites from unique regulatory regions: The Drosophila early developmental enhancers SO GENOME RESEARCH LA English DT Article SN 1088-9051 PU COLD SPRING HARBOR LAB PRESS C1 NYU, Dept Biol, New York, NY 10003 USA NYU, Dept Biol, New York, NY 10003 USA State Sci Ctr Genet, Moscow 113545, Russia Moscow Chem Phys Inst, Moscow 117421, Russia Inst Natl Rech Informat & Automat, F-78153 Le Chesnay, France ID TARAZU PROXIMAL ENHANCER; PAIR-RULE STRIPES; RNA-POLYMERASE-II; GENE-EXPRESSION; DNA-SEQUENCES; COMPUTATIONAL ANALYSIS; EMBRYO; IDENTIFICATION; TRANSCRIPTION; PROTEINS AB The early developmental enhancers of Drosophila melanogaster comprise one of the most sophisticated regulatory systems in higher eukaryotes. An elaborate code in their DNA sequence translates both maternal and early Embryonic regulatory signals into spatial distribution of transcription factors. One of the most striking features of this code is the redundancy of binding sites for these transcription factors (BSTF). Using this redundancy, we explored the possibility of predicting functional binding sites in a single enhancer region without any prior consensus/ matrix description or evolutionary sequence comparisons. We developed a conceptually simple algorithm, scanseq, that employs an original statistical evaluation for identifying the most redundant motifs and locates the position of potential BSTF in a given regulatory region. To estimate the biological relevance of our predictions, we built thorough literature-based annotations for the best-known Drosophila developmental enhancers and we generated detailed distribution maps for the most robust binding sites. The high statistical correlation between the location of BSTF in these experiment-based maps and the location predicted in silico by Scanseq confirmed the relevance of our approach. We also discuss the definition of true binding sites and the possible biological principles that govern patterning of regulatory regions and the distribution of transcriptional signals. TC 8 BP 470 EP 481 PG 12 JI Genome Res. PY 2002 PD MAR VL 12 IS 3 GA 527DP PI PLAINVIEW RP Papatsenko DA NYU, Dept Biol, New York, NY 10003 USA J9 GENOME RES PA 1 BUNGTOWN RD, PLAINVIEW, NY 11724 USA UT ISI:000174171300014 ER PT Journal AU Rajewsky, N Socci, ND Zapotocky, M Siggia, ED TI The evolution of DNA regulatory regions for proteo-gamma bacteria by interspecies comparisons SO GENOME RESEARCH LA English DT Article SN 1088-9051 PU COLD SPRING HARBOR LAB PRESS C1 Rockefeller Univ, Ctr Studies Phys & Biol, 1230 York Ave, New York, NY 10021 USA Rockefeller Univ, Ctr Studies Phys & Biol, New York, NY 10021 USA ID ESCHERICHIA-COLI K-12; REPRESSOR-OPERATOR COMPLEX; CRYSTAL- STRUCTURE; SITES; SEQUENCE; IDENTIFICATION; RESOLUTION; ALIGNMENT; GENOMES AB The comparison of homologous noncoding DNA for organisms a suitable evolutionary distance apart is a powerful tool for the identification of cis regulatory elements for transcription and translation and for the study of how they assemble into functional modules. We have fit the three parameters of an affine global probabilistic alignment algorithm to establish the background mutation rate of noncoding seqeunce between E. colt and a series of gamma proteobacteria ranging from Salmonella to Vibrio. The lower bound we find to the neutral mutation rate is sufficiently high, even for Salmonella, that most of the conservation of noncoding sequence is indicative of selective pressures rather than of insufficient time to evolve. We then use a local version of the alignment algorithm combined with our inferred background mutation rate to assign a significance to the degree of locale sequence conservation between orthologous genes, and thereby deduce a probability profile for the upstream regulatory region of all E. colt protein-coding genes. We recover 75%-85% (depending on significance level) of all regulatory sites from a standard compilation for E. coli, and 66%-85% of sigma sites. We also trace the evolution of known regulatory sites and the groups associated with a given transcription factor. Furthermore, we find that approximately one-third of paralogous gene pairs in E. coli have a significant degree of correlation in their regulatory sequence. Finally, we demonstrate an inverse correlation between the rate of evolution of transcription factors and the number of genes they regulate. Our predictions are available at http:/ /www.physics.rockefeller.edti/-siggia. TC 11 BP 298 EP 308 PG 11 JI Genome Res. PY 2002 PD FEB VL 12 IS 2 GA 518VY PI PLAINVIEW RP Siggia ED Rockefeller Univ, Ctr Studies Phys & Biol, 1230 York Ave, New York, NY 10021 USA J9 GENOME RES PA 1 BUNGTOWN RD, PLAINVIEW, NY 11724 USA UT ISI:000173689600010 ER PT Journal AU Frith, MC Hansen, U Weng, ZP TI Detection of cis-element clusters in higher eukaryotic DNA SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Boston Univ, Bioinformat Program, 44 Cummington St, Boston, MA 02215 USA Boston Univ, Bioinformat Program, Boston, MA 02215 USA Boston Univ, Dept Biol, Boston, MA 02215 USA Boston Univ, Dept Biomed Engn, Boston, MA 02215 USA ID TRANSCRIPTION FACTOR CP2; RNA POLYMERASE-II; IMMUNODEFICIENCY- VIRUS TYPE-1; A-CRYSTALLIN GENE; PROMOTER SEQUENCES; REGULATORY REGIONS; BINDING-SITES; FACTOR LSF; FUNCTIONAL PROMOTER; HIV-1 TRANSCRIPTION AB Motivation: Computational prediction and analysis of transcription. regulatory regions in DNA sequences has the potential to accelerate greatly our understanding of how cellular processes are controlled. We present a hidden Markov model 'based method for detecting regulatory regions in DNA sequences, by searching for clusters of cis-elements. Results: When applied to regulatory targets of the transcription factor LSF, this method achieves a sensitivity of 67%, while making one prediction per 33 kb of nonrepetitive human genomic sequence. When applied to muscle specific regulatory regions, we obtain a sensitivity and prediction rate that compare favorably with one of the best alternative approaches. Our method, which we call Cister, can be used! to predict different varieties of regulatory region by searching for clusters of cis-elements of any type chosen by the user. Cister is simple to use and is available on the web. TC 19 BP 878 EP 889 PG 12 JI Bioinformatics PY 2001 PD OCT VL 17 IS 10 GA 484LA PI OXFORD RP Weng ZP Boston Univ, Bioinformat Program, 44 Cummington St, Boston, MA 02215 USA J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000171690300004 ER PT Journal AU Pilpel, Y Sudarsanam, P Church, GM TI Identifying regulatory networks by combinatorial analysis of promoter elements SO NATURE GENETICS LA English DT Article SN 1061-4036 PU NATURE AMERICA INC C1 Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA Harvard Univ, Sch Med, Lipper Ctr Computat Genet, Boston, MA 02115 USA ID YEAST SACCHAROMYCES-CEREVISIAE; TRANSCRIPTION FACTORS; GENE- EXPRESSION; CELL-CYCLE; COMPUTATIONAL ANALYSIS; EXCISION- REPAIR; GENOMIC SCALE; IDENTIFICATION; SEQUENCES; DATABASE AB Several computational methods based on microarray data are currently used to study genome-wide transcriptional regulation. Few studies, however, address the combinatorial nature of transcription, a well-established phenomenon in eukaryotes. Here we describe a new approach using microarray data to uncover novel functional motif combinations in the promoters of Saccharomyces cerevisiae. In addition to identifying novel motif combinations that affect expression patterns during the cell cycle, sporulation and various stress responses, we observed regulatory cross-talk among several of these processes. We have also generated motif-association maps that provide a global view of transcription networks. The maps are highly connected, suggesting that a small number of transcription factors are responsible for a complex set of expression patterns in diverse conditions. This approach may be useful for modeling transcriptional regulatory networks in more complex eukaryotes. TC 64 BP 153 EP 159 PG 7 JI Nature Genet. PY 2001 PD OCT VL 29 IS 2 GA 478XP PI NEW YORK RP Church GM Harvard Univ, Sch Med, Dept Genet, Boston, MA 02115 USA J9 NAT GENET PA 345 PARK AVE SOUTH, NEW YORK, NY 10010-1707 USA UT ISI:000171374100020 ER PT Journal AU GuhaThakurta, D Stormo, GD TI Identifying target sites for cooperatively binding factors SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Washington Univ, Sch Med, Dept Genet, 4566 Scott Ave,Campus Box 8232, St Louis, MO 63110 USA Washington Univ, Sch Med, Dept Genet, St Louis, MO 63110 USA ID ESCHERICHIA-COLI; SACCHAROMYCES-CEREVISIAE; INFORMATION- CONTENT; GENE-EXPRESSION; COMPUTATIONAL ANALYSIS; FREE-ENERGY; DNA; PROTEIN; SEQUENCE; YEAST AB Motivation: Transcriptional activation in eukaryotic organisms normally requires combinatorial interactions of multiple transcription factors. Though several methods exist for identification of individual protein binding site patterns in DNA sequences, there are few methods for discovery of binding site patterns for cooperatively acting factors. Here we present an algorithm, Co-Bind (for COperative BINDing), for discovering DNA target sites for cooperatively acting transcription factors. The method utilizes a Gibbs sampling strategy to model the cooperativity between two transcription factors and defines position weight matrices for the binding sites. Sequences from both the training set and the entire genome are taken into account, in order to discriminate against commonly occurring patterns in the genome, and produce patterns which are significant only in the training set. Results: We have tested Co-Bind on semi-synthetic and real data sets to show it can efficiently identify DNA target site patterns for cooperatively binding transcription factors. In cases where binding site patterns are weak and cannot be identified by other available methods, Co-Bind, by virtue of modeling the cooperativity between factors, can identify those sites efficiently. Though developed to model protein-DNA interactions, the scope of Co- Bind may be extended to combinatorial, sequence specific, interactions in other macromolecules. TC 13 BP 608 EP 621 PG 14 JI Bioinformatics PY 2001 PD JUL VL 17 IS 7 GA 459KG PI OXFORD RP GuhaThakurta D Washington Univ, Sch Med, Dept Genet, 4566 Scott Ave,Campus Box 8232, St Louis, MO 63110 USA J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000170249100005 ER PT Journal AU Ohler, U Niemann, H TI Identification and analysis of eukaryotic promoters: recent computational approaches SO TRENDS IN GENETICS LA English DT Article SN 0168-9525 PU ELSEVIER SCIENCE LONDON C1 Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung Informat 5, Martensstr 3, D-91058 Erlangen, Germany Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung Informat 5, D-91058 Erlangen, Germany ID YEAST SACCHAROMYCES-CEREVISIAE; REGULATORY ELEMENTS; DROSOPHILA-MELANOGASTER; NONCODING SEQUENCES; GENOMIC SCALE; IN-SILICO; PREDICTION; SITES; REGIONS; RECOGNITION AB The DNA sequence of several higher eukaryotes is now complete, and we know the expression patterns of thousands of genes under a variety of conditions. This gives us the opportunity to identify and analyze the parts of a genome believed to be responsible for most transcription control-the promoters. This article gives a short overview of the state-of-the-art techniques for computational promoter localization and analysis, and comments on the most recent advances in the field. TC 26 BP 56 EP 60 PG 5 JI Trends Genet. PY 2001 PD FEB VL 17 IS 2 GA 432WD PI LONDON RP Ohler U Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung Informat 5, Martensstr 3, D-91058 Erlangen, Germany J9 TRENDS GENET PA 84 THEOBALDS RD, LONDON WC1X 8RR, ENGLAND UT ISI:000168718000002 ER PT Journal AU Wingender, E Chen, X Fricke, E Geffers, R Hehl, R Liebich, I Krull, M Matys, V Michael, H Ohnhauser, R Pruss, M Schacherer, F Thiele, S Urbach, S TI The TRANSFAC system on gene expression regulation SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 Gesell Biotechnol Forsch GmbH, Mascheroder Weg 1, D-38124 Braunschweig, Germany Gesell Biotechnol Forsch GmbH, D-38124 Braunschweig, Germany Peking Univ, Coll Life Sci, Natl Lab Prot Engn & Plant Genet Engn, Beijing 100871, Peoples R China BIOBASE GmbH, D-38124 Braunschweig, Germany Tech Univ Braunschweig, Biozentrum, Inst Genet, D-38106 Braunschweig, Germany ID TRANSCRIPTIONAL REGULATION; DATABASE; COMPILATION; COMPEL; TOOLS; TRRD AB The TRANSFAC database on transcription factors and their DNA- binding sites and profiles (http:// www.gene-regulation.de/) has been quantitatively extended and supplemented by a number of modules. These modules give information about pathologically relevant mutations in regulatory regions and transcription factor genes (PathoDB), scaffold/matrix attached regions (S/MARt DB), signal transduction (TRANSPATH) and gene expression sources (CYTOMER). Altogether, these distinct database modules constitute the TRANSFAC system. They are accompanied by a number of program routines for identifying potential transcription factor binding sites or for localizing individual components in the regulatory network of a cell. TC 81 BP 281 EP 283 PG 3 JI Nucleic Acids Res. PY 2001 PD JAN 1 VL 29 IS 1 GA 391MT PI OXFORD RP Wingender E Gesell Biotechnol Forsch GmbH, Mascheroder Weg 1, D-38124 Braunschweig, Germany J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000166360300077 ER PT Journal AU Bussemaker, HJ Li, H Siggia, ED TI Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis SO PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA LA English DT Article SN 0027-8424 PU NATL ACAD SCIENCES C1 Univ Amsterdam, Swammerdam Inst Life Sci, Kruislaan 318, NL- 1098 SM Amsterdam, Netherlands Rockefeller Univ, Ctr Studies Phys & Biol, New York, NY 10021 USA ID YEAST SACCHAROMYCES-CEREVISIAE; GENE-EXPRESSION; BIOLOGICAL SEQUENCES; SCALE; SPORULATION; MEIOSIS; MOTIFS AB The availability of complete genome sequences and mRNA expression data for all genes creates new opportunities and challenges for identifying DNA sequence motifs that control gene expression. An algorithm, "MobyDick," is presented that decomposes a set of DNA sequences into the most probable dictionary of motifs or words. This method is applicable to any set of DNA sequences: for example, all upstream regions in a genome or all genes expressed under certain conditions. Identification of words is based on a probabilistic segmentation model in which the significance of longer words is deduced from the frequency of shorter ones of various lengths, eliminating the need for a separate set of reference data to define probabilities. We have built a dictionary with 1,200 words for the 6,000 upstream regulatory regions in the yeast genome; the 500 most significant words (some with as few as 10 copies in all of the upstream regions) match 114 of 443 experimentally determined sites (a significance level of 18 standard deviations). When analyzing all of the genes up- regulated during sporulation as a group, we find many motifs in addition to the few previously identified by analyzing the subclusters individually to the expression subclusters. Applying MobyDick to the genes derepressed when the general repressor Tup1 is deleted, we find known as well as putative binding sites for its regulatory partners. TC 25 BP 10096 EP 10100 PG 5 JI Proc. Natl. Acad. Sci. U. S. A. PY 2000 PD AUG 29 VL 97 IS 18 GA 349WA PI WASHINGTON RP Bussemaker HJ Univ Amsterdam, Swammerdam Inst Life Sci, Kruislaan 318, NL-1098 SM Amsterdam, Netherlands J9 PROC NAT ACAD SCI USA PA 2101 CONSTITUTION AVE NW, WASHINGTON, DC 20418 USA UT ISI:000089067500052 ER PT Journal AU Wagner, A TI Genes regulated cooperatively by one or more transcription factors and their identification in whole eukaryotic genomes SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Univ New Mexico, Dept Biol, 167A Castetter Hall, Albuquerque, NM 87131 USA Univ New Mexico, Dept Biol, Albuquerque, NM 87131 USA Santa Fe Inst, Albuquerque, NM 87131 USA ID SACCHAROMYCES-CEREVISIAE; BINDING-SITES; PROMOTER RECOGNITION; DNA-SEQUENCES; II PROMOTERS; IN-SILICO; YEAST; MCM1; INHOMOGENEITIES; CYTOKINESIS AB Motivation: The question addressed here is how cooperative interactions among transcription factors (TFs), a very frequency phenomenon in eukaryotic transcriptional regulation, can be used to identify genes that are regulated by one or more TFs with known DNA binding of only one transcription factor to multiple sites in a gene's regulatory region. It may also be heterotypic, involving binding of more than one TF. both types of cooperativity have in common that the binding sites for the respective TFs form tightly linked "clusters', groups of binding sites often more closely associated thana expected by chance alone. Results: A statistical technique suitable for the identification of statistically significant homotypic or heterotypic TF binding site clusters in whole eukaryotic genomes is presented. It can be used to identify genes likely to be regulated by the TFs. Application of the technique is illustrated with two transcription factors involved in the cell cycle and mating control of the yeast Saccharomyces cervisiae, indicating that the results obtained are biologically meaningful. This rapid and inexpensive computational method of generating hypotheses about gene regulation thus generates information that may be used to guide subsequent costly and laborious experimental approaches, and that may aid in the assignment of biological functions to putative open reading frames. TC 24 BP 776 EP 784 PG 9 JI Bioinformatics PY 1999 PD OCT VL 15 IS 10 GA 279LU PI OXFORD RP Wagner A Univ New Mexico, Dept Biol, 167A Castetter Hall, Albuquerque, NM 87131 USA J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000085045900001 ER PT Journal AU Kel-Margoulis, OV Romashchenko, AG Kolchanov, NA Wingender, E Kel, AE TI COMPEL: a database on composite regulatory elements providing combinatorial transcriptional regulation SO NUCLEIC ACIDS RESEARCH LA English DT Article SN 0305-1048 PU OXFORD UNIV PRESS C1 RAN, SB, Inst Cytol & Genet, 10 Lavrentyev P, Novosibirsk 630090, Russia RAN, SB, Inst Cytol & Genet, Novosibirsk 630090, Russia Gesell Biotechnol Forsch GmbH, Res Grp Bioinformat, D-38124 Braunschweig, Germany ID TRANSFAC; TRRD; CLASSIFICATION AB COMPEL-is a database on composite regulatory elements, the basic structures of combinatorial regulation. Composite regulatory elements contain two closely situated binding sites for distinct transcription factors and represent minimal functional units providing combinatorial transcriptional regulation, Both specific factor-DNA and factor-factor interactions contribute to the function of composite elements (CEs), Information about the structure of known CEs and specific gene regulation achieved through such CEs appears to be extremely useful for promoter prediction, for gene function prediction and for applied gene engineering as well, The structure of the relational model of COMPEL is determined by the concept of molecular structure and regulatory role of CEs. Based on the set of a particular CE, a program has been developed for searching potential CEs in gene regulatory regions, WWW search and browse routines were developed for COMPEL release 3.0, The COMPEL database equipped with the search and browse tools is available at http://compel.bionet.nsc.ru/, The program for prediction of potential CEs of NEAT type is available at http://compel.bionet.nsc.ru/FunSite.html and http://transf ac.gbf.de/dbsearch/funsitep/s_comp.html. TC 16 BP 311 EP 315 PG 5 JI Nucleic Acids Res. PY 2000 PD JAN 1 VL 28 IS 1 GA 276UT PI OXFORD RP Kel-Margoulis OV RAN, SB, Inst Cytol & Genet, 10 Lavrentyev P, Novosibirsk 630090, Russia J9 NUCL ACID RES PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000084896300092 ER PT Journal AU Hertz, GZ Stormo, GD TI Identifying DNA and protein patterns with statistically significant alignments of multiple sequences SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 Univ Colorado, Dept Mol Cellular & Dev Biol, Boulder, CO 80309 USA Univ Colorado, Dept Mol Cellular & Dev Biol, Boulder, CO 80309 USA ID BINDING-SITES; INFORMATION-CONTENT; COMPUTER METHODS; IDENTIFICATION; SPECIFICITY; OPERATORS; SIGNALS AB Motivation: Molecular biologists frequently, can obtain interesting insight by aligning a set of related DNA, RNA or protein sequences. Such alignments can be used to determine either evolutionary or functional relationships. Our intel est is in identifying functional relationships. Unless the sequences are very similar; it is necessary to have a specific strategy for measuring-or scoring-the relatedness of the aligned sequences. if the alignment is not known, one can be determined by finding an alignment that optimizes the scoring scheme. Results: We describe four components to our approach for determining alignments of multiple sequences. First, we review a log-likelihood scoring scheme we call information content. Second, bye describe two methods for estimating the P value of an individual information content score: (i) a method that combines a technique from large-deviation statistics with numerical calculations; (ii) a method that is exclusively numerical. Third, we describe how we count the number of possible alignments given the overall amount of sequence data. This count is multiplied by the P value to determine the expected frequency of an information content score and thus, the statistical significance of the corresponding alignment. Statistical significance cart be used to compare alignments having differing widths and containing differing numbers of sequences. Fourth, we describe a greedy algorithm for determining alignments of functionally related sequences. Finally, bye test the accuracy of our P value calculations, and give an example of using our algorithm to identify binding sites for the Escherichia coli CRP protein. Availability: Programs were developed under the UNIX operating system and are available by anonymous ftp from ftp://beagle.colorado.edu/pub/consensus. Contact: hertz@colorado.edu. TC 55 BP 563 EP 577 PG 15 JI Bioinformatics PY 1999 PD JUL-AUG VL 15 IS 7-8 GA 247GT PI OXFORD RP Hertz GZ Univ Colorado, Dept Mol Cellular & Dev Biol, Boulder, CO 80309 USA J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000083213500006 ER PT Journal AU Klingenhoff, A Frech, K Quandt, K Werner, T TI Functional promoter modules can be defected by formal models independent of overall nucleoside sequence similarity SO BIOINFORMATICS LA English DT Article SN 1367-4803 PU OXFORD UNIV PRESS C1 GSF, Natl Res Ctr Environm & Hlth, Inst Mammalian Genet, Landstr 1, D-85764 Neuherberg, Germany GSF, Natl Res Ctr Environm & Hlth, Inst Mammalian Genet, D-85764 Neuherberg, Germany Genomatix Software GmbH, D-80333 Munich, Germany ID PROGRESSIVE MULTIFOCAL LEUKOENCEPHALOPATHY; CLASS-I GENES; TRANSCRIPTIONAL REGULATION; BINDING-SITES; EXPRESSION; ELEMENTS; TOOLS; GENOMEINSPECTOR; DATABASES; HLA-B7 AB Motivation: Gene regulation often depends on functional modules which feature a detectable internal organization. Overall sequence similarity of these modules is often insufficient for detection by general search methods like FASTA or even Gapped BLAST However; it is of interest to evaluate whether modules, often known from experimental analysis of single sequences, are present in other regulatory sequences. Results: We developed a new method (FastM) which combines a search algorithm for individual transcription factor binding sites (MatInspector) with a distance correlation function. FastM allows fast definition of a model of correlated binding sires derived from as little as a single promoter or enhancer ModelInspector results are suitable for evaluation of the significance of the model. We used FastM to define a model for the experimentally verified NF kappa B/IRF1 regulatory module from the major histocompatibility complex (MHC) class I HLA-B gene promoter Analysis of a test set of sequences as Ir ell as database searches with this model showed excellent correlation of the model with the biological function of the module. These results could not be obtained by searches using FASTA or Gapped BLAST, which are based on sequence similarity. We were also able to demonstrate association of a hypothetical GRE-GRE module with viral sequences based on analysis of several GenBank sections with this module. TC 28 BP 180 EP 186 PG 7 JI Bioinformatics PY 1999 PD MAR VL 15 IS 3 GA 192CK PI OXFORD RP Klingenhoff A GSF, Natl Res Ctr Environm & Hlth, Inst Mammalian Genet, Landstr 1, D-85764 Neuherberg, Germany J9 BIOINFORMATICS PA GREAT CLARENDON ST, OXFORD OX2 6DP, ENGLAND UT ISI:000080060500002 ER