A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types

Haipeng Xing; Willey Liao; Yifan Mo; Michael Q. Zhang

doi:10.3791/4273

JoVE Journal > Biology

Please note that all translations are automatically generated. Click here for the English version.

Biology

Роман байесовского Изменение точки Алгоритм всему геному анализ различных типов данных ChIPseq

Published: December 10, 2012

doi:

10.3791/4273

Haipeng Xing, Willey Liao², Yifan Mo², Michael Q. Zhang³

¹Department of Applied Mathematics & Statistics,Stony Brook University, ²Computational Biology and Bioinformatics,Cold Spring Harbor Laboratory, ³Department of Molecular and Cell Biology,University of Texas at Dallas

Summary

Наши байесовского Точка изменения (BCP) алгоритм основывается на состоянии самой современной прогресс в области моделирования изменения точки через скрытых Марковских моделей и применяет их к иммунопреципитации хроматина секвенирования (ChIPseq) анализ данных. BCP хорошо работает как в широком и точечные типов данных, но превосходит точно идентифицировать надежные, воспроизводимые острова диффузного гистонов обогащения.

Abstract

ChIPseq is a widely used technique for investigating protein-DNA interactions. Read density profiles are generated by using next-sequencing of protein-bound DNA and aligning the short reads to a reference genome. Enriched regions are revealed as peaks, which often differ dramatically in shape, depending on the target protein¹. For example, transcription factors often bind in a site- and sequence-specific manner and tend to produce punctate peaks, while histone modifications are more pervasive and are characterized by broad, diffuse islands of enrichment². Reliably identifying these regions was the focus of our work.

Algorithms for analyzing ChIPseq data have employed various methodologies, from heuristics^3-5 to more rigorous statistical models, e.g. Hidden Markov Models (HMMs)^6-8. We sought a solution that minimized the necessity for difficult-to-define, ad hoc parameters that often compromise resolution and lessen the intuitive usability of the tool. With respect to HMM-based methods, we aimed to curtail parameter estimation procedures and simple, finite state classifications that are often utilized.

Additionally, conventional ChIPseq data analysis involves categorization of the expected read density profiles as either punctate or diffuse followed by subsequent application of the appropriate tool. We further aimed to replace the need for these two distinct models with a single, more versatile model, which can capably address the entire spectrum of data types.

To meet these objectives, we first constructed a statistical framework that naturally modeled ChIPseq data structures using a cutting edge advance in HMMs⁹, which utilizes only explicit formulas-an innovation crucial to its performance advantages. More sophisticated then heuristic models, our HMM accommodates infinite hidden states through a Bayesian model. We applied it to identifying reasonable change points in read density, which further define segments of enrichment. Our analysis revealed how our Bayesian Change Point (BCP) algorithm had a reduced computational complexity-evidenced by an abridged run time and memory footprint. The BCP algorithm was successfully applied to both punctate peak and diffuse island identification with robust accuracy and limited user-defined parameters. This illustrated both its versatility and ease of use. Consequently, we believe it can be implemented readily across broad ranges of data types and end users in a manner that is easily compared and contrasted, making it a great tool for ChIPseq data analysis that can aid in collaboration and corroboration between research groups. Here, we demonstrate the application of BCP to existing transcription factor^10,11 and epigenetic data¹² to illustrate its usefulness.

Protocol

1. Подготовка входных файлов для анализа BCP Совместите короткий читает производится из последовательности серий (чип и ввод библиотеки) к соответствующим геном ссылки с использованием предпочтительного короткие программного обеспечения выравнивания чтения. Отображаемых мест д…

Representative Results

BCP выделяется на выявление регионов широком обогащения гистонов модификации данных. В качестве точки отсчета, ранее мы сравнивали наши результаты с результатами SICER 3, существующий инструмент, который продемонстрировал высокие показатели. Чтобы лучше проиллюстрировать преимуще…

Discussion

Мы задались целью разработать модель для анализа ChIPseq данных, которые могут идентифицировать как точечные и диффузные структуры данных, одинаково хорошо. До сих пор регионы обогащения, в частности, диффузные регионов, которые отражают предполагает ожидание больших размеров остров, бы?…

Disclosures

The authors have nothing to disclose.

Acknowledgements

STARR основу премии (MQZ), NIH грант ES017166 (MQZ), NSF гранта DMS0906593 (HX).

Materials

Name of the reagent	Company	Catalogue number	Comments (optional)
Linux-based workstation

References

Park, P. J. ChIP-seq: advantages and challenges of a maturing technology. Nat. Rev. Genet. 10, 669-680 (2009).
Barski, A., et al. High-resolution profiling of histone methylations in the human genome. Cell. 129, 823-837 (2007).
Zhang, Y., et al. Model-based Analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
Zang, C., et al. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 25, 1952-1958 (2009).
Jothi, R., Cuddapah, S., Barski, A., Cui, K., Zhao, K. Genome-wide identification of in vivo protein-DNA binding sites from ChIP-Seq data. Nucleic Acids Res. 36, 5221-5231 (2008).
Qin, Z. S., et al. HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC Bioinformatics. 11, 369 (2010).
Song, Q., Smith, A. D. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics. 27, 870-871 (2011).
Spyrou, C., Stark, R., Lynch, A. G., Tavaré, S. BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinformatics. 10, 299 (2009).
Lai, T., Xing, H. A simple Bayesian approach to multiple change-points. Statistica Sinica. , (2011).
Robertson, G., et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat. Methods. 4, 651-657 (2007).
Stitzel, M. L., et al. Global epigenomic analysis of primary human pancreatic islets provides insights into type 2 diabetes susceptibility loci. Cell Metab. 12, 443-455 (2010).
Bernstein, B. E., et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat. Biotechnol. 28, 1045-1048 (2010).
Karolchik, D., et al. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32, 493-496 (2004).
Matys, V., et al. TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31, 374-378 (2003).
Portales-Casamar, E., et al. JASPAR 2010: the greatly expanded open-access database of transcription factor binding profiles. Nucleic Acids Res. 38, D105-D110 (2010).

Play Video

PDF

DOI

DOWNLOAD MATERIALS LIST

Cite This Article

Xing, H., Liao, W., Mo, Y., Zhang, M. Q. A Novel Bayesian Change-point Algorithm for Genome-wide Analysis of Diverse ChIPseq Data Types. J. Vis. Exp. (70), e4273, doi:10.3791/4273 (2012).

Роман байесовского Изменение точки Алгоритм всему геному анализ различных типов данных ChIPseq

Summary

Abstract

Protocol

Representative Results

Discussion

Disclosures

Acknowledgements

Materials

References

Tags

Play Video

Cite This Article

View Video

Роман байесовского Изменение точки Алгоритм всему геному анализ различных типов данных ChIPseq

Summary

Abstract

Protocol

Representative Results

Discussion

Disclosures

Acknowledgements

Materials

References

Tags

Play Video

Cite This Article

View Video

✖

To prove you're not a robot, please enter the text in the image below