wiki:public/20121010

Project: Next Generation Sequencing: From Computational Challenges to Biological Insight

Team: Dr. Sascha Sauer, Annabell Witzke, Cornelius Fischer

Research institution: Max Planck Institute for Molecular Genetics (MPIMG)

Abstract: The aim of this project is to analyze the interplay of small molecules with cellular and inflammatory processes and to study mechanisms underlying gene regulation. Resveratrol, a small molecule found in the skin of grapes, is widely known to activate the NAD+-dependent deacetylase Sirt1 and nuclear receptors which are associated with the prevention of diabetes and obesity. In this project, molecules interacting with Sirt1 and nuclear receptors shall be identified and characterized using approaches such as chromatin immunoprecipitation (ChIP-seq) coupled to next generation sequencing (NGS) and transcriptome profiling using RNA-seq. Understanding the mechanism of action of Sirt1 and nuclear receptors can provide a unique opportunity to gain biological insight into diabetes pathogenesis and to develop new compounds for prevention and treatment.

Project Description

Next generation sequencing (NGS) is changing the way researchers approach the management and analysis of biological information. However, one of the main bottlenecks in NGS applications is the computational analysis of experimental data. NGS has increased an already daunting volume of data generated in laboratories by orders of magnitude. For example machines such as Illumina's state of the art HiSeq 2000 can generate terabytes of data per day. Importantly, NGS data analysis is a rapidly developing field of research and many aspects of data interpretation are incompletely understood. The result is an immense information technology challenge that affects all biomedical research. An emerging trend in high-throughput computational analysis of NGS data is to use external providers for cloud and grid computing instead of trying to build and run own institution based data centers. Institution based environments often use established pipelines to increase throughput for the majority of long term in-house research projects. In a running pipeline, it is difficult to keep the pipeline modules up-to-date according to the rapid evolution of bioinformatics tools. However, innovative bioinformatics ideas, uncommon experimental approaches and difficult datasets from relatively small research groups need a flexible and up-to-date computational environment. In two European consortia (READNA, ESGI) we are involved in adapting and optimising these sequencing approaches for specific applications related to the analysis of gene regulation in cell differentiation. Our main applications of NGS are ChIP-seq and RNA-seq approaches. ChIP-seq is used to identify the binding sites of DNA-associated regulatory proteins (e.g. transcription factors) and the genomic domains of proteins associated with DNA-packaging and regulation (histone modifications). RNA-Seq is expected to replace traditional microarray approaches for many applications that involve determining the structure and dynamics of the transcriptome.

Workflow

The computational pipeline for primary data analysis for both ChIP-seq and RNA-seq methods involve quality assessment and mapping of obtained sequence tags to the genome or the annotated transcriptome by applying sequence alignment algorithms implemented in Bowtie , or TopHat . The post-mapping analysis workflows involve determination of genomic regions significantly enriched in aligned tags using peak calling algorithms such as MACS . Basic post-mapping analyses are followed by diverse high-dimensional data analyses methods implemented in the statistical R programming language and gathered in the software package Bioconductor.

Dynamic pipeline

Dynamic pipeline for computational analysis of high-throughput next generation sequencing (NGS) data.

ChIP-seq

Nucleic chromatin is cross-linked (1) and sheared (2), prior enrichment of target protein complexes by immunoprecipitation (3). Afterwards, short reads which were achieved from parallel sequencing (4) are mapped to the reference genome resulting in a distribution of tags on the genome (5). ChIP-seq workflow

ChIP-seq workflow (Szalkowski and Schmid (2010))

References

  • Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.Genome Biol. 10, R25 (2009).
  • Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2.Nature Methods (2012).doi:10.1038/nmeth.1923
  • Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105–1111 (2009).
  • Zhang, Y. et al. Model-based analysis of ChIP-Seq (MACS). Genome Biol. 9, R137 (2008).
  • 2.Ihaka, R. & Gentleman, R. R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 5, 299–314 (1996).
  • Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004)
  • Szalkowski, A. M. & Schmid, C. D.: Rapid innovation in ChIP-seq peak-calling algorithms is outdistancing benchmarking efforts. Brief Bioinform 2011;12:626–33.
Last modified 6 years ago Last modified on Nov 16, 2012 11:01:13 AM

Attachments (2)

Download all attachments as: .zip