Research

Computation biology and statistics

Our work has continued its focus on five main areas: Statistical methods for microarray data, analysis of Solexa second-generation sequencing data, evolutionary approaches to cancer, statistical genetics, and analysis of genomics data.

We have continued our development of statistical methods for the analysis of Illumina BeadArray technologies. Illumina's technology uses randomly assembled arrays of beads, each of which has a probe for a specific genomic feature. Illumina's software can report full bead-level data, access to which allows for more detailed quality assessment and more flexible statistical analyses. In collaboration with Mark Dunning (Bioinformatics Core), we have redesigned our open-source Bioconductor software, beadarray, to provide a statistical environment for such analyses.

beadarray 2.0 is a complete re-write of the original beadarray package with the aim of providing a more flexible interface to analyse all kinds of Illumina BeadArray data. Rather than being specific to expression data, users can now import methylation or genotyping data and easily adapt the standard visualisation and summarisation options provided in the package to their particular data type. The new package structure also facilitates the development and implementation of new methods for expression data.

Another aspect of our research develops evolutionary approaches to cancer. Realistic agent-based models are proving useful in capturing the complex interactions, spanning multiple space and time scales, that drive cancer growth. We have developed cellular Potts models to describe the dynamics of colon crypts and colorectal tumours. Statistical inference for agent-based models is a challenging problem that has hampered the adoption of agent-based models for complex systems. To this end, we have developed the first approximate Bayesian computation methods for such models, an approach that is likely to revolutionise agent-based modelling. We are using molecular markers such as DNA methylation to infer aspects of the stem cell dynamics in crypts, and to study heterogeneity in colorectal tumours.

Measuring the methylation status of single molecules continues to be a focus of our experimental work. We have developed a cheap, automated, emulsion PCR-based method for identifying CpG methylation in hundreds of thousands of molecules at a small number of CpG islands, together with the computational tools needed to analyse such data. We are using this method to generate data for the cancer modelling described above. This technique has a number of other uses, such as the quantification of allele-specific expression, the identification of rare variants, and the validation of findings from second-generation sequencing experiments.

We continue to collaborate on several projects that involve the statistical analysis of resequencing data. We have continued development of our BayesPeak package for the analysis of ChIP-seq experiments, this is now freely available as part of Bioconductor. In collaboration with Illumina we have developed statistical methods for identifying copy number aberrations from second-generation sequencing data obtained from matched tumour-normal pairs. Data-mining and data visualisation remain an important focus of our work. To this end, we have developed custom databases to mine network interactions in genomic data. These tools incorporate information from diverse data types, and have been used in collaborations with the Caldas and Narita labs at the CRI, and the Fearon lab at the University.

We have continued our work with the Caldas laboratory on the statistical analysis of the METABRIC project that has assayed germline and somatic copy number variants, and their impact on expression variation in some 2,500 breast tumours and matched normal samples, using high-density microarrays (Figure 1). Following up on our initial findings, we are performing targeted resequencing of specific patient sub-populations to survey the mutational spectrum of putative cancer genes. We are also generating transcriptomic and paired-end sequence data on a subset of these tumours to study alternative splicing and structural variation. Additionally, we are pursuing integrative bioinformatic and statistical methods to further delineate mechanisms of differential pathway disruption within the novel subgroups. In collaboration with the Bioinformatics core facility we have developed analysis pipelines for production runs on Illumina expression arrays. We have also developed novel methods for detecting aberrations in tumour samples that exploit cross-sample information, and for identifying interacting SNPs in genome-wide association studies.

Trans-acting aberration hotspots are evident in breast cancer and modulate concerted molecular pathways (Tavare report 2010; figure 1)
Figure 1
Trans-acting aberration hotspots are evident in breast cancer and modulate concerted molecular pathways. From top to bottom: Manhattan plot illustrating the genomic location of cis and trans expression associated CNAs, where the directionality of the association is indicated by shading - cispositive (red), cis-negative (pink); trans-positive (blue), trans-negative (light blue). A matrix of predictor-expression associations with Bayes' factor greater than five are plotted, illustrating a strong off diagonal pattern at several loci including regions on chromosomes 1q, 7p, 8, 11q, 14q, 16, 17q, and 20q. The frequency of mRNAs associated with a particular predictor further illuminates these trans-acting aberration 'hotspots'.

Dr Benilton Carvalho joined the lab as a postdoc. Benilton completed his PhD in Biostatistics with Rafael Irizzary at Johns Hopkins University, where he developed CRLMM, a genotyping algorithm for SNP arrays. He is now working on statistical methods for the next generation of Affymetrics and Illumina SNP chips, and for second-generation sequencing data. Daniel Andrews completed Part III in Mathematics in Cambridge, and joined us as a research student. Daniel is working on methods for inferring cellularity from sequencing data.

Sergii Ivakhno and Doug Speed completed their PhDs. Sergii is now in the bioinformatics group at Illumina, and Doug is a postdoc in the Balding laboratory at University College London. Dr Nuno Barbosa-Morais obtained a Marie Curie Fellowship to work in Ben Blencowe's lab at the University of Toronto, and Dr Christina Curtis has taken up her position as an Assistant Professor in Preventive Medicine in the Keck School of Medicine at the University of Southern California.