ABS10 - 2010 Applied Bayesian Statistics School

BAYESIAN MACHINE LEARNING WITH BIOMEDICAL APPLICATIONS

EURAC, Bolzano/Bozen, Italy

June, 11-15, 2010

PARTICIPANTS' TALKS

Simon Aeschbacher, University of Edinburgh, United Kingdom

The choice of summary statistics in Approximate Bayesian Computation

For some models in population genetics, the computation of a likelihood is prohibitively expensive or even impossible. Approximate Bayesian Computation (ABC) has been introduced to avoid explicit calculation of the likelihood. Instead, a rejection sampling scheme is used to directly sample from the posterior distribution. To increase the efficiency of the algorithm, it is necessary to collapse the full data onto a set of summary statistics. However, hardly any summary statistic in population genetics is sufficient. How should summary statistics be chosen? Ideally, we want to extract as much information from the data as possible. On the other hand, we want to keep the number of statistics, and hence the number of dimensions, as low as possible. We used binomial boosting to choose parameter-specific sets of statistics and were able to reduce the full set of candidate statistics. However, there might be more sophisticated methods to tackle this problem and to weight the statistics according to their importance. These weights could then be used to adjust the rejection kernel in the rejection algorithm.

Laura Azzimonti, Politecnico di Milano, Italy

Mixed-effects models for growth curves: an application to the study of reconstitution kinetics of lymphocyte subpopulations

In this work we describe an application of non linear mixed-effects models to biological growth curves. The growth curves represent the reconstitution of lymphocyte subpopulations in the peripheral blood of pediatric patients subjected to Hematopoietic Stem Cell Transplantation; in particular, we focus on iNKT frequency among T cells and on iNKT CD161+ cells among iNKT CD4+ and iNKT CD4- cells. The aim of the study is to describe reconstitution kinetics of these subpopulations and to highlight potential dependencies of these curves on relapse insurgence. We use a logistic mixed-effects model to describe iNKT frequency reconstitution and an asymptotic exponential mixed-effects model to describe iNKT CD161+ frequency growth.

Stefano Baraldo, Politecnico di Milano, Italy

Statistical models for hazard functions: a case study of hospitalizations in health failure telemonitoring

Home telemonitoring is gradually spreading as a possible solution for reducing heavy costs associated to the treatment of heart failure patients with usual care. In this work, a very general statistical model for recurrent events is applied to the analysis of cardiovascular hospitalizations in Lombardia region, concerning patients who underwent a period of telemonitoring. The aim of the study is to identify crucial characteristics of these subjects. The model is applied to the whole known patients? History and to the periods pre and post telemonitoring separately, and cumulative hazard functions related to the hospitalization processes are estimated. In particular, hazard functions could be used to perform a classification analysis, treating them as functional data.

Ramiro Barrantes, University of Vermont, USA

Using shifts in amino acid frequency or substitution rate to identify latent structural characters improves our understanding of the structure, function and evolution of base-excision repair enzymes

We describe a novel method for identification of the gain and loss of structural features in a protein phylogeny, given a set of homologous protein sequences and a high-resolution structure. We then apply this method to study the evolution of amino acid sequence, structure, and function of a family of DNA glycosylases. Protein structure evolution includes transitions between states of phylogenetic characters that are latent in sequence data, for example, the gain or loss of a salt bridge during evolution. Our first goal is to annotate the phylogeny of the Fpg/Nei family of base excision repair enzymes with states of latent structural characters (LSC). First, we identified instances in which amino acid frequencies or overall substitution rates change during evolution using methods developed by Xun Gu and coworkers. Second, we found sets of amino acids near each other in the structure exhibiting correlations of such changes. Third, we used these sets of amino acids to manually identify LSC in clades within the Fpg/Nei phylogeny. We describe seven LSC; an accompanying Proteopedia page ( http://proteopedia.org/wiki/index.php/Fpg_Nei_Protein_Family) describes these in greater detail and facilitates 3D viewing. Our method captures familiar examples, such as a Zn finger, as well as more subtle interactions. Given that identification is based in large part on sequence evolution, the LSC provide a surprisingly complete picture of the interaction of the protein with the DNA. Our second goal is to use these LSC to understand the Fpg/Nei phylogeny. Our methods identified the Fpg substrate specificity loop as an LSC, quantified conservation in each clade, and predicted that the role of this region in specificity varies substantially throughout the family. Phylogenetic inference based on LSC provided convincing evidence of independent losses of Zn fingers in an ancestor of the Plant and Fungal and in an ancestor of metazoan Neil1 proteins. Lastly, we found that the majority of amino acids exhibit a statistically significant change in amino acid substitution rate or frequency, suggesting that explicit modeling of transition involving LSC may prove widely useful in understanding the evolution of protein structure and function.

Antonio Canale, Università di Padova, Italy

Bayesian Mixture for Counts

Although Bayesian nonparametric mixture models for continuous data are well developed, there is a limited literature on related approaches for count data. A common strategy is to use a mixture of Poissons, which unfortunately is quite restrictive in not accounting for distributions having variance less than the mean. Other approaches include mixing multinomials, which requires finite support, and using a Dirichlet process prior with a Poisson base measure, which does not allow smooth deviations from the Poisson. As a broad class of alternative models, we propose to use nonparametric mixtures of rounded continuous kernels. We provide sufficient conditions on the kernels and prior for the mixing measure under which all count distributions fall within the Kullback-Leibler support. This is shown to imply both weak and strong posterior consistency. The conditions are shown to hold for Dirichlet process mixtures of rounded Gaussian, log Gaussian and gamma priors. Focusing on the rounded Gaussian case, we generalize the modeling framework to account for multivariate count data, joint modeling with continuous and categorical variables, and other complications. An efficient Gibbs sampler is developed for posterior computation, and the methods are illustrated through application to marketing data.

(Joint work with David Dunson, Duke University, USA)

Francesca Ieva, Politecnico di Milano, Italy

A hierarchical random-effects model for survival in patients with Acute Myocardial Infarction

Studies of variations in health care utilization and outcome involve the analysis of multilevel clustered data. These analyses involve estimation of a cluster-specific adjusted response, covariates effect and components of variance. Beyond reporting on the extent of observed variations, these studies examine the role of contributing factors including patients and providers characteristics. In addition, they may assess the relationship between health-care process and outcomes. In this talk we present a case-study, considering firstly a Hierarchical Generalized Linear Model (HGLM) formulation, then a semi-parametric Dirichlet ProcessMixtures (DPM), and propose their application to the analysis of MOMI2 (MOnth MOnitoring Myocardial Infarction in MIlan) study on patients admitted with ST-Elevation Myocardial Infarction diagnosys. We develop a Bayesian approach to fitting data using Markov Chain Monte Carlo methods and discuss some issues about model fitting.

Andrea Mognon, Fondazione Bruno Kessler, Italy

Brain Decoding: Biases in Error Estimation

Multivariate approaches for the analysis of high-dimensional data are provoking wide interest and increasing adoption within the neuroscience community. Neuroimaging techniques allow to record brain activity from a subject while exposed to stimuli following a stimulation protocol. The brain decoding challenge is to recognize a relation between brain activity and different categories of stimuli, in order to investigate the brain function of interest. In latest neuroscience trends, one way to perform hypothesis testing is by using the classification approach: when classification accuracy is significantly better than chance, it is likely that data include information allowing to discriminate between different stimuli. Therefore, a classification algorithm is trained on the recorded data; then the misclassification rate of the predictions is estimated to answer a statistical test. This generic classification problem can be implemented in several ways but some implementations produce biased estimates due to circular analysis issues that could invalidate the conclusion of the scientific study; therefore the most suited implementation of the classification problem must be used. According to a recent review [1], biased implementations are frequent in many neuroscience publications published even in most prestigious journals. In this talk we propose different processes to estimate the quality of classification in the case it comprises a variable selection step together with a parameter selection step. For each different implementation we investigate the associated bias. Analyses are conducted on synthetic data as well as on magnetoencephalography (MEG) data from a left vs. right attention task. The effects of different implementations of the classification algorithm are quantified by means of expected misclassification rate. Results prove the importance of adopting a proper error estimation process.

[1]: N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, and C. I. Baker, Circular analysis in systems neuroscience: the dangers of double dipping, Nature Neuroscience, vol. 12, no. 5, pp. 535-540, April 2009. [Online]. Available: http://dx.doi.org/10.1038/nn.2303

Cristian Pattaro, Mirko Modenese and Fabiola Del Greco, EURAC, Italy

Selection of SNPs to be replicated in GWA studies: a Bayesian approach

In this study we aim at developing a novel methodological approach to the selection of SNPs (single nucleotide polymorphisms) for replication, where prior biological knowledge on individual SNPs is used to support the evidence coming from the GWA (Genome-Wide Association) investigation in identifying the "most promising" associations to be tested in the replication sample. This approach will be developed within a Bayesian framework using a robust methodology that is transparent in its assumptions and is reproducible. Bayesian methods have become increasingly popular in all fields of medical research, particularly in those instances where there is a need for integration of evidence from different sources, due to the possibility of formally including prior knowledge into the analysis (through specification of informative prior distributions) and its great flexibility in terms of statistical modeling.

Emanuela Raffinetti, Università di Pavia, Italy

Lorenz zonoids and dependence measures: a proposal

During these last years, the dependence analysis context has assumed a relevant role either in economical and statistical applications: the literature provides a wide set of statistical tools focused in obtaining information about the dependence problem. In this paper the idea consists in focusing the attention on the Lorenz zonoid tool: when considering only the univariate case, the Lorenz zonoid corre sponds to the Gini measure. Our aim is extending the Lorenz zonoids application to the multivariate dimension. In particular we first consider the Lorenz zonoid of a linear regression function characterized by k explanatory variables and then we define the partial contribution due to the introduction of a (k + 1) explanatory variable in terms of dependence measures. The evident effect of a new explanatory variable introduction in the model is translated into the increase of the dilation measure. The final result is characterized by the definition of new dependence measure we called "partial Lorenz dependence measure".

ABS10 - 2010 Applied Bayesian Statistics School BAYESIAN MACHINE LEARNING WITH BIOMEDICAL APPLICATIONS

EURAC, Bolzano/Bozen, Italy June, 11-15, 2010