ABS10 - 2010 Applied Bayesian Statistics School
BAYESIAN MACHINE LEARNING WITH BIOMEDICAL APPLICATIONS
EURAC, Bolzano/Bozen, Italy
June, 11-15, 2010
Simon Aeschbacher, University of Edinburgh, United Kingdom
The choice of summary statistics in Approximate Bayesian Computation
For some models in population genetics, the computation of a likelihood is
prohibitively expensive or even impossible. Approximate Bayesian Computation
(ABC) has been introduced to avoid explicit calculation of the likelihood.
Instead, a rejection sampling scheme is used to directly sample from the
posterior distribution. To increase the efficiency of the algorithm, it is
necessary to collapse the full data onto a set of summary statistics. However,
hardly any summary statistic in population genetics is sufficient. How should
summary statistics be chosen? Ideally, we want to extract as much information
from the data as possible. On the other hand, we want to keep the number of
statistics, and hence the number of dimensions, as low as possible. We used
binomial boosting to choose parameter-specific sets of statistics and were able
to reduce the full set of candidate statistics. However, there might be more
sophisticated methods to tackle this problem and to weight the statistics
according to their importance. These weights could then be used to adjust the
rejection kernel in the rejection algorithm.
Laura Azzimonti, Politecnico di Milano, Italy
Mixed-effects models for growth curves: an application to the study of
reconstitution kinetics of lymphocyte subpopulations
In this work we describe an application of non linear mixed-effects models
to biological growth curves. The growth curves represent the reconstitution
of lymphocyte subpopulations in the peripheral blood of pediatric patients
subjected to Hematopoietic Stem Cell Transplantation; in particular, we
focus on iNKT frequency among T cells and on iNKT CD161+ cells among iNKT
CD4+ and iNKT CD4- cells. The aim of the study is to describe reconstitution
kinetics of these subpopulations and to highlight potential dependencies
of these curves on relapse insurgence. We use a logistic mixed-effects model
to describe iNKT frequency reconstitution and an asymptotic exponential
mixed-effects model to describe iNKT CD161+ frequency growth.
Stefano Baraldo, Politecnico di Milano, Italy
Statistical models for hazard functions: a case study of
hospitalizations in health failure telemonitoring
Home telemonitoring is gradually spreading as a possible solution for
reducing heavy costs associated to the treatment of heart failure patients
with usual care. In this work, a very general statistical model for
recurrent events is applied to the analysis of cardiovascular hospitalizations
in Lombardia region, concerning patients who underwent a period of
telemonitoring. The aim of the study is to identify crucial characteristics
of these subjects. The model is applied to the whole known patients? History
and to the periods pre and post telemonitoring separately, and cumulative
hazard functions related to the hospitalization processes are estimated.
In particular, hazard functions could be used to perform a classification
analysis, treating them as functional data.
Ramiro Barrantes, University of Vermont, USA
Using shifts in amino acid frequency or substitution rate to
identify latent structural characters improves our understanding of
the structure, function and evolution of base-excision repair enzymes
We describe a novel method for identification of the gain and loss
of structural features in a protein phylogeny, given a set of
homologous protein sequences and a high-resolution structure. We then
apply this method to study the evolution of amino acid sequence,
structure, and function of a family of DNA glycosylases. Protein
structure evolution includes transitions between states of
phylogenetic characters that are latent in sequence data, for example,
the gain or loss of a salt bridge during evolution. Our first goal is
to annotate the phylogeny of the Fpg/Nei family of base excision
repair enzymes with states of latent structural characters
(LSC). First, we identified instances in which amino acid
frequencies or overall substitution rates change during evolution
using methods developed by Xun Gu and coworkers. Second, we
found sets of amino acids near each other in the structure exhibiting
correlations of such changes. Third, we used these sets of amino acids
to manually identify LSC in clades within the Fpg/Nei phylogeny. We
describe seven LSC; an accompanying Proteopedia page
(
http://proteopedia.org/wiki/index.php/Fpg_Nei_Protein_Family)
describes these in greater detail and facilitates 3D viewing. Our
method captures familiar examples, such as a Zn finger, as well as
more subtle interactions. Given that identification is based in large
part on sequence evolution, the LSC provide a surprisingly complete
picture of the interaction of the protein with the DNA. Our second
goal is to use these LSC to understand the Fpg/Nei phylogeny. Our
methods identified the Fpg substrate specificity loop as an LSC,
quantified conservation in each clade, and predicted that the role of
this region in specificity varies substantially throughout the family.
Phylogenetic inference based on LSC provided convincing
evidence of independent losses of Zn fingers in an ancestor of the
Plant and Fungal and in an ancestor of metazoan Neil1 proteins.
Lastly, we found that the majority of amino acids exhibit a
statistically significant change in amino acid substitution rate or
frequency, suggesting that explicit modeling of transition involving
LSC may prove widely useful in understanding the evolution of protein
structure and function.
Antonio Canale, Università di Padova, Italy
Bayesian Mixture for Counts
Although Bayesian nonparametric mixture models for continuous data are well
developed, there is a limited literature on related approaches for count
data. A common strategy is to use a mixture of Poissons, which unfortunately
is quite restrictive in not accounting for distributions having variance less
than the mean. Other approaches include mixing multinomials, which requires
finite support, and using a Dirichlet process prior with a Poisson base
measure, which does not allow smooth deviations from the Poisson. As a
broad class of alternative models, we propose to use nonparametric mixtures
of rounded continuous kernels. We provide sufficient conditions on the
kernels and prior for the mixing measure under which all count distributions
fall within the Kullback-Leibler support. This is shown to imply both weak
and strong posterior consistency. The conditions are shown to hold for
Dirichlet process mixtures of rounded Gaussian, log Gaussian and gamma priors.
Focusing on the rounded Gaussian case, we generalize the modeling framework
to account for multivariate count data, joint modeling with continuous and
categorical variables, and other complications. An efficient Gibbs sampler
is developed for posterior computation, and the methods are illustrated
through application to marketing data.
(Joint work with David Dunson, Duke University, USA)
Francesca Ieva, Politecnico di Milano, Italy
A hierarchical random-effects model for survival in patients with
Acute Myocardial Infarction
Studies of variations in health care utilization and outcome involve the
analysis of multilevel clustered data. These analyses involve estimation
of a cluster-specific adjusted response, covariates effect and components
of variance. Beyond reporting on the extent of observed variations, these
studies examine the role of contributing factors including patients and
providers characteristics. In addition, they may assess the relationship
between health-care process and outcomes. In this talk we present a
case-study, considering firstly a Hierarchical Generalized Linear Model
(HGLM) formulation, then a semi-parametric Dirichlet ProcessMixtures (DPM),
and propose their application to the analysis of MOMI2 (MOnth MOnitoring
Myocardial Infarction in MIlan) study on patients admitted with ST-Elevation
Myocardial Infarction diagnosys. We develop a Bayesian approach to fitting
data using Markov Chain Monte Carlo methods and discuss some issues about
model fitting.
Andrea Mognon, Fondazione Bruno Kessler, Italy
Brain Decoding: Biases in Error Estimation
Multivariate approaches for the analysis of high-dimensional data are provoking
wide interest and increasing adoption within the neuroscience community.
Neuroimaging techniques allow to record brain activity from a subject
while exposed to stimuli following a stimulation protocol.
The brain decoding challenge is to recognize a relation between brain activity
and different categories of stimuli, in order to investigate the brain function of interest.
In latest neuroscience trends, one way to perform hypothesis testing is by
using the classification approach: when classification accuracy is
significantly better than chance, it is likely that data include information allowing to discriminate between different stimuli.
Therefore, a classification algorithm is trained on the recorded data; then the
misclassification rate of the predictions is estimated to answer a statistical
test.
This generic classification problem can be implemented in several ways but
some implementations produce biased estimates due to circular analysis issues
that could invalidate the conclusion of the scientific study; therefore the
most suited implementation of the classification problem must be used.
According to a recent review [1], biased implementations are frequent in many
neuroscience publications published even in most prestigious journals.
In this talk we propose different processes to estimate the quality of
classification in the case it comprises a variable selection step together
with a parameter selection step. For each different implementation we
investigate the associated bias.
Analyses are conducted on synthetic data as well as on magnetoencephalography
(MEG) data from a left vs. right attention task. The effects of different
implementations of the classification algorithm are quantified by means of
expected misclassification rate. Results prove the importance of adopting a
proper error estimation process.
[1]: N. Kriegeskorte, W. K. Simmons, P. S. F. Bellgowan, and C. I. Baker,
Circular analysis in systems neuroscience: the dangers of double dipping,
Nature Neuroscience, vol. 12, no. 5, pp. 535-540, April 2009. [Online].
Available: http://dx.doi.org/10.1038/nn.2303
Cristian Pattaro, Mirko Modenese and Fabiola Del Greco, EURAC, Italy
Selection of SNPs to be replicated in GWA studies: a Bayesian approach
In this study we aim at developing a novel methodological approach to the
selection of SNPs (single nucleotide polymorphisms) for replication, where
prior biological knowledge on individual SNPs is used to support the evidence
coming from the GWA (Genome-Wide Association) investigation in identifying
the "most promising" associations to be tested in the replication sample.
This approach will be developed within a Bayesian framework using a robust
methodology that is transparent in its assumptions and is reproducible.
Bayesian methods have become increasingly popular in all fields of medical
research, particularly in those instances where there is a need for
integration of evidence from different sources, due to the possibility of
formally including prior knowledge into the analysis (through specification
of informative prior distributions) and its great flexibility in terms of
statistical modeling.
Emanuela Raffinetti, Università di Pavia, Italy
Lorenz zonoids and dependence measures: a proposal
During these last years, the dependence analysis context has assumed a
relevant role either in economical and statistical applications: the
literature provides a wide set of statistical tools focused in obtaining
information about the dependence problem. In this paper the idea consists
in focusing the attention on the Lorenz zonoid tool: when considering only
the univariate case, the Lorenz zonoid corre sponds to the Gini measure.
Our aim is extending the Lorenz zonoids application to the multivariate
dimension. In particular we first consider the Lorenz zonoid of a linear
regression function characterized by k explanatory variables and then we
define the partial contribution due to the introduction of a (k + 1)
explanatory variable in terms of dependence measures. The evident effect
of a new explanatory variable introduction in the model is translated into
the increase of the dilation measure. The final result is characterized
by the definition of new dependence measure we called "partial Lorenz
dependence measure".