ABS09 - 2009 Applied Bayesian Statistics School
BAYESIAN METHODOLOGY FOR CLUSTERING, CLASSIFICATION AND CATEGORICAL
DATA ANALYSIS
Accademia Cusano, Bressanone/Brixen (BZ), Italy
June, 15-18, 2009
Rupali Rajendra Akerkar, Norwegian University of Technology
and Science, Norway
Approximate Bayesian Inference for Survival models
In the talk will discuss the use of "Integrated nested Laplace
approximations"(INLA) to solve Bayesian inferential problems for
survival models. INLA is a generic tool which approximates posterior
marginals for latent Gaussian models, for which generalized additive
(mixed) models is one important example. Examples include Weibull
distributed lifetime models with frailty and piecewise-constant baseline
hazard models. The power of this approach is illustrated by reanalyzing
the spatial survival model by Henderson et al.(2002, JASA), which
includes semi-parametric effects of covariates, frailty and spatial effects.
Eunice Campiran, National University of Mexico
How to use product partitions models to reflect the prior knowledge
of the stratification in finite population sampling
In Bayesian statistics, the prior distribution of the parameters reflects i
the initial beliefs of the researcher. After data are observed, we have a
learning process which is reflected in the posterior distribution of
parameters. In finite population sampling the researcher not only has prior
knowledge of the parameters of interest, but also in the structure of the
population. In classical statistics, the experience of the researcher can help
to divide the population into relatively homogeneous subgroups in order to
reduce the variability of the estimates of interest. We propose to use product
partition models to model the prior knowledge of the structure of the
population and the parameters of interest. Using the ideas exposed in
Quintana and Iglesias (2003), we use a loss function that allows us to choose
the stratification less expensive. We explore in which cases this procedure
is a learning process, and we can use the posterior distribution of the
partitions to make stratification for another survey related with
the first one. Finally, we study a generalization of this model, when we do
not assume a specific distribution for each stratum; instead, we will use a
random measure to make inference.
Fernando A. Quintana and Pilar L. Iglesias. Bayesian clustering and
product partition models. Journal of the Royal Statistical Society: Series B
(Statistical Methodology), 65(2):557-574, 2003.
Miguel de Carvalho, Universidade Nova de Lisboa, Portugal
Bayesian Methods and Extreme Value Statistics: A Marriage of Convenience
Extreme value statistics plays an important role in the modelling of extreme
events. The domains of application of these methods range from hydrology to
finance. In this work we concern ourselves with the use of the Bayesian
paradigm as a means to suitably conduct an extreme value analysis.
Amelie Crepet, French Food Safety Agency, France
Nonparametric Bayesian model to cluster co-exposure to
pesticides found in the French diet
This work introduces a specific application of the Bayesian nonparametric
methodology in the food risk analysis framework. Namely, the joint
distribution of the exposures to a large number of pesticides is assessed from
the available consumption data and contamination analyses. We propose to
model the exposures by a mixture of Dirichlet processes so as to determine
clusters of pesticides jointly present in the diet at high doses. The goal of
this analysis is to give directions for future toxicological experiments for
studying possible combined effects of multiple pesticide residues simultaneously
present in the diet. Two approaches are compared: the exposures to each
Pesticide are either linked together in a hierarchical Dirichlet process mixture
based on a univariate Gaussian kernel, or they are assumed to arise from a
multivariate Gaussian kernel in a classical Dirichlet process mixture. In both
cases, posterior distributions are computed through a Gibbs sampler based on
stick-breaking priors. Finally, the clustering among individuals also obtained
as an auxiliary output of these analyses is discussed in a risk management
perspective.
Joint work with Jessica Tressou, HKUST-ISOM, Hong Kong & INRA-Met@Risk,
France
Vanda Inacio, University of Lisbona, Portugal
An Overview of Statistical Methods in Medical Tests
Diagnostic tests are procedures used to discriminate individuals with some
disease from those without it, with some chance of error. Our aim in this
work is to quantify this chance of error. For such purpose we offer a survey
of the statistical methods which can be employed in medical tests.
Svetlana Ledyaeva, Helsinki School of Economics, Finland
Determinants of entry mode choice of foreign/multinational firms
in Russia: empirical study
The aim of this paper is to empirically analyze the motivations for foreign
enterprises to produce in Russian regions choosing full ownership (FO)
(setting up a wholly foreign owned enterprise (WFOE) or engaging in a full
acquisition) or sharing ownership with a local partner (firms, individuals,
governmental (state) authorities), i.e. joint ventures (JV). As empirical
tools we utilize binary and ordered logit models and multinomial mode. The
main findings can be summarized as follows. We found some evidence that
institutions influence such entry strategies; in particular, foreign entrants
prefer higher control modes when the institutional environment is better in a
particular Russian region. We also found that when the investment risk is
higher in a particular Russian region, foreign investors tend to establish
lower control modes. We also preliminary conclude that the more
state-owned enterprises dominate in a particular Russian region, the less
foreign firms are in the form of WFOE and higher control modes JV.
Furthermore, we found that the more state-owned enterprises dominate in a
particular Russian region, the more foreign investors prefer partnerships
with governmental authorities to WFOE and JV with private partners. Our
results also indicate that foreign entrants into Russian market prefer higher
control modes when the capital of a firm is high; higher human capital and
economic growth potential enhance foreign investors to establish higher
control modes; JV are more likely in the resource-based industries.
Joint work with Päivi Karhunen, Helsinki School of Economics
Magdalena Malina, Mathematical Institute, University of Wroclaw
Logic regression in application to detection of SNP-SNP interactions
We consider a biological problem of detection genes that are responsible for
quantitative features. The data that we consider are SNPs - Single-Nucleotide
Polymorphisms i.e. DNA sequence variations occurring when a single nucleotide
A, T, C, or G in the genome differs between members of a species or between
paired chromosomes. In such context an issue of great importance is a question
of detection interactions of many SNPs, which may cause the difference in ex.
disease status.
Logic regression introduced in [2] by Ruczinski, Kooperberg and LeBlanc is a
regression method that attempts to construct predictors as Boolean combinations
of binary variables. There are many possible versions of logic regression by
now:
classical version with a simulated annealing as a search algorithm, proposed by
Schwender in [1], Monte Carlo logic regression, by Kooperberg and Ruczinski or
Bayesian version of logic regression introduced by Fritsch and Ickstadt in [4]
and Fritsch in [5]. All these methods identify combinations of predictors
associated with an outcome and can be applied to genetic data.
When there are many SNPs however, as in Genome-Wide Association Studies,
then there arises a multiple testing problem and in classical model selection
criteria we need an additional penalty for model dimension (Bogdan et. al [6]). In Bayesian versions of model selection criteria the problem of the penalty
choice is replaced by a problem of proper selection of prior distributions
(Scott and Berger[7]).
[1] Schwender H., Statistical analysis of genotype and gene expression data
PhD thesis, URL: hdl.handle.net/2003/23306
[2] Ruczinski I., Kooperberg C., LeBlanc M., Logic regression, J. Comput.
Graphical Statist. 12 (3),(2003),474-511,
URL: biostat.jhsph.edu/iruczins/publications/publications.html
[3] Kooperberg C., Ruczinski I., Identifying Interacting SNPs Using Monte
Carlo Logic Regression, Genetic Epidemiology 28, 157-170 (2005)
[4] Fritsch A., Ickstadt K., Comparing Logic Regression Based Methods for
Identifying SNP Interactions , Springer Berlin / Heidelberg, Lecture Notes
in Computer Science, Volume 4414/2007, pp 90-103
[5] Fritsch A., A Full Bayesian Version of Logic regression for SNP Data ,
Diploma Thesis, (2006)
[6] Bogdan M., Gosh J.K., Zak-Szatkowska M. Selecting explanatory variables
with the modified version of Bayesian Information Criterion, Quality and
Reliability Engineering International, 24: 627-641, 2008.
[7] James G. Scott and James O. Berger. Bayes and empirical-Bayes multiplicity
adjustment in the variable-selection problem., Duke University Department
of Statistical Science Technical Report (2008).
Joint work with Magorzata Bogdan, Institute of Mathematics and Computer
Science, Wroclaw University of Technology
Emanuele Olivetti, Fondazione Bruno Kessler, Trento, Italy
Automatic fiber bundle segmentation in the human brain: learning
from examples
Recent neuroimaging techniques like Diffusion Spectral Imaging
(DSI) allow major improvements in the reconstructions of white matter
fibers inside the human brain. The problem of segmenting fibers in fiber
bundles, for anatomical interpretation, is traditionally approached as
an unsupervised problem where clustering techniques estimate bundles
using little prior information. In this work we deal with the same
segmentation problem but helped with side information of segmented
(example) bundles provided by experts neuroanatomists on a single brain.
The problem then is to find the equivalent bundles into a target brain
exploiting the information from the examples. We propose a method based
on fibers pair classification and graph clustering which does not rely
on the spatial location of the bundles. We tested the proposed method on
the Pittsburgh Brain Competition 2009 (PBC2009) dataset.
Cristian Pattaro, Institute of Genetic Medicine, European Academy
Bozen/Bolzano (EURAC), Italy
Genome-wide association analysis of Serum Creatinine in five European
populations
Serum creatinine is used to estimate the renal filtration rate, which is
the main indicator of renal condition. Decreased renal function can lead to
renal insufficiency and is an important risk factor for cardiovascular disease
mortality and morbility. We present the results of a meta-analysis of
genome-wide association scans from five European populations.
Joan Petur Petersen, Technical University of Denmark, Denmark (Faroe
Islands)
Energy optimization for propulsion of ocean going vessels
This project aims at developing an advanced, mathematical model for
analyzing the dynamics of ocean-going vessels, especially with regard to
modeling fuel efficiency. Traditionally physical models, e.g. hydrodynamics,
the study of motions of liquids, have been used to model the dynamics and
fuel efficiency of ships. However, these models are severely limited when it
comes to modeling the real-life conditions of ocean-going vessels.
Therefore, the goal of of this project is to develop powerful models, based
on machine-learning approaches, that are able to adapt to actual conditions
of these ships.
Our system approach integrates visualization, feature extraction and
prediction approaches for energy usage from the measured features. These
machine learning approaches include principal component analysis, clustering
methods and artificial neural networks.