ABS09 - 2009 Applied Bayesian Statistics School

BAYESIAN METHODOLOGY FOR CLUSTERING, CLASSIFICATION AND CATEGORICAL DATA ANALYSIS

Accademia Cusano, Bressanone/Brixen (BZ), Italy

June, 15-18, 2009

PARTICIPANTS'S TALKS

Rupali Rajendra Akerkar, Norwegian University of Technology and Science, Norway

Approximate Bayesian Inference for Survival models

In the talk will discuss the use of "Integrated nested Laplace approximations"(INLA) to solve Bayesian inferential problems for survival models. INLA is a generic tool which approximates posterior marginals for latent Gaussian models, for which generalized additive (mixed) models is one important example. Examples include Weibull distributed lifetime models with frailty and piecewise-constant baseline hazard models. The power of this approach is illustrated by reanalyzing the spatial survival model by Henderson et al.(2002, JASA), which includes semi-parametric effects of covariates, frailty and spatial effects.

Eunice Campiran, National University of Mexico

How to use product partitions models to reflect the prior knowledge of the stratification in finite population sampling

In Bayesian statistics, the prior distribution of the parameters reflects i the initial beliefs of the researcher. After data are observed, we have a learning process which is reflected in the posterior distribution of parameters. In finite population sampling the researcher not only has prior knowledge of the parameters of interest, but also in the structure of the population. In classical statistics, the experience of the researcher can help to divide the population into relatively homogeneous subgroups in order to reduce the variability of the estimates of interest. We propose to use product partition models to model the prior knowledge of the structure of the population and the parameters of interest. Using the ideas exposed in Quintana and Iglesias (2003), we use a loss function that allows us to choose the stratification less expensive. We explore in which cases this procedure is a learning process, and we can use the posterior distribution of the partitions to make stratification for another survey related with the first one. Finally, we study a generalization of this model, when we do not assume a specific distribution for each stratum; instead, we will use a random measure to make inference.

Fernando A. Quintana and Pilar L. Iglesias. Bayesian clustering and product partition models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(2):557-574, 2003.

Miguel de Carvalho, Universidade Nova de Lisboa, Portugal

Bayesian Methods and Extreme Value Statistics: A Marriage of Convenience

Extreme value statistics plays an important role in the modelling of extreme events. The domains of application of these methods range from hydrology to finance. In this work we concern ourselves with the use of the Bayesian paradigm as a means to suitably conduct an extreme value analysis.

Amelie Crepet, French Food Safety Agency, France

Nonparametric Bayesian model to cluster co-exposure to pesticides found in the French diet

This work introduces a specific application of the Bayesian nonparametric methodology in the food risk analysis framework. Namely, the joint distribution of the exposures to a large number of pesticides is assessed from the available consumption data and contamination analyses. We propose to model the exposures by a mixture of Dirichlet processes so as to determine clusters of pesticides jointly present in the diet at high doses. The goal of this analysis is to give directions for future toxicological experiments for studying possible combined effects of multiple pesticide residues simultaneously present in the diet. Two approaches are compared: the exposures to each Pesticide are either linked together in a hierarchical Dirichlet process mixture based on a univariate Gaussian kernel, or they are assumed to arise from a multivariate Gaussian kernel in a classical Dirichlet process mixture. In both cases, posterior distributions are computed through a Gibbs sampler based on stick-breaking priors. Finally, the clustering among individuals also obtained as an auxiliary output of these analyses is discussed in a risk management perspective.

Joint work with Jessica Tressou, HKUST-ISOM, Hong Kong & INRA-Met@Risk, France

Vanda Inacio, University of Lisbona, Portugal

An Overview of Statistical Methods in Medical Tests

Diagnostic tests are procedures used to discriminate individuals with some disease from those without it, with some chance of error. Our aim in this work is to quantify this chance of error. For such purpose we offer a survey of the statistical methods which can be employed in medical tests.

Svetlana Ledyaeva, Helsinki School of Economics, Finland

Determinants of entry mode choice of foreign/multinational firms in Russia: empirical study

The aim of this paper is to empirically analyze the motivations for foreign enterprises to produce in Russian regions choosing full ownership (FO) (setting up a wholly foreign owned enterprise (WFOE) or engaging in a full acquisition) or sharing ownership with a local partner (firms, individuals, governmental (state) authorities), i.e. joint ventures (JV). As empirical tools we utilize binary and ordered logit models and multinomial mode. The main findings can be summarized as follows. We found some evidence that institutions influence such entry strategies; in particular, foreign entrants prefer higher control modes when the institutional environment is better in a particular Russian region. We also found that when the investment risk is higher in a particular Russian region, foreign investors tend to establish lower control modes. We also preliminary conclude that the more state-owned enterprises dominate in a particular Russian region, the less foreign firms are in the form of WFOE and higher control modes JV. Furthermore, we found that the more state-owned enterprises dominate in a particular Russian region, the more foreign investors prefer partnerships with governmental authorities to WFOE and JV with private partners. Our results also indicate that foreign entrants into Russian market prefer higher control modes when the capital of a firm is high; higher human capital and economic growth potential enhance foreign investors to establish higher control modes; JV are more likely in the resource-based industries.

Joint work with Päivi Karhunen, Helsinki School of Economics

Magdalena Malina, Mathematical Institute, University of Wroclaw

Logic regression in application to detection of SNP-SNP interactions

We consider a biological problem of detection genes that are responsible for quantitative features. The data that we consider are SNPs - Single-Nucleotide Polymorphisms i.e. DNA sequence variations occurring when a single nucleotide A, T, C, or G in the genome differs between members of a species or between paired chromosomes. In such context an issue of great importance is a question of detection interactions of many SNPs, which may cause the difference in ex. disease status.

Logic regression introduced in [2] by Ruczinski, Kooperberg and LeBlanc is a regression method that attempts to construct predictors as Boolean combinations of binary variables. There are many possible versions of logic regression by now: classical version with a simulated annealing as a search algorithm, proposed by Schwender in [1], Monte Carlo logic regression, by Kooperberg and Ruczinski or Bayesian version of logic regression introduced by Fritsch and Ickstadt in [4] and Fritsch in [5]. All these methods identify combinations of predictors associated with an outcome and can be applied to genetic data.

When there are many SNPs however, as in Genome-Wide Association Studies, then there arises a multiple testing problem and in classical model selection criteria we need an additional penalty for model dimension (Bogdan et. al [6]). In Bayesian versions of model selection criteria the problem of the penalty choice is replaced by a problem of proper selection of prior distributions (Scott and Berger[7]).

[1] Schwender H., Statistical analysis of genotype and gene expression data PhD thesis, URL: hdl.handle.net/2003/23306
[2] Ruczinski I., Kooperberg C., LeBlanc M., Logic regression, J. Comput. Graphical Statist. 12 (3),(2003),474-511, URL: biostat.jhsph.edu/iruczins/publications/publications.html
[3] Kooperberg C., Ruczinski I., Identifying Interacting SNPs Using Monte Carlo Logic Regression, Genetic Epidemiology 28, 157-170 (2005)
[4] Fritsch A., Ickstadt K., Comparing Logic Regression Based Methods for Identifying SNP Interactions , Springer Berlin / Heidelberg, Lecture Notes in Computer Science, Volume 4414/2007, pp 90-103
[5] Fritsch A., A Full Bayesian Version of Logic regression for SNP Data , Diploma Thesis, (2006)
[6] Bogdan M., Gosh J.K., Zak-Szatkowska M. Selecting explanatory variables with the modified version of Bayesian Information Criterion, Quality and Reliability Engineering International, 24: 627-641, 2008.
[7] James G. Scott and James O. Berger. Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem., Duke University Department of Statistical Science Technical Report (2008).

Joint work with Magorzata Bogdan, Institute of Mathematics and Computer Science, Wroclaw University of Technology

Emanuele Olivetti, Fondazione Bruno Kessler, Trento, Italy

Automatic fiber bundle segmentation in the human brain: learning from examples

Recent neuroimaging techniques like Diffusion Spectral Imaging (DSI) allow major improvements in the reconstructions of white matter fibers inside the human brain. The problem of segmenting fibers in fiber bundles, for anatomical interpretation, is traditionally approached as an unsupervised problem where clustering techniques estimate bundles using little prior information. In this work we deal with the same segmentation problem but helped with side information of segmented (example) bundles provided by experts neuroanatomists on a single brain. The problem then is to find the equivalent bundles into a target brain exploiting the information from the examples. We propose a method based on fibers pair classification and graph clustering which does not rely on the spatial location of the bundles. We tested the proposed method on the Pittsburgh Brain Competition 2009 (PBC2009) dataset.

Cristian Pattaro, Institute of Genetic Medicine, European Academy Bozen/Bolzano (EURAC), Italy

Genome-wide association analysis of Serum Creatinine in five European populations

Serum creatinine is used to estimate the renal filtration rate, which is the main indicator of renal condition. Decreased renal function can lead to renal insufficiency and is an important risk factor for cardiovascular disease mortality and morbility. We present the results of a meta-analysis of genome-wide association scans from five European populations.

Joan Petur Petersen, Technical University of Denmark, Denmark (Faroe Islands)

Energy optimization for propulsion of ocean going vessels

This project aims at developing an advanced, mathematical model for analyzing the dynamics of ocean-going vessels, especially with regard to modeling fuel efficiency. Traditionally physical models, e.g. hydrodynamics, the study of motions of liquids, have been used to model the dynamics and fuel efficiency of ships. However, these models are severely limited when it comes to modeling the real-life conditions of ocean-going vessels. Therefore, the goal of of this project is to develop powerful models, based on machine-learning approaches, that are able to adapt to actual conditions of these ships.

Our system approach integrates visualization, feature extraction and prediction approaches for energy usage from the measured features. These machine learning approaches include principal component analysis, clustering methods and artificial neural networks.

ABS09 - 2009 Applied Bayesian Statistics School BAYESIAN METHODOLOGY FOR CLUSTERING, CLASSIFICATION AND CATEGORICAL DATA ANALYSIS

Accademia Cusano, Bressanone/Brixen (BZ), Italy June, 15-18, 2009