ABS13 - 2013 Applied Bayesian Statistics School

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Villa del Grumello, Como, Italy

June, 17-21, 2013

PARTICIPANTS' TALKS

Silvia Bozza, Università di Venezia, Italy

The evaluation of handwriting evidence in forensic science in the form of multivariate data

The evaluation of handwritten characters that are selected from an anonymous letter and written material from a suspect is an open problem in forensic science. The individualization of handwriting is largely dependent on examiners who evaluate the characteristics in a qualitative and subjective way. Precise individual characterization of the shape of handwritten characters is possible through, for example, Fourier analysis: each handwritten character can be described through a set of variables such as the surface and several harmonics. The assessment of the value of the evidence is performed through the derivation of a likelihood ratio for multivariate data. One of the criticisms leveled against the use of multivariate statistical techniques in forensic science is the lack of background information from which to estimate parameters. To approach the problem of multidimensionality it is often assumed statistical independence among variables, however this assumption is seldom warranted (for example, the amplitudes and the phases of the harmonics retained for characters description show in some cases a high degree of correlation). Multilevel models that incorporates several levels of variation (the within-writer and the between-writer variability, but also the variability among different type of characters coming from the same author) need to be introduced. Numerical procedures must be implemented to handle the complexity originated by the non constant variability within sources and the elevate number of variables, and to compute the marginal likelihood under competing propositions.

Paola Cerchiello, Università di Pavia, Italy

Bayesian credit ratings

Our research aim is to improve ordinal variable selection in the context of causal models for credit risk estimation. In this regard, we propose an approach that provides a formal inferential tool to compare the explanatory power of each covariate and, therefore, to select an effective model for classification purposes. Our proposed model is Bayesian nonparametric thus keeps the amount of model specification to a minimum. We consider the case in which information from the covariates is at the ordinal level. A noticeable instance of this regards the situation in which ordinal variables result from rankings of companies that are to be evaluated according to different macro and micro economic aspects, leading to ordinal covariates that correspond to various ratings, that entail different magnitudes of the probability of default. For each given covariate, we suggest to partition the statistical units in as many groups as the number of observed levels of the covariate. We then assume individual defaults to be homogeneous within each group and heterogeneous across groups. Our aim is to compare and, therefore select, the partition structures resulting from the consideration of different explanatory covariates. The metric we choose for variable comparison is the calculation of the posterior probability of each partition. The application of our proposal to an European credit risk database shows that it performs well, leading to a coherent and clear method for variable averaging of the estimated default probabilities.

Giorgio Corani, IDSIA, Lugano, Switzerland

An ensemble of Bayesian networks for multilabel classification

We present a novel approach for multilabel classification based on an ensemble of Bayesian networks. The class variables are connected by a tree; each model of the ensemble uses a different class as root of the tree. We assume the features to be conditionally independent given the classes, thus generalizing the naive Bayes assumption to the multi-class case. This assumption allows us to optimally identify the correlations between classes and features; such correlations are moreover shared across all models of the ensemble. Inferences are drawn from the ensemble via logarithmic opinion pooling. To minimize Hamming loss, we compute the marginal probability of the classes by running standard inference on each Bayesian network in the ensemble, and then pooling the inferences. To instead minimize the subset 0/1 loss, we pool the joint distributions of each model and cast the problem as a MAP inference in the corresponding graphical model. Experiments show that the approach is competitive with state-of-the-art methods for multilabel classification.

Joint work with A. Anonucci, D. Maua and S. Gabaglio.

Marzia Cremona, Politecnico di Milano, Italy

Clustering of ChIP-Seq data through peak shape

Ten years after the sequencing of human genome, many techniques are available to study genetic and epigenetic processes. We focus on a particular "next generation sequencing" method called ChIP-Seq (Chromatin ImmunoPrecipitation Sequencing), that permits to investigate protein-DNA interactions such as transcription factor bindings and histone modifications. At present the analysis of ChIP-Seq data is mainly restricted to the detection of enriched areas (peaks) of the genome. The innovative approach we want to develop takes into consideration the shape of such peaks: the idea is that peak shape might reveal us new insights into chromatin function. We use clustering techniques to assess whether there exist groups within peaks, by looking at their shapes. We look at peaks from different points of view: we select some shape indices to integrate the data in the framework of multivariate statistical analysis, but we also treat ChIP-Seq peaks as functional data or as probability density functions; lastly, we study peaks as spike trains. The aim of this novel approach is to detect statistically significant differences in peak shape and to associate the shape with a functional role and a biological meaning.

Joint work with L. Sangalli, P. Secchi and S. Vantini.

Georg Kropat, University of Lausanne, Switzerland

Machine learning based mapping of indoor radon concentrations in Switzerland

Radon is a radioactive gas that occurs everywhere in nature as a decay product of uranium. It can concentrate to substantial amounts in residential buildings and is considered as the second leading cause of lung cancer after smoking. The entry of radon from the soil into houses is a complex process that is determined by a variety of influencing factors like geology, architectural characteristics, meteorological variables and anthropogenic influences. Reliable maps to estimate local radon potential are therefore necessary tools for decision-making for public health strategies. The aim of this project is to predict and map indoor radon concentrations in Switzerland with the least possible uncertainty. Indoor radon measurements are regularly carried out all over Switzerland since the early 1980s. This led to a database of about 210,000 indoor radon measurements carried out in more than 120,000 houses. We use this data to develop methods for the prediction of indoor radon concentrations at locations where no measurements have been carried out. In a first attempt, we used kernel regression methods and random forests in order to learn models from indoor radon measurements by taking into account variables like building type, foundation type, year of construction, outdoor temperature, topographic coordinates and altitude. With random forest we can explain 38% of the variance of the log-transformed radon concentrations and with kernel regression about 33%. Up to now, the development of probabilistic indoor radon maps remains a challenging question. In the future we plan to take into account geological information as well as meteorological variables like wind and pressure.

Joint work with J-P. Laedermann, C. Murith, M. Jaboyedoff, F. Bochud, S. Baechler

Daniela Pauger, Johannes Kepler University Linz, Austria

Analysing formalisation of management accounting by Bayesian variable selection in a cumulative logit model

In many applications especially in social and business sciences it is of interest which variables out of a set of potential predictors are actually associated with an ordinal response variable. As an example we present an analysis where the response variable 'formalisation of management accounting' in firms is measured on an ordinal scale with 3 categories ranging from 'less or not recorded' to 'fully recorded'. We use a Bayesian cumulative logit model and implement variable selection by specifying spike and slab priors for the regression coefficients. Posterior inference is feasible by MCMC methods and data augmentation, expanding the auxiliary mixtures sampler of (Frühwirth-Schnatter and Frühwirth, 2010) to ordinal data. We apply the sampler to data from a survey on Austrian and German firms and consider as potential predictors in our model annual sales, number of employees, business sector, state, structure (family firm or non-family firm) and generation. Results indicate that only two of these potential regressors ('structure' and 'number of employees') are associated with the degree of formalisation of management accounting.

Joint work with C. Duller and H. Wagner.

Florian Stimberg, TU Berlin, Germany

MCMC inference for Poisson processes with Chinese restaurant process driven rate -- with an application to neural spike trains

We introduce a model where the rate of an inhomogeneous Poisson process is modified by a Chinese restaurant process. Applying a MCMC sampler to this model allows us to do posterior Bayesian inference about the number of states in Poisson-like data. Our sampler is shown to get accurate results for synthetic data and we apply it to V1 neuron spike data to find discrete firing rate states depending on the orientation of a stimulus.

Anja Zernig, KAI, Kompetenzzentrum fur Automobil- und Industrieelektronik, Villach, Austria

Device level maverick screening

In semiconductor industries the reliability demands on chips grow as they are more and more used in safety relevant applications. To check their reliability, a method is needed which distinguishes between good, bad and risky chips, also called Mavericks. Bad chips are those which electrically fail or are out of specification, respectively, whereas all other (at the moment good) chips are examined with the so-called Burn-In. This procedure tests chips over several hours under increased test conditions like high temperature, high voltage etc. to detect further chips which have to be rejected and not delivered to the customers. Due to undesirable side effects of Burn-In, like high costs, caused by testing time, special equipment (implying routine maintenance) and extra trained stuff, the need of a favorable screening method reducing Burn-In becomes attractive. In the current production process, state-of-the-art methods are already used, e.g. "Part Average Testing "(PAT)[1] methods, which are based on detecting outliers (corresponds to risky chips) on data distributions or the "Nearest Neighbor Yield" (NNY)[2], which focuses on distance-based dependencies between good and risky chips. Unfortunately, these statistical methods can just detect some of chips at risk and often at the expense of rejecting good chips. A promising approach presented by Turakhia et al[3] is the use of the "Independent Component Analysis" (ICA)[4], which shows an improvement in detecting Mavericks and at the same time less rejection of good chips. Further promising methods that will be investigated are Machine Learning Classifiers, e.g. "Support Vector Machines" (SVM)[5], used to separate good and bad chips in clear distinct groups.

[1] http://www.aecouncil.com/AECDocuments.html, AEC-Q001 Rev-D, "Guidelines For Part Average Testing"
[2] W. C. Riordan, R. Miller, E. R. St. Pierre, "Reliability Improvement and Burn-In Optimization trough the Use of Die Level Predictive Modeling", IEEE 05CH37616 43 Annual International Reliability Physics Symposium, San Jose, 2005
[3] R. Turakhia, B. Benware, R. Madge, "Defect Screening Using Independent Component Analysis on IDDQ", Proceeding of the 23rd IEEE VLSI Test Symposium (VTS'05)
[4] A. Hyvärinen, J. Karhunen, E. Oja, "Independent Component Analysis", John Wiley & Sons, 2001
[5] N. Cristianini and J. Shawe-Taylos, "An Introduction to Support Vector Machines and other Kernel-based Learning Methods", Cambridge University Press, 2000

ABS13 - 2013 Applied Bayesian Statistics School BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Villa del Grumello, Como, Italy June, 17-21, 2013