ABS13 - 2013 Applied Bayesian Statistics School
BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA
Villa del Grumello, Como, Italy
June, 17-21, 2013
Silvia Bozza, Università di Venezia, Italy
The evaluation of handwriting evidence in forensic science in
the form of multivariate data
The evaluation of handwritten characters that are selected from an
anonymous letter and written material from a suspect is an open problem
in forensic science. The individualization of handwriting is largely
dependent on examiners who evaluate the characteristics in a qualitative
and subjective way. Precise individual characterization of the shape of
handwritten characters is possible through, for example, Fourier analysis:
each handwritten character can be described through a set of variables
such as the surface and several harmonics. The assessment of the value
of the evidence is performed through the derivation of a likelihood ratio
for multivariate data. One of the criticisms leveled against the use of
multivariate statistical techniques in forensic science is the lack of
background information from which to estimate parameters. To approach
the problem of multidimensionality it is often assumed statistical
independence among variables, however this assumption is seldom warranted
(for example, the amplitudes and the phases of the harmonics retained
for characters description show in some cases a high degree of correlation).
Multilevel models that incorporates several levels of variation (the
within-writer and the between-writer variability, but also the variability
among different type of characters coming from the same author) need to be
introduced. Numerical procedures must be implemented to handle the complexity
originated by the non constant variability within sources and the elevate
number of variables, and to compute the marginal likelihood under competing
propositions.
Paola Cerchiello, Università di Pavia, Italy
Bayesian credit ratings
Our research aim is to improve ordinal variable selection in the context of
causal models for credit risk estimation. In this regard, we propose an
approach that provides a formal inferential tool to compare the explanatory
power of each covariate and, therefore, to select an effective model for
classification purposes. Our proposed model is Bayesian nonparametric thus
keeps the amount of model specification to a minimum. We consider the case
in which information from the covariates is at the ordinal level. A
noticeable instance of this regards the situation in which ordinal
variables result from rankings of companies that are to be evaluated
according to different macro and micro economic aspects, leading to ordinal
covariates that correspond to various ratings, that entail different
magnitudes of the probability of default. For each given covariate, we
suggest to partition the statistical units in as many groups as the number
of observed levels of the covariate. We then assume individual defaults to
be homogeneous within each group and heterogeneous across groups. Our aim
is to compare and, therefore select, the partition structures resulting
from the consideration of different explanatory covariates. The metric we
choose for variable comparison is the calculation of the posterior
probability of each partition. The application of our proposal to an
European credit risk database shows that it performs well, leading to a
coherent and clear method for variable averaging of the estimated default
probabilities.
Giorgio Corani, IDSIA, Lugano, Switzerland
An ensemble of Bayesian networks for multilabel classification
We present a novel approach for multilabel classification based on an
ensemble of Bayesian networks. The class variables are connected by a
tree; each model of the ensemble uses a different class as root of the
tree. We assume the features to be conditionally independent given the
classes, thus generalizing the naive Bayes assumption to the multi-class
case. This assumption allows us to optimally identify the correlations
between classes and features; such correlations are moreover shared
across all models of the ensemble. Inferences are drawn from the
ensemble via logarithmic opinion pooling. To minimize Hamming loss,
we compute the marginal probability of
the classes by running standard inference on each Bayesian network in
the ensemble, and then pooling the inferences. To instead minimize the
subset 0/1 loss, we pool the joint
distributions of each model and cast the problem as a MAP inference in
the corresponding graphical model. Experiments show that the approach is
competitive with state-of-the-art methods for multilabel classification.
Joint work with A. Anonucci, D. Maua and S. Gabaglio.
Marzia Cremona, Politecnico di Milano, Italy
Clustering of ChIP-Seq data through peak shape
Ten years after the sequencing of human genome, many techniques are available
to study genetic and epigenetic processes. We focus on a particular "next
generation sequencing" method called ChIP-Seq (Chromatin ImmunoPrecipitation
Sequencing), that permits to investigate protein-DNA interactions such as
transcription factor bindings and histone modifications.
At present the analysis of ChIP-Seq data is mainly restricted to the detection
of enriched areas (peaks) of the genome. The innovative approach we want to
develop takes into consideration the shape of such peaks: the idea is that
peak shape might reveal us new insights into chromatin function.
We use clustering techniques to assess whether there exist groups within
peaks, by looking at their shapes. We look at peaks from different points of
view: we select some shape indices to integrate the data in the framework of
multivariate statistical analysis, but we also treat ChIP-Seq peaks as
functional data or as probability density functions; lastly, we study peaks as
spike trains.
The aim of this novel approach is to detect statistically significant
differences in peak shape and to associate the shape with a functional role
and a biological meaning.
Joint work with L. Sangalli, P. Secchi and S. Vantini.
Georg Kropat, University of Lausanne, Switzerland
Machine learning based mapping of indoor radon concentrations in
Switzerland
Radon is a radioactive gas that occurs everywhere in nature as a decay
product of uranium. It can concentrate to substantial amounts in
residential buildings and is considered as the second leading cause of
lung cancer after smoking. The entry of radon from the soil into houses
is a complex process that is determined by a variety of influencing
factors like geology, architectural characteristics, meteorological
variables and anthropogenic influences. Reliable maps to estimate local
radon potential are therefore necessary tools for decision-making for
public health strategies. The aim of this project is to predict and map
indoor radon concentrations in Switzerland with the least possible
uncertainty. Indoor radon measurements are regularly carried out all
over Switzerland since the early 1980s. This led to a database of about
210,000 indoor radon measurements carried out in more than 120,000
houses. We use this data to develop methods for the prediction of indoor
radon concentrations at locations where no measurements have been
carried out. In a first attempt, we used kernel regression methods and
random forests in order to learn models from indoor radon measurements
by taking into account variables like building type, foundation type,
year of construction, outdoor temperature, topographic coordinates
and altitude. With random forest we can explain 38% of the variance of
the log-transformed radon concentrations and with kernel regression
about 33%. Up to now, the development of probabilistic indoor radon
maps remains a challenging question. In the future we plan to take
into account geological information as well as meteorological
variables like wind and pressure.
Joint work with J-P. Laedermann, C. Murith, M. Jaboyedoff, F. Bochud,
S. Baechler
Daniela Pauger, Johannes Kepler University Linz, Austria
Analysing formalisation of management accounting by Bayesian variable
selection in a cumulative logit model
In many applications especially in social and business sciences it is of
interest which variables out of a set of potential predictors are actually
associated with an ordinal response variable. As an example we present
an analysis where the response variable 'formalisation of management
accounting' in firms is measured on an ordinal scale with 3 categories
ranging from 'less or not recorded' to 'fully recorded'.
We use a Bayesian cumulative logit model and implement variable selection
by specifying spike and slab priors for the regression coefficients.
Posterior inference is feasible by MCMC methods and data augmentation,
expanding the auxiliary mixtures sampler of (Frühwirth-Schnatter and
Frühwirth, 2010) to ordinal data.
We apply the sampler to data from a survey on Austrian and German firms
and consider as potential predictors in our model annual sales, number
of employees, business sector, state, structure (family firm or
non-family firm) and generation.
Results indicate that only two of these potential regressors ('structure'
and 'number of employees') are associated with the degree of formalisation
of management accounting.
Joint work with C. Duller and H. Wagner.
Florian Stimberg, TU Berlin, Germany
MCMC inference for Poisson processes with Chinese restaurant process
driven rate -- with an application to neural spike trains
We introduce a model where the rate of an inhomogeneous Poisson process
is modified by a Chinese restaurant process. Applying a MCMC sampler to
this model allows us to do posterior Bayesian inference about the number
of states in Poisson-like data. Our sampler is shown to get accurate
results for synthetic data and we apply it to V1 neuron spike data to
find discrete firing rate states depending on the orientation of a stimulus.
Anja Zernig, KAI, Kompetenzzentrum fur Automobil- und
Industrieelektronik, Villach, Austria
Device level maverick screening
In semiconductor industries the reliability demands on chips grow as they are
more and more used in safety relevant applications. To check their
reliability, a method is needed which distinguishes between good, bad and
risky chips, also called Mavericks. Bad chips are those which electrically
fail or are out of specification, respectively, whereas all other (at the
moment good) chips are examined with the so-called Burn-In. This procedure
tests chips over several hours under increased test conditions like high
temperature, high voltage etc. to detect further chips which have to be
rejected and not delivered to the customers. Due to undesirable side effects
of Burn-In, like high costs, caused by testing time, special equipment
(implying routine maintenance) and extra trained stuff, the need of a
favorable screening method reducing Burn-In becomes attractive.
In the current production process, state-of-the-art methods are already used,
e.g. "Part Average Testing "(PAT)[1] methods, which are based on detecting
outliers (corresponds to risky chips) on data distributions or the "Nearest
Neighbor Yield" (NNY)[2], which focuses on distance-based dependencies between
good and risky chips. Unfortunately, these statistical methods can just detect
some of chips at risk and often at the expense of rejecting good chips. A
promising approach presented by Turakhia et al[3] is the use of the "Independent
Component Analysis" (ICA)[4], which shows an improvement in detecting Mavericks
and at the same time less rejection of good chips. Further promising methods
that will be investigated are Machine Learning Classifiers, e.g. "Support
Vector Machines" (SVM)[5], used to separate good and bad chips in clear distinct
groups.
[1] http://www.aecouncil.com/AECDocuments.html, AEC-Q001 Rev-D, "Guidelines For
Part Average Testing"
[2] W. C. Riordan, R. Miller, E. R. St. Pierre, "Reliability Improvement and
Burn-In Optimization trough the Use of Die Level Predictive Modeling", IEEE
05CH37616 43 Annual International Reliability Physics Symposium, San Jose,
2005
[3] R. Turakhia, B. Benware, R. Madge, "Defect Screening Using Independent
Component Analysis on IDDQ", Proceeding of the 23rd IEEE VLSI Test Symposium
(VTS'05)
[4] A. Hyvärinen, J. Karhunen, E. Oja, "Independent Component Analysis", John
Wiley & Sons, 2001
[5] N. Cristianini and J. Shawe-Taylos, "An Introduction to Support Vector
Machines and other Kernel-based Learning Methods", Cambridge University Press,
2000