GENETICS

by Siva Sivaganesan
siva@math.uc.edu
We present the annotated bibliography of the Bayesian applications in Genetics.



The list below is certainly not exhaustive, and may not even be representative. Our intended goal here is to give the readers a sense of the current use of Bayesian analysis in Genetics.
$ \bullet$ J.S. SINSHEIMER, J.A. LAKE, AND R.J.A. LITTLE (1996). Bayesian Hypothesis Testing of Four-Taxon Topologies Using Molecular Sequence Data. Biometrics 52, pp193-210.
Contact: Roderick J. A. Little, University of Michigan, <rlittle@umich.edu>.

Bayesian analysis is used to test three hypotheses concerning the correct topology from the available DNA sequences. Classical hypothesis testing is reported to be difficult due to test the multiple alternative hypotheses involved, while Bayesian analysis being conceptually straightforward. Multinomial and Multivariate normal sampling models are used with uniform priors on the parameters to obtain the posterior probabilities of the hypotheses. Using a large simulation study to assess the frequentist properties of the Bayesian tests, the authors conclude that Bayesian tests are well calibrated and have reasonable discriminating power for a wide range of realistic conditions.
$ \bullet$ K. SJ¨OLANDER, K. KARPLUS, M. BROWN, R. HUGHEY, A. KROGH, I. S. MIAN AND D. HAUSSLER (1996). Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Computational Applications in Bio. Science 12, pp 327-345
Contact: Kimmen Sjölander, University of California, Santa Cruz, <kimmen@cse.ucsc.edu>.

Bayesian methods are used in finding remote homologs with lower primary sequence identity . The authors use a Dirichlet mixture prior for amino acids distributions. This prior is estimated using data obtained from multiple alignment databases on observed counts of amino acids in clusters, and the maximum likelihood method. The observed amino acid frequencies are then used to obtain posterior estimates of amino acid probabilities at each position in a profile.
$ \bullet$ I. HOESCHELE, P. UIMARI, F. E. GRIGNOLA, Q. ZHANG, K.M. GAGE (1997). Advances in Statistical Methods to Map Quantitative Trait Loci in Outbred Populations. Genetics 147, pp1445-1457.
Contact : Ina Hoeschele, Virginia Polytechnic Institute and State University, <ina@vt.edu>.

Six different statistical methods, including maximum likelihood, exact and approximate Bayesian methods, used for gene mapping are reviewed. Authors comment that Bayesian analysis takes full account of the uncertainty associated with all unknowns in the problem, and allows fitting of different models quantitative trait loci variation. References to many other related work using Bayesian methods are also given.
$ \bullet$ P. UIMARI AND I. HOESCHELE (1997). Mapping-Linked Quantitative Trait Loci Using Bayesian Analysis and Markov Chain Monte Carlo Algorithms. Genetics 146, pp735-743.
Contact: Ina Hoeschele, Virginia Polytechnic Institute and State University, <ina@vt.edu>.

A Bayesian analysis for mapping linked quantitative trait loci(QTL) using multiple linked genetic markers is given. This approach was motivated by the evidence of detecting a single ``ghost QTL'' with least square analysis, when in fact two linked QTL were segregating. Here, the authors extend existing Bayesian linkage analysis to fit models that allow multiple linked markers; specifically zero, one and two QTL linked to the markers. Model selection from among these linkage models is done using data simulated under four different designs with map positions and effects. Three different MCMC algorithms are used to fit a mixed effect model for each data. These MCMC algorithms use different methods of fitting, such as use of indicator variable in the model, variable selection approach, and reversible jump MCMC. All three MCMC methods are found to do well. Detailed comparisons of these methods are given. The authors conclude that it is feasible to fit linked QTL simultaneously using Bayesian analysis, and that it provides estimates of all genetic parameters and can fit alternative QTL models.
$ \bullet$ R. L. DUNBRACK, JR. AND F. E. COHEN (1997). Bayesian statistical analysis of protein side-chain rotamer preferences. Protein Science, 6, pp1661-1681.
Contact: Roland L. Dunbrack, Jr., Institute for Cancer Research, Philadelphia. <rl_dunbrack@fccc.edu>.

A Bayesian analysis is used to account for varying amount of information in the Protein Data Bank for $ \chi_{1}^{}$ backbone dependent rotamer distributions, and to obtain more complete estimates of these distributions. In addition, Bayesian analysis is used to provide better estimates of the probability of occurrences of other rare rotamers. Multinomial models and Dirichlet priors are used. Parameters of the prior distribution are derived from previous data or from pooling some of the present data. Model checking is done using a Bayesian version of p-value calculated by simulating both parameter and data.
$ \bullet$ G. PARMIGIANI, D. A. BERRY AND O. AGUILAR (1998). Determining Carrier Probabilities for Breast Cancer Susceptibility Genes BRCA1 and BRCA2. American Journal of Human Genetics, 62, pp145-158.
Contact: Giovanni Parmigiani, Duke University, < gp@stat.duke.edu.edu>.

Breast cancer susceptibility genes BRCA1 and BRCA2 have recently been identified on the human genome. Women who carry a mutation of one of these genes have a greatly increased chance of developing breast and ovarian cancer, and they usually develop the disease at a much younger age, compared with normal individuals. Women can be tested to see whether they are carriers. A woman who undergoes genetic counseling before testing can be told the probabilities that she is a carrier, given her family history. In this paper we develop a model for evaluating the probabilities that a woman is a carrier of a mutation of BRCA1 and BRCA2, on the basis of her family history of breast and ovarian cancer in first- and second-degree relatives. Of special importance are the relationships of the family members with cancer, the ages at onset of the diseases, and the ages of family members who do not have the diseases. This information can be elicited during genetic counseling and prior to genetic testing. The carrier probabilities are obtained from Bayes's rule, by use of family history as the evidence and by use of the mutation prevalences as the prior distribution. In addressing an individual's carrier probabilities, we incorporate uncertainty about some of the key inputs of the model, such as the age-specific incidence of diseases and the overall prevalence of mutations. There is some evidence that other, undiscovered genes may be important in explaining familial breast cancer. Users of the current version of the model should be aware of this limitation. The methodology that we describe can be extended to more than two genes, should data become available about other genes.
$ \bullet$ G. M. PETERSEN, G. PARMIGIANI AND D. THOMAS (1998). Missense Mutations in Disease Genes: A Bayesian Approach to Evaluate Causality. American Journal of Human Genetics, 62, pp1516-1524.
Contact: Gloria M. Petersen, Johns Hopkins University, < gpeterse@jhsph.edu>.

The problem of interpreting missense mutations of disease-causing genes is an increasingly important one. Because these point mutations result in alteration of only a single amino acid of the protein product, it is often unclear whether this change alone is sufficient to cause disease. We propose a Bayesian approach that utilizes genetic information on affected relatives in families ascertained through known missense-mutation carriers. This method is useful in evaluating known disease genes for common disease phenotypes, such as breast cancer or colorectal cancer. The posterior probability that a missense mutation is disease causing is conditioned on the relationship of the relatives to the proband, the population frequency of the mutation, and the phenocopy rate of the disease. The approach is demonstrated in two cancer data sets: BRCA1 R841W and APC I1307K. In both examples, this method helps establish that these mutations are likely to be disease causing, with Bayes factors in favor of causality of 5.09 and 66.97, respectively, and posterior probabilities of .836 and .985. We also develop a simple approximation for rare alleles and consider the case of unknown penetrance and allele frequency.
$ \bullet$ J. S. LIU AND C. E. LAWRENCE (1999). Bayesian Inference on biopolymer models. Bioinformatics, 15, pp38-52.
Contact: Jun S. Liu, Stanford University, <jliu@stat.stanford.edu>.

This article introduces the Bayesian methods and its use to researchers in bioinformatics. The article gives a tutorial introduction to Bayesian methods using an example involving data from tossing two different coins. This example is then further extended to illustrate application in bioinformatics using two specific examples: sequence segmentation and global sequence alignment. The authors state that the need for setting parameter values has been the subject of much discussion, and that a distinct advantage of the Bayesian method is the added modeling flexibility in the specification of parameters. The authors comment that the rich history of computation in bioinformatics such as dynamic programming recursions can be modified to complete the high dimensional computation required by the Bayesian methods, and that through the use of these recursions, the full power of the Bayesian methodology can be brought to bear on a wide range of problems previously addressed by dynamic programming.
Return to the main page