BYU Statistics :: Seminar Series

BYU Statistics Seminar Series 2010--2011

Date

Speaker

Title/Abstract

September

16 Jared Lunceford

Evaluating Surrogate Variables for Improving Microarray Multiple Testing Inference

Merck

High-throughput technologies (e.g. microarray) are producing vast amounts of data and consequently investigations where many hypothesis are being tested simultaneously. It is also widely recognized that dependencies are present in many high-throughput studies and it is desirable to have these dependencies considered when conducting multiple testing procedures. The use of surrogate variables (Leek and Storey; 2007, 2008) has been proposed as a means to capture, for a given observed set of data, sources driving the dependency structure among high-dimensional sets of features and remove the effects of those sources and their potential negative impact on simultaneous inference. We illustrate the potential effects of latent variables on testing dependence and the resulting impact on multiple inference and briefly review the method of surrogate variable analysis proposed by Leek and Storey (2008). We assess that method via simulations intended to mimic the complexity of feature dependence observed in real-world microarray data. The method is also assessed via application to a recent Merck microarray data set.

Reading

23 C. Arden Pope III

Statistical Modeling in Exploring the Human Health Effects of Air Pollution

Mary Lou Fulton Professor of Economics, Brigham Young University

The science of particulate matter air pollution and health has a long, rich, and complex history. Episode studies, daily time-series studies, panel-based acute exposure studies, and case-crossover studies provide a large and rich evidence base to indicate that short-term air pollution exposure exacerbates existing cardio-pulmonary disease and increases the risk of becoming symptomatic, requiring medical attention, or even dying. Natural experiment studies, cohort- and panel-based studies, and case control studies indicate even larger effects of long- term air pollution exposure. A major aspect of this research generally has been the use of and development of statistical modeling. This presentation will discuss the role of various statistical modeling approaches used to analyze different types health endpoint data generated from these study designs. Specific, high profile, local and national examples will be provided.

Reading 1

Reading 2

Stephen Colbert on Arden Pope

The Colbert Report

Mon - Thurs 11:30pm / 10:30c

Cheating Death - Lung Health

www.colbertnation.com

Colbert Report Full Episodes

2010 Election

Fox News

30 Valen Johnson

Consistent Bayesian model selection in p < n settings

Department of Biostatistics, MD Anderson Cancer Center

Suppose Y denotes an nx1 random vector, X an nxp matrix of real numbers, and b a px1 regression vector. My talk focuses on the selection of non-zero components of b when it is assumed that Y ~ N(Xb,sigma^2 I). Model selection is based on the calculation of posterior model probabilities using non-local prior densities on the regression coefficients for each possible model. The non-local prior densities used for model definition are obtained as products of normal moment priors and are called pMOM prior densities. Under mild conditions on the matrix X, I demonstrate that the use of these priors guarantees that the posterior probability of the true model converges to 1 as the sample size increases, and that the resulting model selection procedure exhibits an "oracle" property in p < n settings.

Reading 1

Reading 2

October

7 Joshua Tebbs

Informative retesting

Department of Statistics, University of South Carolina

In situations where individuals are screened for an infectious disease or other binary characteristic and where resources for testing are limited, group testing can offer substantial benefits. Group testing, where subjects are tested in groups (pools) initially, has been successfully applied to problems in blood bank screening, public health, drug discovery, genetics, and many other areas. In these applications, often the goal is to identify each individual as positive or negative using initial group tests and subsequent retests of individuals within positive groups. Many group testing identification procedures have been proposed; however, the vast majority of them fail to incorporate heterogeneity among the individuals being screened. In this talk, we present a new approach to identify positive individuals when covariate information is available on each. This covariate information is used to structure how retesting is implemented within positive groups; therefore, we call this new approach "informative retesting." We derive closed-form expressions and implementation algorithms for the probability mass functions for the number of tests needed to decode positive groups. These informative retesting procedures are illustrated through a number of examples and are applied to chlamydia and gonorrhea testing in Nebraska for the Infertility Prevention Project. Overall, our work shows compelling evidence that informative retesting can dramatically decrease the number of tests while providing accuracy similar to established noninformative retesting procedures.

Reading 1

Reading 2

14 Leanna House

Second-order Exchangeable Functions with Application to Multi-deterministic Simulators

Department of Statistics, Virginia Tech

Analysts often use deterministic computer models to predict the behavior of complex physical systems when observational data are limited. However, inferences based partially or entirely on simulated data require adequate assessments of model uncertainty that can be hard to quantify. The deterministic nature of computer models limits the information we can extract from simulations to separate model signal from model error. In this paper we present a new approach to assess the uncertainty of computer models to which we refer as multi-deterministic. Evaluations from a multi-deterministic computer model can be considered to be a collection of deterministic simulators which share the same input and output space, do not present obvious theoretical or computational advantages, and generate disparate predictions. To quantify the uncertainty of predictions from multi-deterministic models we use the construct of a latent model about which we learn from observed evaluations. We assume that outcomes from multi-deterministic models are sequences of second-order exchangeable functions (SOEF) and use Bayes linear methods to assess the latent model a posteriori. We demonstrate our methods using multi-deterministic results from a galaxy formation model called Galform.

Reading 1

Reading 2

21 Athanasios Kottas

Nonparametric Bayesian Modeling for Developmental Toxicology Data

Department of Applied Mathematics and Statistics, UC-Santa Cruz

We present a Bayesian nonparametric mixture modeling framework for replicated count responses in dose-response settings. We explore the methodology for modeling and risk assessment in developmental toxicity studies, where the primary objective is to determine the relationship between the level of exposure to a toxic chemical and the probability of a physiological or biochemical response. Data from these experiments typically involve features that can not be captured by standard parametric approaches. To provide flexibility in the functional form of both the response distribution and the probability of positive response, the proposed mixture model is built from a dependent Dirichlet process prior, with the dependence of the mixing distributions governed by the dose level. The methodology is tested with a simulation study, which involves comparison with semiparametric Bayesian approaches to highlight the practical utility of the dependent Dirichlet process nonparametric mixture model. Further illustration will be provided through the analysis of data from two developmental toxicity studies. Joint work with Ph.D. student (and BYU alumna) Kassandra Fronczyk.

Reading

28 Nick Polson

Sequential learning, predictive regressions, and optimal portfolio returns

Graduate School of Business, University of Chicago

We analyze sequential learning for predictive regression models. We evaluate the economic benefits to an investor who adopts multiple models to exploit predictability when forming portfolios. To do this, we develop a new particle-based method for sequential learning about parameters, state variables and multiple models. Our sequential perspective allows us to quantify how an investor's views about predictability and multiple models varies over time, naturally mimicking the learning problem encountered in practice. Our models account for drifting coefficients and stochastic volatility and our on-line method quantifies the time-variation of these esti- mates together with model probabilities. Our Bayesian optimal portfolios outperform rolling and cumulative regressions that ignore parameter uncertainty.

Reading

November

4 Ruth Kerry

Geostatistical Methods in Geography

Geographers are concerned with analyzing ``spatial data'' and thus use many spatial as well as aspatial statistics. The development of spatial statistics can be traced to the early part of the 20th century, to analysis of agricultural field trial data by statisticians. Geostatistics is a component of spatial statistics but its evolution has been led principally by applied scientists and mathematicians rather than classically-trained statisticians. Any methodology for analyzing spatial data needs to acknowledge Tobler's (1970) ``First Law of Geography'' that ``everything is related to everything else, but near things are more related than distant things''; or in other words, the fundamental property of spatial dependence or spatial autocorrelation. The primary concern of geostatistical analysis is to investigate the spatial autocorrelation in data. Haining (2003) noted that quantifying spatial dependence matters, whether the purpose of an analysis is to interpolate, to fit a regression model, or for hypothesis testing. Along with spatial dependence, geographical data acquire properties as a consequence of the chosen representation of geographic space. The areal units into which a study region is partitioned for reporting attribute values often vary in size and shape (e.g., census tracts). If the population denominator for rates (e.g., crime rates) varies, the standard errors of such statistics are not constant across a map and data for areas with small populations suffer from the ``small number problem''. Methods used in Geography must not only be able to deal with spatial dependence, but also with these types of properties acquired as a consequence of the chosen representation of geographic space (Haining et al., 2010). This presentation will show some of the ways aspatial and spatial statistics have been used in my recent research with particular emphasis being placed on geostatistical methods. Issues that will be addressed will include working with nominal or ordinal data, designing sampling schemes and working with areal and rate data. Case studies will include the analysis of soil and crop data for precision farming, crime patterns in the Baltic States and large herbivore distribution patterns in Kruger National Park, South Africa.

References

Haining, R.P. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge University Press.

Haining, R., Kerry, R. & Oliver, M. A. 2010. Geography, Spatial Data Analysis and Geostatistics: An Overview. Geographical Analysis 42, 7-31.

Tobler, W. 1970. A computer movie simulating urban growth in the Detroit region. Economic Geography 46, 234-240.

Reading

11 Alun Thomas

On the Statistical Analysis of Dirty Graphs

Department of Biomedical Informatics, University of Utah

Starting from a general framework we consider estimating a graph given indirect, partial or possibly erroneous information about its structure. More specifically I'll describe the estimation of the conditional independence graphs of graphical models using Markov chain Monte Carlo methods. This will be applied to some simulated examples and to real genetic data. Finally, this approach is used to model allelic association, or linkage disequilibrium, between marker loci in modern, dense genotyping assays.

Reading

18

No Seminar: Graduate Student Elective Course Taste Testing

December

2 Matt Heaton

Kernel Averaged Predictors for Spatio-Temporal Processes

Department of Statistical Science, Duke University

For spatio-temporal processes, predictors from multiple locations affect the response at a separate location. For example, predictors such as precipitation, temperature, pollution emissions, etc. are often used to explain ground-level ozone production. Due to weather and other factors, however, the relationship between these predictors and ozone is not confined to a single spatial location or time period as is often assumed. Again, the effect of pollution on mortality is spatially and temporally lagged because mortality does not, typically, occur at time of exposure. Here, kernels are proposed as a tool to properly weight predictor surfaces in spatio-temporal regression models. The kernels are assumed to be parametric with parameters that are estimable from the data. Distributional results are provided for the case of a univariate predictor and response in the Gaussian process setting. Additionally, relations to previously proposed models and computational details are discussed.

Reading

January

20 William Christensen

Filtered Kriging for Spatial Data with Heterogeneous Measurement Error Variances

Department of Statistics, Brigham Young University

When predicting values for the measurement-error-free component of an observed spatial process, it is generally assumed that the process has a common measurement error variance. However, it is often the case that each measurement in a spatial data set has a known, site-specific measurement error variance, rendering the observed process nonstationary. We present a simple approach for estimating the semivariogram of the unobservable measurement-error-free process using a bias-adjustment of the classical semivariogram formula. We then develop a new kriging predictor which filters the measurement errors. For scenarios where each site's measurement error variance is a function of the process of interest, we recommend an approach which also uses a variance-stabilizing transformation. The properties of the heterogeneous variance measurement-error-filtered kriging (HFK) predictor and variance-stabilized HFK predictor, and the improvement of these approaches over standard measurement-error filtered kriging are demonstrated using simulation. The approach is illustrated with climate model output from the Hudson Strait area in northern Canada. In the illustration, locations with high or low measurement error variances are appropriately down- or up-weighted in the prediction of the underlying process, yielding a realistically smooth picture of the phenomenon of interest.

27 Sudipto Banerjee

Computationally Feasible Hierarchical Modeling Strategies for Large Spatial Datasets

Division of Biostatistics, University of Minnesota

Large point referenced datasets are common in the environmental and natural sciences. The computational burden in fitting large spatial datasets undermines estimation of Bayesian models. We explore several improvements low-rank and other scalable spatial process models including reduction of biases and process-based modeling of "centers" or "knots" that determine optimal subspaces for data projection. I also consider alternate strategies for handling massive spatial datasets. One approach concerns developing process-based super-population models and developing Bayesian finite-population sampling techniques for spatial data. I also explore model-based simultaneous dimension-reduction in space, time and the number of variables. Flexible and rich hierarchical modeling applications in forestry are demonstrated.

February

3 Robert delMas

A Different Flavor of Introductory Statistics: Teaching Students to Really Cook

Department of Educational Psychology, University of Minnesota

The NSF-funded CATALST project is developing a radically different undergraduate introductory-statistics course based on ideas presented by George Cobb and Danny Kaplan (Cobb, 2007a, b; Kaplan, 2007). Standard parametric tests of significance, such as the two-sample t-test and Chi-square analyses, are not taught in the course. Instead, a carefully designed sequence of activities based on research in mathematics and statistics education help students develop their understanding of randomness, chance models, randomization tests and bootstrap coverage intervals. For each unit in this course, students first engage in a Model-Eliciting Activity (MEA; Lesh & Doer, 2003; Zawojewski, Bowman, & Diefes-Dux, 2008) that primes them for learning the statistical content of the unit (Schwartz, 2004). This is followed by activities where the students explore how to model chance and chance models using modeling software such as TinkerPlots and then transition to carry out randomization tests and estimate bootstrap coverage intervals. The talk will present activities from different parts of the course to illustrate this approach, as well as results from preliminary data gathered fall 2010.

References

Cobb, G. (2007a). The introductory statistics course: A Ptolemaic curriculum? Technology Innovations in Statistics Education, 1(1) [Online]. http://repositories.cdlib.org/uclastat/cts/tise/vol1/iss1/art1

Cobb, G. (2007b). One possible frame for thinking about experiential learning. International Statistical Review.

Kaplan, D. (2007). Computing and introductory statistics. Technology Innovations in Statistics Education, 1(1) [Online]. http://repositories.cdlib.org/uclastat/cts/tise/vol1/iss1/art5

Lesh, R., & Doerr, H. M. (2003). Foundations of a models and modeling perspective on mathematics teaching, learning, and problem solving. In R. Lesh & H. M. Doerr (Eds.), Beyond constructivism: Models and modeling perspectives on mathematics teaching, learning, and problem solving (pp. 3-33). Mahwah, NJ: Lawrence Erlbaum.

Schwartz, D. L. (2004). Inventing to prepare for future learning: The hidden efficiency of encouraging original student production in statistics instruction. Cognition and Instruction, 22(2), 129-184.

Zawojewski, J., Bowman, K., & Diefes-Dux, H.A. (in press). Mathematical modeling in engineering education: Designing experiences for all students. Roterdam, the Netherlands: Sense Publishers.

10 Katherine Bennett Ensor

Estimating the Term Structure With a Semiparametric Bayesian Hierarchical Model: An Application to Corporate Bonds

Department of Statistics, Center for Computational Finance and Economic System, Rice University

The term structure of interest rates is used to price defaultable bonds and credit derivatives, as well as to infer the quality of bonds for risk management purposes. We introduce methodology that jointly estimates term structures by means of a Bayesian hierarchical model with a prior probability model based on Dirichlet process mixtures. The modeling methodology borrows strength across term structures for purposes of estimation. The main advantage of our framework is its ability to produce reliable estimators at the company level even when there are only a few bonds per company. After describing the proposed model, we discuss an empirical application in which the term structure of 197 individual companies is estimated. The sample of 197 consists of 143 companies with only one or two bonds. In-sample and out-of-sample tests are used to quantify the improvement in accuracy that results from approximating the term structure of corporate bonds with estimators by company rather than by credit rating, the latter being a popular choice in the financial literature.