|
Date |
Speaker |
Title/Abstract |
|
|
September | 16 | Jared Lunceford | Evaluating Surrogate Variables for Improving Microarray Multiple Testing Inference |
| | Merck | High-throughput technologies (e.g. microarray) are producing vast amounts of data and consequently investigations where many hypothesis are being tested simultaneously. It is also widely recognized that dependencies are present in many high-throughput studies and it is desirable to have these dependencies considered when conducting multiple testing procedures. The use of surrogate variables (Leek and Storey; 2007, 2008) has been proposed as a means to capture, for a given observed set of data, sources driving the dependency structure among high-dimensional sets of features and remove the effects of those sources and their potential negative impact on simultaneous inference. We illustrate the potential effects of latent variables on testing dependence and the resulting impact on multiple inference and briefly review the method of surrogate variable analysis proposed by Leek and Storey (2008). We assess that method via simulations intended to mimic the complexity of feature dependence observed in real-world microarray data. The method is also assessed via application to a recent Merck microarray data set. |
| | | Reading |
| 23 | C. Arden Pope III | Statistical Modeling in Exploring the Human Health Effects of Air Pollution |
| | Mary Lou Fulton Professor of Economics, Brigham Young University | The science of particulate matter air pollution and health has a long, rich, and complex
history. Episode studies, daily time-series studies, panel-based acute exposure studies, and
case-crossover studies provide a large and rich evidence base to indicate that short-term air
pollution exposure exacerbates existing cardio-pulmonary disease and increases the risk of
becoming symptomatic, requiring medical attention, or even dying. Natural experiment studies,
cohort- and panel-based studies, and case control studies indicate even larger effects of long-
term air pollution exposure. A major aspect of this research generally has been the use of and
development of statistical modeling. This presentation will discuss the role of various statistical
modeling approaches used to analyze different types health endpoint data generated from these
study designs. Specific, high profile, local and national examples will be provided. |
| | | Reading 1 |
| | | Reading 2 |
| | | Stephen Colbert on Arden Pope |
| 30 | Valen Johnson | Consistent Bayesian model selection in p < n settings |
| | Department of Biostatistics, MD Anderson Cancer Center | Suppose Y denotes an nx1 random vector, X an nxp matrix of real numbers, and b a px1 regression vector. My talk focuses on the selection of non-zero components of b when it is assumed that
Y ~ N(Xb,sigma^2 I).
Model selection is based on the calculation of posterior model probabilities using non-local prior densities on the regression coefficients for each possible model. The non-local prior densities used for model definition are obtained as products of normal moment priors and are called pMOM prior densities. Under mild conditions on the matrix X, I demonstrate that the use of these priors guarantees that the posterior probability of the true model converges to 1 as the sample size increases, and that the resulting model selection procedure exhibits an "oracle" property in p < n settings. |
| | | Reading 1 |
| | | Reading 2 |
October | 7 | Joshua Tebbs | Informative retesting |
| | Department of Statistics, University of South Carolina | In situations where individuals are screened for an infectious disease or other
binary characteristic and where resources for testing are limited, group testing can offer
substantial benefits. Group testing, where subjects are tested in groups (pools) initially,
has been successfully applied to problems in blood bank screening, public health, drug
discovery, genetics, and many other areas. In these applications, often the goal is to
identify each individual as positive or negative using initial group tests and subsequent
retests of individuals within positive groups. Many group testing identification
procedures have been proposed; however, the vast majority of them fail to incorporate
heterogeneity among the individuals being screened. In this talk, we present a new
approach to identify positive individuals when covariate information is available on
each. This covariate information is used to structure how retesting is implemented
within positive groups; therefore, we call this new approach "informative retesting." We
derive closed-form expressions and implementation algorithms for the probability mass
functions for the number of tests needed to decode positive groups. These informative
retesting procedures are illustrated through a number of examples and are applied to
chlamydia and gonorrhea testing in Nebraska for the Infertility Prevention Project.
Overall, our work shows compelling evidence that informative retesting can dramatically
decrease the number of tests while providing accuracy similar to established noninformative
retesting procedures.
|
| | | Reading 1 |
| | | Reading 2 |
| 14 | Leanna House | Second-order Exchangeable Functions with Application to
Multi-deterministic Simulators |
| | Department of Statistics, Virginia Tech | Analysts often use deterministic computer models to predict the
behavior of complex physical systems when observational data are
limited. However, inferences based partially or entirely on
simulated data require adequate assessments of model uncertainty
that can be hard to quantify. The deterministic nature of computer
models limits the information we can extract from simulations to
separate model signal from model error. In this paper we present a
new approach to assess the uncertainty of computer models to which
we refer as multi-deterministic. Evaluations from a
multi-deterministic computer model can be considered to be a
collection of deterministic simulators which share the same input
and output space, do not present obvious theoretical or
computational advantages, and generate disparate predictions. To
quantify the uncertainty of predictions from multi-deterministic
models we use the construct of a latent model about which we learn
from observed evaluations. We assume that outcomes from
multi-deterministic models are sequences of second-order
exchangeable functions (SOEF) and use Bayes linear methods to
assess the latent model a posteriori. We demonstrate our methods using
multi-deterministic results from a galaxy formation model called
Galform. |
| | | Reading 1 |
| | | Reading 2 |
| 21 | Athanasios Kottas | Nonparametric Bayesian Modeling for Developmental Toxicology Data |
| | Department of Applied Mathematics and Statistics, UC-Santa Cruz | We present a Bayesian nonparametric mixture modeling framework for replicated
count responses in dose-response settings. We explore the methodology for
modeling and risk assessment in developmental toxicity studies, where the
primary objective is to determine the relationship between the level of
exposure to a toxic chemical and the probability of a physiological or
biochemical response. Data from these experiments typically involve features
that can not be captured by standard parametric approaches. To provide
flexibility in the functional form of both the response distribution and the
probability of positive response, the proposed mixture model is built from
a dependent Dirichlet process prior, with the dependence of the mixing
distributions governed by the dose level. The methodology is tested with a
simulation study, which involves comparison with semiparametric Bayesian
approaches to highlight the practical utility of the dependent Dirichlet
process nonparametric mixture model. Further illustration will be provided
through the analysis of data from two developmental toxicity studies.
Joint work with Ph.D. student (and BYU alumna) Kassandra Fronczyk. |
| | | Reading |
| 28 | Nick Polson | Sequential learning, predictive regressions, and optimal portfolio returns |
| | Graduate School of Business, University of Chicago | We analyze sequential learning for predictive regression models. We evaluate the economic benefits to an investor who adopts multiple models to exploit predictability when forming portfolios. To do this, we develop a new particle-based method for sequential learning about parameters, state variables and multiple models. Our sequential perspective allows us to quantify how an investor's views about predictability and multiple models varies over time, naturally mimicking the learning problem encountered in practice. Our models account for drifting coefficients and stochastic volatility and our on-line method quantifies the time-variation of these esti- mates together with model probabilities. Our Bayesian optimal portfolios outperform rolling and cumulative regressions that ignore parameter uncertainty. |
| | | Reading |
November | 4 | Ruth Kerry | Geostatistical Methods in Geography |
| |
| Geographers are concerned with analyzing ``spatial data'' and thus use many spatial
as well as aspatial statistics. The development of spatial statistics can be traced to the
early part of the 20th century, to analysis of agricultural field trial data by statisticians.
Geostatistics is a component of spatial statistics but its evolution has been led principally
by applied scientists and mathematicians rather than classically-trained statisticians. Any
methodology for analyzing spatial data needs to acknowledge Tobler's (1970) ``First
Law of Geography'' that ``everything is related to everything else, but near things are
more related than distant things''; or in other words, the fundamental property of spatial
dependence or spatial autocorrelation. The primary concern of geostatistical analysis is
to investigate the spatial autocorrelation in data. Haining (2003) noted that quantifying
spatial dependence matters, whether the purpose of an analysis is to interpolate, to fit a
regression model, or for hypothesis testing.
Along with spatial dependence, geographical data acquire properties as a consequence
of the chosen representation of geographic space. The areal units into which a study
region is partitioned for reporting attribute values often vary in size and shape (e.g.,
census tracts). If the population denominator for rates (e.g., crime rates) varies, the
standard errors of such statistics are not constant across a map and data for areas with
small populations suffer from the ``small number problem''. Methods used in Geography
must not only be able to deal with spatial dependence, but also with these types of
properties acquired as a consequence of the chosen representation of geographic space
(Haining et al., 2010). This presentation will show some of the ways aspatial and spatial
statistics have been used in my recent research with particular emphasis being placed on
geostatistical methods. Issues that will be addressed will include working with nominal
or ordinal data, designing sampling schemes and working with areal and rate data.
Case studies will include the analysis of soil and crop data for precision farming, crime
patterns in the Baltic States and large herbivore distribution patterns in Kruger National
Park, South Africa.
|
| | | References |
| | | Haining, R.P. 2003. Spatial Data Analysis: Theory and Practice. Cambridge: Cambridge
University Press. |
| | | Haining, R., Kerry, R. & Oliver, M. A. 2010. Geography, Spatial Data Analysis and
Geostatistics: An Overview. Geographical Analysis 42, 7-31. |
| | | Tobler, W. 1970. A computer movie simulating urban growth in the Detroit region.
Economic Geography 46, 234-240. |
| | | Reading |
| 11 | Alun Thomas | On the Statistical Analysis of Dirty Graphs |
| | Department of Biomedical Informatics, University of Utah | Starting from a general framework we consider estimating
a graph given indirect, partial or possibly erroneous
information about its structure. More specifically I'll
describe the estimation of the conditional independence
graphs of graphical models using Markov chain Monte Carlo
methods. This will be applied to some simulated examples
and to real genetic data. Finally, this approach is used
to model allelic association, or linkage disequilibrium,
between marker loci in modern, dense genotyping assays. |
| | | Reading |
| 18 | | No Seminar: Graduate Student Elective Course Taste Testing |
December | 2 | Matt Heaton | Kernel Averaged Predictors for Spatio-Temporal Processes |
| | Department of Statistical Science, Duke University | For spatio-temporal processes, predictors from multiple locations affect the response at a separate location. For example, predictors such as precipitation, temperature, pollution emissions, etc. are often used to explain ground-level ozone production. Due to weather and other factors, however, the relationship between these predictors and ozone is not confined to a single spatial location or time period as is often assumed. Again, the effect of pollution on mortality is spatially and temporally lagged because mortality does not, typically, occur at time of exposure. Here, kernels are proposed as a tool to properly weight predictor surfaces in spatio-temporal regression models. The kernels are assumed to be parametric with parameters that are estimable from the data. Distributional results are provided for the case of a univariate predictor and response in the Gaussian process setting. Additionally, relations to previously proposed models and computational details are discussed. |
| | | Reading |
January |
20 | William Christensen | Filtered Kriging for Spatial Data with Heterogeneous Measurement Error Variances |
| | Department of Statistics, Brigham Young University
| When predicting values for the measurement-error-free component of an observed spatial process, it is generally assumed that the process has a common measurement error variance. However, it is often the case that each measurement in a spatial data set has a known, site-specific measurement error variance, rendering the observed process nonstationary. We present a simple approach for estimating the semivariogram of the unobservable measurement-error-free process using a bias-adjustment of the classical semivariogram formula. We then develop a new kriging predictor which filters the measurement errors. For scenarios where each site's measurement error variance is a function of the process of interest, we recommend an approach which also uses a variance-stabilizing transformation. The properties of the heterogeneous variance measurement-error-filtered kriging (HFK) predictor and variance-stabilized HFK predictor, and the improvement of these approaches over standard measurement-error
filtered kriging are demonstrated using simulation. The approach is illustrated with climate model output from the Hudson Strait area in northern Canada. In the illustration, locations with high or low measurement error variances are appropriately down- or up-weighted in the prediction of the underlying process, yielding a realistically smooth picture of the phenomenon of interest. |
|
27 | Sudipto Banerjee | Computationally Feasible
Hierarchical Modeling Strategies for Large Spatial Datasets |
| | Division of Biostatistics, University of Minnesota | Large point referenced datasets are
common in the environmental and natural sciences. The computational burden in fitting large spatial datasets undermines estimation of Bayesian models. We explore several improvements low-rank and other scalable spatial process models including reduction of biases and process-based modeling of "centers" or "knots" that determine optimal subspaces for data projection. I also consider alternate strategies for handling massive spatial datasets. One approach concerns developing process-based super-population models and developing Bayesian finite-population sampling techniques for spatial data. I also explore model-based simultaneous dimension-reduction in space, time and the number of variables. Flexible and rich hierarchical modeling applications in forestry are demonstrated.
|
February | 3 | Robert delMas | A Different Flavor of Introductory Statistics: Teaching Students to Really Cook |
| | Department of Educational Psychology, University of Minnesota |
The NSF-funded CATALST project is developing a radically different undergraduate introductory-statistics course based on ideas presented by George Cobb and Danny Kaplan (Cobb, 2007a, b; Kaplan, 2007). Standard parametric tests of significance, such as the two-sample t-test and Chi-square analyses, are not taught in the course. Instead, a carefully designed sequence of activities based on research in mathematics and statistics education help students develop their understanding of randomness, chance models, randomization tests and bootstrap coverage intervals. For each unit in this course, students first engage in a Model-Eliciting Activity (MEA; Lesh & Doer, 2003; Zawojewski, Bowman, & Diefes-Dux, 2008) that primes them for learning the statistical content of the unit (Schwartz, 2004). This is followed by activities where the students explore how to model chance and chance models using modeling software such as TinkerPlots and then transition to carry out randomization tests and estimate bootstrap coverage intervals. The talk will present activities from different parts of the course to illustrate this approach, as well as results from preliminary data gathered fall 2010. |
| | | References |
| | | Cobb, G. (2007a). The introductory statistics course: A Ptolemaic curriculum? Technology Innovations in Statistics Education, 1(1) [Online]. http://repositories.cdlib.org/uclastat/cts/tise/vol1/iss1/art1 |
| | | Cobb, G. (2007b). One possible frame for thinking about experiential learning. International Statistical Review. |
| | | Kaplan, D. (2007). Computing and introductory statistics. Technology Innovations in Statistics Education, 1(1) [Online]. http://repositories.cdlib.org/uclastat/cts/tise/vol1/iss1/art5 |
| | | Lesh, R., & Doerr, H. M. (2003). Foundations of a models and modeling perspective on mathematics teaching, learning, and problem solving. In R. Lesh & H. M. Doerr (Eds.), Beyond constructivism: Models and modeling perspectives on mathematics teaching, learning, and problem solving (pp. 3-33). Mahwah, NJ: Lawrence Erlbaum. |
| | | Schwartz, D. L. (2004). Inventing to prepare for future learning: The hidden efficiency of encouraging original student production in statistics instruction. Cognition and Instruction, 22(2), 129-184. |
| | | Zawojewski, J., Bowman, K., & Diefes-Dux, H.A. (in press). Mathematical modeling in engineering education: Designing experiences for all students. Roterdam, the Netherlands: Sense Publishers. |
| 10 | Katherine Bennett Ensor | Estimating the Term Structure With a Semiparametric Bayesian Hierarchical Model: An Application to Corporate Bonds |
| | Department of Statistics, Center for Computational Finance and Economic System, Rice University | The term structure of interest rates is used to price defaultable bonds and credit derivatives, as well as to infer the quality of bonds for risk management purposes. We introduce methodology that jointly estimates term structures by means of a Bayesian hierarchical model with a prior probability model based on Dirichlet process mixtures. The modeling methodology borrows strength across term structures for purposes of estimation. The main advantage of our framework is its ability to produce reliable estimators at the company level even when there are only a few bonds per company. After describing the proposed model, we discuss an empirical application in which the term structure of 197 individual companies is estimated. The sample of 197 consists of 143 companies with only one or two bonds. In-sample and out-of-sample tests are used to quantify the improvement in accuracy that results from approximating the term structure of corporate bonds with estimators by company rather than by credit rating, the latter being a popular choice in the financial literature. |
| 17 | Gary Parker | |
| | Department of Statistics and Actuarial Science, Simon Fraser University |
| 24 | Ryan Elmore | |
| | Data, Informatics, and Systems Group, Computational Science, National Renewable Energy Lab |
March | 3 |
Amanda Cox | |
| | Graphics Editor, The New York Times |
| 10 | | |
| | |
| 17 | Michael Rendall | |
| | RAND Population Research Center |
| 24 | Vanja Dukic | |
| | Department of Applied Mathematics, University of Colorado at Boulder |
| 31 | Crystal Linkletter |
| | Department of Community Health, Brown University |
April | 7 | Brad Efron |
| | Department of Statistics, Stanford University |