Working PapersShmueli, Galit.,Lin, M.,, Lucas, H. "Is More Always Better? Larger Samples and False Discoveries"Read Abstract >Close >The Internet presents great opportunities for research about information technology, allowing IS researchers to collect very large and rich datasets. It is common to see research papers with tens or even hundreds of thousands of data points, especially when reading about electronic commerce. Large samples are better than smaller samples in that they provide greater statistical power and produce more precise estimates. However, statistical inference using p-values does not scale up to large samples and often leads to erroneous conclusions. We find evidence of an over-reliance on p-values in large sample IS studies in top IS journals and conferences. In this commentary, we focus on interpreting effects of individual independent variables on a dependent variable in regression-type models. We discuss how p-values become deflated with a large sample and illustrate this deflation in analyzing data from over 340,000 digital camera auctions on eBay. The commentary recommends that IS researchers be more conservative in interpreting statistical significance in large sample studies, and instead, interpret results in terms of practical significance. In particular, we suggest that authors of large-sample IS studies report and discuss confidence intervals for independent variables of interest rather than coefficient signs and p-values. We also suggest taking advantage of a large dataset for examining how coefficients and p-values change as sample size increases, and for estimating models on multiple subsamples to further test robustness.

Working PapersShmueli, Galit.,Sellers, K. F. "Predicting Censored Count Data with COM-Poisson Regression"Read Abstract >Close >Censored count data are encountered in many applications, often due to a data collection mecha- nism that introduces censoring. A common example is questionnaires with question answers of the type 0,1,2,3+. We consider the problem of predicting a censored output variable Y , given a set of complete predictors X. The common solution would be to use adaptations for Poisson or negative binomial regression models that account for the censoring. We study two alternatives that allow for both over- and under-dispersion: Conway-Maxwell-Poisson (COM-Poisson) regression, and gener- alized Poisson regression models, each with adaptations for censoring. We compare the predictive power of these models by applying them to a German panel dataset on fertility, where we intro- duce censoring of dierent levels into the outcome variable. We explore two additional variants: (1) using the mean versus the median of the predictive count distribution, and (2) ensembles of COM-Poisson models based on the parametric and non-parametric bootstrap. Keywords: over-dispersion, under-dispersion, predictive distribution, mean versus median predictions, ensembles

Working PapersShmueli, Galit.,Koppius, O. (. "The Challenge of Prediction in IS Research"Read Abstract >Close >Empirical research in Information Systems (IS) is dominated by the use of explanatory statistical models for testing causal hypotheses, and by a focus on explanatory power. Predictive statistical models, which are aimed at predicting out-of-sample observations with high accuracy, are rare, and so is attention to predictive power. The distinction between explanatory and predictive statistical models is key, as both types of models play a different, yet essential, role in advancing scientific research. Similarly, explanatory power and predictive accuracy are two distinct qualities of a statistical model, and are measured in different ways. A literature review of MISQ and ISR shows that predictive goals, predictive claims, and predictive statistical models are scarce in mainstream empirical IS research. In addition, we find three questionable common practices: First, even when the stated goal of modeling is predictive, explanatory statistical modeling is often employed. Second, the predictive power of a model is often inferred from its explanatory power. And third, the vast majority of explanatory statistical models lack proper predictive assessment, which is a key scientific requirement. In light of the distinction between explanatory and predictive statistical modeling and power, and current practice in IS, we highlight the main differences between them, focusing on practical issues that confront an empirical researcher in the data analysis process.

Working PapersShmueli, Galit.,Jank, W.,Bapna, Ravi. "Measuring Consumer Surplus on eBay: An Empirical Study"Read Abstract >Close >Online auctions, consumer surplus, eBay, sniping

Working PapersShmueli, Galit . "Simulating Multivariate Syndromic Time Series and Outbreak Signatures"Read Abstract >Close >A serious challenge to research in the ¯eld of biosurveillance is the lack of available authentic syndromic data to researchers. This signi¯cantly limits the possibility of algorithm development and evaluation, and hiders the comparison of methods across di®erent groups of researchers. Since syndromic datasets are usually proprietary and tightly held by their owners, a robust simulation method for multivariate time series derived from syndromic data is required. This paper describes a method for simulating multivariate syndromic count data, in the form of daily counts from multiple syndromic series. The simulator can be used to generate multivariate syndromic data by specifying the requested statistical structure, and as a method to mimic an existing set of syndromic data for the purpose of creating a new dataset with the same statistical properties. The product of this study is both a software program that generates multivariate semi-authentic data, as well as a set of datasets that can serve as an initial repository for researchers. An additional component to the data simulator is an outbreak simulator that generates multivariate signatures of arti¯cial outbreaks of di®erent nature, and which can then be embedded within the simulated data to evaluate detection algorithm performance.