Getting Started with Tweedie Models
Tweedie models are a special Generalized Linear Model (GLM) that can be useful when we want to model an outcome that sometimes equals 0 but is otherwise positive and continuous. Some examples include daily precipitation data and annual income. Data like this can have zeroes, often lots of zeroes, in addition to positive values. When modeling data of this nature, we may want to ensure our model does not predict negative values. We may also want to log-transform this data without dropping the zeroes. Tweedie models allow us to do both.
Getting Started with Multilevel Regression and Poststratification
Multilevel Regression and Poststratification (MRP) is a method of adjusting model estimates for non-response. By “non-response” we mean under-sampled groups in a population. For example, imagine conducting a phone survey to estimate the percentage of a population that approves of an elected official. It’s likely that certain age groups in the population will be under-sampled because they’re less likely to answer a call from an unfamiliar number. MRP allows us to analyze the data and adjust the estimate by taking the under-sampled groups into account.
Using Wavelets to Analyze Time Series Data
Time series data can contain a lot of information. Often, it is difficult to visually detect patterns in a time series and even harder to quantify patterns. How can we analyze our time series data to understand its underlying signals and how these signals are changing through time?
Understanding t-tests, ANOVA, and MANOVA
Imagine you love baking cookies and invite your friends over for a cookie party. You want to know how many cookies you should make so you ask your friends about how many cookies they think they will each eat. They respond:
- Francesca: 5 cookies
- Sydney: 3 cookies
- Noelle: 1 cookie
- James: 7 cookies
- Brooke: 2 cookies
We take these numbers and add all of them together to estimate that about 18 cookies will be eaten in total at our party.
Understanding Polychoric Correlation
Polychoric correlation is a measure of association between two ordered categorical variables, each assumed to represent latent continuous variables that have a bivariate standard normal distribution. When we say two variables have a bivariate standard normal distribution, we mean they’re both normally distributed with mean 0 and standard deviation 1, and that they have linear correlation.
Power and Sample Size Calculations for Ordered Categorical Data
In this article we demonstrate how to calculate the sample size needed to achieve a desired power for experiments with an ordered categorical outcome. We assume the data will be analyzed with a proportional odds logistic regression model. We’ll use the R statistical computing environment and functions from the {Hmisc} package to implement the calculations.
Bootstrapped Association Rule Mining in R
Association rule mining is a machine learning technique designed to uncover associations between categorical variables in data. It produces association rules which reflect how frequently categories are found together. For instance, a common application of association rule mining is the “Frequently Bought Together” section of the checkout page from an online store. Through mining across transactions from that store, items that are frequently bought together have been identified (e.g., shaving cream and razors).
Spatial R: Using the sf package
The spread of disease, politics, the movement of animals, regions vulnerable to earthquakes and where people are most likely to buy frosted flakes are all informed by spatial data. Spatial data links information to specific positions on earth and can tell us about patterns that play out from location to location. We can use spatial data to uncover processes over space and tackle complex problems.
Assessing Model Assumptions with Lineup Plots
When fitting a linear model we make two assumptions about the distribution of residuals:
Bootstrapping Residuals for Linear Models with Heteroskedastic Errors Invites Trouble
Bootstrapping—resampling data with replacement and recomputing quantities of interest—lets analysts approximate sampling distributions for complex estimators and frees them of the reliably unmet assumptions of traditional, parametric inferential statistics. It’s an elegant, intuitive approach in which an analyst exploits the (often) parallel resample-to-sample and sample-to-population relationships to understand uncertainty in an estimate.