StatLab Articles

Getting Started with Shiny

What is Shiny?

Shiny is an R package that facilitates the creation of interactive web apps using R code, which can be hosted locally, on the shinyapps server, or on your own server. Shiny apps can range from extremely simple to incredibly sophisticated. They can be written purely with R code or supplemented with HTML, CSS, or JavaScript. Visit R studio’s shiny gallery to view some examples.

R, visualization, Shiny, Laura White

Databases for Data Scientists

As data scientists, we’re often most excited about the final layer of analysis. Once all the data are cleaned and stored in a format readable by our favorite programming language (Python, R, STATA, etc.), the most fun part of our work is when we’re finding counter-intuitive causations with statistical methods. If you can prove that the mutual presence of McDonalds really does prevent wars between countries or that an increase in diversity really does boost business profitability, that is a good day.

data wrangling, SQL, Srikar Gullapalli

Modeling Non-Constant Variance

One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.

Below we create a sorted vector of numbers ranging from 1 to 10 called x, and then create a vector of numbers called y that is a function of x. When we plot x vs y, we get a straight line with an intercept of 1.2 and a slope of 2.1.

R, linear regression, statistical methods, GLS, Clay Ford

Creating an SQLite database for Use with R

When you import or load data into R, the data are stored in random-access memory (RAM). This is the memory that is deleted when you close R or shut off your computer. It’s very fast but temporary. If you save your data, it is saved to your hard drive. But when you open R again and load the data, once again it is loaded into RAM. While many newer computers come with lots of RAM (such as 16 GB), it’s not an infinite amount. When you open RStudio, you’re using RAM even if no data is loaded. Open a web browser or any other program and they too are loaded into RAM.

R, data wrangling, SQL, SQLite, Clay Ford

Simulating Data for Count Models

A count model is a linear model where the dependent variable is a count. For example, the number of times a car breaks down, the number of rats in a litter, the number of times a young student gets out of his seat, etc. Counts are either 0 or a positive whole number, which means we need to use special distributions to generate the data.

R, simulation, statistical methods, poisson regression, negative binomial regression, zero-inflated models, Clay Ford

Simulating a Logistic Regression Model

Logistic regression is a method for modeling binary data as a function of other variables. For example we might want to model the occurrence or non-occurrence of a disease given predictors such as age, race, weight, etc. The result is a model that returns a predicted probability of occurrence (or non-occurrence, depending on how we set up our data) given certain values of our predictors. We might also be able to interpret the coefficients in our model to summarize how a change in one predictor affects the odds of occurrence.

R, logistic regression, power analysis, simulation, statistical methods, Clay Ford

An Introduction to Analyzing Twitter Data with R

NOTE: As of March 2023, the free version of the Twitter API no longer allows read requests. This means the instructions below to create a developer account, access Twitter, and download tweets no longer works as written. If you have a paid "Basic" tier or higher then these instructions may work for you, but we have not verified this.

R, text analysis, text mining, Leah Malkovich

Getting Started with Multiple Imputation in R

Whenever we are dealing with a dataset, we almost always run into a problem that may decrease our confidence in the results that we are getting - missing data! Examples of missing data can be found in surveys - where respondents intentionally refrained from answering a question, didn’t answer a question because it is not applicable to them, or simply forgot to give an answer. Or our dataset on trade in agricultural products for country-pairs over years could suffer from missing data as some countries fail to report their accounts for certain years.

R, linear regression, statistical methods, multiple imputation, Aycan Katitas

Digital Governance Lab Proposal

Related Scholarship

A Guide to Python in QGIS

This post is something I’ve been thinking about writing for a while. I was inspired to write it by my own trials and tribulations, which are still ongoing, while working with the QGIS API, trying to programmatically do stuff in QGIS instead of relying on available widgets and plugins. I have spent, and will probably continue to spend, many hours scouring the internet and especially Stack Overflow looking for answers of how to use various classes, methods, attributes, etc.

Python, data wrangling, QGIS, Erich Purpur