linear regression

Understanding Ordered Factors in a Linear Model

Consider the following data from the text Design and Analysis of Experiments, 7th ed. (Montgomery, 2009, Table 3.1). It has two variables: power and rate. power is a discrete setting on a tool used to etch circuits into a silicon wafer. There are four levels to choose from. rate is the distance etched measured in Angstroms per minute. (An Angstrom is one ten-billionth of a meter.) Of interest is how (or if) the power setting affects the etch rate.

Understanding Robust Standard Errors

What are robust standard errors? How do we calculate them? Why use them? Why not use them all the time if they’re so robust? Those are the kinds of questions this post intends to address.

Modeling Non-Constant Variance

One of the basic assumptions of linear modeling is constant, or homogeneous, variance. What does that mean exactly? Let’s simulate some data that satisfies this condition to illustrate the concept.

Below we create a sorted vector of numbers ranging from 1 to 10 called x, and then create a vector of numbers called y that is a function of x. When we plot x vs y, we get a straight line with an intercept of 1.2 and a slope of 2.1.

Getting Started with Multiple Imputation in R

Whenever we are dealing with a dataset, we almost always run into a problem that may decrease our confidence in the results that we are getting - missing data! Examples of missing data can be found in surveys - where respondents intentionally refrained from answering a question, didn’t answer a question because it is not applicable to them, or simply forgot to give an answer. Or our dataset on trade in agricultural products for country-pairs over years could suffer from missing data as some countries fail to report their accounts for certain years.

Interpreting Log Transformations in a Linear Model

Log transformations are often recommended for skewed data, such as monetary measures or certain biological and demographic measures. Log transforming data usually has the effect of spreading out clumps of data and bringing together spread-out data. For example, below is a histogram of the areas of all 50 US states. It is skewed to the right due to Alaska, California, Texas and a few others.

Hierarchical Linear Regression

Note: This post is not about hierarchical linear modeling (HLM; multilevel modeling). Hierarchical regression is model comparison of nested regression models.

Understanding Diagnostic Plots for Linear Regression Analysis

You ran a linear regression analysis and the stats software spit out a bunch of numbers. The results were significant (or not). You might think that you’re done with analysis. No, not yet. After running a regression analysis, you should check if the model works well for the data.

Should I Always Transform My Variables to Make Them Normal?

When I first learned data analysis, I always checked normality for each variable and made sure they were normally distributed before running any analyses, such as t-test, ANOVA, or linear regression. I thought normal distribution of variables was the important assumption to proceed to analyses. That’s why stats textbooks show you how to draw histograms and QQ-plots in the beginning of data analysis in the early chapters and see if variables are normally distributed, isn’t it?