Missing Data Using SPSS
If you are missing much of your data, this can cause several problems. The most apparent problem is that there simply won’t be enough data points to run your analyses. The EFA, CFA, and path models require a certain number of data points in order to compute estimates. This number increases with the complexity of your model. If you are missing several values in your data, the analysis just won’t run.
Additionally, missing data might represent bias issues. Some people may not have answered particular questions in your survey because of some common issue. For example, if you asked about gender, and females are less likely to report their gender than males, then you will have male-biased data. Perhaps only 50% of the females reported their gender, but 95% of the males reported gender. If you use gender in your causal models, then you will be heavily biased toward males, because you will not end up using the unreported responses.
To find out how many missing values each variable has, in SPSS go to Analyze, then Descriptive Statistics, then Frequencies. Enter the variables in the variables list. Then click OK. The table in the output will show the number of missing values for each variable.
The threshold for missing data is flexible, but generally, if you are missing more than 10% of the responses on a particular variable, or from a particular respondent, that variable or respondent may be problematic. There are several ways to deal with problematic variables.
- Just don’t use that variable.
- If it makes sense, impute the missing values. This should only be done for continuous or interval data (like age or Likert-scale responses), not for categorical data (like gender).
- If your dataset is large enough, just don’t use the responses that had missing values for that variable. This may create a bias, however, if the number of missing responses is greater than 10%.
To impute values in SPSS, go to Transform, Replace Missing Values; then select the variables that need imputing, and hit OK. See the screenshots below. In this screenshot, I use the Mean replacement method. But there are other options, including Median replacement. Typically with Likert-type data, you want to use median replacement, because means are less meaningful in these scenarios. For more information on when to use which type of imputation, refer to: Lynch (2003)
Handling problematic respondents is somewhat more difficult. If a respondent did not answer a large portion of the questions, their other responses may be useless when it comes to testing causal models. For example, if they answered questions about diet, but not about weight loss, for this individual we cannot test a causal model that argues that diet has a positive effect on weight loss. We simply do not have the data for that person. My recommendation is to first determine which variables will actually be used in your model (often we collect data on more variables than we actually end up using in our model), then determine if the respondent is problematic. If so, then remove that respondent from the analysis.