Pre-Class Readings and Videos
Examining frequency distributions for each of your variables is the key to further guiding the decision making involved in quantitative research.
EXAMPLE:
A random sample of 1,200 U.S. college students were asked the following questions as part of a larger survey: “What is your perception of your own body? Do you feel that you are overweight, underweight, or about right?” The following table shows part of the data (5 of the 1200 observations);
STUDENT | BODY IMAGE |
---|---|
Student 25 | Overweight |
Student 26 | About Right |
Student 27 | Underweight |
Student 28 | About Right |
Student 29 | About Right |
Here is some information that would be interesting to get from these data:
- What percentage of the sampled students fall into each category?
- How are students divided across the three body image categories?
Are they equally divided? If not, do the percentages follow some other kind of pattern?
There is no way that we can answer these questions by looking at the raw data, which are in the form of a long list of 1,200 responses and thus not very useful. However, both these questions will be easily answered once we summarize and look at the frequency distribution of the variable BodyImage
(i.e., once we summarize how often each of the categories occurs).
In order to summarize the distribution of a categorical variable, we ask our statistical software program to create a table of the different values (categories) the variable takes, how many times each value occurs (count), and, more importantly, how often each value occurs (percentages). Here is the table (i.e. frequency distribution) for our example:
CATEGORY | COUNT | PERCENTAGE |
---|---|---|
About Right | 855 | 71.3% |
Overweight | 235 | 19.6% |
Underweight | 110 | 9.2% |
Total | 1200 | 100% |
Please watch the video below.
Data Management
-
R
-
SAS
-
Stata
Data Management
During the class session, we will begin to work through how to make decisions about data management and how to put those decisions into action.
An understanding of basic operations to be used with your statistical software is a good place to start.
Examples of data management decisions:
1. Need to identify missing data
Often, you must define the response categories that represent missing data. For example, if the number 9 is used to represent a missing value, you must either designate in your program that this value represents missingness or else you must recode the variable into a missing data character that your statistical software recognizes. If you do not, the 9 will be treated as a real/meaningful value and will be included in each of your analyses.
2. Need to recode responses to “no” based on skip patterns
There are a number of skip outs in some data sets. For example, if we ask someone whether or not they have ever used marijuana, and they say “no”, it would not make sense to ask them more detailed questions about their marijuana use (e.g. quantity, frequency, onset, impairment, etc.)
When analyzing more detailed questions regarding marijuana (e.g. have you ever smoked marijuana daily for a month or more?), those individuals that never used the substance may show up as missing data. Since they have never used marijuana, we can assume that their answer is “no”, they have never smoked marijuana daily. This would need to be explicitly recoded. Note that we commonly code a no as 0 and a yes as 1.
3. Need to collapse response categories
If a variable has many response categories, it can be difficult to interpret the statistical analyses in which it is used. Alternatively, there may be too few subjects or observations identified by one or more response categories to allow for a successful analysis. In these cases, you would need to collapse across categories. Consider the variable S1Q6A
from the data frame NESARC
which has 14 levels that record the highest level of education of the participant. To collapse the categories into a dichotomous variable that indicates the presence of a high school degree, use the ifelse
function. The levels 1
, 2
, 3
, 4
, 5
, 6
, and 7
of the variable S1Q6A
correspond to education levels less than completing high school.
4. Need to aggregate variables
In many cases, you will want to combine multiple variables into one. Consider creating create a new variable DepressLife
which is Yes
if the variable MAJORLIFE
or DYSLIFE
is a 1 (data frame NESARC
).
5. Need to create continuous variables
If you are working with a number of items that represent a single construct, it may be useful to create a composite variable/score. For example, I want to use a list of nicotine dependence symptoms meant to address the presence or absence of nicotine dependence (e.g. tolerance, withdrawal, craving, etc.). Rather than using a dichotomous variable (i.e. nicotine dependence present/absent), I want to examine the construct as a dimensional scale (i.e. number of nicotine dependence symptoms). In this case, I would want to recode each symptom variable so that yes=1
and no=0
and then sum the items so that they represent one composite score.
6. Labeling variable responses/values
Given that nominal and ordinal variables have, or are given numeric response values (i.e. dummy codes), it can be useful to label those values so that the labels are displayed in your output.
7. Need to further subset the sample
When using large data sets, it is often necessary to subset the data so that you are including only those observations that can assist in answering your particular research question. In these cases, you may want to select your own sample from within the survey’s sampling frame. For example, if you are interested in identifying demographic predictors of depression among Type II diabetes patients, you would plan to subset the data to subjects endorsing Type II Diabetes.
NOTE: Often, you will need to create groups or sub-samples from the data set for the purpose of making comparisons. It is important to be certain that the groups that you would like to compare are of adequate size and number. For example, if you were interested in comparing complications of depression in parents who had lost a child through miscarriage vs. parents who had lost a child in the first year of life, it would be important to have large enough groups of each. It would not be appropriate to attempt to compare 5000 observations in the miscarriage group to only 9 observations in the first year group.
Pre-Class Quiz
After reviewing the material above, take Quiz 4 in moodle. Please note that you have 2 attempts for this quiz and the higher grade prevails.
During Class Tasks
Mini-Assignment 3
Project Component E