Project Component A


The first step of your project is to find a data set that interests you. Some students may gravitate towards political sentiment, mental health, or science. It is up to you to decide what data set (among the ones we have provided) will spark the most interesting questions for you. What makes a data set interesting to you may be the general topics that are covered or specific questions/topics that are covered within the data set.Next, try to narrow down particular parts of the data set you find interesting. Your goal is to brainstorm some possible research questions. One of the simplest research questions that can be asked is whether two constructs are associated.

For example, in GSS participants were asked about their happiness in marriage (rated Very Happy, Somewhat Happy, or Not Too Happy) and also about their self-assessment of their own health (Excellent, Good, Fair, or Poor). One question might be: Is there a relationship between health and relationship happiness?

As another example, in NESARC participants were asked about whether their blood/natural father was depressed (No, Yes, or Unknown), whether their blood/natural mother was depressed (No, Yes, or Unknown) and also a multitude of questions about their own mental health (A list of questions that reveal some form of depression). One question could be: Does parental depression predict an individual’s level of depression?

Remember that you can tweak your question as we move forward, but it will benefit you greatly to spend time now deciding a direction for your project.

The requirement of this assignment is to: select a dataset (also include information about why it is interesting to you), discuss potential research questions, and copy and paste the relevant components of the codebook into a document. (This will help you keep organized in the coming weeks, it is likely you will need to update it as your project and research question evolve). If applicable, let us know whether you are having trouble picking a topic or have any other concerns about how to move forward.

Sample Submission:

After looking through the codebook for the NESARC study, I have decided that I am particularly interested in studying family background and depression. Examining mental health and trying to understand contributing factors to depression is something that I explored during my summer internship. While the internship focused on activity level and depression, I was always interested how parental figure’s own depression (either biologically or through interactions) may contribute to their child’s own depression in adulthood.

My personal codebook includes all questions in the NESARC study that give me information about mother and father depression and also includes some signs of an individual’s own depression:

4126 1. Yes
32192 2. No
6775 9. Unknown


7134 1. Yes
31448 2. No
4511 9. Unknown
2399 1. Yes
39510 2. No
1244 9. Unknown
1292 1. Yes
1013 2. No
34 9. Unknown
40754 BL – never had 2+ years of low mood
805 1. Yes
1505 2. No
29 9. Unknown
40754 BL – never had 2+ years of low mood
1613 1. Yes
701 2. No
25 9. Unknown
40754 BL – never had 2+ years of low mood
1177 1. Yes
1134 2. No
28 9. Unknown
40754 BL – never had 2+ years of low mood

Please note you can have as few of two codebook items in your personal codebook or multiple pages of your personal codebook.

Project Component B
Directions: It is important that you examine the existing literature on your topic or association of interest. This will allow you to understand what researchers have already studied on your topic. Your ultimate objective is to go beyond what is already known through your project in this course. In order to achieve this you must familiarize yourself with what researchers have studied. The requirement of this assignment is to: Describe the association or topic that you have decided to examine and key words you found helpful in your search. List at least 5 of the most appropriate references that you have found and read.  Describe findings and interesting themes that you have uncovered and list a tentative research question or two that you hope to pursue.  Be brief and use bullets. 

Sample Submission:Given the association that I have decided to examine, I use such keywords as nicotine dependencetobacco dependence and smoking. After reading through several titles and abstracts, I notice that there has been relatively little attention in the research literature to the association between smoking exposure and nicotine dependence. I expand a bit to include other substance use that provides relevant background as well.


Caraballo, R. S., Novak, S. P., & Asman, K. (2009). Linking quantity and frequency profiles of cigarette smoking to the presence of nicotine dependence symptoms among adolescent smokers: Findings from the 2004 National Youth Tobacco Survey. Nicotine & Tobacco Research, 11(1), 49-57.

Chen, K., Kandel, D.,(2002). Relationship between extent of cocaine use and dependence among adolescents and adults in the United States. Drug & Alcohol Dependence. 68, 65-85.

Chen, K., Kandel, D. B., Davies, M. (1997). Relationships between frequency and quantity of marijuana use and last year proxy dependence among adolescents and adults in the United States. Drug & Alcohol Dependence. 46, 53-67.

Dierker, L., He, J. P., Kalaydjian, A., Swendsen, J., Degenhardt, L., Glantz, M., Merikangas, K. (2008). The importance of timing of transitions for risk of regular smoking and nicotine dependence. Annals of Behavioral Medicine, 36(1), 87-92.

Dierker, L. C., Donny, E., Tiffany, S., Colby, S. M., Perrine, N., Clayton, R. R., & Network, T. (2007). The association between cigarette smoking and DSMIV nicotine dependence among first year college students. Drug and Alcohol Dependence, 86(2-3), 106-114.

Lessov-Schlaggar, C. N., Hops, H., Brigham, J., Hudmon, K. S., Andrews, J. A., Tildesley, E., . . . Swan, G. E. (2008). Adolescent smoking trajectories and nicotine dependence. Nicotine & Tobacco Research, 10(2), 341-351. Riggs, N. R., Chou, C. P., Li, C. Y., &

Pentz, M. A. (2007). Adolescent to emerging adulthood smoking trajectories: When do smoking trajectoriesdiverge, and do they predict early adulthood nicotine dependence? Nicotine & Tobacco Research, 9(11), 1147-1154.

Van De Ven, M. O. M., Greenwood, P. A., Engels, R., Olsson, C. A., & Patton, G. C. (2010). Patterns of adolescent smoking and later nicotine dependence in young adults: A 10-year prospective study. Public Health, 124(2), 65-70.

Based on my reading of the above articles as well as others, I have noted a few common and interesting themes:

  1. While it is true that smoking exposure is a necessary requirement for nicotine dependence, frequency and quantity of smoking are markedly imperfect indices for determining an individual’s probability of exhibiting nicotine dependence (this is true for other drugs as well)
  2. The association may differ based on ethnicity, age, and gender (although there is little work on this)
  3. One of the most potent risk factors consistently implicated in the etiology of smoking behavior and nicotine dependence is depression.

I have decided to further focus my question by examining whether the association between nicotine dependence and smoking differs based on whether a person is experiencing depression. I am wondering if at low levels of smoking compared to high levels, nicotine dependence is more common among individuals with major depression than those without major depression. I add relevant depression questions/items/variables to my personal codebook as well as several demographic measures (age, gender, ethnicity, etc.) and any other variables I may wish to consider.

Project Component C

Directions: You will continue to frame your topic and research question and outline your research intentions. In preparation of creating your final project, you will work on organizing your research by outlining your ideas. For this assignment you are expected to think about a title and the 3 sections that will become part of your final poster: The Introduction, Methods, and Predicted Results/Implications. The details below explain what is expected in each of those sections. For your submission, you are expected only to make an outline of each of the sections below:

  • Title: Your title should summarize the main idea of your research question and should include the variables under investigation. The title should be fully explanatory when standing alone.
  • Introduction (Literature Review): Your introduction should describe your topic and rationale for your research question. Your objective is to convince the reader why they should care about the topic and frame how you are going to contribute to the literature. Your introduction should:
    • Include an opening statement about your main topic.
    • Describe what is known in the literature about your topic or association. (You should have at least 3 main points to discuss here)
    • Justify your research
    • Use specific examples and describe major findings.
    • Describe what is not known about your topic.
    • Summarize any gaps found in the literature and describe how your analyses contribute to filling this gap
    • Your research question
  • Methods: 
    • Name data set and at least 3 key features of the sample or way data were collected.
    • Describe your measures
      • What type of variables are you using? Explain if you are combining several measures.
  • Predicted Results/Implications
    • What do you expect that your research will reveal?
    • Why would these findings be important? Could anything actionable happen as a result of your findings?

Sample Submission:

  • Title: The Association Between Nicotine Dependence and Major Depression
  • Introduction
    • Major depression is a major risk factor of the development of nicotine dependence
    • Depression has been shown to increase risk of later smoking. This temporal ordering suggests the possibility of a causal relationship.
    • Research shows major depression increases the probability and amount of smoking
    • A substantial number of individuals reporting daily and/or heavy smoking do not meet criteria for nicotine dependence. (Kandel & Chen, 2000)
    • It is unclear whether those with major depression experience nicotine dependence beyond what would be expected by smoking exposure alone.
    • Is there a relationship between major depression and nicotine dependence? Does the relationship between nicotine dependence and major depression exists above and beyond smoking exposure?


  • Methods
    • NESARC
      • The sample from the first wave of the National Epidemiologic Survey on Alcohol and Related Conditions (NESARC) represents the civilian, non-institutionalized adult population of the United States
      • The NESARC included over sampling of Blacks, Hispanics and young adults aged 18 to 24 years.
      • Face-to-face computer assisted interviews were conducted in respondents’ homes following informed consent procedures.
      • The sample included 43,093 participants.
    • Measures
      • Nicotine Dependence: Using the tobacco module, the criteria for nicotine dependence was assessed.
      • Nicotine Use: “About how often did you usually smoke in the past year?”) coded dichotomously in terms of the 3 presence or absence of daily smoking and quantity (“On the days that you smoked in the last year, about how many cigarettes did you usually smoke?”). These questions were combined together to determine approximately how many cigarettes were smoked per month.
      • Major Depression: Lifetime major depression (i.e. those experienced in the past 12 months and prior to the past 12 months) were assessed using the NIAAA, Alcohol Use Disorder and Associated Disabilities Interview Schedule – DSM-IV


  • Predicted Results/Implications
    • It is understood that nicotine use predicts nicotine dependence.
    • It is not yet clear whether major depression will predict nicotine dependence after controlling for nicotine use.
    • If individuals with major depression are more sensitive to the development of nicotine dependence, they would represent an important population subgroup for targeted smoking intervention programs.


Project Component D

Directions: The requirement of this assignment is to: Call in the appropriate dataset, select the columns (i.e. variables), and possibly rows (i.e. observations), of interest, and explore the  distributions for your chosen variables.You should include:

  1. Your program.
  2. The output that displays three of your variables with summary information for their distributions.
  3. A few sentences describing the distributions of  your variables.

Project Component E
You will begin to work through how to make decisions about data management and how to put those decisions into action.You should include:

  1. A program that manages your data
  2. Output that displays 3 of your secondary (i.e. data managed) variables as frequency tables.
  3. Briefly describe the findings from your frequency tables.

Project Component F

Directions: There are a variety of conventional ways to visualize data – tables, histograms, bar graphs, etc. Now that your data have been managed, it is time to graph your variables one at a time and examine both center and spread.Include your univariate graphs of your two main constructs (i.e. data managed variables). Write a few sentences describing what your graphs reveal in terms of shape, spread, and center (if variable is quantitative) and most/least likely categories if variable is categorical.

Project Component G

Directions:(1) Construct a graph that shows the association between your explanatory and response variables (bivariate graph). Write a few sentences describing what your graphs reveal in terms of the relationships among the variables. How does this correspond with your predictions? Does the graph reveal anything unexpected or interesting about your relationship of interest? (2) OPTIONAL: Construct a 2nd graph that shows the association between another explanatory variable and your response variable. Again, write a few sentences describing what your graph reveals in terms of the relationships among the variables.

Project Component H

Directions: Determine what the appropriate statistical test is for your main two variables of interest. Your options are:

  • Analysis of variance (ANOVA) assesses whether the means of two or more groups are statistically different from each other. This analysis is appropriate whenever you want to compare the means (quantitative variables) of groups (categorical variables). The null hypothesis is that there is no difference in the mean of the quantitative variable across groups (categorical variable), while the alternative is that there is a difference.
  • A Chi-Square Test of Independence compares frequencies of one categorical variable for different values of a second categorical variable. The null hypothesis is that the relative proportions of one variable are independent of the second variable; in other words, the proportions of one variable are the same for different values of the second variable. The alternate hypothesis is that the relative proportions of one variable are associated with the second variable. Note: although it is possible to run large Chi-Square tables (e.g. 5 x 5, 4 x 6, etc.), the test is really only interpretable when you response variable has 2 levels (see Graphing decisions flow chart in bivariate graphing chapter).
  • Correlation coefficient assesses the degree of linear relationship between two variables. It ranges from +1 to -1. A correlation of +1 means that there is a perfect, positive, linear relationship between the two variables. A correlation of -1 means there is a perfect, negative linear relationship between the two variables. In both cases, knowing the value of one variable, you can perfectly predict the value of the second. Note: Two 3+ level categorical variables can be used to generate a correlation coefficient if the the categories are ordered and the average (i.e. mean) can be interpreted. The scatter plot on the other hand will not be useful. In general the scatterplot is not useful for discrete variables (i.e. those that take on a limited number of values). When we square r, it tells us the proportion of the variability in one variable that is described by variation in the second variable (aka RSquare or Coefficient of Determination).
  • Please note: If you have a quantitative explanatory variable and a categorical response, you will eventually be using logistic regression. For now, categorize your explanatory variable and use a chi-square test as explained above.

The requirement of this assignment is to: Run the appropriate test, post the syntax used, and interpret your findings. In addition, use post-hoc tests if appropriate. Please see the samples below for guidance in writing statistical findings.

Sample Submission: 

  • Example of how to write results for ANOVA:
    • When examining the association between current number of cigarettes smoked (quantitative response) and past year nicotine dependence (categorical explanatory), an Analysis of Variance (ANOVA) revealed that among daily, young adult smokers (my sample), those with nicotine dependence reported smoking significantly more cigarettes per day (Mean=14.6, s.d. ±9.15) compared to those without nicotine dependence (Mean=11.4, s.d. ±7.43), F(1, 1313)=44.68, p=.0001.
    • Post hoc ANOVA results: ANOVA revealed that among daily, young adult smokers (my sample), number of cigarettes smoked per day (collapsed into 5 ordered categories, which is the categorical explanatory variable) and number of nicotine dependence symptoms (quantitative response variable) were significantly associated, F (4, 1308)=11.79, p=.0001. Post hoc comparisons of mean number of nicotine dependence symptoms by pairs of cigarettes per day categories revealed that those individuals smoking more than 10 cigarettes per day (i.e. 11 to 15, 16 to 20 and >20) reported significantly more nicotine dependence symptoms compared to those smoking 10 or fewer cigarettes per day (i.e. 1 to 5 and 6 to 10). All other comparisons were statistically similar.
  • Chi-Square Test of Independence
    • When examining the association between lifetime major depression (categorical response) and past year nicotine dependence (categorical explanatory), a chi-square test of independence revealed that among daily, young adults smokers (my sample), those with past year nicotine dependence were more likely to have experienced major depression in their lifetime (36.2%) compared to those without past year nicotine dependence (12.7%), X2 =88.60, 1 df, p=0001.
    • Post hoc Chi-Square results: A Chi Square test of independence revealed that among daily, young adult smokers (my sample), number of cigarettes smoked per day (collapsed into 5 ordered categories) and past year nicotine dependence (binary categorical variable) were significantly associated, X2 =45.16, 4 df, p=.0001. Post hoc comparisons of rates of nicotine dependence by pairs of cigarettes per day categories revealed that higher rates of nicotine dependence were seen among those smoking more cigarettes, up to 11 to 15 cigarettes per day. In comparison, prevalence of nicotine dependence was statistically similar among those groups smoking 10 to 15, 16 to 20, and > 20 cigarettes per day.
  • Correlation
    • Among daily, young adult smokers (my sample), the correlation between number of cigarettes smoked per day (quantitative) and number of nicotine dependence symptoms experienced in the past year (quantitative) was 0.17 (p=.0001), suggesting that only 3% (i.e. 0.17 squared) of the variance in number of current nicotine dependence symptoms can be explained by number of cigarettes smoked per day.

Project Component I

Directions: In statistics, moderation occurs when the relationship between two variables depends on a third variable. In this case, the third variable is referred to as the moderating variable or simply the moderator. The effect of a moderating variable is often characterized statistically as an interaction; that is, a third variable that affects the direction and/or strength of the relation between your explanatory and response variable.When testing a potential moderator, we are asking the question whether there is an association between two constructs, but separately for different subgroups within the sample.For this assignment, post the syntax to your journal used to test moderation along with corresponding output and a few sentences of interpretation.Note: for our purposes, the third variable must be categorical.

Project Component J
Directions: Run the appropriate regression that utilizes all of your explanatory variables of interest (Reminder: Multiple regression is used when the response variable is quantitative and Logistic regression is used when the response variable is binary, categorical). Interpret your regression output.

Sample Submission:

  • Example of how to write results for Multiple Regression: After adjusting for potential confounding factors (list them) major depression (Beta=1.34, p=.0001) was significantly and positively associated with number of nicotine dependence symptoms.
  • Example of how to write results for Logistic Regression: After adjusting for potential confounding factors (list them) major depression (O.R. 4.0, CI 2.94-5.37) was significantly and positively associated with the likelihood of meeting criteria for nicotine dependence.