Mini-Assignments – Applied Data Analysis

Table of Contents

Assignment 1
Assignment 2
Assignment 3
Assignment 4
Assignment 5
Assignment 6
Assignment 7
Assignment 8
Assignment 9

Assignment 1
Directions:

Assignment 2
For this assignment you will be working with Health Evaluation and Linkage to Primary Care (HELP) data set. The HELP study was a clinical trial for adult inpatients recruited from a detoxification unit. Patients with no primary care physician were randomized to receive a multidisciplinary assessment and a brief motivational intervention or usual care, with the goal of linking them to primary medical care. The data set contains 453 observations on the following variables:

age subject age at baseline (in years)
anysub use of any substance post-detox: a factor with levels no yes
cesd Center for Epidemiologic Studies Depression measure at baseline (high scores indicate more depressive symptoms)
d1 lifetime number of hospitalizations for medical problems (measured at baseline)
daysanysub time (in days) to first use of any substance post-detox
dayslink time (in days) to linkage to primary care
drugrisk Risk Assessment Battery drug risk scale at baseline
e2b number of times in past 6 months entered a detox program (measured at baseline)
female 0 for male, 1 for female
sex a factor with levels male female
g1b experienced serious thoughts of suicide in last 30 days (measured at baseline): a factor with levels no yes
homeless housing status: a factor with levels housed homeless
i1 average number of drinks (standard units) consumed per day, in the past 30 days (measured at baseline)
i2 maximum number of drinks (standard units) consumed per day, in the past 30 days (measured at baseline)
id subject identifier
indtot Inventory of Drug Use Consequences (InDUC) total score (measured at baseline)
linkstatus post-detox linkage to primary care (0 = no, 1 = yes)
link post-detox linkage to primary care: no yes
mcs SF-36 Mental Component Score (measured at baseline, lower scores indicate worse status)
pcs SF-36 Physical Component Score (measured at baseline, lower scores indicate worse status)
pss_fr perceived social support by friends (measured at baseline, higher scores indicate more support)
racegrp race/ethnicity: levels black hispanic other white
satreat any BSAS substance abuse treatment at baseline: no yes
sexrisk Risk Assessment Battery sex risk score (measured at baseline)
substance primary substance of abuse: alcohol cocaine heroin
treat randomized to HELP clinic: no yes

Directions:

Load in the HELP data set.
Make a frequency table for sex and d1.
Answer Questions 1-3 below
Now, subset the data to only include patients whose primary substance of abuse is cocaine and who are at least 40 years old.
Make a frequency table for sex based on this subset.
Answer Questions 4-5 below
Transfer your answers into tabview 2 Answer Submission

Question 1: How many patients in the study are female?

Question 2: How many patients in the study have never been hospitalized for medical problems

Question 3: What percentage of patients in the study have been hospitalized fewer than 5 times? (Round to nearest whole percentage?

Question 4: How many patients in the study are at least 40 years old and have cocaine listed as his/her primary abuse substance?

Question 5: What percentage of patients who are at least 40 years old and have cocaine listed as this/her primary abuse substance are male?

Submit your answers here

Assignment 3
Directions:

Load in the HELP data set
Patients with a mental component score (mcs) less than 20 are thought to be at extreme risk of returning to the detoxification unit within the next 12 months. Make a new variable called “ExtremeMCS” and code it as 1 if a patient is at risk based on his/her mcs score and 0 otherwise.
Make a new variable “SuicidalThought” based off of the variable g1b. Have 1 indicate a patient has had suicidal thoughts and a 0 indicate he/she has not.
Make a new variable “HomelessStatus” based off of the variable “homeless”. Have a 1 indicate that a patient is homeless and a 0 indicate he/she is housed.
Suppose we want to assess the overall risk a patient has to return to the detoxification unit and it is believed “ExtremeMCS”, “SuicidalThought”, and “HomelessStatus” are considered risk factors. Construct a new variable called ‘RiskTotal’ which computes the number of risk factors a particular patient has. That is, make it a sum of these 3 variables.
Make a frequency table of ‘ExtremeMCS’, ‘SuicidalThought’, ‘RiskTotal’ and answer questions 1-3 below.

Question 1: How many patients are thought to be at risk based on his/her mcs score?

Question 2: What percentage of patients have experienced suicidal thoughts? (round your answer to nearest percentage)

Question 3: What percentage of patients in the study have fewer than 3 risk factors. (round your answer to nearest percentage) Submit your answers here

Assignment 4
Directions:

Import the Utilities data set. (Look at the codebook to familiarize yourself the the variables; see CODEBOOK below).
Make an appropriate graph to display the distribution of customer’s total monthly bill. Answer Question 1
Make an appropriate graph to display gas bill by month.
Make an appropriate graph to display electric bill by month.
Answer Questions 2-3 below.
Make a graph to display the relationship between kwh usage and gas bill.
Make a graph to display the relationship between kwh usage and electric bill.
Answer Question 4 below.
Make a graph to determine whether there is relationship between season and donation.

Start by making a new variable to categorize each bill into “season”. Have the variable equal winter if the bill was from December, January, or February. Have the variable equal spring if the bill was from March, April, or May. Have the variable equal summer if the bill was from June, July, or August. Have the variable equal fall if the bill was from September, October, or November.
Make a new variable DonorStatus and set equal to 1 if the billee donated money to Operation Fuel and 0 otherwise.
Now make a bar chart to display the proportion of donors by season.
Answer Question 5 below.

Question 1: What best describes the distribution of customer’s total monthly bill?

Question 2: What month has the highest average gas bill?

Question 3: What month has the highest average electric bill?

Question 4: Does there appear to be a relationship between kwh and gas bill? Does there appear to be a relationship between kwh and electric bill?

Question 5: Order the seasons from lowest proportion of donors to highest proportion of donors. Does there seem to be a relationship between season and donation?

Submit your answers here

CODEBOOK: Utilities The following is a data frame from a random sample of 117 utility bills for the following variables.

month: month (coded as a number)
day: day of month on which bill was calculated
year: year of bill
temp: average temperature (F) for billing period
kwh: electricity usage (kwh)
ccf: gas usage (ccf)
thermsPerDay: a numeric vector
billingDays: number of billing days in billing period
totalbill: total bill (in dollars)
gasbill: gas bill (in dollars)
elecbill: electric bill (in dollars)
notes: notes about the billing period
donate: (yes or no) did the person add money to their bill to be donated to Operation Fuel – a charity providing heat to families/small businesses in need?

Assignment 5
Directions:

Load in the Utilities data set from Mini-assigment 4. (Look at the codebook to familiarize yourself the the variables) and apply your data management code from the previous assignment that created your season variable.
Use an appropriate statistical test to determine whether there is a relationship between total bill and season. Answer Question 1 and 2 below.
Apply post-hoc tests (if appropriate) to determine which seasons vary significantly. Answer Question 3 below.

Question 1: What is the test statistic of the test you selected?

Question 2: Is there a significant relationship between total bill and season?

Question 3: Which seasons (if any) have significantly different mean total bills?

Submit your answers here

Assignment 6
Directions:

Familiarize yourself with the CPS read me file.
Import/load the CPS data set.
Find the mean wage earned per hour for males and females. Answer Question 1 below.
Implement the appropriate statistical test to determine whether males have significantly higher wages earned per hour than females. Answer Question 2 below.
Implement the appropriate statistical test to determine whether there is a significant linear relationship between wages earned per hour and number of years of work experience. Answer Question 3 below.
Implement the appropriate statistical test to determine whether job satisfication varies significantly among the different job sectors. Answer Questions 4 below.

Question 1: What is the mean wage earned per hour for males and females?

Question 2: What is the value of the test statistic to answer whether males have a significantly higher wages earned per hour than females? What conclusions can be drawn?

Question 3: What is the value of the test statistic to answer whether there is a significant linear relationship between wages earned per hour and number of years of work experience? What conclusions can be drawn?

Question 4: What is the value of the test statistic to determine whether job satisfication varies significantly among the different job sectors? What conclusions can be drawn? Are post-hoc tests necessary?

Submit your answers here

CODEBOOK: CPS

This data comes from the 1985 Current Population Survey

Description

The Current Population Survey (CPS) is used to supplement census information between census years. These data consist of a random sample of persons from the survey, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership.

A data frame with 534 observations on the following variables.

wage: wage (US dollars per hour)
educ: number of years of education
race: a factor with levels NW (nonwhite) or W (white)
sex: a factor with levels F M
hispanic: a factor with levels Hisp NH
south: a factor with levels NS S
married: a factor with levels Married Single
exper: number of years of work experience (inferred from age and educ)
union: a factor with levels Not Union
age: age in years
sector: a factor with levels clerical const manag manuf other prof sales service
satisfaction: 1 implies participant satisfied with current employment, 0 otherwise

Assignment 7
Directions:

Re-familiarize yourself with the CPS CODEBOOK from Mini-assigment 6.
Import/load the CPS data set.
Suppose you want to test whether number of years experience is significantly associated with wages earned per hour. During your last assignment, you found a sample correlation coefficient of r=0.09 and found that it was significantly different from 0. This indicated that there was a significant relationship between experience and wages.
Determine whether gender moderates the relationship between experience and wages earned per hour? Answer Question 1 below.
Now suppose you want to test whether job sector is significantly associated with wages earned per hour. Implement the appropriate statistical test and answer Question 2 below.
Determine whether race moderates the relationship between job sector and wages earned per hour? Answer Question 3 below.

Question 1: Does gender moderate the relationship between experience and wages earned per hour? Why?

Question 2: What is the test statistic, p-value, and appropriate conclusion to answer the question “Is job sector significantly associated with wages earned per hour?”

Question 3: Does race moderate the relationship between job sector and wages earned per hour? Why?

Submit your answers here

Assignment 8
Directions:

Familiarize yourself with the read me file for the Saratoga Houses dataset
Import/load the Saratoga Houses dataset.
Determine whether price and number of bathrooms are significantly associated based on an appropriate bivariate testing procedure from the previous weeks. Answer Question 1 below.
Now construct a regression equation with price as your response variable and number of bathrooms as your explanatory variable. Build an appropriate model. Answer Questions 2-4 below.
Test the bivariate association between living area and price. Answer Question 5 below.
Test the bivariate association between living area and number of bathrooms. Answer Question 6 below.
Construct a multiple regression equation to test whether number of bathrooms are significantly associated with price after controlling for living area.
Familiarize yourself with the read me file for the Student Health dataset.
Import/load the Student dataset.
Determine the mean weight of students based on major. Answer Question 8 below.
Determine whether weight and major are significantly associated based on an appropriate bivariate testing procedure from the previous weeks. Answer Question 9 below.
Now construct a regression equation with weight as your response variable and major as your explanatory variable. Build an appropriate model. Answer Questions 10-11 below.
Test the bivariate association between gender and weight. Answer Question 12 below.
Test the bivariate assocaition between gender and major. Answer Question 13 below.
Construct the appropriate model to determine whether weight and major are significantly associated after controlling for gender. Answer Question 14 below.

Question 1: What is the corresponding p-value to test whether number of bathrooms and house price are significantly associated. What conclusion can be drawn?

Question 2: What is the appropriate regression equation. How can you specifically explain the relationship between number of bathrooms and house price?

Question 3: Why are number of bathrooms significantly associated with house price?

Question 4: Why should living area be considered as a possible confounding variable?

Question 5: What is the corresponding p-value to test whether price and living area are significantly associated. What conclusion can be drawn?

Question 6: What is the corresponding p-value to test whether number of bathrooms and living area are significantly associated. What conclusion can be drawn?

Question 7: After controlling for living area, is there a significant association between number of bathrooms and house price?

Question 8: State the mean weight for nursing and engineering majors below.

Question 9: What is the corresponding test statistic and p-value to test whether major and weight are significantly associated. What conclusion can be drawn?

Question 10: Find an appropriate regression equation that uses weight as the response variable and major as the explanatory variable. What does the regression equation tell us about how the mean weight of Nursing majors and Engineering majors compare?

Question 11: Why is gender a reasonable or not reasonable choice to consider as a confounding variable?

Question 12: What is the corresponding test statistic and p-value to test whether gender and weight are significantly associated. What conclusion can be drawn?

Question 13: What is the corresponding test statistic and p-value to test whether gender and major are significantly associated. What conclusion can be drawn?

Question 14: Are major and weight significantly asssociated after controlling for gender?

Submit your answers here

CODEBOOK: Saratoga Houses

The dataset contains information on 1,063 houses in Saratoga County, New York, USA in 2006. The variables in the dataset include:

Price: Amount of house in US dollars
Living.Area: Square feet of house
Baths: Number of baths
Bedrooms: Number of bedrooms
Fireplace: “yes” or “no”
Acres: Number of acres on the property
Age: Age of house in years

CODEBOOK: Student Health

The following data comes from self-reported volunteer data of 100 students majoring in nursing and engineering at University of Miami in 2014. The following data was recorded:

major: Either “Nursing” or “Engineering” based on sample collected
gender: Listed as “Male” or “Female”
sleep: Estimated number of hours slept on a typical night
depression: diagnosed as depressed (1) or not diagnosed as depressed (0)
weight: Student weight in pounds.
smedia: Estimated amount of time (in hours) spent on social media each day.

Assignment 9
Directions:

Familiarize yourself with the read me file for the Email dataset
Import/load the Email dataset.
In this assignment we will focus on predicting whether an email is spam or not. Determine the number of emails that make up this data set and the proportion of the emails that were spam. Answer Question 1 below.
We will start with a simple spam filter that will only use a single predictor “attachment” to classify a message as spam or not. Determine what proportion of emails with an attachment are spam and what proportion of emails without an attachment are spam. Answer Question 2 below.
Construct a chi-square test to determine whether there is an association between spam email and whether an email contained an attachment. Answer Question 3 below.
Now, fit an appropriate regression model between spam and attachment. Answer Question 4 below.
Find the odds ratios of model coefficients. Answer Question 5 below.
Construct an appropriate model that can be used to determine whether the association between whether an email is spam and whether it has an attachment is significant after controlling for the number of characters . Answer Question 6 below.
Find the odds ratios of model coefficients. Answer Question 7 below.
Construct an appropriate model with our response variable (spam) and explanatory variables (attachment, number of characters, and whether there is exclaimation point in the subject). Answer Question 8 below.

Question 1: How many emails make up this data set? What percent of the emails were spam?

Question 2: ____% of emails without an attachment were classified as spam, whereas _____ % of emails with an attachment were classified as spam.

Question 3: State the correspondig test statistic and p-value to test this association. What is your conclusion?

Question 4: What type of regression model is appropriate? Why?

Question 5: Interpret the odds ratio to compare emails with attachments to emails without attachments.

Question 6: Is there a significant association between whether an email is spam and whether the email has an attachment after controlling for number of characters of an email message has? What is the associated p-value and conculsion?

Question 7: What is the correct interpretation of the odds ratio corresponding to characters.

Question 8: Controlling for all other predictor variables, is whether a message is spam independently associated with whether there is an exclaimation point in the subject? Why? Submit your answers here

CODEBOOK: E-mail Data

Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account received and sent regular emails as well as receiving a large amount of spam, unsolicited bulk email. We will be using what we have learned about logistic regression models to see if we can build a model that is able to predict whether or not a message is spam based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large coorportations like Google and Microsoft are quite a bit more complex the fundamental idea is the same – binary classification based on a set of predictors. The description of the data is as follows:

spam Indicator for whether the email was spam.
tomultiple Indicator for whether the email was addressed to more than one recipient.
from Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
cc Indicator for whether anyone was CCed
sent_email Indicator for whether the sender had been sent an email in the last 30 days
image Indicates whether any images were attached.
attach Indicates whether any files were attached
dollar Indicates whether a dollar sign or the word ‘dollar’ appeared in the email
winner Indicates whether “winner” appeared in the email
inherent Indicates whether “inherit” (or an extension, such as inheritance) appeared in the email.
password Indicates whether “password” appeared in the email.
num_char The number of characters in the email, in thousands.
line_breaks The number of line breaks in the email (does not count text wrapping).
format Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext.
re_subj Indicates whether the subject started with “Re:”, “RE:”, “re:”, or “rE”
exclaim_subj Indicates whether there was an exclamation point in the subject.
urgent_subj Indicates whether the word “urgent” was in the email subject.
exclaim_mess The number of exclamation points in the email message.
number Factor variable saying whether there was no number, a small number (under 1 million), or a big number.