Week 2 (Activity A) Description

A marketing consultant observed 50 consecutive shoppers at a grocery store, and recorded how much money each shopper spent in the store. The dataset is listed below.


The Data

First things first: create a new RStudio Project. Using a separate project for each assignment is probably best. Don’t worry about version control unless you either a) plan on working on these assignments across multiple devices, or b) are already knowledgeable of and comfortable with Git or Subversion.

You’ll need to use this activityA.csv data for your assignment. The data is also displayed below.

Variable Name: spending = amount spent in USD. Use the following R code to import the data. Using the following code loads the file remotely. Alternatively, you can download it and keep a local copy and use dfa <- read.table("activityA.csv", header = TRUE), though this assumes your .csv is in the project’s working directory.) Want to load the remote file? Look below.

Why are we calling it dfa? Well, df is often used as an abbreviation for data frame, which is what our new csv/table is called in R. It’s dfa because it’s the data frame for Activity A. It’s important to name data and variables meaningfully.

Note: If you want to reference the content in the spending column you’ll need to reference the dataframe AND the variable: dfa$spending. Remember: dfa refers to the data frame, spending refers to the column/variable in that data frame, and the $ tells R the relationship (ie, spending is a variable in the dfa data frame). Simple as that!

# Create the dataframe called dfa. 
dfa <- read.csv(url("https://302.ryanstraight.com/activityA.csv"), header = TRUE) # This loads the data from the remote .csv file and saves it in our environment.

# Display our newly found data.
kable(dfa, caption = "Spending") %>% # That displays the data frame we've just created as a nice looking table. You could also simply type dfa. Try them both out.
  kable_styling(bootstrap_options = c("striped", "hover")) %>% # This uses the kableExtra package and places alternately colored stripes for each row, easing readibility.
  scroll_box(width = "50%", height = "500px") # This puts the kable in a scrolling window.
Spending
spending
2.32
6.61
6.90
8.04
9.45
10.26
11.34
11.63
12.66
12.95
13.67
13.72
14.35
14.52
14.55
15.01
15.33
16.55
17.15
18.22
18.30
18.71
19.54
19.55
20.58
20.89
20.91
21.13
23.85
26.04
27.07
28.76
29.15
30.54
31.99
32.82
33.26
33.80
34.76
36.22
37.52
39.28
40.80
43.97
45.58
52.36
61.57
63.85
64.30
69.49

Want to get your code inserted in the document in addition to the results? (Hint: you should for this class!) Except when instructed otherwise, make sure echo = TRUE flag is set on your knitr::opts_chunk$set(echo = FALSE) line in the r setup chunk! It defaults to FALSE so make sure you switch it to TRUE.


Assignment

For this activity, create and submit a document with the following (doing the coding in an R script file and then putting that code in an RMarkdown file is required!):

  1. Summarize the data by creating and describing the following descriptive statistics:
    1. mean
    2. median
    3. standard deviation
    4. interquartile range
    5. (optional) any other descriptive statistics you find interesting
  2. Show how a histogram that, although the distribution of the data is slightly skewed with a long right tail, is approximately normally distributed.
  3. It’s easiest to write your code in a .R file (called an R script file) so you can easily test it while working. Then, when you’ve got everything above taken care of, create a .Rmd file (RMarkdown) and use that to present your data rather than simply turning in code and the results. Here’s a great write-up on how code from an R script can be used in an R Markdown file.
    1. For this assignment and all others, having read the introduction to RMarkdown page is absolutely key.
    2. This is very likely going to take some trial and error. Set aside 2-3 times the amount of time you think this will take to account for fixing errors and debugging. R code is relatively straight forward and easy to use but it can be somewhat intimidating to the beginner. You’re encouraged to read through most of the R Markdown book as it will make things much easier on you in the long run. When in doubt: copy example code that works and tweak to your specifications.
  4. Submitting the assignment:
    1. Submit both your Rmd and your PDF to the Activity A dropbox in the LMS by the stated due date and time.
    2. Remember: the point of using this file system is reproducability. If I can’t see the content you won’t get credit for it. That sounds obvious, right? This is why a PDF is important: if you just knit your Rmd file to html, you may be referencing local files in that page. Files that I don’t have in the same location as you, possibly. So: Rmd AND PDF submissions, please!

Results

We can find each of these respective descriptives using individual commands, such as mean(dfa$spending), median(dfa$spending), sd(dfa$spending), and IQR(dfa$spending). We can even get a very basic histogram with the hist(dfa$spending) command.

We could simply use an in-line R chunk like this: `r mean(dfa$spending)`. This would show up in our RMarkdown files as 25.8364. Remember, the point of using RMarkdown is that you don’t need to work out the answer and then paste it somewhere else. The inline R code lets you place the calculated answer wherever you want that updates whenever the data is changed. Add a few more rows to the bottom of the activityA.csv and that mean calculation will change without you having to do a thing.

That said, there are other ways to go about doing this.

Summary

Here we have a summary of the spending data in two forms: descriptives and a table of relevant information like valid cases, missing cases, and a brief histogram. (You will need to install the summarytools package before this chunk will work.)

#First, load the library
library(summarytools)    # That's how you make sure this chunk will run even if you run it all on its own.

# Next, display the summary.
print(descr(dfa), method = 'render', style = 'rmarkdown', table.classes = 'st-small')

Descriptive Statistics

spending

N: 50
spending
Mean 25.84
Std.Dev 16.15
Min 2.32
Q1 14.35
Median 20.73
Q3 33.80
Max 69.49
MAD 12.99
IQR 19.27
CV 0.63
Skewness 1.04
SE.Skewness 0.34
Kurtosis 0.38
N.Valid 50
Pct.Valid 100.00

Generated by summarytools 0.9.4 (R version 3.5.3)
2019-10-02

Data Frame Summary

dfa

Dimensions: 50 x 1
Duplicates: 0
No Variable Stats / Values Freqs (% of Valid) Graph Valid Missing
1 dfa [numeric] Mean (sd) : 25.8 (16.2) min < med < max: 2.3 < 20.7 < 69.5 IQR (CV) : 19.3 (0.6) 50 distinct values 50 (100%) 0 (0%)

Generated by summarytools 0.9.4 (R version 3.5.3)
2019-10-02

Histogram

Though there is a small histogram displayed with the dfSummary output, we can plot a larger one to better see. Using R’s build-in histogram function, hist:

hist(dfa$spending)

If we like, we can use the ggplot2 plugin to create a more informative and visually pleasing one with labels, a different fill color, and a mean line:

library(ggplot2)
ggplot(dfa, aes(x=dfa$spending)) +
  geom_histogram(binwidth=2.5, color="black", fill="lightblue") +
  geom_vline(aes(xintercept=mean(dfa$spending)), color="blue", linetype="dashed") +
  labs(title="Spending histogram plot",x="Spent($USD), mean `r mean(dfa$spending)`", y = "Count") +
  theme_classic()


Copyright © 2019 Ryan Straight. All rights reserved.