This activity explores **correlation** *and* to learn how to create a scatterplot using R.

Using content from this website (note that we’re using RStudio, not Excel), here is the context for this activity:

A researcher was interested in whether there was an association between communication skills and quality of peer relationships in third grade classrooms. Teachers in each class completed a communicative skills checklist and a rating scale of peer relations for each child. The items for each scale were averaged to provide an overall score for each child.

You’ll need to use this **correlation.csv** data for your assignment.

Variable names:

`ID`

= case identification number`comm`

= communication skills`peers`

= quality of peer relations

Import the data using whatever method you prefer, either the code below or the **Import Dataset** option in the **Environment** tab. It’s easiest to simply use the code below as it will point to the file on the server, taking its location out of the equation for your own code.

Protip: load this data frame as **dfb** (it’s the data for Activity B, so… **dfb** but, really, you can call it whatever you like) and create values for the variables using the following commands:

```
dfb <- read.csv(url("https://302.ryanstraight.com/correlation.csv"), header = TRUE)
comm <- dfb$comm
peers <- dfb$peers
```

Install the **summarytools** and **ggplot2** plugins in RStudio. This is the one we looked at while reviewing Activity A’s solutions.

- Scatterplot in RStudio
- How to make a scatterplot in R (with regression line)
- Plotting with ggplot2 to get the basics on how to produce the plots you need below.

This activity goes well beyond simply displaying frequencies and descriptives for common concepts like means and medians. You will need to dig around the R documentation for the commands to best answer the questions below. For this activity, create and submit a document that includes *and answers* the following:

- A
**scatterplot**that displays the relationship between the**peers**and**comm**variables. Answer the following:- What does this chart tell you about the relationship between the two variables?
- What direction is this association?
- How did you determine this?
- If you had to identify the association, would you label is
*small*,*moderate*, or*strong*? Why?

- Determine the
*Pearson correlation*for the**peers**and**comm**variables.- Interpret this number.
- What is the strength of the association?
- What is the direction?
- Like the scatterplot above, do you think it is small, moderate, or strong?
- How does your interpretation of the Pearson correlation coefficient compare to that of the scatterplot?

- Determine R
^{2}- Compute the square of the correlation coefficient you previously calculated.
- Interpret this value. What does it indicate about the association?
- Write a statement about the meaning of the R-squared (R
^{2}) value in terms of the variables. - How does R
^{2}compare to what you saw in the scatterplot and the Pearson correlation coefficient? - Do you think this is a more valuable statistic? Why?

Your RStudio results should look like the following. I’ve displayed the R code chunks before the results.

First, let’s get some descriptives of the data. Remember that `ID`

is simply a case identifier and isn’t important for this.

```
library(summarytools)
descr(dfb, style = "rmarkdown")
```

```
## ### Descriptive Statistics
##
## | | comm | ID | peers |
## |----------------:|-------:|-------:|-------:|
## | **Mean** | 5.01 | 100.50 | 4.82 |
## | **Std.Dev** | 1.61 | 57.88 | 1.27 |
## | **Min** | 1.00 | 1.00 | 1.00 |
## | **Q1** | 4.00 | 50.50 | 4.00 |
## | **Median** | 5.00 | 100.50 | 4.80 |
## | **Q3** | 6.50 | 150.50 | 5.80 |
## | **Max** | 7.00 | 200.00 | 7.00 |
## | **MAD** | 1.85 | 74.13 | 1.33 |
## | **IQR** | 2.50 | 99.50 | 1.80 |
## | **CV** | 0.32 | 0.58 | 0.26 |
## | **Skewness** | -0.48 | 0.00 | -0.38 |
## | **SE.Skewness** | 0.17 | 0.17 | 0.17 |
## | **Kurtosis** | -0.71 | -1.22 | -0.15 |
## | **N.Valid** | 200.00 | 200.00 | 200.00 |
## | **Pct.Valid** | 100.00 | 100.00 | 100.00 |
```

Let’s take a closer look.

```
library(summarytools)
dfSummary(dfb, plain.ascii = FALSE, tyle = "grid",
graph.magnif = 0.75, valid.col = FALSE)
```

Using the simple RStudio **plot** command:

```
plot(comm, peers, # plot the variables
xlab="Communication", # x axis label
ylab="Peer relationships") # y axis label
abline(lm(comm ~ peers)) # draw the trend line
```

Using the more enjoyable **ggplot2** command that provides a nicer looking scatterplot *and* confidence intervals around the trend line:

```
library(ggplot2)
ggplot(dfb, aes(x=comm, y=peers)) +
geom_point(shape=1) +
geom_smooth(method=lm)
```

Pearson’s correlation coefficient: 0.5194169

`cor(comm, peers)`

`## [1] 0.5194169`

To determine the R^{2} we’ll need to examine the coefficients a bit more closely. It’s easier if we create an object in RStudio to do just that:

```
model1 <- lm(comm~peers)
summary(model1)
```

```
##
## Call:
## lm(formula = comm ~ peers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4562 -0.9695 0.1863 1.0161 2.9949
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.82838 0.38462 4.754 3.84e-06 ***
## peers 0.65961 0.07712 8.553 3.26e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.383 on 198 degrees of freedom
## Multiple R-squared: 0.2698, Adjusted R-squared: 0.2661
## F-statistic: 73.16 on 1 and 198 DF, p-value: 3.258e-15
```

Copyright © 2019 Ryan Straight. All rights reserved.