Hi everyone, my name is Keana, a second year grad student working with Coren Apicella. For my research, I focus on factors that may affect gender differences in willingness to compete. If you’d like to hear more about my work (I’m more than happy to talk about it!), feel free to talk to me after this workshop or shoot me an email. Today, I’ll be leading you through a workshop on data management today, which will mainly focus on data cleaning, arguably the most time-consuming part of analyzing your data.
I started learning R in my last year of undergrad (so it was spring/summer of 2017). I first tried using datacamp and swirl to learn how to use it. But it’s harder to really force yourself unless you’re working with real data, so I’d recommend either taking a class (where you are forced to do homework and projects regularly - there are some good courses in the statistics department that I have been taking) or just forcing yourself to completely analyze your data in R (without any help from Excel). I did this and I learned so much faster - you will have to be patient with this, warning in advance.
Before we get started on writing our code for today’s session, I want to really quickly review what you learned last week during the intro to R workshop. Does anyone have any specific questions about any of these concepts they would like for me to address? For each of these, please feel free to shout out any questions that come to your mind.
directories
workspace
scripts
the save button
installing packages
basic r syntax
basic analyses
1 ggplot example
So one thing I think is really important for trying to work in R is being familiar with the R environment. So I’ll quickly refresh your memory on some important parts of the display here. By default, on the left hand side of the screen, we have two separate sections. The top box is where you can write and save your script for later use. Note: you can still run the code and get output up here, it’s just that you’re code will also be saved. In the bottom box, you can run code and get output, but it will not save. So my personal recommendation is to always write in the top box and save your script at regular intervals. Next, we have the panels on the right. There are a few tabs, but some are less useful than others:
Any questions?
Alright so let’s review one other thing that I definitely found confusing when I first started learning R: graphing with ggplot
library(ggplot2)
graph<- ggplot(data=iris, aes(x = Sepal.Length, y = Sepal.Width))
## what does this "graph" look like before we add anything else?
graph
## essentially a template for us to add the data and any other settings we want to set. You NEED to do this everytime. Then you add on to it (literally, by using addition symbol). But you need a basic structure on which to build (think of it like building the foundation of a house and adding on to that). Can't build without the foundation though.
## first we have to tell it we want a scatterplot.
graph + geom_point()
### Seems like there's some separation here, as if there's two different groups that have different slopes etc. Do you remember what groups we defined last time that seemed to be driving this separation?
## Species! Let's add to our graph a setting that differentiates between species based on color AND shape.
graph + geom_point(aes(color=Species, shape=Species))
## so here, each species has a unique color and shape. We could also just use color as their identifiers in the key:
graph + geom_point(aes(color=Species))
## Now we just have a unique color for all of them, but they share the same shape (it looks like the default shape is a circle, but you should be able to change that in the arguments for this function if you have a preference for another shape)
## One other thing we did was add axis labels and a title, like so:
graph + xlab("Sepal Length") + ylab("Sepal Width") +
ggtitle("Sepal Length-Width")
### mention saving a graph here: notice how it got rid of the point, even though it changed the title and axis labels. Does anyone know why?
## finally, there are themes in R that you can use that just make formatting much easier - you can just set it so if you're a person like me who doesn't know anything about aesthetics etc, you don't have to think too hard. So let's tack on a theme and put it all together.
graph + theme_minimal()+ geom_point(aes(color=Species, shape=Species)) + xlab("Sepal Length") + ylab("Sepal Width") +
ggtitle("Sepal Length-Width")