R is something I’ve been interested in for awhile. “For why???” you ask, idk partly for the data analysis, partly for the statistics, and I’ll admit it “I’m a sucker for good data visualizations”.
So what am I using to do all this R stuff??? Well here, download this first to run R Studio (because an error message will pop up if you only download R Studio and not the aforementioned files), and then download R Studio.
After you’ve downloaded all that, start up R Studio and make a new R Script file.
NOTE: To execute a specific line, highlight that line within R Studio, and then press Ctrl+Enter.
Some of the nicer things I noticed about the R programming language are:
- Not having to declare a variable type: 12345# Creating variablesl <- TRUEi <- 123Ln <- 123.45c <- "ABC 123"
I like not having to declare the variables’ type all the time. When you’ve typed “Integer” and “Boolean” thousands of times it’s just tedious and burdensome. As a long time programmer you just inherently know (off the top of your head) what that variable is.
- Easy to install and load libraries 12345# Installing packages (command line)install.packages("ggplot2")# Loading librarieslibrary(ggplot2)
This goes without saying… in comparison to my least favorite language to load packages in (cough Python). I really like how simple and easy it is to load a package.
- It’s easy to mess around with tables in this language 1234567891011121314# Creating a data framedf <- data.frame(Name = c("Cat", "Dog", "Cow", "Pig"),HowMany = c(5, 10, 15, 20),IsPet = c(TRUE, TRUE, FALSE, FALSE))df# Subsetting data framesdf[c(2, 4), ]df[2:4, ]df[c(TRUE, FALSE, TRUE, FALSE), ]df[df$IsPet == TRUE, ]df[df$HowMany > 10, ]df[df$Name %in% c("Cat", "Cow")]
The data.frame() function is a useful way to create a table, and then there are so many different ways to get subsets off those tables. There’s just a never ending smorgasbord of functions to perform on those subsets.
Let’s dive right in to inputting a .txt file into R, and then outputting that info as a .csv file (after it has been sanitized):
NOTE: My .txt file was relatively tame, and I don’t wanna get into all the minutiae of the formatting of that .txt file.
Set your working directory:12# Set working directorysetwd("<your C:\ path goes here>")
Use the read.table() function:1234567# Load data from tab-delimited filemovies <- read.table(file = "Movies.txt",sep = "\t",header = TRUE,quote = "\"")
Now output your table as a .csv file using the write.csv() function.12# Save data in a CSV filewrite.csv(tableVariable, "tableName.csv")
So after looking around online, I decided to make a web scraper with R. Well what should we scrape? The movie Doctor Strange is about to come out, so why don’t we scrape the IMDB page of Doctor Strange (in which off topic I think the CGI for this film is being done in the building I work in???).
First off you’re gonna have to familiarize yourself with a Chrome Extension called SelectorGadget. There’s a video here that will help you familiarize yourself with the tool. We’re gonna use this tool to find the following:
1. The IMDB score of Doctor Strange 2. The names of the cast of Doctor Strange 3. The usernames and messages off the message board at the bottom of the page
So let’s start writing the R script. You’re gonna need to load these packages (rvest needs the XML packages, or else an error gets thrown everytime you try running a rvest function):
With that let’s load the Doctor Strange webpage into an object (if you try looking around online you might get told to use the deprecated html() function, don’t use it):
Using the SelectorGadget tool, I determined that the element of the IMDB score is “strong span”. We’re gonna use the function html_node() to find the first html tidbit that matches “strong span”. Then we use html_text() to extract the attributes, text and tag names from that html. Finally we convert that html text into a numeric value.
NOTE: The pipe operator (%>%) from what I’ve seen is just a means to move the left operator into the arguments of the function being called on the right of the operator.
We can get the cast in a like fashion, but this time we will use html_nodes() instead of html_node() which just extracts out many html tidbits with the name “.itemprop .itemprop”:
Finally let’s get the messages as a table, with columns for the message and the username. On your own, try putting in any number from 1 to N inside the brackets (.[]) to scroll through the many different tables on the webpage.
R is pretty interesting I can see why people would want to use this language it’s got some good stuff. Try using it on your own for something cool and then tell me about it @ email@example.com!!!