This document addresses one solution to the data-wrangling exercise described in Exercise3.pdf.
The data.table
format does not seem to display consistently – or I’m missing something – within the usual nb.html
format. The html
version here may be prettier. The R Notebook with executable code – the Rmd file – should be downloadable from the pulldown at upper right of the nb.html
file or can be downloaded separately here: Exercise3-datatableSolution.Rmd.
The input to the exercise is the raw data on 2016 Centre County, PA, precinct level votes found in “CentreCountyPrecinctResults2016GeneralElection.txt”, which were retrieved from http://centrecountypa.gov/Index.aspx?NID=802. (Update - 2020: This link is no longer accurate, and the original file does not appear to be available there any more. You can, however, find it on the Internet Archive’s Wayback Machine, for example, here: http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802. The file is no longer available due to the website being “updated” to provide the data in even less accessible form (pdf).)
The exercise asks you to extract the data on votes cast by precinct in statewide elections, and process them into a new table with precinct level data on total votes, Democratic share of two-party vote, and ballot rolloff from presidential votes to votes in other statewide races.
This solution uses the R data.table
library. For alternative solutions see https://burtmonroe.github.io/SoDA501/Exercise3-2018.
data.table
We will proceed with the same general strategy as with the tidyverse solution, creating three tables – total votes, rolloffs, and two party shares – that we then merge in the final step.
First we read in the raw data (this time using the fread
command to read into a data.table
), and look at the first 50 rows.
library(data.table)
rawdata.dt <- fread("CentreCountyPrecinctResults2016GeneralElection.txt") # 3520 rows, 16 columns
dim(rawdata.dt)
## [1] 3520 16
rawdata.dt
These data have, roughly, 39 rows for each of 91 precincts in Centre County, PA, identified by the variables PrecNo
(which ranges from 0001
to 0091
) and PrecName
(which ranges from 01 BELLEFONTE NORTH
to 91 FERGUSON NORTH CENTRAL
). Each precinct starts with three rows for summary information, identified in the Contest
variable by values REGISTERED VOTERS - TOTAL
, BALLOTS CAST - TOTAL
, VOTER TURNOUT - TOTAL
, followed by five rows with information on straight ticket voting for each of five parties (which can be ignored for our purposes here). From there, each row contains information for a single candidate in a particular contest. We care specifically about the statewide contests that are held in every precinct here: President (Contest=="PRESIDENTIAL ELECTORS"
), US Senator (Contest=="UNITED STATES SENATOR"
), Attorney General (Contest=="ATTORNEY GENERAL"
), Auditor General (Contest=="AUDITOR GENERAL"
), and Treasurer (Contest=="STATE TREASURER"
). All of our calculations are based on the number in the Count
variable.
This solution creates three data.tables that it joins together in the last step: the total votes, the two-party shares, and the rolloffs. Each of these data.tables should ultimately have 91 rows.
Creating the total vote table requires just picking the right rows and relabeling the Count variable.
Tot.dt <- rawdata.dt[Contest=="BALLOTS CAST - TOTAL",.(PrecNo,PrecName, Tot=Count)]
Tot.dt
As with the tidyverse solution, there are two nontrivial steps here. The first is the grouped summary step (the third assignment below), where vote counts are summed by Precinct Number and Contest. This is more or less the core process in data.table
, so its implementation is very compact. The second is the dcast
command which does the equivalent of what pivot_wider
(spread
) does in the tidyverse. The notation here uses a formula (here PrecNo ~ Con
) to define the “key” on which the table is cast.
# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Count,Con=substr(Contest,1,3))] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt # 2093 rows, 3 columns
# Create grouped summary data.table with Total votes by Precinct-Contest
PrecinctContestsLong.dt <- NeededColumnsRows.dt[,.(ConTot = sum(Count)), by=.(PrecNo,Con)]
PrecinctContestsLong.dt <- PrecinctContestsLong.dt[,.(PrecNo,Con,ConTot)] # could be "chained" w above
PrecinctContestsLong.dt # 455 rows, 6 columns
# "Cast" the data by Contest (spread from long to wide)
PrecinctContestsWide.dt <- dcast(PrecinctContestsLong.dt, PrecNo ~ Con, value.var = "ConTot")
PrecinctContestsWide.dt # 91 rows, 6 columns
# Calculate Rolloff Variables
Rolloffs.dt <- PrecinctContestsWide.dt[,.(PrecNo, # Keep Precinct Number
ROSen=100*(1-UNI/PRE), # Rolloff for US Senator
ROAtt=100*(1-ATT/PRE), # Rolloff for Attorney General
ROAud=100*(1-AUD/PRE), # Rolloff for Auditor General
ROTre=100*(1-STA/PRE))] # Rolloff for Treasurer
Rolloffs.dt
Now we merge
the tables and format the Precinct Number and Name as requested in the Exercise.
Ex3Data.dt <- merge(Tot.dt,Rolloffs.dt, by="PrecNo") # Merge Total with Rolloffs
Ex3Data.dt <- merge(Ex3Data.dt,DemTwoPartyShares.dt, by="PrecNo") # Merge that with Dem 2-party Shares
Ex3Data.dt[,PrecNo := as.integer(PrecNo)] # Make Precinct number a number
Ex3Data.dt[,PrecName := substr(PrecName,4,stop=40L)] # Strip duplicate info from Name
Ex3Data.dt