This document addresses one solution to the data-wrangling exercise described in Exercise3.pdf.

The data.table format does not seem to display consistently – or I’m missing something – within the usual nb.html format. The html version here may be prettier. The R Notebook with executable code – the Rmd file – should be downloadable from the pulldown at upper right of the nb.html file or can be downloaded separately here: Exercise3-datatableSolution.Rmd.

The exercise asks you to extract the data on votes cast by precinct in statewide elections, and process them into a new table with precinct level data on total votes, Democratic share of two-party vote, and ballot rolloff from presidential votes to votes in other statewide races.

Solving with `data.table`

We will proceed with the same general strategy as with the tidyverse solution, creating three tables – total votes, rolloffs, and two party shares – that we then merge in the final step.

First we read in the raw data (this time using the fread command to read into a data.table), and look at the first 50 rows.

library(data.table)
rawdata.dt <- fread("CentreCountyPrecinctResults2016GeneralElection.txt") # 3520 rows, 16 columns
dim(rawdata.dt)

## [1] 3520   16

rawdata.dt

These data have, roughly, 39 rows for each of 91 precincts in Centre County, PA, identified by the variables PrecNo (which ranges from 0001 to 0091) and PrecName (which ranges from 01 BELLEFONTE NORTH to 91 FERGUSON NORTH CENTRAL). Each precinct starts with three rows for summary information, identified in the Contest variable by values REGISTERED VOTERS - TOTAL, BALLOTS CAST - TOTAL, VOTER TURNOUT - TOTAL, followed by five rows with information on straight ticket voting for each of five parties (which can be ignored for our purposes here). From there, each row contains information for a single candidate in a particular contest. We care specifically about the statewide contests that are held in every precinct here: President (Contest=="PRESIDENTIAL ELECTORS"), US Senator (Contest=="UNITED STATES SENATOR"), Attorney General (Contest=="ATTORNEY GENERAL"), Auditor General (Contest=="AUDITOR GENERAL"), and Treasurer (Contest=="STATE TREASURER"). All of our calculations are based on the number in the Count variable.

Solve in three pieces

This solution creates three data.tables that it joins together in the last step: the total votes, the two-party shares, and the rolloffs. Each of these data.tables should ultimately have 91 rows.

Table 1: Total votes by precinct

Creating the total vote table requires just picking the right rows and relabeling the Count variable.

Tot.dt <- rawdata.dt[Contest=="BALLOTS CAST - TOTAL",.(PrecNo,PrecName, Tot=Count)]
Tot.dt

Table 2: Rolloffs in down-ballot races

As with the tidyverse solution, there are two nontrivial steps here. The first is the grouped summary step (the third assignment below), where vote counts are summed by Precinct Number and Contest. This is more or less the core process in data.table, so its implementation is very compact. The second is the dcast command which does the equivalent of what pivot_wider (spread) does in the tidyverse. The notation here uses a formula (here PrecNo ~ Con) to define the “key” on which the table is cast.

# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Count,Con=substr(Contest,1,3))] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt   #  2093 rows, 3 columns

# Create grouped summary data.table with Total votes by Precinct-Contest 
PrecinctContestsLong.dt <- NeededColumnsRows.dt[,.(ConTot = sum(Count)), by=.(PrecNo,Con)]
PrecinctContestsLong.dt <- PrecinctContestsLong.dt[,.(PrecNo,Con,ConTot)] # could be "chained" w above
PrecinctContestsLong.dt     # 455 rows, 6 columns

# "Cast" the data by Contest (spread from long to wide)
PrecinctContestsWide.dt <- dcast(PrecinctContestsLong.dt, PrecNo ~ Con, value.var = "ConTot")
PrecinctContestsWide.dt     # 91 rows, 6 columns

# Calculate Rolloff Variables
Rolloffs.dt <- PrecinctContestsWide.dt[,.(PrecNo,                 # Keep Precinct Number
                                          ROSen=100*(1-UNI/PRE),  # Rolloff for US Senator
                                          ROAtt=100*(1-ATT/PRE),  # Rolloff for Attorney General
                                          ROAud=100*(1-AUD/PRE),  # Rolloff for Auditor General 
                                          ROTre=100*(1-STA/PRE))] # Rolloff for Treasurer
Rolloffs.dt

Table 3: Democratic share of two-party vote

In the tidyverse version, we had to create a single column of Contest-Party indicators to act as a key. With data.table we can use two keys, just listing them on the right side of the formula (here, PrecNo ~ Con + Pty), and the keys from all pairings are constructed automatically. So this is a couple of commands more compact in data.table.

# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Pty=substr(Party,1,3),Con=substr(Contest,1,3), Count)] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Pty %in% c("DEM","REP") & Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt   #  910 rows, 3 columns

# "Cast" the data by Party & Contest (spread from long to wide)
#    Note that data.table has no problem automatically creating a key based on two columns
PrecPartyContestsWide.dt <- dcast(NeededColumnsRows.dt, PrecNo ~ Con + Pty, value.var = "Count")
PrecPartyContestsWide.dt     # 91 rows, 6 columns

# Calculate Democratic Two-Party Share Variables
DemTwoPartyShares.dt <- PrecPartyContestsWide.dt[,.(PrecNo,                 # Keep Precinct Number
                         D2Pre=100*(PRE_DEM/(PRE_DEM+PRE_REP)),# D2Pre = Dem 2 party share, President
                         D2Sen=100*(UNI_DEM/(UNI_DEM+UNI_REP)),# D2Pre = Dem 2 party share, US Senator
                         D2Att=100*(ATT_DEM/(ATT_DEM+ATT_REP)),# D2Pre = Dem 2 party share, Attorney Gen
                         D2Aud=100*(AUD_DEM/(AUD_DEM+AUD_REP)),# D2Pre = Dem 2 party share, Auditor Gen
                         D2Tre=100*(STA_DEM/(STA_DEM+STA_REP)))]# D2Pre = Dem 2 party share, State Treas
DemTwoPartyShares.dt

Final output table: Merged Data

Now we merge the tables and format the Precinct Number and Name as requested in the Exercise.

Ex3Data.dt <- merge(Tot.dt,Rolloffs.dt, by="PrecNo")              # Merge Total with Rolloffs
Ex3Data.dt <- merge(Ex3Data.dt,DemTwoPartyShares.dt, by="PrecNo") # Merge that with Dem 2-party Shares
Ex3Data.dt[,PrecNo := as.integer(PrecNo)]                         # Make Precinct number a number
Ex3Data.dt[,PrecName := substr(PrecName,4,stop=40L)]              # Strip duplicate info from Name
Ex3Data.dt

SoDA 501, Exercise 3 - data.table Solution

Burt L. Monroe

Solving with `data.table`

Solve in three pieces

Table 1: Total votes by precinct

Table 2: Rolloffs in down-ballot races

Final output table: Merged Data

SoDA 501, Exercise 3 - data.table Solution

Burt L. Monroe

Solving with data.table

Solve in three pieces

Table 1: Total votes by precinct

Table 2: Rolloffs in down-ballot races

Table 3: Democratic share of two-party vote

Final output table: Merged Data

Solving with `data.table`