This document addresses one solution to the data-wrangling exercise described in Exercise3.pdf.
The data.table
format does not seem to display consistently – or I’m missing something – within the usual nb.html
format. The html
version here may be prettier. The R Notebook with executable code – the Rmd file – should be downloadable from the pulldown at upper right of the nb.html
file or can be downloaded separately here: Exercise3-datatableSolution.Rmd.
The input to the exercise is the raw data on 2016 Centre County, PA, precinct level votes found in “CentreCountyPrecinctResults2016GeneralElection.txt”, which were retrieved from http://centrecountypa.gov/Index.aspx?NID=802. (Update - 2020: This link is no longer accurate, and the original file does not appear to be available there any more. You can, however, find it on the Internet Archive’s Wayback Machine, for example, here: http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802. The file is no longer available due to the website being “updated” to provide the data in even less accessible form (pdf).)
The exercise asks you to extract the data on votes cast by precinct in statewide elections, and process them into a new table with precinct level data on total votes, Democratic share of two-party vote, and ballot rolloff from presidential votes to votes in other statewide races.
This solution uses the R data.table
library. For alternative solutions see https://burtmonroe.github.io/SoDA501/Exercise3-2018.
Solving with data.table
We will proceed with the same general strategy as with the tidyverse solution, creating three tables – total votes, rolloffs, and two party shares – that we then merge in the final step.
First we read in the raw data (this time using the fread
command to read into a data.table
), and look at the first 50 rows.
library(data.table)
rawdata.dt <- fread("CentreCountyPrecinctResults2016GeneralElection.txt") # 3520 rows, 16 columns
dim(rawdata.dt)
[1] 3520 16
County
1: CENTRE COUNTY, PENNSYLVANIA
2: CENTRE COUNTY, PENNSYLVANIA
3: CENTRE COUNTY, PENNSYLVANIA
4: CENTRE COUNTY, PENNSYLVANIA
5: CENTRE COUNTY, PENNSYLVANIA
---
3516: CENTRE COUNTY, PENNSYLVANIA
3517: CENTRE COUNTY, PENNSYLVANIA
3518: CENTRE COUNTY, PENNSYLVANIA
3519: CENTRE COUNTY, PENNSYLVANIA
3520: CENTRE COUNTY, PENNSYLVANIA
ElecYear Election
1: 2016 GENERAL ELECTION
2: 2016 GENERAL ELECTION
3: 2016 GENERAL ELECTION
4: 2016 GENERAL ELECTION
5: 2016 GENERAL ELECTION
---
3516: 2016 GENERAL ELECTION
3517: 2016 GENERAL ELECTION
3518: 2016 GENERAL ELECTION
3519: 2016 GENERAL ELECTION
3520: 2016 GENERAL ELECTION
ElecDate PrecNo
1: NOVEMBER 8 2016 1
2: NOVEMBER 8 2016 1
3: NOVEMBER 8 2016 1
4: NOVEMBER 8 2016 1
5: NOVEMBER 8 2016 1
---
3516: NOVEMBER 8 2016 91
3517: NOVEMBER 8 2016 91
3518: NOVEMBER 8 2016 91
3519: NOVEMBER 8 2016 91
3520: NOVEMBER 8 2016 91
PrecName
1: 01 BELLEFONTE NORTH
2: 01 BELLEFONTE NORTH
3: 01 BELLEFONTE NORTH
4: 01 BELLEFONTE NORTH
5: 01 BELLEFONTE NORTH
---
3516: 91 FERGUSON NORTH CENTRAL
3517: 91 FERGUSON NORTH CENTRAL
3518: 91 FERGUSON NORTH CENTRAL
3519: 91 FERGUSON NORTH CENTRAL
3520: 91 FERGUSON NORTH CENTRAL
Party
1:
2:
3:
4: DEMOCRATIC
5: REPUBLICAN
---
3516: DEMOCRATIC
3517: REPUBLICAN
3518:
3519:
3520:
Contest
1: REGISTERED VOTERS - TOTAL
2: BALLOTS CAST - TOTAL
3: VOTER TURNOUT - TOTAL
4: STRAIGHT PARTY
5: STRAIGHT PARTY
---
3516: REPRESENTATIVE IN THE GENERAL ASSEMBLY 171ST LEGISLATIVE
3517: REPRESENTATIVE IN THE GENERAL ASSEMBLY 171ST LEGISLATIVE
3518: REPRESENTATIVE IN THE GENERAL ASSEMBLY 171ST LEGISLATIVE
3519: JUDGES AGE
3520: JUDGES AGE
Descr
1: NA
2: NA
3: NA
4: NA
5: NA
---
3516: NA
3517: NA
3518: NA
3519: NA
3520: NA
VoteFor
1:
2:
3:
4: VOTE FOR NOT MORE THAN 1
5: VOTE FOR NOT MORE THAN 1
---
3516: VOTE FOR NOT MORE THAN 1
3517: VOTE FOR NOT MORE THAN 1
3518: VOTE FOR NOT MORE THAN 1
3519: VOTE FOR NOT MORE THAN 1
3520: VOTE FOR NOT MORE THAN 1
Candidate Posn
1: 10
2: 11
3: 12
4: DEMOCRATIC 17
5: REPUBLICAN 18
---
3516: MELODY S FLECK 7700
3517: KERRY A BENNINGHOFF 7701
3518: WRITE-IN 7702
3519: YES 7707
3520: NO 7708
Count PctCnt Total PctTot
1: 507 NA NA NA
2: 391 NA NA NA
3: NA 77.12 NA NA
4: 46 40.35 NA NA
5: 66 57.89 NA NA
---
3516: 301 43.81 NA NA
3517: 385 56.04 NA NA
3518: 1 0.15 NA NA
3519: 317 53.64 NA NA
3520: 274 46.36 NA NA
These data have, roughly, 39 rows for each of 91 precincts in Centre County, PA, identified by the variables PrecNo
(which ranges from 0001
to 0091
) and PrecName
(which ranges from 01 BELLEFONTE NORTH
to 91 FERGUSON NORTH CENTRAL
). Each precinct starts with three rows for summary information, identified in the Contest
variable by values REGISTERED VOTERS - TOTAL
, BALLOTS CAST - TOTAL
, VOTER TURNOUT - TOTAL
, followed by five rows with information on straight ticket voting for each of five parties (which can be ignored for our purposes here). From there, each row contains information for a single candidate in a particular contest. We care specifically about the statewide contests that are held in every precinct here: President (Contest=="PRESIDENTIAL ELECTORS"
), US Senator (Contest=="UNITED STATES SENATOR"
), Attorney General (Contest=="ATTORNEY GENERAL"
), Auditor General (Contest=="AUDITOR GENERAL"
), and Treasurer (Contest=="STATE TREASURER"
). All of our calculations are based on the number in the Count
variable.
Solve in three pieces
This solution creates three data.tables that it joins together in the last step: the total votes, the two-party shares, and the rolloffs. Each of these data.tables should ultimately have 91 rows.
Table 1: Total votes by precinct
Creating the total vote table requires just picking the right rows and relabeling the Count variable.
Tot.dt <- rawdata.dt[Contest=="BALLOTS CAST - TOTAL",.(PrecNo,PrecName, Tot=Count)]
Tot.dt
Table 2: Rolloffs in down-ballot races
As with the tidyverse solution, there are two nontrivial steps here. The first is the grouped summary step (the third assignment below), where vote counts are summed by Precinct Number and Contest. This is more or less the core process in data.table
, so its implementation is very compact. The second is the dcast
command which does the equivalent of what pivot_wider
(spread
) does in the tidyverse. The notation here uses a formula (here PrecNo ~ Con
) to define the “key” on which the table is cast.
# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Count,Con=substr(Contest,1,3))] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt # 2093 rows, 3 columns
# Create grouped summary data.table with Total votes by Precinct-Contest
PrecinctContestsLong.dt <- NeededColumnsRows.dt[,.(ConTot = sum(Count)), by=.(PrecNo,Con)]
PrecinctContestsLong.dt <- PrecinctContestsLong.dt[,.(PrecNo,Con,ConTot)] # could be "chained" w above
PrecinctContestsLong.dt # 455 rows, 6 columns
# "Cast" the data by Contest (spread from long to wide)
PrecinctContestsWide.dt <- dcast(PrecinctContestsLong.dt, PrecNo ~ Con, value.var = "ConTot")
PrecinctContestsWide.dt # 91 rows, 6 columns
# Calculate Rolloff Variables
Rolloffs.dt <- PrecinctContestsWide.dt[,.(PrecNo, # Keep Precinct Number
ROSen=100*(1-UNI/PRE), # Rolloff for US Senator
ROAtt=100*(1-ATT/PRE), # Rolloff for Attorney General
ROAud=100*(1-AUD/PRE), # Rolloff for Auditor General
ROTre=100*(1-STA/PRE))] # Rolloff for Treasurer
Rolloffs.dt
Table 3: Democratic share of two-party vote
In the tidyverse
version, we had to create a single column of Contest-Party indicators to act as a key. With data.table
we can use two keys, just listing them on the right side of the formula (here, PrecNo ~ Con + Pty
), and the keys from all pairings are constructed automatically. So this is a couple of commands more compact in data.table.
# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Pty=substr(Party,1,3),Con=substr(Contest,1,3), Count)] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Pty %in% c("DEM","REP") & Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt # 910 rows, 3 columns
# "Cast" the data by Party & Contest (spread from long to wide)
# Note that data.table has no problem automatically creating a key based on two columns
PrecPartyContestsWide.dt <- dcast(NeededColumnsRows.dt, PrecNo ~ Con + Pty, value.var = "Count")
PrecPartyContestsWide.dt # 91 rows, 6 columns
# Calculate Democratic Two-Party Share Variables
DemTwoPartyShares.dt <- PrecPartyContestsWide.dt[,.(PrecNo, # Keep Precinct Number
D2Pre=100*(PRE_DEM/(PRE_DEM+PRE_REP)),# D2Pre = Dem 2 party share, President
D2Sen=100*(UNI_DEM/(UNI_DEM+UNI_REP)),# D2Pre = Dem 2 party share, US Senator
D2Att=100*(ATT_DEM/(ATT_DEM+ATT_REP)),# D2Pre = Dem 2 party share, Attorney Gen
D2Aud=100*(AUD_DEM/(AUD_DEM+AUD_REP)),# D2Pre = Dem 2 party share, Auditor Gen
D2Tre=100*(STA_DEM/(STA_DEM+STA_REP)))]# D2Pre = Dem 2 party share, State Treas
DemTwoPartyShares.dt
Final output table: Merged Data
Now we merge
the tables and format the Precinct Number and Name as requested in the Exercise.
Ex3Data.dt <- merge(Tot.dt,Rolloffs.dt, by="PrecNo") # Merge Total with Rolloffs
Ex3Data.dt <- merge(Ex3Data.dt,DemTwoPartyShares.dt, by="PrecNo") # Merge that with Dem 2-party Shares
Ex3Data.dt[,PrecNo := as.integer(PrecNo)] # Make Precinct number a number
Ex3Data.dt[,PrecName := substr(PrecName,4,stop=40L)] # Strip duplicate info from Name
Ex3Data.dt
---
title: "SoDA 501, Exercise 3 - data.table Solution"
author: "Burt L. Monroe"
output:
  html_document:
    df_print: paged
    toc: yes
  html_notebook:
    code_folding: show
    df_print: paged
    highlight: tango
    theme: united
    toc: yes
---


This document addresses one solution to the data-wrangling exercise described in [Exercise3.pdf](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018/Exercise3.pdf).

The `data.table` format does not seem to display consistently -- or I'm missing something -- within the usual `nb.html` format. The `html` version [here](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018/datatableSolution/Exercise3-datatableSolution.html) may be prettier. The R Notebook with executable code -- the Rmd file -- should be downloadable from the pulldown at upper right of the `nb.html` file or can be downloaded separately here: [Exercise3-datatableSolution.Rmd](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018/datatableSolution/Exercise3-datatableSolution.Rmd).  

The input to the exercise is the raw data on 2016 Centre County, PA, precinct level votes found in ["CentreCountyPrecinctResults2016GeneralElection.txt"](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018/CentreCountyPrecinctResults2016GeneralElection.txt), which were retrieved from [http://centrecountypa.gov/Index.aspx?NID=802](http://centrecountypa.gov/Index.aspx?NID=802). (Update - 2020: This link is no longer accurate, and the original file does not appear to be available there any more. You can, however, find it on the Internet Archive's Wayback Machine, for example, here: [http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802](http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802). The file is no longer available due to the website being "updated" to provide the data in even less accessible form (pdf).)


The exercise asks you to extract the data on votes cast by precinct in statewide elections, and process them into a new table with precinct level data on total votes, Democratic share of two-party vote, and ballot rolloff from presidential votes to votes in other statewide races.

This solution uses the R `data.table` library. For alternative solutions see [https://burtmonroe.github.io/SoDA501/Exercise3-2018](https://burtmonroe.github.io/SoDA501/Exercise3-2018).

# Solving with `data.table`

We will proceed with the same general strategy as with the tidyverse solution, creating three tables -- total votes, rolloffs, and two party shares -- that we then merge in the final step.

First we read in the raw data (this time using the `fread` command to read into a `data.table`), and look at the first 50 rows.

```{r}
library(data.table)
rawdata.dt <- fread("CentreCountyPrecinctResults2016GeneralElection.txt") # 3520 rows, 16 columns
dim(rawdata.dt)
rawdata.dt
```

These data have, roughly, 39 rows for each of 91 precincts in Centre County, PA, identified by the variables `PrecNo` (which ranges from `0001` to `0091`) and `PrecName` (which ranges from `01 BELLEFONTE NORTH` to `91 FERGUSON NORTH CENTRAL`). Each precinct starts with three rows for summary information, identified in the `Contest` variable by values `REGISTERED VOTERS - TOTAL`, `BALLOTS CAST - TOTAL`, `VOTER TURNOUT - TOTAL`, followed by five rows with information on straight ticket voting for each of five parties (which can be ignored for our purposes here). From there, each row contains information for a single candidate in a particular contest. We care specifically about the statewide contests that are held in every precinct here: President (`Contest=="PRESIDENTIAL ELECTORS"`), US Senator (`Contest=="UNITED STATES SENATOR"`), Attorney General (`Contest=="ATTORNEY GENERAL"`), Auditor General (`Contest=="AUDITOR GENERAL"`), and Treasurer (`Contest=="STATE TREASURER"`). All of our calculations are based on the number in the `Count` variable.

## Solve in three pieces

This solution creates three data.tables that it joins together in the last step: the total votes, the two-party shares, and the rolloffs. Each of these data.tables should ultimately have 91 rows.

### Table 1: Total votes by precinct

Creating the total vote table requires just picking the right rows and relabeling the Count variable.

```{r}
Tot.dt <- rawdata.dt[Contest=="BALLOTS CAST - TOTAL",.(PrecNo,PrecName, Tot=Count)]
Tot.dt
```

### Table 2: Rolloffs in down-ballot races

As with the tidyverse solution, there are two nontrivial steps here. The first is the grouped summary step (the third assignment below), where vote counts are summed by Precinct Number and Contest. This is more or less the core process in `data.table`, so its implementation is very compact. The second is the `dcast` command which does the equivalent of what `pivot_wider` (`spread`) does in the tidyverse. The notation here uses a formula (here `PrecNo ~ Con`) to define the "key" on which the table is cast.

```{r}
# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Count,Con=substr(Contest,1,3))] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt   #  2093 rows, 3 columns

# Create grouped summary data.table with Total votes by Precinct-Contest 
PrecinctContestsLong.dt <- NeededColumnsRows.dt[,.(ConTot = sum(Count)), by=.(PrecNo,Con)]
PrecinctContestsLong.dt <- PrecinctContestsLong.dt[,.(PrecNo,Con,ConTot)] # could be "chained" w above
PrecinctContestsLong.dt     # 455 rows, 6 columns

# "Cast" the data by Contest (spread from long to wide)
PrecinctContestsWide.dt <- dcast(PrecinctContestsLong.dt, PrecNo ~ Con, value.var = "ConTot")
PrecinctContestsWide.dt     # 91 rows, 6 columns

# Calculate Rolloff Variables
Rolloffs.dt <- PrecinctContestsWide.dt[,.(PrecNo,                 # Keep Precinct Number
                                          ROSen=100*(1-UNI/PRE),  # Rolloff for US Senator
                                          ROAtt=100*(1-ATT/PRE),  # Rolloff for Attorney General
                                          ROAud=100*(1-AUD/PRE),  # Rolloff for Auditor General 
                                          ROTre=100*(1-STA/PRE))] # Rolloff for Treasurer
Rolloffs.dt
```

### Table 3: Democratic share of two-party vote

In the `tidyverse` version, we had to create a single column of Contest-Party indicators to act as a key. With `data.table` we can use two keys, just listing them on the right side of the formula (here, `PrecNo ~ Con + Pty`), and the keys from all pairings are constructed automatically. So this is a couple of commands more compact in `data.table.`

```{r}
# Create data.table with just vars PrecNo, Count, and Con (abbreviated Contest)
NeededColumnsRows.dt <- rawdata.dt[,.(PrecNo,Pty=substr(Party,1,3),Con=substr(Contest,1,3), Count)] # pick columns, abbreviate
NeededColumnsRows.dt <- NeededColumnsRows.dt[Pty %in% c("DEM","REP") & Con %in% c("PRE","UNI","ATT","AUD","STA"),] # pick rows
NeededColumnsRows.dt   #  910 rows, 3 columns

# "Cast" the data by Party & Contest (spread from long to wide)
#    Note that data.table has no problem automatically creating a key based on two columns
PrecPartyContestsWide.dt <- dcast(NeededColumnsRows.dt, PrecNo ~ Con + Pty, value.var = "Count")
PrecPartyContestsWide.dt     # 91 rows, 6 columns

# Calculate Democratic Two-Party Share Variables
DemTwoPartyShares.dt <- PrecPartyContestsWide.dt[,.(PrecNo,                 # Keep Precinct Number
                         D2Pre=100*(PRE_DEM/(PRE_DEM+PRE_REP)),# D2Pre = Dem 2 party share, President
                         D2Sen=100*(UNI_DEM/(UNI_DEM+UNI_REP)),# D2Pre = Dem 2 party share, US Senator
                         D2Att=100*(ATT_DEM/(ATT_DEM+ATT_REP)),# D2Pre = Dem 2 party share, Attorney Gen
                         D2Aud=100*(AUD_DEM/(AUD_DEM+AUD_REP)),# D2Pre = Dem 2 party share, Auditor Gen
                         D2Tre=100*(STA_DEM/(STA_DEM+STA_REP)))]# D2Pre = Dem 2 party share, State Treas
DemTwoPartyShares.dt
                                        
```

## Final output table: Merged Data

Now we `merge` the tables and format the Precinct Number and Name as requested in the Exercise.

```{r}
Ex3Data.dt <- merge(Tot.dt,Rolloffs.dt, by="PrecNo")              # Merge Total with Rolloffs
Ex3Data.dt <- merge(Ex3Data.dt,DemTwoPartyShares.dt, by="PrecNo") # Merge that with Dem 2-party Shares
Ex3Data.dt[,PrecNo := as.integer(PrecNo)]                         # Make Precinct number a number
Ex3Data.dt[,PrecName := substr(PrecName,4,stop=40L)]              # Strip duplicate info from Name
Ex3Data.dt
```


