This document addresses one solution to the data-wrangling exercise described in Exercise3.pdf. For the accompanying R Notebook with executable code, download the Rmd file from the pulldown “Code” menu on the upper right.

The input to the exercise is the raw data on 2016 Centre County, PA, precinct level votes found in “CentreCountyPrecinctResults2016GeneralElection.txt”, which were retrieved from http://centrecountypa.gov/Index.aspx?NID=802. (Update - 2020: This link is no longer accurate, and the original file does not appear to be available there any more. You can, however, find it on the Internet Archive’s Wayback Machine, for example, here: http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802. The file is no longer available due to the website being “updated” to provide the data in even less accessible form (pdf).)

The exercise asks you to extract the data on votes cast by precinct in statewide elections, and process them into a new table with precinct level data on total votes, Democratic share of two-party vote, and ballot rolloff from presidential votes to votes in other statewide races.

This solution uses the R tidyverse. For alternative solutions see https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018.

Solving with the tidyverse (mainly dplyr)

Let’s load the tidyverse, read in the data, and look at the first 50 rows:

library(tidyverse)
── Attaching packages ───────────
✔ ggplot2 3.2.1     ✔ purrr   0.3.3
✔ tibble  2.1.3     ✔ dplyr   0.8.3
✔ tidyr   1.0.0     ✔ stringr 1.4.0
✔ readr   1.3.1     ✔ forcats 0.4.0
── Conflicts ────────────────────
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
rawdata <- read_csv("CentreCountyPrecinctResults2016GeneralElection.txt") # 3520 rows, 16 columns
Parsed with column specification:
cols(
  County = col_character(),
  ElecYear = col_double(),
  Election = col_character(),
  ElecDate = col_character(),
  PrecNo = col_character(),
  PrecName = col_character(),
  Party = col_character(),
  Contest = col_character(),
  Descr = col_logical(),
  VoteFor = col_character(),
  Candidate = col_character(),
  Posn = col_double(),
  Count = col_double(),
  PctCnt = col_double(),
  Total = col_logical(),
  PctTot = col_logical()
)
dim(rawdata)
[1] 3520   16
head(rawdata, n=50)

These data have, roughly, 39 rows for each of 91 precincts in Centre County, PA, identified by the variables PrecNo (which ranges from 0001 to 0091) and PrecName (which ranges from 01 BELLEFONTE NORTH to 91 FERGUSON NORTH CENTRAL). Each precinct starts with three rows for summary information, identified in the Contest variable by values REGISTERED VOTERS - TOTAL, BALLOTS CAST - TOTAL, VOTER TURNOUT - TOTAL, followed by five rows with information on straight ticket voting for each of five parties (which can be ignored for our purposes here). From there, each row contains information for a single candidate in a particular contest. We care specifically about the statewide contests that are held in every precinct here: President (Contest=="PRESIDENTIAL ELECTORS"), US Senator (Contest=="UNITED STATES SENATOR"), Attorney General (Contest=="ATTORNEY GENERAL"), Auditor General (Contest=="AUDITOR GENERAL"), and Treasurer (Contest=="STATE TREASURER"). All of our calculations are based on the number in the Count variable.

Solve in three pieces

This solution creates three dataframes (tables) that it joins together in the last step: the total votes, the two-party shares, and the rolloffs. Each of these dataframes should ultimately have 91 rows.

Table 1: Total votes by precinct

The total vote numbers are already in the data, we just need to filter rows down to just the 91 we need (the ones that list "BALLOTS CAST - TOTAL" in the Contest column, and then select the columns we need. We’ll keep the precinct number, the precinct name, and the total votes.

Tot <- rawdata %>% 
  filter(Contest=="BALLOTS CAST - TOTAL") %>% # filter to the rows with total votes
  select(PrecNo,PrecName,Tot=Count) #select the id columns and the counts. Rename the counts to Tot
dim(Tot) # 91 rows, 3 columns
[1] 91  3
head(Tot)

Table 2: Rolloffs in down-ballot races

Now we’ll calculate the rolloff data. Ultimately, the output is 91 rows, 1 row per precinct, with four columns containing the rolloff in the non-presidential races in that precinct. To get there, we need an intermediate table of 91 rows with the total votes per statewide contest, including President, by precinct. So the trickiest step is the one where we `spread’ the data from its “long” format to a “wide” format with these variables.

The following code is presented as one long pipeline with one input and one output. The identical code is repeated below in this notebook, with output “glimpsed” for each intermediate step.

Rolloffs <- rawdata %>%                  # Start with the raw data
  select(PrecNo, Contest, Count) %>%     # For rolloff we need the data by precinct & contest
  mutate(Con=substr(Contest,1,3)) %>%    # For cleanliness, create an abbreviated contest variable
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %>%  # Pick the rows with just the statewide contests
  group_by(PrecNo,Con) %>%               # Group into Precinct-Contest units
  summarise(ConTot=sum(Count)) %>%       # Calculate total vote by Precinct-Contest
  spread(Con, ConTot) %>%               # Spread the data by Contest
  mutate(ROSen=100*(1-UNI/PRE),         # Rolloff for Senator (UNI TED STATES SENATOR)
         ROAtt=100*(1-ATT/PRE),         # Rolloff for Attorney General (ATT ORNEY GENERAL)
         ROAud=100*(1-AUD/PRE),         # Rolloff for Auditor General (AUD ITOR GENERAL)
         ROTre=100*(1-STA/PRE)) %>%     # Rolloff for Treasurer (STA TE TREASURER)
  select(PrecNo,ROSen,ROAtt,ROAud,ROTre) # Keep just the Precinct Number and Rolloff variables
dim(Rolloffs) # 91 rows, 5 columns
[1] 91  5
head(Rolloffs)

NB. 2020 Update

Since the original posting of this exercise / solution, the spread and gather commands have been superceded by pivot_wider and pivot_longer. The code above still works, but the preferred formulation would now be:

Rolloffs <- rawdata %>%                  # Start with the raw data
  select(PrecNo, Contest, Count) %>%     # For rolloff we need the data by precinct & contest
  mutate(Con=substr(Contest,1,3)) %>%    # For cleanliness, create an abbreviated contest variable
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %>%  # Pick the rows with just the statewide contests
  group_by(PrecNo,Con) %>%               # Group into Precinct-Contest units
  summarise(ConTot=sum(Count)) %>%       # Calculate total vote by Precinct-Contest
  pivot_wider(names_from = Con, values_from = ConTot) %>% 
  # replaces
  # spread(Con, ConTot) %>%           # Spread the data by Contest
  mutate(ROSen=100*(1-UNI/PRE),         # Rolloff for Senator (UNI TED STATES SENATOR)
         ROAtt=100*(1-ATT/PRE),         # Rolloff for Attorney General (ATT ORNEY GENERAL)
         ROAud=100*(1-AUD/PRE),         # Rolloff for Auditor General (AUD ITOR GENERAL)
         ROTre=100*(1-STA/PRE)) %>%     # Rolloff for Treasurer (STA TE TREASURER)
  select(PrecNo,ROSen,ROAtt,ROAud,ROTre) # Keep just the Precinct Number and Rolloff variables
dim(Rolloffs) # 91 rows, 5 columns
[1] 91  5
head(Rolloffs)

Table 3: Democratic share of two-party vote

This is similar to rolloff in that we need to calculate two intermediate quantities along the way: total vote for Republican and Democrat in each race, by precinct. With rolloff, we needed to spread the data to create contest data by precinct. With party shares, we need to spread the data to create contest-party data by precinct, so we need to create a contest-party indicator variable to act as a “key” for spreading. (2020 note: again, spread should be replaced by pivot_wider.)

DemTwoPartyVotes <- rawdata %>%            # Start with the raw data
  select(PrecNo,Party,Contest,Count) %>%   # All we need is Precinct, the "Contest", and the "Count"
  mutate(Con=substr(Contest,1,3),Pty=substr(Party,1,3)) %>% # Create abbreviated Party & Contest variables
  select(PrecNo,Pty,Con,Count) %>%         # Get rid of the unabbreviated variables
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %>%  # Pick just the rows with elections of interest
  filter(Pty %in% c('DEM','REP')) %>%      # Pick just Democratic and Republican candidates
  mutate(ConPty = paste(Con,Pty,sep="")) %>%  #### CREATE CONTEST-PARTY TO ACT AS KEY FOR SPREAD
  select(PrecNo,ConPty,Count) %>%          # Get rid of columns we don't need.
  pivot_wider(names_from = ConPty, values_from = Count) %>%
  # replaces
  #  spread(ConPty,Count) %>%         ##### SPREAD THE DATA BY THE CONTEST-PARTY KEY
  mutate(D2Pre=100*(PREDEM/(PREDEM+PREREP)),     # D2Pre = Dem share of 2 party vote for President
         D2Sen=100*(UNIDEM/(UNIDEM+UNIREP)),     # D2Sen = Dem share of 2 party vote for US Senator
         D2Att=100*(ATTDEM/(ATTDEM+ATTREP)),     # D2Att = Dem share of 2 party vote for Attorney Genl
         D2Aud=100*(AUDDEM/(AUDDEM+AUDREP)),     # D2Aud = Dem share of 2 party vote for Auditor Genl
         D2Tre=100*(STADEM/(STADEM+STAREP))) %>% # D2Tre = Dem share of 2 party vote for St Treasurer
  select(PrecNo,D2Pre,D2Sen,D2Att,D2Aud,D2Tre)  # Get rid of columns we don't need
dim(DemTwoPartyVotes) # 91 rows, 6 columns
[1] 91  6
head(DemTwoPartyVotes)

Final output table: Merged Data

Now we merge the tables and format the Precinct Number and Name as requested in the Exercise. The main verb here is *_join. Any of left_join, right_join, or inner_join will work in this case to match data from each table by the only matching variable, PrecNo. (These verbs differ in how they treat rows in which the join variables are missing or duplicated in one or the other table, but in this case all three tables have exactly 91 rows with unique PrecNo, so the effect is identical.)

Ex3Data <- Tot %>%                            # Start with the Tot data_frame
  left_join(DemTwoPartyVotes) %>%             # Merge the DemTwoParty table (on shared PrecNo)
  left_join(Rolloffs) %>%                     # Merge the Rolloff table (on shared PrecNo)
  mutate(PrecNo=as.integer(PrecNo)) %>%       # Turn PrecNo into a number
  mutate(PrecName=str_sub(PrecName,start=4L)) # Strip the redundant number off of the Precinct names
Joining, by = "PrecNo"Joining, by = "PrecNo"
Ex3Data

Solution repeated with intermediate steps displayed

Here we repeat the exact steps from above, but with a “glimpse” statement at each step to see all of the intermediate results.

Tot <- rawdata %>% 
  filter(Contest=="BALLOTS CAST - TOTAL") %T>% glimpse()  %>% # filter to the rows with total votes
  select(PrecNo,PrecName,Tot=Count) #select the id columns and the counts. Rename the counts to Tot
Observations: 91
Variables: 16
$ County    <chr> "CENTRE COUN…
$ ElecYear  <dbl> 2016, 2016, …
$ Election  <chr> "GENERAL ELE…
$ ElecDate  <chr> "NOVEMBER 8 …
$ PrecNo    <chr> "0001", "000…
$ PrecName  <chr> "01 BELLEFON…
$ Party     <chr> NA, NA, NA, …
$ Contest   <chr> "BALLOTS CAS…
$ Descr     <lgl> NA, NA, NA, …
$ VoteFor   <chr> NA, NA, NA, …
$ Candidate <chr> NA, NA, NA, …
$ Posn      <dbl> 11, 96, 181,…
$ Count     <dbl> 391, 687, 46…
$ PctCnt    <dbl> NA, NA, NA, …
$ Total     <lgl> NA, NA, NA, …
$ PctTot    <lgl> NA, NA, NA, …
dim(Tot) # 91 rows x 3
[1] 91  3
head(Tot)
Rolloffs <- rawdata %>%                  # Start with the raw data
  select(PrecNo, Contest, Count) %T>% glimpse()  %>%  # Pick data by precinct & contest
  mutate(Con=substr(Contest,1,3)) %T>% glimpse()  %>% # For cleanliness, create abbrev contest variable
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %T>% glimpse() %>% # Pick rows w statewide contests
  group_by(PrecNo,Con) %T>% glimpse() %>%               # Group into Precinct-Contest units
  summarise(ConTot=sum(Count)) %T>% glimpse() %>%       # Calculate total vote by Precinct-Contest
  spread(Con, ConTot) %T>% glimpse() %>%               # Spread the data by Contest
  mutate(ROSen=100*(1-UNI/PRE),         # Rolloff for Senator (UNI TED STATES SENATOR)
         ROAtt=100*(1-ATT/PRE),         # Rolloff for Attorney General (ATT ORNEY GENERAL)
         ROAud=100*(1-AUD/PRE),         # Rolloff for Auditor General (AUD ITOR GENERAL)
         ROTre=100*(1-STA/PRE)) %T>% glimpse()  %>%     # Rolloff for Treasurer (STA TE TREASURER)
  select(PrecNo,ROSen,ROAtt,ROAud,ROTre) # Keep just the Precinct Number and Rolloff variables
Observations: 3,520
Variables: 3
$ PrecNo  <chr> "0001", "0001"…
$ Contest <chr> "REGISTERED VO…
$ Count   <dbl> 507, 391, NA, …
Observations: 3,520
Variables: 4
$ PrecNo  <chr> "0001", "0001"…
$ Contest <chr> "REGISTERED VO…
$ Count   <dbl> 507, 391, NA, …
$ Con     <chr> "REG", "BAL", …
Observations: 2,093
Variables: 4
$ PrecNo  <chr> "0001", "0001"…
$ Contest <chr> "PRESIDENTIAL …
$ Count   <dbl> 188, 174, 2, 3…
$ Con     <chr> "PRE", "PRE", …
Observations: 2,093
Variables: 4
Groups: PrecNo, Con [455]
$ PrecNo  <chr> "0001", "0001"…
$ Contest <chr> "PRESIDENTIAL …
$ Count   <dbl> 188, 174, 2, 3…
$ Con     <chr> "PRE", "PRE", …
Observations: 455
Variables: 3
Groups: PrecNo [91]
$ PrecNo <chr> "0001", "0001",…
$ Con    <chr> "ATT", "AUD", "…
$ ConTot <dbl> 382, 364, 391, …
Observations: 91
Variables: 6
Groups: PrecNo [91]
$ PrecNo <chr> "0001", "0002",…
$ ATT    <dbl> 382, 668, 451, …
$ AUD    <dbl> 364, 657, 448, …
$ PRE    <dbl> 391, 679, 457, …
$ STA    <dbl> 368, 655, 447, …
$ UNI    <dbl> 386, 675, 452, …
Observations: 91
Variables: 10
Groups: PrecNo [91]
$ PrecNo <chr> "0001", "0002",…
$ ATT    <dbl> 382, 668, 451, …
$ AUD    <dbl> 364, 657, 448, …
$ PRE    <dbl> 391, 679, 457, …
$ STA    <dbl> 368, 655, 447, …
$ UNI    <dbl> 386, 675, 452, …
$ ROSen  <dbl> 1.2787724, 0.58…
$ ROAtt  <dbl> 2.3017903, 1.62…
$ ROAud  <dbl> 6.905371, 3.240…
$ ROTre  <dbl> 5.882353, 3.534…
Rolloffs
DemTwoPartyVotes <- rawdata %>%            # Start with the raw data
  select(PrecNo,Party,Contest,Count) %T>% glimpse() %>%   # All we need is Precinct, Contest, & "Count"
  mutate(Con=substr(Contest,1,3),Pty=substr(Party,1,3)) %T>% glimpse() %>% # Abbreviated Party & Contest
  select(PrecNo,Pty,Con,Count) %T>% glimpse() %>%         # Get rid of the unabbreviated variables
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %T>% glimpse() %>%  # Pick rows w elections
  filter(Pty %in% c('DEM','REP')) %T>% glimpse() %>%  # Pick just Democratic and Republican candidates
  mutate(ConPty = paste(Con,Pty,sep="")) %T>% glimpse() %>%  #### CREATE CONTEST-PARTY AS KEY FOR SPREAD
  select(PrecNo,ConPty,Count) %T>% glimpse() %>%          # Get rid of columns we don't need.
  spread(ConPty,Count) %T>% glimpse() %>%                 ##### SPREAD THE DATA BY THE CONTEST-PARTY KEY
  mutate(D2Pre=100*(PREDEM/(PREDEM+PREREP)),     # D2Pre = Dem share of 2 party vote for President
         D2Sen=100*(UNIDEM/(UNIDEM+UNIREP)),     # D2Sen = Dem share of 2 party vote for US Senator
         D2Att=100*(ATTDEM/(ATTDEM+ATTREP)),     # D2Att = Dem share of 2 party vote for Attorney Genl
         D2Aud=100*(AUDDEM/(AUDDEM+AUDREP)),     # D2Aud = Dem share of 2 party vote for Auditor Genl
         D2Tre=100*(STADEM/(STADEM+STAREP))) %T>% glimpse() %>% # D2Tre = Dem share, 2 pty vote St Treasurer
  select(PrecNo,D2Pre,D2Sen,D2Att,D2Aud,D2Tre)  # Get rid of columns we don't need
Observations: 3,520
Variables: 4
$ PrecNo  <chr> "0001", "0001"…
$ Party   <chr> NA, NA, NA, "D…
$ Contest <chr> "REGISTERED VO…
$ Count   <dbl> 507, 391, NA, …
Observations: 3,520
Variables: 6
$ PrecNo  <chr> "0001", "0001"…
$ Party   <chr> NA, NA, NA, "D…
$ Contest <chr> "REGISTERED VO…
$ Count   <dbl> 507, 391, NA, …
$ Con     <chr> "REG", "BAL", …
$ Pty     <chr> NA, NA, NA, "D…
Observations: 3,520
Variables: 4
$ PrecNo <chr> "0001", "0001",…
$ Pty    <chr> NA, NA, NA, "DE…
$ Con    <chr> "REG", "BAL", "…
$ Count  <dbl> 507, 391, NA, 4…
Observations: 2,093
Variables: 4
$ PrecNo <chr> "0001", "0001",…
$ Pty    <chr> "DEM", "REP", "…
$ Con    <chr> "PRE", "PRE", "…
$ Count  <dbl> 188, 174, 2, 3,…
Observations: 910
Variables: 4
$ PrecNo <chr> "0001", "0001",…
$ Pty    <chr> "DEM", "REP", "…
$ Con    <chr> "PRE", "PRE", "…
$ Count  <dbl> 188, 174, 169, …
Observations: 910
Variables: 5
$ PrecNo <chr> "0001", "0001",…
$ Pty    <chr> "DEM", "REP", "…
$ Con    <chr> "PRE", "PRE", "…
$ Count  <dbl> 188, 174, 169, …
$ ConPty <chr> "PREDEM", "PRER…
Observations: 910
Variables: 3
$ PrecNo <chr> "0001", "0001",…
$ ConPty <chr> "PREDEM", "PRER…
$ Count  <dbl> 188, 174, 169, …
Observations: 91
Variables: 11
$ PrecNo <chr> "0001", "0002",…
$ ATTDEM <dbl> 194, 265, 258, …
$ ATTREP <dbl> 187, 401, 192, …
$ AUDDEM <dbl> 170, 264, 238, …
$ AUDREP <dbl> 163, 354, 171, …
$ PREDEM <dbl> 188, 259, 230, …
$ PREREP <dbl> 174, 385, 193, …
$ STADEM <dbl> 177, 252, 229, …
$ STAREP <dbl> 164, 344, 169, …
$ UNIDEM <dbl> 169, 251, 229, …
$ UNIREP <dbl> 185, 386, 191, …
Observations: 91
Variables: 16
$ PrecNo <chr> "0001", "0002",…
$ ATTDEM <dbl> 194, 265, 258, …
$ ATTREP <dbl> 187, 401, 192, …
$ AUDDEM <dbl> 170, 264, 238, …
$ AUDREP <dbl> 163, 354, 171, …
$ PREDEM <dbl> 188, 259, 230, …
$ PREREP <dbl> 174, 385, 193, …
$ STADEM <dbl> 177, 252, 229, …
$ STAREP <dbl> 164, 344, 169, …
$ UNIDEM <dbl> 169, 251, 229, …
$ UNIREP <dbl> 185, 386, 191, …
$ D2Pre  <dbl> 51.93370, 40.21…
$ D2Sen  <dbl> 47.74011, 39.40…
$ D2Att  <dbl> 50.91864, 39.78…
$ D2Aud  <dbl> 51.05105, 42.71…
$ D2Tre  <dbl> 51.90616, 42.28…
DemTwoPartyVotes
Ex3Data <- Tot %>%                                # Start with the Tot data_frame
  left_join(DemTwoPartyVotes) %T>% glimpse() %>%    # Merge the DemTwoParty table (on shared PrecNo)
  left_join(Rolloffs) %T>% glimpse() %>%            # Merge the Rolloff table (on shared PrecNo)
  mutate(PrecNo=as.integer(PrecNo)) %T>% glimpse() %>%       # Turn PrecNo into a number
  mutate(PrecName=str_sub(PrecName,start=4L)) # Strip the redundant number off of the Precinct names
Joining, by = "PrecNo"
Observations: 91
Variables: 8
$ PrecNo   <chr> "0001", "0002…
$ PrecName <chr> "01 BELLEFONT…
$ Tot      <dbl> 391, 687, 461…
$ D2Pre    <dbl> 51.93370, 40.…
$ D2Sen    <dbl> 47.74011, 39.…
$ D2Att    <dbl> 50.91864, 39.…
$ D2Aud    <dbl> 51.05105, 42.…
$ D2Tre    <dbl> 51.90616, 42.…
Joining, by = "PrecNo"
Observations: 91
Variables: 12
$ PrecNo   <chr> "0001", "0002…
$ PrecName <chr> "01 BELLEFONT…
$ Tot      <dbl> 391, 687, 461…
$ D2Pre    <dbl> 51.93370, 40.…
$ D2Sen    <dbl> 47.74011, 39.…
$ D2Att    <dbl> 50.91864, 39.…
$ D2Aud    <dbl> 51.05105, 42.…
$ D2Tre    <dbl> 51.90616, 42.…
$ ROSen    <dbl> 1.2787724, 0.…
$ ROAtt    <dbl> 2.3017903, 1.…
$ ROAud    <dbl> 6.905371, 3.2…
$ ROTre    <dbl> 5.882353, 3.5…
Observations: 91
Variables: 12
$ PrecNo   <int> 1, 2, 3, 4, 5…
$ PrecName <chr> "01 BELLEFONT…
$ Tot      <dbl> 391, 687, 461…
$ D2Pre    <dbl> 51.93370, 40.…
$ D2Sen    <dbl> 47.74011, 39.…
$ D2Att    <dbl> 50.91864, 39.…
$ D2Aud    <dbl> 51.05105, 42.…
$ D2Tre    <dbl> 51.90616, 42.…
$ ROSen    <dbl> 1.2787724, 0.…
$ ROAtt    <dbl> 2.3017903, 1.…
$ ROAud    <dbl> 6.905371, 3.2…
$ ROTre    <dbl> 5.882353, 3.5…
Ex3Data # 91 rows, 12 columns
---
title: "SoDA 501, Exercise 3 (2018) - `tidyverse` Solution"
author: "Burt L. Monroe"
output:
  html_notebook:
    code_folding: show
    highlight: tango
    theme: united
    df_print: paged
    toc: yes
---

This document addresses one solution to the data-wrangling exercise described in [Exercise3.pdf](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018/Exercise3.pdf). For the accompanying R Notebook with executable code, download the Rmd file from the pulldown "Code" menu on the upper right.  

The input to the exercise is the raw data on 2016 Centre County, PA, precinct level votes found in ["CentreCountyPrecinctResults2016GeneralElection.txt"](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018/CentreCountyPrecinctResults2016GeneralElection.txt), which were retrieved from [http://centrecountypa.gov/Index.aspx?NID=802](http://centrecountypa.gov/Index.aspx?NID=802). (Update - 2020: This link is no longer accurate, and the original file does not appear to be available there any more. You can, however, find it on the Internet Archive's Wayback Machine, for example, here: [http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802](http://web.archive.org/web/20181106161514/centrecountypa.gov/index.aspx?NID=802). The file is no longer available due to the website being "updated" to provide the data in even less accessible form (pdf).)

The exercise asks you to extract the data on votes cast by precinct in statewide elections, and process them into a new table with precinct level data on total votes, Democratic share of two-party vote, and ballot rolloff from presidential votes to votes in other statewide races.

This solution uses the R `tidyverse`. For alternative solutions see [https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018](https://burtmonroe.github.io/SoDA501/Exercises/Exercise3-2018).

# Solving with the `tidyverse` (mainly `dplyr`)

Let's load the tidyverse, read in the data, and look at the first 50 rows:

```{r}
library(tidyverse)
rawdata <- read_csv("CentreCountyPrecinctResults2016GeneralElection.txt") # 3520 rows, 16 columns
dim(rawdata)
head(rawdata, n=50)
```

These data have, roughly, 39 rows for each of 91 precincts in Centre County, PA, identified by the variables `PrecNo` (which ranges from `0001` to `0091`) and `PrecName` (which ranges from `01 BELLEFONTE NORTH` to `91 FERGUSON NORTH CENTRAL`). Each precinct starts with three rows for summary information, identified in the `Contest` variable by values `REGISTERED VOTERS - TOTAL`, `BALLOTS CAST - TOTAL`, `VOTER TURNOUT - TOTAL`, followed by five rows with information on straight ticket voting for each of five parties (which can be ignored for our purposes here). From there, each row contains information for a single candidate in a particular contest. We care specifically about the statewide contests that are held in every precinct here: President (`Contest=="PRESIDENTIAL ELECTORS"`), US Senator (`Contest=="UNITED STATES SENATOR"`), Attorney General (`Contest=="ATTORNEY GENERAL"`), Auditor General (`Contest=="AUDITOR GENERAL"`), and Treasurer (`Contest=="STATE TREASURER"`). All of our calculations are based on the number in the `Count` variable.

## Solve in three pieces

This solution creates three dataframes (tables) that it joins together in the last step: the total votes, the two-party shares, and the rolloffs. Each of these dataframes should ultimately have 91 rows.

### Table 1: Total votes by precinct

The total vote numbers are already in the data, we just need to `filter` rows down to just the 91 we need (the ones that list `"BALLOTS CAST - TOTAL"` in the `Contest` column, and then `select` the columns we need. We'll keep the precinct number, the precinct name, and the total votes.

```{r}
Tot <- rawdata %>% 
  filter(Contest=="BALLOTS CAST - TOTAL") %>% # filter to the rows with total votes
  select(PrecNo,PrecName,Tot=Count) #select the id columns and the counts. Rename the counts to Tot
dim(Tot) # 91 rows, 3 columns
head(Tot)
```

### Table 2: Rolloffs in down-ballot races

Now we'll calculate the rolloff data. Ultimately, the output is 91 rows, 1 row per precinct, with four columns containing the rolloff in the non-presidential races in that precinct. To get there, we need an intermediate table of 91 rows with the total votes per statewide contest, including President, by precinct. So the trickiest step is the one where we `spread' the data from its "long" format to a "wide" format with these variables. 

The following code is presented as one long pipeline with one input and one output. The identical code is repeated below in this notebook, with output "glimpsed" for each intermediate step.

```{r}
Rolloffs <- rawdata %>%                  # Start with the raw data
  select(PrecNo, Contest, Count) %>%     # For rolloff we need the data by precinct & contest
  mutate(Con=substr(Contest,1,3)) %>%    # For cleanliness, create an abbreviated contest variable
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %>%  # Pick the rows with just the statewide contests
  group_by(PrecNo,Con) %>%               # Group into Precinct-Contest units
  summarise(ConTot=sum(Count)) %>%       # Calculate total vote by Precinct-Contest
  spread(Con, ConTot) %>%               # Spread the data by Contest
  mutate(ROSen=100*(1-UNI/PRE),         # Rolloff for Senator (UNI TED STATES SENATOR)
         ROAtt=100*(1-ATT/PRE),         # Rolloff for Attorney General (ATT ORNEY GENERAL)
         ROAud=100*(1-AUD/PRE),         # Rolloff for Auditor General (AUD ITOR GENERAL)
         ROTre=100*(1-STA/PRE)) %>%     # Rolloff for Treasurer (STA TE TREASURER)
  select(PrecNo,ROSen,ROAtt,ROAud,ROTre) # Keep just the Precinct Number and Rolloff variables
dim(Rolloffs) # 91 rows, 5 columns
head(Rolloffs)
```

#### NB. 2020 Update

Since the original posting of this exercise / solution, the `spread` and `gather` commands have been superceded by `pivot_wider` and `pivot_longer`. The code above still works, but the preferred formulation would now be:

```{r}
Rolloffs <- rawdata %>%                  # Start with the raw data
  select(PrecNo, Contest, Count) %>%     # For rolloff we need the data by precinct & contest
  mutate(Con=substr(Contest,1,3)) %>%    # For cleanliness, create an abbreviated contest variable
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %>%  # Pick the rows with just the statewide contests
  group_by(PrecNo,Con) %>%               # Group into Precinct-Contest units
  summarise(ConTot=sum(Count)) %>%       # Calculate total vote by Precinct-Contest
  pivot_wider(names_from = Con, values_from = ConTot) %>% 
  # replaces
  # spread(Con, ConTot) %>%           # Spread the data by Contest
  mutate(ROSen=100*(1-UNI/PRE),         # Rolloff for Senator (UNI TED STATES SENATOR)
         ROAtt=100*(1-ATT/PRE),         # Rolloff for Attorney General (ATT ORNEY GENERAL)
         ROAud=100*(1-AUD/PRE),         # Rolloff for Auditor General (AUD ITOR GENERAL)
         ROTre=100*(1-STA/PRE)) %>%     # Rolloff for Treasurer (STA TE TREASURER)
  select(PrecNo,ROSen,ROAtt,ROAud,ROTre) # Keep just the Precinct Number and Rolloff variables
dim(Rolloffs) # 91 rows, 5 columns
head(Rolloffs)
```

### Table 3: Democratic share of two-party vote

This is similar to rolloff in that we need to calculate two intermediate quantities along the way: total vote for Republican and Democrat in each race, by precinct. With rolloff, we needed to spread the data to create contest data by precinct. With party shares, we need to spread the data to create contest-party data by precinct, so we need to create a contest-party indicator variable to act as a "key" for spreading. (2020 note: again, `spread` should be replaced by `pivot_wider`.)

```{r}
DemTwoPartyVotes <- rawdata %>%            # Start with the raw data
  select(PrecNo,Party,Contest,Count) %>%   # All we need is Precinct, the "Contest", and the "Count"
  mutate(Con=substr(Contest,1,3),Pty=substr(Party,1,3)) %>% # Create abbreviated Party & Contest variables
  select(PrecNo,Pty,Con,Count) %>%         # Get rid of the unabbreviated variables
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %>%  # Pick just the rows with elections of interest
  filter(Pty %in% c('DEM','REP')) %>%      # Pick just Democratic and Republican candidates
  mutate(ConPty = paste(Con,Pty,sep="")) %>%  #### CREATE CONTEST-PARTY TO ACT AS KEY FOR SPREAD
  select(PrecNo,ConPty,Count) %>%          # Get rid of columns we don't need.
  pivot_wider(names_from = ConPty, values_from = Count) %>%
  # replaces
  #  spread(ConPty,Count) %>%         ##### SPREAD THE DATA BY THE CONTEST-PARTY KEY
  mutate(D2Pre=100*(PREDEM/(PREDEM+PREREP)),     # D2Pre = Dem share of 2 party vote for President
         D2Sen=100*(UNIDEM/(UNIDEM+UNIREP)),     # D2Sen = Dem share of 2 party vote for US Senator
         D2Att=100*(ATTDEM/(ATTDEM+ATTREP)),     # D2Att = Dem share of 2 party vote for Attorney Genl
         D2Aud=100*(AUDDEM/(AUDDEM+AUDREP)),     # D2Aud = Dem share of 2 party vote for Auditor Genl
         D2Tre=100*(STADEM/(STADEM+STAREP))) %>% # D2Tre = Dem share of 2 party vote for St Treasurer
  select(PrecNo,D2Pre,D2Sen,D2Att,D2Aud,D2Tre)  # Get rid of columns we don't need
dim(DemTwoPartyVotes) # 91 rows, 6 columns
head(DemTwoPartyVotes)
```


## Final output table: Merged Data

Now we merge the tables and format the Precinct Number and Name as requested in the Exercise. The main verb here is `*_join.` Any of `left_join`, `right_join`, or `inner_join` will work in this case to match data from each table by the only matching variable, PrecNo. (These verbs differ in how they treat rows in which the join variables are missing or duplicated in one or the other table, but in this case all three tables have exactly 91 rows with unique PrecNo, so the effect is identical.)

```{r}
Ex3Data <- Tot %>%                            # Start with the Tot data_frame
  left_join(DemTwoPartyVotes) %>%             # Merge the DemTwoParty table (on shared PrecNo)
  left_join(Rolloffs) %>%                     # Merge the Rolloff table (on shared PrecNo)
  mutate(PrecNo=as.integer(PrecNo)) %>%       # Turn PrecNo into a number
  mutate(PrecName=str_sub(PrecName,start=4L)) # Strip the redundant number off of the Precinct names
Ex3Data
```

## Solution repeated with intermediate steps displayed

Here we repeat the exact steps from above, but with a "glimpse" statement at each step to see all of the intermediate results.

```{r}
Tot <- rawdata %>% 
  filter(Contest=="BALLOTS CAST - TOTAL") %T>% glimpse()  %>% # filter to the rows with total votes
  select(PrecNo,PrecName,Tot=Count) #select the id columns and the counts. Rename the counts to Tot
dim(Tot) # 91 rows x 3
head(Tot)
```


```{r}
Rolloffs <- rawdata %>%                  # Start with the raw data
  select(PrecNo, Contest, Count) %T>% glimpse()  %>%  # Pick data by precinct & contest
  mutate(Con=substr(Contest,1,3)) %T>% glimpse()  %>% # For cleanliness, create abbrev contest variable
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %T>% glimpse() %>% # Pick rows w statewide contests
  group_by(PrecNo,Con) %T>% glimpse() %>%               # Group into Precinct-Contest units
  summarise(ConTot=sum(Count)) %T>% glimpse() %>%       # Calculate total vote by Precinct-Contest
  spread(Con, ConTot) %T>% glimpse() %>%               # Spread the data by Contest
  mutate(ROSen=100*(1-UNI/PRE),         # Rolloff for Senator (UNI TED STATES SENATOR)
         ROAtt=100*(1-ATT/PRE),         # Rolloff for Attorney General (ATT ORNEY GENERAL)
         ROAud=100*(1-AUD/PRE),         # Rolloff for Auditor General (AUD ITOR GENERAL)
         ROTre=100*(1-STA/PRE)) %T>% glimpse()  %>%     # Rolloff for Treasurer (STA TE TREASURER)
  select(PrecNo,ROSen,ROAtt,ROAud,ROTre) # Keep just the Precinct Number and Rolloff variables
Rolloffs
```



```{r}
DemTwoPartyVotes <- rawdata %>%            # Start with the raw data
  select(PrecNo,Party,Contest,Count) %T>% glimpse() %>%   # All we need is Precinct, Contest, & "Count"
  mutate(Con=substr(Contest,1,3),Pty=substr(Party,1,3)) %T>% glimpse() %>% # Abbreviated Party & Contest
  select(PrecNo,Pty,Con,Count) %T>% glimpse() %>%         # Get rid of the unabbreviated variables
  filter(Con %in% c("PRE","UNI","ATT","AUD","STA")) %T>% glimpse() %>%  # Pick rows w elections
  filter(Pty %in% c('DEM','REP')) %T>% glimpse() %>%  # Pick just Democratic and Republican candidates
  mutate(ConPty = paste(Con,Pty,sep="")) %T>% glimpse() %>%  #### CREATE CONTEST-PARTY AS KEY FOR SPREAD
  select(PrecNo,ConPty,Count) %T>% glimpse() %>%          # Get rid of columns we don't need.
  spread(ConPty,Count) %T>% glimpse() %>%                 ##### SPREAD THE DATA BY THE CONTEST-PARTY KEY
  mutate(D2Pre=100*(PREDEM/(PREDEM+PREREP)),     # D2Pre = Dem share of 2 party vote for President
         D2Sen=100*(UNIDEM/(UNIDEM+UNIREP)),     # D2Sen = Dem share of 2 party vote for US Senator
         D2Att=100*(ATTDEM/(ATTDEM+ATTREP)),     # D2Att = Dem share of 2 party vote for Attorney Genl
         D2Aud=100*(AUDDEM/(AUDDEM+AUDREP)),     # D2Aud = Dem share of 2 party vote for Auditor Genl
         D2Tre=100*(STADEM/(STADEM+STAREP))) %T>% glimpse() %>% # D2Tre = Dem share, 2 pty vote St Treasurer
  select(PrecNo,D2Pre,D2Sen,D2Att,D2Aud,D2Tre)  # Get rid of columns we don't need
DemTwoPartyVotes
```


```{r}
Ex3Data <- Tot %>%                                # Start with the Tot data_frame
  left_join(DemTwoPartyVotes) %T>% glimpse() %>%    # Merge the DemTwoParty table (on shared PrecNo)
  left_join(Rolloffs) %T>% glimpse() %>%            # Merge the Rolloff table (on shared PrecNo)
  mutate(PrecNo=as.integer(PrecNo)) %T>% glimpse() %>%       # Turn PrecNo into a number
  mutate(PrecName=str_sub(PrecName,start=4L)) # Strip the redundant number off of the Precinct names
Ex3Data # 91 rows, 12 columns
```
