Material for Social Data Analytics 501 @ Penn State
Due Thursday, Jan 30 (which is defined operationally as before Friday, Jan 31, 7:00 am)
Complete the replication and discussion of Michel, et al., in Bit by Bit, Chapter 2, Exercise 6, parts (a) through (g).
Technical note: So that we don’t have anyone blowing up their hard drives / bandwidth etc., please note that when Salganik says “Please read all parts of the question before you begin coding” and “Get the raw data from the Google Books NGram Viewer,” he is telling you to get the data you need to answer this question. For part (a) you need counts of single words (unigrams or 1-grams) like “1875” or “1932.” You only need to download the data from a single link on the page he provides. The file at that link is 283M compressed and 1.5G uncompressed. If you’ve downloaded more than that, you’ve downloaded more than you need. Moreover, the data you actually need is a tiny fraction of that 1.5G, so you don’t need to keep all of it, or load all of it into memory – figure out how to load just the part you need. Do all the cleaning / subsetting with code … don’t open a 1.5G file in Excel and delete 1.4G+ of rows.
(For part b, you also need the tiny “total_counts” file.)
You’ll do this in assigned teams of two or three.
Submit your team’s answer, and the code, as an R Notebook or Jupyter (Python) notebook.
output: html_notebook
in the preamble. This will create a notebook file (.nb.html) when you “Preview” or “Knit”. Submit - or provide in a github repository - both files.Due Friday, Feb 21, 5:00pm (Groups as assigned in class)
This exercise is based on Salganik, 2.7, and involves a semi-replication of Penney (2016): Penney, J.W., 2016. Chilling effects: Online surveillance and Wikipedia use. Berkeley Tech. LJ, 31, p.117.
Consider Figure 3, which demonstrates a “chilling effect” around the Snowden revelations on June 6, 2013:
Due Monday, Jan 14 (which is defined operationally as before Tuesday, 7:00 am)
Consider the following search on Google Trends: https://trends.google.com/trends/explore?date=all&geo=US&q=islam (relative use of the search term “islam” in the US for all time available.
You can see what appears to be a seasonal pattern. I want your team to discuss what you think the cause of that is and try to think of comparison search terms that would follow a related pattern if that were the cause. It doesn’t have to be identical … you might think of something that should move in the opposite direction or be the same but shifted 3 months. But of course just finding seasonality isn’t hard.
How about https://trends.google.com/trends/explore?date=all&geo=US&q=islam,oranges :
Or https://trends.google.com/trends/explore?date=all&geo=US&q=islam,basketball :
In any case … does it look like you were right? If not, keep trying.
Write a paragraph giving your team’s best explanation for the pattern, and a small set of comparison terms that you think best support your case.
This figure from Google Books Ngrams Viewer implies that texting peaked in the 17th century:
What the hell is going on here? Use any evidence you want, or just conjecture. Write another paragraph with your team’s best explanation.