SoDA501

Material for Social Data Analytics 501 @ Penn State

View the Project on GitHub burtmonroe/SoDA501

SoDA 501 Glossary

Jingle-jangle fallacies refer to the erroneous assumptions that two different things are the same because they bear the same name (jingle fallacy) or that two identical or almost identical things are different because they are labeled differently (jangle fallacy). https://en.wikipedia.org/wiki/Jingle-jangle_fallacies

A/B testing

accuracy

ACI Advanced Cyber-infrastructure. The cluster computing resources provided on Penn State’s campus by ICS.

activation function

active learning

additive smoothing

adjacency matrix

admixture model (see “mixed membership model”)

administrative data / records

adversarial (learning, examples, network)

affine transformation

affinity

aggregation

AI (see “artificial intelligence”)

AIC Akaike Information Criterion. “AIC is an estimator of the relative quality of statistical models for a given set of data.” (Wikipedia). Related and often compared to BIC. Asymptotically equivsalent to leave-one-out cross-validation in some models, including OLS.

Airflow / Apache Airflow

algorithmic bias / algorithmic fairness / machine bias

algorithmic confounding

AlphaGo

amplified asking

ANN – Artificial neural net (see “neural net”)

anonymous function

Apache Software Foundation

API Application programming interface.

artificial intelligence (AI)

association

ATE Average treatment effect.

AUC Area under curve. A diagnostic for the performance of a binary classifier. Can refer to area under the Receiver Operating Characteristic curve (“ROC curve”) or area under Precision-Recall curve (“PR curve”). Usual advice is to use ROC AUC except in cases of class imbalance, in which case PR AUC is preferred. That is, PR curve is preferable when one case (say, the positive case) is much rarer than the other, or where you care much more about one case than the other. (This is debatable given mathematical relationships between the two.) ROC curve plots true positive rate (aka “sensitivity”, aka “recall”, aka “hit rate”, aka “probability of detection”) against false positive rate (aka “fall-out,” aka “probability of false alarm,” = 1- “specificity”) for different thresholds. PR curve plots precision (aka “positive predictive value”) against recall. Each is a summary of how the “confusion matrix” changes as a function of decision threshold. A perfect classifier will have an AUC of 1.

autoencoder

auxiliary data At least three meanings. One in double / two-phase sampling. One in statistical privacy. One in Bayesian estimation.

auxiliary information

Avro / Apache Avro

awk

AWS Amazon Web Services.

Azure / Microsoft Azure

B-tree “… a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree is a generalization of a binary search tree in that a node can have more than two children … It is commonly used in

s and file systems.” (https://en.wikipedia.org/wiki/B-tree). See explainer on “data structures.”

backpropagation

bag of words

bagging

balance

bash

basis

batch

Bayes / Bayesian (statistics, estimation, theory, updating)

Beam

behavioral drift

behavioral science

Berkeley Data Analytics Stack (BDAS)

between-subjects design

betweenness centrality

bias

bias-variance tradeoff

BIC Bayesian Information Criterion

big data

Big O notation

BigQuery see “Google BigQuery”

Bigram An n-gram of length two. (See “n-gram.”)

Bigtable

bit sampling

black box

blind source separation

block diagonal

blocking

blue-team dynamics (see “red-team dynamics”)

Bonferroni correction

Boolean

boosting

bot

breadth-first search

Bridges XSEDE computing resource for large datasets, run by Pittsburgh Supercomputing Center (PSC)

calibration

canonical correlation analysis (CCA)

CAPTCHA

capture-recapture

case-folding

cast

causal discovery

causal effect

causal inference

causal graph

CCA (see “canonical correlation analysis”)

centered

centrality

change of basis

characteristic matrix

chunk

citizen science

classification / classifier

Clojure

cloud

cluster analysis

cluster computing

clustered sampling

CNN (see “convolutional neural net”)

codes / codebook At least three meanings of codebook. Social science meaning. Data compression meaning. Cryptography meaning.

coding

collaborative filtering

collective intelligence

column-oriented database / column store

Common Rule / Revised Common Rule

community / community detection

compositional / compositional data

compression

compression artifact

computational social science

confounder / confounding

confusion matrix

conjugate prior

concurrent validity

constituency parse

construct validity

content validity

control (group, variable)

convergent validty

convolution

convolutional neural net (CNN)

core-sets

correlation

correspondence analysis

cosine similarity

Couchbase

CouchDB

counterfactual

covariance / covariance matrix

coverage bias

coverage error

cross entropy

cross product

curse of dimensionality

cross-validation

crowdsourcing

csv

CUR decomposition

DAG (see Directed acyclic graph)

data augmentation

data deluge

data editing / statistical data editing

data exhaust

data fusion

data lineage

data mapping

data marshalling

data mining

data munging

data privacy

data profiling

data prototyping

data provenance

data science

data squashing

data store

data streams

data structure

data transformations

data type

data wrangling / cleaning / scrubbing / munging

database

database management system

Dataflow

data-intensive

data.table

DCA Discrete Component Analysis.

de-anonymization

decomposition

de-duplication

deep learning

degeneracy (of labels; of ERGMs)

degree

degree centrality

dehydrated data

delimited file

demand effect

dependency parse

design matrix

DFT Discrete Fourier Transform (see Fourier)

difference-in-differences (DID)

differential privacy

differentially private query

digital traces / digital footprints / digital fingerprints

dimensionality reduction

directed acyclic graph / DAG

dirty (data)

discriminant validity

dissimilarity

distance

distributed computing/processing

distributed data collection

distributed data store

distributed representation

distributed sensors

Docker

document-oriented database

document-term-matrix / document-frequency matrix

dot product

DOM

double centered

double sampling Used in at least three distinct ways in the sampling literature. One describes a two-phase sampling procedure in which a sample is taken and then, if inconclusive, a second sample is taken. (cite).

The second, as used by Thompson, is the more relevant for SoDA. This describes a two-phase procedure in which a sample is taken to measure some variable auxiliary to our variable of interest, and then a smaller sample of those is taken to measure our (presumably more expensive or intrusive) variable of interest. The auxiliary variable is presumed to occur at a constant ratio to the variable of interest, so then we can use the auxiliary variable to improve our estimate of the variable of interest. This is very similar to the sort of procedure Salganik calls amplified asking in which an expensive survey is combined with cheaper “big data,” except that the relationship between the two is estimated by a more general process of supervised learning rather than a ratio assumption.

A third type of double sampling is also a two-phase sampling procedure. Here we want to do stratified sampling, but don’t know the size of the strata. An initial sample is used to estimate the size of a strata, and then a second stratified sample is taken to measure the variable of interest.

drift

drill

dropout

DTM (see “document-term matrix”)

dummy observations see “pseudo-observations.”

dummy variable

duplicate detection

ecological momentary assessment (EMA)

econometrics

edit distance

edge

Eigen- (value, vector, decomposition)

Eigenvector centrality

elastic net

Elasticsearch

EM algorithm

embarrassingly parallel / parallelizable

embedding(s)

encouragement design

enriched asking

ensembling / ensembles

entity

entity resolution / disambiguation

entropy

environment

epoch

ERGM (see “Exponential Random Graph Model.”)

ETL (see “Extract, transform, load”)

Euclidean (distance, norm)

event data

Exceed OnDemand Software – remote access client for X-windowing, recommended / provided by ICS for use on the ACI systems. Not to be confused with “XSEDE”, the consortium of remote computing resources funded by NSF (and generally facilitiated by ICS-ACI).

exclusion restriction

experiment

Exponential Random Graph Model (ERGM) A generative model of link formation in networks. Extensions include TERGM, GERGM.

exponential family

external validity

Extract, Transform, Load (ETL)

F1

face validity

factor analysis (FA)

factorial design

factorization (of a matrix)

feature

feature engineering

feature extraction

feature learning

feature selection

feedforward networks

FFT Fast Fourier Transform (see Fourier).

field

field view (of geographic data)

filter

first-class function / first-class citizen (in programming)

fixed effects

flat file

floating point

forecasting

formal theory - used in political science to mean theorizing through mathematically coherent models … usually means use of microeconomics-style models or game theory to model some political phenomenon. Main usage is to distinguish from another subfield of political science, political theory, which in most instances bears closer resemblance to philosophy than economics.

Fourier (anaylsis, operator, series, transform) / Fast Fourier Transform (FFT) / Discrete Fourier Transform (DFT)

FP-tree

frame

frame population

Frobenius norm

functional programming

Fundamental Problem of Causal Inference

games with a purpose / gamification

GAN (see “generative adversarial network”)

garden of forking paths

Gaussian

generalization error

generative adversarial network (GAN)

generative model

geometric mean

GIA Geographic Information Analysis

Gibbs sampling

Giraph / Apache Giraph

GIS Geographic Information System

git / github

GloVe

gold standard

Google BigQuery

Google Books

Google Colab

Google Correlate

Google Flu

Google Ngram Viewer

Google Trends

GPU Graphics Processing Unit

gradient boosting

gradient descent see also “stochastic gradient descent”

Gram matrix

granularity

graph

graph mining

graphical database

graphical models

grid computing

ground truth

grouped summaries

H2O

Hamming distance

Hadamard product

Hadoop

harmonic mean

hash / hashing / hash table

Haskell

Hawthorne effect

HBase

HCI Human-computer interaction.

HDFS Hadoop distributed file system. (also, around here, “Human Development and Family Studies”)

HDF5 Hierarchical Data Format, version 5

Hessian

heterogeneity

heterogeneous treatment effects

heteroskedasticity

hidden layer / hidden nodes

Hidden Markov Model (HMM)

hierarchical / hierarchy

*higher-order functions

Hilbert space

HITS

Hive

homogeneous-response-propensities-within-group assumption

homophily

honeytrap (for web scrapers)

human computing / human-in-the-loop computation

human subjects

hyperparameters

hypothesis space

ICA (see Independent Component Analysis)

ICS Penn State’s Institute for CyberScience. Administer the ACI (advanced cyber infrastructure) systems on campus.

IDE Integrated development environment.

Impala

idempotent

identification / identification problem

identity matrix

ill-posed (problem)

image

image filter / image kernel

imputation

incidence matrix

Independent Component Analysis (ICA)

index

indicator variable

indirect effects

influence maximization

information

information retrieval

information theory

informed consent

ingestion (of data)

inner product

instance construction

instance detection

instrument

instrumental variable (IV)

integer

integrated / integration

intention to treat (ITT)

internal validity

interrupted time series

inverse (of a matrix)

inverse problem

inverted index

intervening variable

invertible

IR Usually in this field “information retrieval.” Also used for “international relations”, “infrared”

IRB Institutional Review Board.

IRT (see “item response theory”)

item nonresponse

item response theory (IRT) / IRT model

Jaccard similarity

Javascript

Jetstream

Johnson-Lindenstrauss lemma

join

JSON

Julia

Jupyter / Jupyter notebooks

k-means

k-NN (k Nearest Neighbors)

k-shingle

Kaggle.com kernels

Kagglification

Keras

kernel

kernel density estimation

Kernel PCA (KPCA)

kernel smoothing

kernel trick

key

key-value pair

key-value store

KL-divergence (see “Kullback-Leibler divergence”)

Kullback-Liebler divergence (aka “relative entropy”)

L1-norm / L1 regulariation

L2-norm / L2 regularization

lambda operator / function

Laplace (distribution / prior)

Laplacian (of a network)

Laplacian eigenmaps

LASSO

latency

Latent Dirichlet Allocation (LDA)

Latent Semantic Analysis (LSA)

Latent Semantic Indexing (LSI)

latent variables

layer

layer view (of geographic data)

LDA either “Latent Dirichlet Allocation” or (Fisher’s) “Linear Discriminant Analysis”

LDC Linguistic Data Consortium.

Leaflet

leakage (of data)

leave-one-out cross-validation (LOOCV)

lemma / lemmatization

levels of measurement

lifelong learning

likelihood

linear subspace

linear transformation

link

linkage (“record linkage”)

linked data

list

list experiment

literate programming

load / loadings

local average treatment effect (LATE) aka “complier average causal effect (CACE)”

locality-sensitive hashing

locally linear embedding (LLE)

logarithm

logistic regression / logit

long data

longitudinal

loss function

LSTM

Lucene (Apache Lucene)

Luigi

machine bias (see “algorithmic bias”)

machine learning

makefile

manifest

manifold

Mahout

map

map/reduce, MapReduce

MariaDB

Markov (process, chain, model)

Markov Chain Monte Carlo / MCMC

matching In causal inference, matching is a process for analysis of observational data, under which treated units are matched with control units that are otherwise similar by some measure on observable pretreatment variables. In information sciences, matching is another term for “record linkage.”

max norm

MaxEnt

MCMC (see “Markov Chain Monte Carlo”)

MDS (see “multidimensional scaling”)

measurement model

mechanism

mediator / mediating variable

melt

Memcached

merge

metadata

metric

MinHash

minimum spanning tree

missing data / missingness

mixed membership model (also called an “admixture model”)

mixed models

mixture model

model-based

moderator / moderating variable

Modifiable Areal Unit Problem (MAUP)

moments

MongoDB

Moore-Penrose pseudoinverse

morpheme / morphology

MovieLens

MPI Message Passing Interface.

MRP / “Mister P” Multilevel regression and poststratification.

MTurk

multidimensional scaling (MDS)

multilevel modeling

multiple comparisons

multiple imputation

multiple systems estimation (MSE)

multiscale

multithreaded

multivariate statistics

MusicLab

mutual information (MI), pointwise mutual information (PMI), positive pointwise mutual information (PPMI)

MWE Multiword entity. (See “entity.”)

MySQL

Naive Bayes (classifier)

name matching

named entity recognition / NER

natural experiment

natural language processing / NLP

nearest neighbors

negative sampling

Neo4j

NER (see “named entity recognition”)

Netflix Prize

neural net / artifical neural net (ANN)

n-gram

NLP (See “natural language processing”)

No Free Lunch Theorem

node

noise

noncompliance

non-metric (distance / similarity function)

non-negative matrix factorization (NMF)

nonparametric

non-probability sample

nonreactive (measure)

non-rectangular data

nonresponse bias

non-stationary (time series)

norm / normal / normalize / normalization

*normal distribution

normal form

NoSQL

notebooks

nowcasting

NP-hard / NP-complete

numerical computation

NumPy stack

object

object-oriented programming

object view (of geographic data)

observational (data, design)

OCR Optical character recognition. The task of turning images into alphabetic (or similar) characters.

OLS Ordinary least squares

one-hot

online Could be used in regular meaning like “on the web” but can also be used like “streaming” as a modifier to “data” or “algorithm” or “processing” to indicate data is being processed sequentially, with one pass through the data, or even in real-time as data is generated.

open data

open science

OpenStreetMap

operationalize / operationalization

ORC Optimized Columnar. Big data format. Like Parquet, a columnar database (contrast with Avro - row-based).

orthogonal

orthogonalize

orthonormal

OSEMN workflow

out-of-sample prediction

over-determined

over-fitting

p-hacking

Pachyderm

PageRank

pandas

panel data

parallel worlds design

parametric

Parquet

parse tree

parsing

Pasteur’s quadrant

path

pattern

pattern recognition

pdf (stands for “portable desription format.”) “Where data goes to die.” – Simon Rogers (Data editor, Google News Lab).

Penrose inverse

persistence

perturb-and-observe experiment

pickle (pkl)

pipe / pipe operator

pipelines

pivot table

placebo / placebo test

PMI Pointwise mutual information. (See “mutual information”)

population drift

POS (in NLP) Part of speech. A “POS tagger” attempts to annotate the tokens/words of input sentences / text with their part of speech.

positive definite

posterior

PostGIS

PostgreSQL

posting

post-SQL

post-stratification

potential outcomes model

PPMI Positive pointwise mutual information. (See “mutual information”)

precision

prediction

preprocessing

Presto

pre-registration

principal angle

Principal Component Analysis (PCA)

prior

privacy-preserving data-mining

probabilistic data structures

Procrustes (analysis)

profiling (of code)

profiling (of data) “Data profiling is the process of examining the data available from an existing information source … and collecting statistics or informative summaries about that data.” (Wikipedia) This is an information science concept, where it is also known as “data archeology”, “data assessment”. I know of no literature overlap, but it is essentially the same as the precursors to the process of “data editing” used by official statistics agencies. Individual records profiling can consist of syntactic profiling (making sure entries fit broad format constraints) or semantic profiling (making sure entries make sense). Set-based profiling involves examining the distribution of values of a variable/field, and parallels the social science practice of providing “descriptive statistics.” See Rattenbury, et al. (2017) Principles of Data Wrangling.

projection

provenance (of data)

pruning

pseudo-inverse

pseudo-observations / pseudo-counts

quasi-experiment

QGIS

query

R

random forests

random projections

randomized controlled experiment / trial

raster

RCT randomized controlled trial.

RDD in context of causal inference with observational data, see “regression discontinuity design.” In context of computing with Spark, see “resilient distributed dataset.”

RDF (Resource description format)

RDS (R Data Serialization format)

reactivity

recall

recast

recommender system

reconstruction error

record

record linkage - (see “linkage”)

recurrent neural net (RNN)

red-team dynamics / blue-team dynamics

Redis

redundancy

regression

regression discontinuity design (RDD)

regression to the mean

regression trees

regex (see “regular expression”)

regular expressions

regularization / regularize

reinforcement learning

relation / relational data / relational database

reliability

ReLU (Rectified Linear Unit)

remote sensing

replace, refine, reduce

replicability

report

representation (of data)

representative / representativeness

reproducibility

repurposing (of data)

resilient distributed dataset (RDD)

respondent-driven sampling

RESTful / REST API

ridge regression

RMSE Root mean squared error.

RNN (see “recurrent neural net”)

robots.txt

ROC curve

rotation

row-centered

RStudio

SAC (see “Split-Apply-Combine”)

sample population

sampling

sampling error

sampling frame

Scala

scale

scaling

scatter matrix

schema

scikit-learn

script

SciPy

segmentation

selection bias

semantic web

semantics

semi-parametric

semi-structured data Often used to reference data not in relational normalized form. Commonly used to describe JSON and XML. In big data contexts, can refer to things like ORC, Parquet, Avro.

semi-supervised learning

sensitive / sensitivity

sensors

SEO Search engine optimization.

sequence modeling

serialization (of data) / serialization formats

SGD (see “stochastic gradient descent”)

shapefile

shell

shingle / shingling

shrinkage

sigmoid (activation function)

signal processing

simplex

singular

singular value decomposition (SVD)

sketches

smoothing

SNA Social network analysis

snakemake

social

social data

social data analytics

social data stack

social network I’d prefer if this were only used to refer to networks in which the nodes are people or groups of people and the edges indicate some kind of “social” relationship. It is sometimes used in reference to networks more generally. It is also, of course, a term of art used to refer to social media sites or platforms, e.g., Facebook.

social science

softmax

Software Carpentry

Solr (Apache)

Spark

sparse coding

sparse matrix

sparsity

spatial

spatial autocorrelation

spectrum / spectral theory

spider trap

spillover

split-apply-combine (SAC)

spurious

SQL

SQLite

stationarity

statistical conclusion validity

statistical disclosure limitations

statistical learning

stemming / stem

STM Structural Topic Model

stochastic gradient descent / SGD

strata

stratified sampling

streaming / stream processing Could be used in regular meaning like “streaming video” but can also mean data is being passed through an algorithm sequentially, with the algorithm updating after each observation.

structural

structured data

supervised learning

support vector machine (SVM)

SUTVA

SVD (see “singular value decomposition”)

SVM (see “support vector machine”)

systematic sample

systemic drift

tensor

TensorFlow

tesselation

test data / test set (see “training data”)

tf.idf / tf-idf

threats to validity

Thrift / Apache Thrift

tidy data / tidyverse

Tikhonov regularization

tile

topic model

toponymy The study of place names.

total survey error

TPU Tensor processing unit. Currently proprietary chips from Google, but available as an option in Google Colab.

trace

training data/set; test data/set; validation data/set

transfer learning

transformation

transition matrix

transparency

transportability

treatment / treatment effect

triangle inequality

Trifacta Wrangler

TSCS Time series - cross-section

t-SNE

tsv

Tucker decomposition

Turkers

uncertainty

unfolding Lots of disciplines have something they call “unfolding” (e.g., music, biochemistry). Here, it’s most likely to refer to a type of data analysis closely related to multidimensional scaling, which maps individuals and objects over which they have preferences in the same space. The next most likely usage is as “deconvolution”, the reversing of a convolution operation.

Unicode

unit nonresponse

unobtrusive (measure)

unpivot

unsupervised learning

uptake rate

user-attribute inference

UTF-8

“V”s of big data Traditional “three Vs of big data” - volume, velocity, variety. There are many “fourth V’s” including Monroe’s “five Vs of big data social science”

validation / validity

variance

variational inference / variational Bayesian methods

variety One of the conventional “three Vs” of “Big Data”

varimax / VARIMAX

vector

vector quantization (VQ)

vector space / vector space model

VEM Variational Expectation Maximization

version control

vertex

vinculation - the tendency for social data to display interconnectedness (e.g., tied through network edges, exhibiting spatial correlation) that complicates inference and/or is itself the target of inference. Vinculated data may be small in N, but still require computationally intensive methods. One of Monroe’s (2013) “five Vs.” A vinculum is a “bond” or “tie”; used in anatomy, chemistry, and math.

virtual machine

virtualization

visual analytics

Voronoi diagram / tesselation aka “proximity polygons”

wavelet

weak instrument

web driver

weights

WEIRD Western, Educated, Industrialized, Rich, and Democratic. A critique of the typical pool of participants for lab experiments.

wide data

within-subjects design

word2vec Also doc2vec, sense2vec, skip-gram, CBOW, negative sampling

XML

XPath

XSEDE “Extreme Science and Engineering Discovery Environment”

YAML (“YAML Ain’t Markup Language”) A human-readable text-based serialization format (and markup language).

zone