This summer, I joined UVA’s Data Science for the Public Good program as a graduate fellow. I learned a ton, and I can’t speak highly enough of my experience there. One of the first lessons: ten weeks of data science sounds glamorous, but it’s four weeks of data profiling for every four weeks of data wrangling for every two weeks of data analysis.
There’s nothing new I can say about data wrangling. However, I want to take a moment to sing the praises of data profiling. Data profiling is a systematic way to dig into your data and evaluate fitness-for-use, beyond measures of central tendency and the first/last 5 rows. The data profiling method I learned at UVA has three pillars: completeness, uniqueness, and validity. I added an additional step: auditing for accuracy.
When you profile your data, you’re going to find stuff that looks weird! By investigating the weird stuff, you’ll get a real feel for your dataset’s texture and quirks, and a sense of what you can expect when working in it later.
I’ll show what can be learned from completeness, uniqueness, validity, and accuracy, with examples from the Count Love’s protest data. R code snippets are included where appropriate, and my full profiling script is available on github.
Before getting into profiling, I highly recommend reading any documentation for the data that is already available! No point in solving a mystery that’s already explained in the data dictionary. For this dataset, I found the Count Love FAQ and blog particularly helpful.
There’s a lot of detail below, so feel free to jump to “In Sum” for the tl;dr version.
Missing data can stop an analysis before it gets started. I don’t want to put together an analysis plan that relies on a field I can’t actually use. Completeness measures what proportion of a field actually holds data.
First, I check whether any wacky symbols or strings are used to indicate missing data. (If I never see “N/A” in a spreadsheet again, it will be too soon.) To do that, I simply sort the values in the field by the number of times they appear and glance through the list:
protest %>% group_by(date) %>% summarise(count = n()) %>% arrange(desc(count))
This isn’t a perfect check, but it’s pretty easy to spot a value that means “missing,” or search through for MISSING, N/A, NONE, etc. For a numeric field, I also check the minimum and maximum values to look for extreme/unlikely values (such as -999, often used in SPSS files to indicate missing data).
Once I’m sure I only need to look for NAs, I run the following code snippet, which gives me the percentage of non-missing values:
(nrow(protest) - sum(is.na(protest$date)))/nrow(protest)*100
Fortunately, the CountLove data are almost entirely complete:
The only field with any missing data was number of protestors. That’s a shame—it would be very cool to examine protest size! However, when I checked some sample articles, the protest size simply wasn’t included.
Uniqueness can give a sense of a dataset’s scope: for instance, how many individual places had protests? How many unique tags were used to describe protests?
Getting a count of unique values is as simple as finding the length of a vector of unique values:
Interpreting it is another matter entirely:
Sometimes, it’s reasonable to have a unique value for every record; in others, you would expect to see only a few unique values. A few things stand out to me here:
1. Why are there so many unique tag values? Are there really 1,400+ unique tags?
I glanced at the first few values in the Tag field, and saw lots of different tags separated by semicolons:
1 Healthcare; For Affordable Care Act
2 Civil Rights; For racial justice; Martin Luther King, Jr.
3 Environment; For wilderness preservation
4 Civil Rights; For racial justice; Martin Luther King, Jr.
5 Healthcare; For Planned Parenthood
I’ll need to split those apart to get a real sense of how many unique tags exist. I used Andrew Ba Tran’s muckrakr package:
tags <- untangle(protest, "tags", pattern = ";")
This adds a column for each unique value in “tags” separated by semicolons, with a 1 in each row that includes the tag and a 0 in each row where that tag is not found. To find the number of unique tags, I can just look at the number of new columns:
That gives me 441 unique tags–much more reasonable, especially since I know from the documentation that every protest is tagged with at least one subject, at least one stance, and additional context where needed. For instance, there are three gun-related tags: guns, for greater gun control, and against greater gun control.
I would also like to know how many tags tend to be used for each event, so I summed the number of ones in the tag columns and looked at a quick-and-dirty histogram:
tag_num <- apply(tags[,9:449], 1, sum) summary(tag_num) hist(tag_num)
Every protest has at least two tags, and the maximum number of tags is 8. I checked the source for an event with 8 tags, and found that, indeed, the event covered all the listed tags (Civil Rights; Immigration; For racial justice; Against border wall; For compassionate immigration; For greater accountability; May Day; Police)
2. Why are there fewer source articles (16,789) than protests (20,025)?
To answer this, I checked the sources that were used most often:
protest %>% group_by(source) %>% summarise(count = n()) %>% filter(count > 1) %>% ggplot(aes(x=count)) + geom_histogram() + ggtitle("Frequency of sources used for more than one protest")
Some were used 20 times or more! I checked the article that had been used 23 times, and discovered that it covered 23 separate school protests:
Since each protest has its own row, the same URL would need to be repeated for each.
3. That’s a lot of locations (6,000+).
On the other hand, the U.S. is a huge country! I’ll keep an eye on the location field in the Validity step.
Validity is a tricky concept: to know how many valid values there are in a field, you first have to get a sense of what valid data looks like for that field.
The first and easiest step is to check the class of each variable. Any numeric variable represented as a “character” vector in R probably has some invalid text values. Fortunately, the variable classes in the CountLove data are pretty straightforward: the date is a date, and numeric fields are integers (“int”):
For numeric values or dates, I like to look at the distribution of values and apply the sniff test. In this case, the dates began early in January 2017 and ended in September 2019, as expected.
For character data, I start by looking at the number of characters in each value:
The locations listed in this dataset have a huge spread, from five characters to seventy-five. Both of those values seem pretty extreme!
I looked at the five-character option and discovered, to my delight, that the protest occurred in space:
The other short places seemed to genuinely have short place names: “Napa, CA,” “Reno, NV,” “Bend, OR,” etc. The very long locations had a specific location, like a school or courthouse, listed before the city and state (for instance,”Charlotte Douglas International Airport, Charlotte, NC”).
Looking at the extremes of the location data clued me in to a common structure: most of the examples I’ve seen ended with a two-letter state abbreviation. I used regular expressions to check whether that was the case for all locations:
state <- "(?-i:A[LKSZRAEP]|C[AOT]|D[EC]|F[LM]|G[AU]| HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]| P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$" unique(protest$location[!str_detect(protest$location,state)])
I found 26 locations that didn’t match the pattern. Those exceptions generally took place in another country (or in space), but in five cases, one letter in the state name was lower-cased. That gives me a couple of to-dos:
- Decide whether I want to include protests that occurred in other countries
- Fix the capitalization for some protests inside the US.
I decided to spot-check the location values for one location that might be described in many ways: New York City.
protest[str_detect(protest$location, coll("New York")) | str_detect(protest$location, coll("NYC")) | str_detect(protest$location, coll("New York, NY")) | str_detect(protest$location, coll("Manhattan")),] %>% group_by(location) %>% summarize(count = n()) %>% mutate(perc = count/sum(count))%>% arrange(desc(count))
There were 88 unique locations, many of which included information about specific boroughs or neighborhoods (“Far Rockaway, Queens, New York, NY” ). I found that New York City was usually referred to as New York City, NY and New York, NY. On one occasion, it was “Manhattan, NY,” but never NYC.
That gives me one more to-do:
- if I decide to look at broad metropolitan areas, I need to figure out a way to standardize names (count New York, NY and Manhattan, NY as a part of New York City, NY; count “Far Rockaway, Queens, New York, NY” as a part of New York City).
That’s certainly possible! I just need to factor in some extra time to make it happen before trying to dive into analysis based on place name.
Audit for accuracy
The CountLove data was kind enough to include a source for each protest, so it behooves me to fact-check some of the data to make sure I can trust the dates, locations, and tags for each protest.
I drew a random sample of 30 protests to check by hand:
audit <- protest[sample(1:nrow(protest), 30),] write.csv(audit, "audit.csv")
After following the linked source for each, I found that the dates, locations, and tags listed were all correct, but not necessarily complete. A few protests were missing tags that I thought could also have been applied. For instance, this article probably could have used the “amazon” tag. The tags that were there, however, accurately described the subject of the protest and the stance of the protestors. That lets me know that I should apply a little caution to my results: I can trust the tags that are there, but my findings may not be descriptive of every protest about a particular subject. However, 90% of the audited articles had complete tags, so I feel comfortable moving forward with the tags in general.
If data profiling looks like a lot of work, well, it is. I can automate some parts of it, but mostly it involves a lot of on-the-fly inquiries, intuition, and ingenuity. Ultimately, I feel extremely lucky with regards to this dataset. There were no huge surprises, and only one disappointment (protest size). I will probably find some more issues as I actually conduct the analysis, but I feel prepared to make a plan.