8 min read

Cleaning UFO Data with dplyr

UFO Sightings

Want to smash together 10 different data files in R? Got untidy dates in need of formatting? Here’s an extreme1 example of data cleaning several data sets.

I’m not sure about extraterrestrial life. But while performing data analysis here on Earth, we seldom receive clean data in one single file. We’ll learn more about the tidyverse by exploring the universe of variables we could possibly connote with UFO sightings.

Step 0: Read in libraries

First, let’s load the necessary packages or libraries. If you don’t have any one of the packages installed, try typing the command below into your Console. Say you don’t have the tidyverse package installed:

install.packages("tidyverse")

Now we’re ready to load those libraries!

library(here)
library(tidyverse)
library(data.table)
library(janitor)
library(noncensus)
library(zoo)
library(plyr)
library(datetime)

library(purrr)
library(lmtest)
library(ggplot2)
library(ggrepel)
library(GGally)
library(lubridate)

Step 2: Download files

Let’s download the files from this Github repo. If you’re not as comfortable with Git, you can just click the green button Clone or download in the upper right corner and download zip.

Step 3: Read in files

I used to write file.path() to set relative paths to files, but recently I discovered the here package through Jenny Bryan.

The here() function helps ensure consistency of file paths–even when we have nested files.

beer.path <- here("raw", "brew_count_by_state_1984_2017.csv")
movies.path <- here("raw", "alien_movies_per_year.csv")
gdp.path<- here("raw", "gdp_per_capita_per_year.csv"
internet.path <- here("raw", "internet.csv"
internet.rurality.path <- here("raw", "internet_by_rurality.csv"
usafbase.path <- here("raw", "usaf_base_locs.csv"
statepops.path <- here("raw", "state_pops.csv"
sightings.path <- here("raw", "ufo_sightings.csv"
allmovies.path <- here("raw", "number_of_movies_per_year.csv"
statepops3.path <- here("raw", "statepops.db3.csv")

Now we’re ready to read the csv files. I used to use read.csv but I started using fread from the data.table package more often since it’s significantly faster for larger data sets. (You can learn more about fread and other ways of reading data into R through Datacamp).

The na.strings parameter automatically converts specific strings into NA. This is useful since different files will have different entries for NA. For example: the beer.csv file contains a bunch of null entries that are denoted as asterisks * or empty strings " " while the sightings.csv file will contain null entries denoted as an empty string, "unknown", or "--".

beer.raw <- fread(beer.path, header=TRUE, na.strings=c("*", ""))
sightings.raw <- fread(sightings.path, header = TRUE, na.strings = c("", "Unknown", "--"))
movies.raw <- fread(movies.path, header = TRUE)
gdp.raw <- fread(gdp.path, header = TRUE)
internet.raw <- fread(internet.path, header = TRUE)
internet.rurality.raw <- fread(internet.rurality.path, header = TRUE)
usafbase.raw <- fread(usafbase.path, header = TRUE, na.strings = c(""))
statepops.raw <- fread(statepops.path, header = TRUE)
allmovies.raw <- fread(allmovies.path, header = TRUE)
statepops3.raw <- fread(statepops3.path, header = TRUE)

Step 4: Data cleaning

Now that we’ve read in our data sets, we’re ready to start cleaning up each individual data set. First, let’s start with the breweries data.

Step 4a: Tidy Beers

In the quest for consistency, the rename function from the dplyr library allows us to easily rename variables. Here we are renaming STATE as state and assigning this change to a new dataframe called beer.db. This helps us distinguish the raw dataframes we read in from the cleaned dataframes we desire in the end.

beer.db <- beer.raw %>% 
  rename(state = STATE)

Now let’s get rid of a bunch of unwanted text in the file. At the bottom of the state column, there’s a bunch of text we don’t need. We can filter out rows in the state column that are NA’s, text like “Total” or “Other” or etc.

beer.db <- beer.db %>% 
  filter(! (is.na(state) | state == "Total" | state == "Other" |
              state == "* No reportable data" | 
              state == "«This list will be updated quarterly."))

It would be cumbersome to keep reassigning the beer.db dataframe every time we wanted to make a change. Fortunately the pipe operator %>% allows us to chain together all these manipulations to the beer.raw dataframe and assign it to a variable called beer.db in one tidy sweep.

beer.db <- beer.raw %>% 
  dplyr::rename(state = STATE) %>%
  filter(! (is.na(state) | state == "Total" | state == "Other" |
              state == "* No reportable data" | 
              state == "«This list will be updated quarterly.")) %>%
  mutate(state = as.factor(state)) %>%
  gather_("year", "breweries", as.character(seq(1984, 2017))) %>%
  mutate(breweries = as.numeric(breweries)) %>%
  filter(1984 <= year & year <= 2016) %>%
  mutate(year = as.numeric(year))

Step 4b: Tidy the Internets

# Combine the two into one data frame based on year
internet.db <- merge(internet.raw, internet.rurality.raw, by = "Year")

# Rename the columns
colnames(internet.db) <- c("year", "internet_avg", "internet_urban", "internet_suburban", "internet_rural")

Step 5: Group the states into regions.

The states data set from the noncenusus package comes in handy since we want to bin each state into a region category. For example: California will be considered “west” and Massachusetts will be considered “east.”

data(states)
statepops.db <- statepops.raw %>% 
  #gather_("Name", "year", "population")
  gather_("year", "population", as.character(seq(1960,2010,10))) %>% 
  filter(Name %in% state.name) %>% 
  mutate(year = as.integer(year)) %>% 
  mutate(population = as.numeric(gsub(",", "", population)))

Let’s filter for just the names of states.

statepops.db2 <- statepops.raw %>% 
  filter(Name %in% state.name)  

statepops.db3 <- data.frame(t(statepops.db2))

We will only keep data from years 1970 to 2016 since the timeframe coincides with the rest of our data sets. Let’s write the clean data to a csv file within the /clean/ folder to tidy up a little.

yrs <- 1970:2016
yrs <- zoo(NA, datetime::as.year(yrs))
zpops <- zoo(select(statepops3.raw, -year), as.year(statepops3.raw$year))

z <- merge(yrs, zpops)
z <- z[, colnames(z) != "yrs"]
interps <- na.spline(z)
interps <- as.data.frame(interps)
interps$year <- rownames(interps)
colnames(interps) <- mapvalues(colnames(interps), state.name, state.abb)
interps <- interps %>%
    filter(year >= 1974)
fwrite(interps, "./clean/interpolated_pops.csv")

Tidy the UFO Sightings

If you’ve read this far must be wondering, “What do those two colons :: mean?!” Since different packages might share the same name for functions, this is how R differentiates them. For example: the filter function from dplyr package, which is not the same as filter from the stats package. To be frank, the code below doesn’t look most elegant but we get the job done.

sightings.db <- sightings.raw %>% 
  mutate(state = toupper(state)) %>% 
  mutate(year = as.numeric(format(as.Date(date_time, format="%m/%d/%y"),"%Y"))) %>% 
  dplyr::filter(state %in% state.abb) %>% 
  dplyr::filter(1974 <= year & year <= 2016) %>% 
  dplyr::count(year, state) %>% 
  dplyr::rename(num_sightings = n) %>%
  dplyr::left_join(states, by = c("state" = "state")) %>%
  dplyr::select(-area, -population, -division, -capital) %>%
  dplyr::mutate(decade = year - (year %% 10))

Step 6: Normalize the number of sightings by the population of each state

Perhaps the number of sightings is highly skewed by scale of population in each state. Let’s normalize the number of sightings per capita.

df$sightings_per_thousand <- (df$sightings / df$pop) * 1000
interps <- melt(interps)
interps$year <- as.numeric(interps$year)
sightings.db2 <- as.data.frame(sightings.db) %>% 
dplyr::left_join(interps, by=c("year"="year", "state"="variable")) %>%
dplyr::mutate(sightings_per_100k=((num_sightings/value) * 100000)) %>% 
dplyr::select(year, state, region, sightings_per_100k)

write_csv(sightings.db2, "./clean/sightings_db2.csv")

Using population data from the Census, we’re able to get a feel for the population of each state during each decade.

# statepops.raw$Name <- mapvalues(statepops.raw$Name, state.name, state.abb)
# state_pops <- statepops.raw %>%
#     filter(Name %in% state.abb) %>%
#     select(one_of(c("1970", "1980", "1990", "2000", "2010"))) %>%
#     apply(1, mean)

Step 7: Normalize Movies By Total Movies That Year

Also remember to stick within the time domain of our other data.

#all_movies <- allmovies.raw[nrow(allmovies.raw):1,]
allmovies.raw <- allmovies.raw %>% 
  filter(1974 <= year & year <= 2016)

Divide number of alien movies that year. Divide by total box office movies that came out that year.

The number is small, so interepretability may be an issue here. Careful!

movies.db <- movies.raw %>% 
  filter(1974 <= year & year <= 2016) %>%
  left_join(allmovies.raw, by=c("year"="year")) %>% 
  mutate(alienmovies_per_year = num_movies/total_movies) %>% 
  select(year, alienmovies_per_year)
gdp.db <- gdp.raw %>% 
  clean_names() %>% 
  dplyr::rename(year = year_and_category) %>% 
  filter(year >= 1974) %>% 
  select(year, per_capita_gdp_current)

The data on number of US air bases per state is a bit sparse, so be sure to handle missing entries.

usafbase.db <- usafbase.raw %>% 
  #filter(State == "FL") %>% 
  # rename(state = State) %>% 
  group_by(State) %>%
  dplyr::count() %>%
  merge(state.abb, by.x="State", by.y=1, all=T) %>%
  mutate(n=replace(n, is.na(n), 0)) %>%
  filter(!is.na(State)) %>% 
  dplyr::rename(afbase_per_state = n) %>% 
  dplyr::rename(state = State)

Step 8: Join files together

When you join two data sets, you join based on shared values of the rows. For example: I want to pair up all the entries from beer.db and movies.db based on year.

all.states.normalized.df <- sightings.db2 %>% 
    left_join(beer.db, by=c("year" = "year", "state"= "state")) %>% 
    left_join(movies.db, by = c("year" = "year")) %>% 
    left_join(gdp.db, by = c("year" = "year")) %>%
    left_join(internet.db, by=c("year" = "year")) %>% 
    left_join(usafbase.db, by=c("state" = "state")) %>%
    left_join(interps, by=c("year"="year", "state"="variable")) %>%
    dplyr::rename(population=value) %>%
    mutate(normalized_afb_count=afbase_per_state/population*1000000) %>%
    select(one_of("state", "year", "region", "sightings_per_100k", "alienmovies_per_year", "per_capita_gdp_current", "internet_avg", "internet_urban", "internet_suburban", "internet_rural", "normalized_afb_count"))

all.states.normalized.df[is.na(all.states.normalized.df)] = 0

We’re almost there!

Now let’s select the data between the year 1985 and 2014 inclusive.

all.states.normalized.df <- all.states.normalized.df %>%
    filter(year >= 1985 & year <= 2014)

Step 9: Write out

Finally we’re ready to write the clean data frame to a csv. We did it!

fwrite(all.states.normalized.df, "clean/all_states_normalized.csv")

  1. I consider this an extreme example because my 3 teammates and I scraped 10 different data sets from the internet from sources varying from the UFO Association to Imdb to the Census. [return]