Skip to content

Latest commit

 

History

History
420 lines (338 loc) · 16.9 KB

readme.md

File metadata and controls

420 lines (338 loc) · 16.9 KB

Pets

Australian Pets

The data this week comes from the RSPCA, Townsville Animal Complaints and Brisbane Open Data - Animal Complaints.

h/t to Georgios Karamanis for suggesting part of this data.

The RSPCA is Australia's oldest, largest and most trusted animal welfare organisation. With this privileged position comes great responsibility. This year we received1 124,146 animals into our animal shelters and adoption centres across the country.

With a great deal of effort from RSPCAs all over the country, adoption and reclaiming rates nationally have been increasing over time and significant improvements in the outcomes for cats and dogs (including kittens and puppies) have been achieved. This can be attributed to the introduction of new approaches and programs to increase the number of animals adopted and reunited with their owners.

For general data on the states/regions of Australian with population as of Dec 2019 - Wikipedia.

There's a remarkable amount of possible data cleaning/aggregation here. The animal_outcomes dataset is pretty much ready to go, although you could pivot it longer to stack the regions. If you want to go really far down the rabbit hole, check out the cleaning script and see if you can improve upon it or approach things a different way. Note the data came from PDFs and I had to do a lot of manual spot-checks. Some fun dplyr::update_rows() which saved me a LOT of time.

The brisbane_complaints dataset has an interesting structure. No dates were reported within a dataset, so we had to add them from the file names within a purrr call. This cleaning step is pretty realistic! I left the data_range messy so you can further clean it up.

Note that there is state-level data from the RSPCA, but only two cities for the animal complaints. Up to you if you want to extrapolate between the data, but I'm not confident it will be meaningful.

The RSPCA Report (lots of graphs to recreate) - RSPCA Report.

Journal article - A Retrospective Analysis of Complaints to RSPCA Queensland, Australia, about Dog Welfare.

Get the data here

# Get the Data

# Read in with tidytuesdayR package 
# Install from CRAN via: install.packages("tidytuesdayR")
# This loads the readme and all the datasets for the week of interest

# Either ISO-8601 date or year/week works!

tuesdata <- tidytuesdayR::tt_load('2020-07-21')
tuesdata <- tidytuesdayR::tt_load(2020, week = 30)

animal_outcomes <- tuesdata$animal_outcomes

# Or read in the data manually

animal_outcomes <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-21/animal_outcomes.csv')
animal_complaints <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-21/animal_complaints.csv')
brisbane_complaints <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/main/data/2020/2020-07-21/brisbane_complaints.csv')

Data Dictionary

animal_outcomes.csv

variable class description
year double Full year
animal_type character Animal type (horse, wildlife, dog, cat, etc)
outcome character Animal outcome - euthanized, released, rehomed, etc
ACT double ACT - Australian Capital Territory
NSW double New South Wales
NT double Northern Territory
QLD double Queensland
SA double South Australia
TAS double Tasmania
VIC double Victoria
WA double Western Australian
Total double Australian Total

animal_complaints.csv

variable class description
Animal Type character Animal Type
Complaint Type character Complaint type
Date Received character Date received (Month - year)
Suburb character Suburb/region
Electoral Division character Electoral Division

brisbane_complaints.csv

variable class description
nature character Nature of complaints (animal)
animal_type character Animal type
category character Category of complaint
suburb character Suburb where reported
date_range character Date range (typically 1 quarter + year)
responsible_office character Responsible office for the complaint
city character City (Brisbane)

Cleaning Script


library(tidyverse)
library(pdftools)
library(rvest)
library(glue)

# Brisbane data for complaints:

# Source: https://www.data.brisbane.qld.gov.au/data/dataset/96bec69c-6170-4ef0-93f1-eda279149b97

all_complaints <- list.files("2020/2020-07-21/raw_csv") %>% 
  map_dfr(.f = function(x){
    read_csv(file = paste0("2020/2020-07-21/raw_csv/", x)) %>% 
      mutate(date_range = x)
  })

brisbane_complaints <- all_complaints %>% 
  mutate(date_range = str_remove(date_range, "animal-compliance-|cars-bis-open-data-animal-related-complaints-")) %>% 
  set_names(nm= c("nature", "animal_type", "category", "suburb", "date_range", "responsible_office")) %>% 
  mutate(city = "Brisbane")

brisbane_complaints %>% write_csv("2020/2020-07-21/brisbane_complaints.csv")

library(tidyverse)
library(pdftools)
library(rvest)
library(glue)

# Get all urls for PDFs ---------------------------------------------------

raw_pca <- "https://www.rspca.org.au/what-we-do/our-role-caring-animals/annual-statistics" %>%
  read_html()

all_urls <- raw_pca %>%
  html_node("body > div:nth-child(5) > div.node.node-full-page > div.paragraphs-items.paragraphs-items-field-shared-sections.paragraphs-items-field-shared-sections-full.paragraphs-items-full > div > div > div > div > div > div:nth-child(6)") %>%
  html_nodes("p > a") %>%
  html_attr("href") %>%
  enframe() %>%
  arrange(desc(name)) %>%
  mutate(name = c(1999:2015, 2017)) %>%
  add_row(name = 2018, value = "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%20final%202018-2019.pdf") %>%
  set_names(nm = c("year", "url"))

# Download all the files
save_rspca <- function(year, url){
  download.file(url, destfile = glue::glue("rspca_{year}.pdf"))
}

# Iterate across the files and save
all_urls %>% 
  pwalk(save_rspca)

### Function to get data
get_animals <- function(year, url) {
  url_rspca <- url
  
  # Read PDF as character strings
  raw_pdf <- pdftools::pdf_text(url_rspca)
  
  # Get table number by page
  page_num_df <- raw_pdf %>%
    enframe() %>%
    mutate(value = str_trim(value) %>%
             word(1, 2))
  
  # Add logic for getting the right table number
  page_num_pull <- page_num_df %>%
    filter(str_detect(value, if_else(year < 2007, "Table 3", "Table 4"))) %>%
    pull(name)
  
  # More logic because someone in 2011 decided to further wrap tables :facepalm:
  page_num <- if_else(year == 2011, list(5), list(page_num_pull))[[1]]
  
  # Tables can span up to 2 additional pages
  # Combine these pages and split into rows by new lines
  raw_text <- raw_pdf[c(page_num:(page_num + 2))] %>%
    str_split("\n") %>%
    flatten_chr()
  
  # Find and limit to start/end of table
  table_start <- stringr::str_which(raw_text, "Dogs")
  table_end <- stringr::str_which(raw_text, "Total animals|Total received|Total Animals")
  
  table_end <- table_end[[length(table_end)]]
  
  # Trim table of extra whitespace and limit to right rows
  table_trimmed <- raw_text[c(table_start:table_end)] %>% str_trim()
  
  # squish table together and replace excess whitespace with a '|'
  squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
    str_remove_all(",")
  
  raw_df <- enframe(squished_table)
  
  # separate the columns
  animal_df <- raw_df %>%
    separate(
      value,
      into = c(
        "outcome",
        "ACT",
        "NSW",
        "NT",
        "QLD",
        "SA",
        "TAS",
        "VIC",
        "WA",
        "Total"
      ),
      sep = "\\|"
    ) %>%
    # convert columns to double
    mutate(across(c(ACT:Total), as.double)) %>%
    mutate(
      year = year,
      # Clean up the outcome column
      outcome = str_to_title(outcome) %>% str_remove("[:digit:]+")
    ) %>%
    mutate(
      # create animal_type column
      animal_type = if_else(
        str_detect(outcome, paste(c("Dogs", "Cats", "Horses", "Livestock", "Other animals", "Other Animals", "Wildlife"), collapse = "|")),
        outcome,
        NA_character_
      ),
      animal_type = str_to_title(animal_type)
    ) %>%
    # fill down the animal type column
    fill(animal_type)
  
  
  animal_df
}

# Get all the data --------------------------------------------------------

# Note you could do some of this in bulk with purrr, but since I had to
# do so much extra cleaning outside of the core function I decided to 
# just do it manually and merge at the end

data_1999 <- get_animals(1999, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics1999-2000.pdf")
data_2000 <- get_animals(2000, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202000%20-%202001.pdf")
data_2001 <- get_animals(2001, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202001%20-%202002.pdf")
data_2002 <- get_animals(2002, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202002-2003.pdf")
data_2003 <- get_animals(2003, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202003%20-%202004.pdf") %>%
  filter(!is.na(Total))
data_2004 <- get_animals(2004, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202004%20-%202005.pdf") %>%
  filter(!is.na(Total))
data_2005 <- get_animals(2005, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202005%20-%202006.pdf") %>%
  rows_update(
    tibble(name = 5, NT = NA, QLD = 456, SA = 97, TAS = 79, VIC = 472, WA = 133, Total = 1951)
  ) %>%
  rows_update(
    tibble(name = 10, SA = NA, TAS = 8, VIC = 100, WA = 0, Total = 1730)
  ) %>%
  rows_update(
    tibble(name = 7, NT = NA, QLD = 47, SA = 451, TAS = 127, VIC = 0, WA = 0, Total = 1015)
  ) %>%
  rows_update(
    tibble(
      name = c(18, 19), NT = c(NA, NA), QLD = c(474, 21), SA = c(96, 0),
      TAS = c(52, 300), VIC = c(218, 829), WA = c(52, 0), Total = c(1336, 1168)
    )
  ) %>%
  rows_update(
    tibble(
      name = c(60:66), NT = c(rep(NA, 7)),
      QLD = c(54, 694, 80, 0, 645, 97, 1570),
      SA = c(10, 108, 2, 0, 222, NA, 342),
      TAS = c(5, 95, 1, 0, 11, 2, 114),
      VIC = c(47, 329, 76, 0, 373, 20, 845),
      WA = c(0, 6, 0, 0, 1, 0, 7),
      Total = c(162, 1594, 224, 260, 2507, 481, 5228)
    )
  ) %>%
  filter(!is.na(Total))

data_2006 <- get_animals(2006, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202006%20-%202007.pdf") %>%
  rows_update(
    tibble(name = 7, TAS = NA, VIC = 153, WA = 5, Total = 2087)
  ) %>%
  rows_update(
    tibble(name = 32, NT = NA, QLD = 30, SA = 5, TAS = 7, VIC = 42, WA = 0, Total = 95)
  )

# Note decided to use data_2006_upd instead of overwriting
# One of the columns got dropped in parsing step, so I used it to
# update rows in place

data_2006_upd <- data_2006 %>%
  rows_update(
    data_2006 %>%
      filter(name %in% c(44:50, 53:60)) %>%
      select(-ACT) %>%
      set_names(nm = c(
        "name", "outcome", "ACT", "NSW", "NT", "QLD",
        "SA", "TAS", "VIC", "WA", "year", "animal_type"
      )) %>%
      mutate(Total = c(
        1893, 142, 1502, 6035, 1318, 10890, 11051,
        167, 1631, 169, 186, 2291, 227, 4671, 5228
      ), .after = WA) %>%
      mutate(
        outcome = c(
          "Released", "In Stock", "Transferred", "Euthanased", "Other", "Total", "Last Year's Total",
          "Reclaimed", "Rehomed", "In Stock", "Transferred", "Euthanased", "Other", "Total", "Last Year's Total"
        ),
        animal_type = c(rep("Wildlife", 7), rep("Other Animals", 8))
      )
  ) %>%
  filter(!is.na(Total))

data_2007 <- get_animals(2007, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20national%20Statistics%202007-2008.pdf") %>%
  filter(!is.na(Total))

data_2008 <- get_animals(2008, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202008%20-%202009.pdf") %>%
  rows_update(
    tibble(
      name = c(2, 3, 8, 16),
      WA = c(621, 1045, 248, 958),
      Total = c(22896, 19236, 22085, 19666)
    )
  ) %>%
  rows_update(
    tibble(
      name = 19,
      NT = 212, QLD = 11045, SA = 2712, TAS = 1974, VIC = 9801, WA = 99, Total = 39495
    )
  ) %>%
  filter(!is.na(Total))

data_2009 <- get_animals(2009, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202009%20-%202010.pdf") %>%
  filter(!is.na(Total))

data_2010 <- get_animals(2010, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202010%20-%202011.pdf") %>%
  filter(!is.na(Total))

data_2011 <- get_animals(2011, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202011-2012.pdf") %>%
  filter(!is.na(Total))

data_2012 <- get_animals(2012, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA%20Australia%20National%20Statistics%202012%20-%202013.pdf") %>%
  filter(!is.na(Total))

data_2013 <- get_animals(2013, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA_Australia-Annual_Statistics_2013-2014.pdf") %>%
  filter(!is.na(Total))

data_2014 <- get_animals(2014, "https://www.rspca.org.au/sites/default/files/website/The-facts/Statistics/RSPCA_Australia-Annual_Statistics_2014-2015.pdf") %>%
  filter(!is.na(Total))

data_2015 <- get_animals(2015, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%202015-2016%20.pdf") %>%
  rows_update(
    tibble(
      name = 62, ACT = 581, NSW = 1361, NT = 34, QLD = 720, SA = 449, TAS = 191, VIC = 716, WA = 38, Total = 4090
    )
  ) %>%
  filter(!is.na(Total))

data_2016 <- get_animals(2016, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%20final%202016-2017.pdf") %>% 
  filter(!is.na(Total))

data_2017 <- get_animals(2017, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%202017-2018.pdf") %>%
  filter(!is.na(Total))

data_2018 <- get_animals(2018, "https://www.rspca.org.au/sites/default/files/RSPCA%20Australia%20Annual%20Statistics%20final%202018-2019.pdf") %>%
  filter(!is.na(Total))

# Combine all the cleaned data by year!
# note that 2016 was not reported. :sad:
# I would recommend imputing if you want

comb_data <- bind_rows(
  list(
    data_1999, data_2000, data_2001, data_2002, data_2003, data_2004, data_2005,
    data_2006_upd, data_2007, data_2008, data_2009, data_2010, data_2011, 
    data_2012, data_2013, data_2014, data_2015, data_2016, data_2017, data_2018
  )
)

# clean the final data
clean_data <- comb_data %>%
  filter(outcome %in% c(
    "Euthanased", "Other", "Rehomed", "Transferred",
    "Reclaimed", "Currently In Care", "In Stock",
    "Released", "In Our Care"
  )) %>%
  mutate(
    outcome = case_when(
      outcome == "In Our Care" ~ "Currently In Care",
      outcome == "Euthanased" ~ "Euthanized",
      TRUE ~ outcome
    )
  ) %>%
  select(year, animal_type, outcome, ACT:Total)

# Sanity checking
clean_data %>% ggplot(aes(x = year, y = Total, color = outcome)) +
  geom_line() +
  facet_wrap(~animal_type, scales = "free_y")

# Sanity checking
clean_data %>%
  filter(outcome %in% c("Euthanized", "Rehomed"), animal_type != "Wildlife") %>%
  ggplot(aes(x = year, y = Total, fill = outcome)) +
  geom_col(position = "fill") +
  facet_wrap(~animal_type)

# Sanity checking
clean_data %>%
  count(animal_type, sort = T)

# Sanity checking Wildlife being an odd number
# It's because first 3 years only had 3 outcomes

clean_data %>%
  filter(animal_type == "Wildlife") %>%
  count(year)

clean_data %>% 
  write_csv("2020/2020-07-21/animal_outcomes.csv")