OpenPDI is an unofficial effort to document and standardize data submitted to the Police Data Initiative (PDI). The goal is to make the data more accessible by addressing a number of issues related to a lack of standardization—namely,
- File types: While some agencies make use if the
Socrata Open Data API, many provide their data
in raw
.csv
,.xlsx
, or.xls
files of varying structures. - Column names: Many columns that represent the same data (e.g.,
race
) are named differently across departments, cities, and states. - Value formats: Dates, times, and other comparable fields are submitted in many different formats.
- Column availability: It's currently very difficult to identify data sources that contain certain columns—e.g., Use of Force data specifying the hire date of the involved officer(s).
$ pip install openpdi
Dataset | ID | Source |
---|---|---|
Use of Force | uof |
https://www.policedatainitiative.org/datasets/use-of-force/ |
import csv
import openpdi
# The library has a single entry point:
dataset = openpdi.Dataset(
# The dataset ID (see the table above).
"uof",
# Limit the data sources to a specific state using its two-letter code.
#
# Default: `scope=[]`.
scope=["TX"],
# A list of columns that must be provided in every data source included in
# this dataset. See `openpdi/meta/{ID}/schema.json` for the available
# columns.
#
# Default: `columns=[]`.
columns=["reason"],
# If `True`, only return the user-specified columns -- i.e., those listed
# in the `columns` parameter.
#
# Default: `strict=False`.
strict=False)
# The names of the agencies included in this dataset:
print(dataset.agencies)
# The URLs of the external data sources inlcuded in this dataset:
print(dataset.sources)
# `gen` is a generator object for iterating over the CSV-formatted dataset.
gen = dataset.download()
# Write to a CSV file:
with open("dataset.csv", "w+") as f:
writer = csv.writer(f, delimiter=",", quoting=csv.QUOTE_ALL)
writer.writerows(gen)
In an attempt to avoid unnecessary bloat (in terms of GBs), we don't actually store any PDI data in this repository. Instead, we store small, JSON-formatted descriptions of externally hosted datasets—for example, uof/CA/meta.json
:
[
{
"url": "https://www.norwichct.org/Archive.aspx?AMID=61&Type=Recent",
"type": "csv",
"start": 1,
"columns": {
"date": {
"index": 0,
"specifier": "%m/%d/%Y"
},
"city": {
"raw": "Richmond"
},
"state": {
"raw": "CA"
},
"service_type": {
"index": 1
},
"force_type": {
"index": 10
},
"light_conditions": {
"index": 8
},
"weather_conditions": {
"index": 7
},
"reason": {
"index": 2
},
"officer_injured": {
"index": 6
},
"officer_race": {
"index": 9
},
"subject_injured": {
"index": 5
},
"aggravating_factors": {
"index": 3
},
"arrested": {
"index": 4
}
}
}
]
This file describes a Use of Force (uof
) dataset from Richmond, CA. Each entry in the columns
array maps a column from the externally-hosted data to a column in the dataset's schema file (uof/schema.json
).
The schema.json
file assigns a format
to every possible column in a particular dataset, which is a Python function tasked with standardizing a raw column value (see openpdi/validators.py
).