OpenPDI

OpenPDI is an unofficial effort to document and standardize data submitted to the Police Data Initiative (PDI). The goal is to make the data more accessible by addressing a number of issues related to a lack of standardization—namely,

File types: While some agencies make use if the Socrata Open Data API, many provide their data in raw .csv, .xlsx, or .xls files of varying structures.
Column names: Many columns that represent the same data (e.g., race) are named differently across departments, cities, and states.
Value formats: Dates, times, and other comparable fields are submitted in many different formats.
Column availability: It's currently very difficult to identify data sources that contain certain columns—e.g., Use of Force data specifying the hire date of the involved officer(s).

Getting Started

Installation

$ pip install openpdi

Usage

Dataset	ID	Source
Use of Force	`uof`	https://www.policedatainitiative.org/datasets/use-of-force/

import csv
import openpdi

# The library has a single entry point:
dataset = openpdi.Dataset(
    # The dataset ID (see the table above).
    "uof",
    # Limit the data sources to a specific state using its two-letter code.
    #
    # Default: `scope=[]`.
    scope=["TX"],
    # A list of columns that must be provided in every data source included in
    # this dataset. See `openpdi/meta/{ID}/schema.json` for the available
    # columns.
    #
    # Default: `columns=[]`.
    columns=["reason"],
    # If `True`, only return the user-specified columns -- i.e., those listed
    # in the `columns` parameter.
    #
    # Default: `strict=False`.
    strict=False)

# The names of the agencies included in this dataset:
print(dataset.agencies)

# The URLs of the external data sources inlcuded in this dataset:
print(dataset.sources)

# `gen` is a generator object for iterating over the CSV-formatted dataset.
gen = dataset.download()

# Write to a CSV file:
with open("dataset.csv", "w+") as f:
    writer = csv.writer(f, delimiter=",", quoting=csv.QUOTE_ALL)
    writer.writerows(gen)

Datasets

In an attempt to avoid unnecessary bloat (in terms of GBs), we don't actually store any PDI data in this repository. Instead, we store small, JSON-formatted descriptions of externally hosted datasets—for example, uof/CA/meta.json:

[
    {
        "url": "https://www.norwichct.org/Archive.aspx?AMID=61&Type=Recent",
        "type": "csv",
        "start": 1,
        "columns": {
            "date": {
                "index": 0,
                "specifier": "%m/%d/%Y"
            },
            "city": {
                "raw": "Richmond"
            },
            "state": {
                "raw": "CA"
            },
            "service_type": {
                "index": 1
            },
            "force_type": {
                "index": 10
            },
            "light_conditions": {
                "index": 8
            },
            "weather_conditions": {
                "index": 7
            },
            "reason": {
                "index": 2
            },
            "officer_injured": {
                "index": 6
            },
            "officer_race": {
                "index": 9
            },
            "subject_injured": {
                "index": 5
            },
            "aggravating_factors": {
                "index": 3
            },
            "arrested": {
                "index": 4
            }
        }
    }
]

This file describes a Use of Force (uof) dataset from Richmond, CA. Each entry in the columns array maps a column from the externally-hosted data to a column in the dataset's schema file (uof/schema.json).

The schema.json file assigns a format to every possible column in a particular dataset, which is a Python function tasked with standardizing a raw column value (see openpdi/validators.py).

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
openpdi		openpdi
scripts		scripts
tests		tests
.editorconfig		.editorconfig
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.travis.yml		.travis.yml
CITATION		CITATION
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenPDI

Getting Started

Installation

Usage

Datasets

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

jdkato/openpdi

Folders and files

Latest commit

History

Repository files navigation

OpenPDI

Getting Started

Installation

Usage

Datasets

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages