Getting Started#
Welcome to the quickstart guide for OpenPoliceData (OPD)! Here, you should find all you need to learn the basics of OPD.
New to Python?: Check out the free first python notebook course
Questions or Comments?: If you questions or comments about anything related to installing or using OPD, please reach out on our discussion board.
Installation#
Install OPD with pip from PyPI
pip install openpolicedata
For installation in a Jupyter Notebook, replace pip with %pip.
See here for advanced installation including how to install [GeoPandas] alongside OPD to enable geospatial analysis of data loaded by OPD.
Import#
To use OPD, you must always start by importing it into your Python code:
[5]:
import openpolicedata as opd
We recommend shortening openpolicedata to opd to make your code more readable.
The Basics#
OPD provides access to over 300 police datasets with just 2 simple lines of code:
[13]:
# Load traffic stops data from Lousiville for the year 2022.
src = opd.Source("Louisville")
tbl = src.load_from_url(2022, table_type="TRAFFIC STOPS")
The table attribute contains the loaded data as a pandas DataFrame so it can be analyzed with pandas’ simple and powerful capabilities.
[12]:
# View the 1st 5 rows with pandas' head function
tbl.table.head()
[12]:
| TYPE_OF_STOP | CITATION_CONTROL_NUMBER | ACTIVITY_RESULTS | OFFICER_GENDER | OFFICER_RACE | OFFICER_AGE_RANGE | ACTIVITY_DATE | ACTIVITY_TIME | ACTIVITY_LOCATION | ACTIVITY_DIVISION | ACTIVITY_BEAT | DRIVER_GENDER | DRIVER_RACE | DRIVER_AGE_RANGE | NUMBER_OF_PASSENGERS | WAS_VEHCILE_SEARCHED | REASON_FOR_SEARCH | ObjectId | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | COMPLAINT/CRIMINAL VIOLATION | DU03293 | CITATION ISSUED | M | WHITE | 21 - 30 | 01/02/2022 | 21:44 | M ST ... | 4TH DIVISION | BEAT 4 | M | WHITE | 26 - 30 | 2 | YES | 0 | 1 |
| 1 | COMPLAINT/CRIMINAL VIOLATION | DV75866 | CITATION ISSUED | M | WHITE | 51 - 60 | 07/21/2022 | 02:00 | KEEGAN WAY ... | 7TH DIVISION | BEAT 1 | M | HISPANIC | 16 - 19 | 1 | YES | 4 | 2 |
| 2 | COMPLAINT/CRIMINAL VIOLATION | DV87754 | CITATION ISSUED | M | WHITE | 51 - 60 | 07/21/2022 | 02:00 | KEEGAN WAY ... | 7TH DIVISION | BEAT 1 | M | HISPANIC | 16 - 19 | 1 | NO | 0 | 3 |
| 3 | COMPLAINT/CRIMINAL VIOLATION | DW19051 | CITATION ISSUED | M | WHITE | 21 - 30 | 01/25/2022 | 11:23 | 4500 BLOCK SOUTHERN PKWY | 4TH DIVISION | BEAT 6 | M | WHITE | 20 - 25 | 0 | YES | 4 | 4 |
| 4 | COMPLAINT/CRIMINAL VIOLATION | DX65321 | CITATION ISSUED | M | WHITE | 31 - 40 | 01/13/2022 | 05:30 | PRESTON HWY @ OUTER LOOP ... | 7TH DIVISION | BEAT 6 | M | WHITE | 51 - 60 | 1 | YES | 3 | 5 |
Finding Datasets#
OPD provides the datasets module for querying what datasets are available in OPD. To get all available datasets, query the source table with no inputs:
[11]:
all_datasets = opd.datasets.query()
all_datasets.head()
[11]:
| State | SourceName | Agency | AgencyFull | TableType | coverage_start | coverage_end | last_coverage_check | Description | source_url | readme | URL | Year | DataType | date_field | dataset_id | agency_field | min_version | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arizona | Gilbert | Gilbert | Gilbert Police Department | CALLS FOR SERVICE | 2006-11-15 | 2023-05-14 | 05/15/2023 | <NA> | https://data.gilbertaz.gov/maps/2dcb4c20c9a444... | <NA> | https://maps.gilbertaz.gov/arcgis/rest/service... | MULTIPLE | ArcGIS | EventDate | <NA> | <NA> | <NA> |
| 1 | Arizona | Gilbert | Gilbert | Gilbert Police Department | EMPLOYEE | NaT | NaT | 05/15/2023 | A data set of all employees that have previous... | https://data.gilbertaz.gov/datasets/TOG::gilbe... | <NA> | https://services1.arcgis.com/JLuzSHjNrLL4Okwb/... | NONE | ArcGIS | <NA> | <NA> | <NA> | <NA> |
| 2 | Arizona | Gilbert | Gilbert | Gilbert Police Department | STOPS | 2008-01-01 | 2018-05-23 | 05/15/2023 | Standardized stop data from the Stanford Open ... | https://openpolicing.stanford.edu/data/ | https://github.com/stanford-policylab/opp/blob... | https://stacks.stanford.edu/file/druid:yg821jf... | MULTIPLE | CSV | date | <NA> | <NA> | <NA> |
| 3 | Arizona | Mesa | Mesa | Mesa Police Department | CALLS FOR SERVICE | 2017-01-01 | 2023-05-12 | 05/15/2023 | <NA> | https://data.mesaaz.gov/Police/Police-Computer... | <NA> | data.mesaaz.gov | MULTIPLE | Socrata | creation_datetime | ex94-c5ad | <NA> | <NA> |
| 4 | Arizona | Mesa | Mesa | Mesa Police Department | INCIDENTS | 2016-01-01 | 2023-03-31 | 05/15/2023 | Incidents based on initial police reports take... | https://data.mesaaz.gov/Police/Police-Incident... | <NA> | data.mesaaz.gov | MULTIPLE | Socrata | report_date | 39rt-2rfj | <NA> | <NA> |
The source table provides the information needed to create sources and load data as well as background information. It is a DataFrame that can be filtered with pandas filtering operations. Key information includes:
State: Optionally used when creating a
Sourceto distinguish ambiguous sources (i.e. same city name in different states)SourceName: Original source of the data (typically a shortened name for a police department). Used when creating a
Source.Agency: Shortened agency / police department name. Typically the same as SourceName. However, it may be
MULTIPLEif a datasets contains data for multiple agencies.TableType: Type of data (TRAFFIC STOPS, USE OF FORCE, etc.). Used when loading data.
coverage_start: Start date of data contained in dataset. Combined with coverage_end, this determines the years available for this datasets when loading data. NOTE: Often, agencies store their data in different datasets for different years so one table type may be spread across multiple datasets corresponding to each year of data.
coverage_end: Most recently checked date for data contained in dataset. Combined with coverage_start, this determines the years available for this datasets when loading data. If the data has been updated by the dataset owner since the date in
last_coverage_check, more recent years may be available. NOTE: Often, agencies store their data in different datasets for different years so one table type may be spread across multiple datasets corresponding to each year of data.source_url: Homepage for dataset
readme: Direct URL for data dictionary containing definitions of columns, etc. If empty, the
source_urlmay also contain a data dictionary.
With its optional inputs, query can be used to filter for desired data. Here is a very specific query using all optional inputs:
[15]:
ds = opd.datasets.query(source_name="Menlo Park", state="California", agency="Menlo Park", table_type="CALLS FOR SERVICE")
ds
[15]:
| State | SourceName | Agency | AgencyFull | TableType | coverage_start | coverage_end | last_coverage_check | Description | source_url | readme | URL | Year | DataType | date_field | dataset_id | agency_field | min_version | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 73 | California | Menlo Park | Menlo Park | Menlo Park Police Department | CALLS FOR SERVICE | 2018-01-01 | 2018-12-31 | 05/15/2023 | <NA> | https://data.menlopark.org/datasets/4036c27030... | https://data.menlopark.org/datasets/4036c27030... | https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/... | 2018 | ArcGIS | <NA> | <NA> | <NA> | <NA> |
| 74 | California | Menlo Park | Menlo Park | Menlo Park Police Department | CALLS FOR SERVICE | 2019-01-01 | 2019-12-31 | 05/15/2023 | <NA> | https://data.menlopark.org/datasets/e88877f5d9... | https://data.menlopark.org/datasets/e88877f5d9... | https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/... | 2019 | ArcGIS | <NA> | <NA> | <NA> | <NA> |
| 75 | California | Menlo Park | Menlo Park | Menlo Park Police Department | CALLS FOR SERVICE | 2020-01-01 | 2020-12-31 | 05/15/2023 | <NA> | https://data.menlopark.org/datasets/510eb69337... | https://data.menlopark.org/datasets/510eb69337... | https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/... | 2020 | ArcGIS | <NA> | <NA> | <NA> | <NA> |
| 76 | California | Menlo Park | Menlo Park | Menlo Park Police Department | CALLS FOR SERVICE | 2021-01-01 | 2021-12-31 | 05/15/2023 | <NA> | https://data.menlopark.org/datasets/4c04a71c71... | https://data.menlopark.org/datasets/4c04a71c71... | https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/... | 2021 | ArcGIS | <NA> | <NA> | <NA> | <NA> |
get_table_types finds available table types in OPD. Here, we use optional contains input to only get the table types containing the word “STOPS”:
[5]:
table_types = opd.datasets.get_table_types(contains="STOPS")
table_types
[5]:
['PEDESTRIAN STOPS', 'STOPS', 'TRAFFIC STOPS']
Loading Data#
The Source class is used to explore datasets and load data. We first need to create a source, which we can use to view all datasets from that source. Let’s create a source of Columbia, South Carolina. We need to specify the state because there are datasets from Columbias from multiple states
[6]:
src = opd.Source("Columbia", state="South Carolina")
src.datasets
[6]:
| State | SourceName | Agency | AgencyFull | TableType | coverage_start | coverage_end | last_coverage_check | Description | source_url | readme | URL | Year | DataType | date_field | dataset_id | agency_field | min_version | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 744 | South Carolina | Columbia | Columbia | Columbia Police Department | ARRESTS | 2016-01-01 | 2022-12-31 | 07/07/2023 | <NA> | https://coc-colacitygis.opendata.arcgis.com/da... | <NA> | https://services1.arcgis.com/Mnt8FoJcogKtoVBs/... | MULTIPLE | ArcGIS | Arrest_Date | <NA> | <NA> | 0.2 |
| 745 | South Carolina | Columbia | Columbia | Columbia Police Department | FIELD CONTACTS | 2016-01-01 | 2022-12-31 | 07/07/2023 | Field Interview is a collection of data result... | https://coc-colacitygis.opendata.arcgis.com/da... | <NA> | https://services1.arcgis.com/Mnt8FoJcogKtoVBs/... | MULTIPLE | ArcGIS | TOC | <NA> | <NA> | <NA> |
To get a list of available table types:
[17]:
src.get_tables_types()
[17]:
['ARRESTS', 'FIELD CONTACTS']
You can get the number of records for a dataset using get_count. Let’s get the number of records in the year 2022 for the FIELD CONTACTS dataset.
[18]:
src.get_count(2022, "FIELD CONTACTS")
[18]:
2382
You can find which years are available for a given table type:
[12]:
src.get_years(table_type="FIELD CONTACTS")
[12]:
[2016, 2017, 2018, 2019, 2020, 2021, 2022]
Now, let’s load in some field contacts data for 2022.
[7]:
tbl = src.load_from_url(2022, "FIELD CONTACTS")
tbl
[7]:
state: South Carolina,
source_name: Columbia,
agency: Columbia,
table_type: FIELD CONTACTS,
year: 2022,
description: Field Interview is a collection of data resulting from citizen contact related to suspicious activity.,
url: https://services1.arcgis.com/Mnt8FoJcogKtoVBs/arcgis/rest/services/FieldInterview/FeatureServer/0,
date_field: TOC,
source_url: https://coc-colacitygis.opendata.arcgis.com/datasets/ColaCityGIS::field-interview-1-1-2016-3-31-2022/about
The loaded data is contained in a pandas DataFrame in the table attribute.
[13]:
tbl.table.head(2)
[13]:
| OBJECTID | Case_Num | TOC | Address | City | Zip | State | Age | Race | Sex | Contact_Type | Year | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25351 | 220000108 | 2022-01-01 21:47:00 | 12XX Main St | 29201 | 32 | W | M | Field Interview | 2022.0 | POINT (1989801.776 788862.968) | ||
| 1 | 25350 | 220000161 | 2022-01-02 15:05:00 | 21XX Main St | 29201 | 29 | B | M | Field Interview | 2022.0 | POINT (1988210.189 793174.093) |
Data can be saved locally as CSV files. This allows you to:
Open the data using the software of your choice
Re-open the data in OPD from a local copy
[19]:
tbl.to_csv()
new_src = opd.Source("Columbia", state="South Carolina")
new_tbl = new_src.load_from_csv(2022, table_type="FIELD CONTACTS")
new_tbl.table.head(2)
[19]:
| OBJECTID | Case_Num | TOC | Address | City | Zip | State | Age | Race | Sex | Contact_Type | Year | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25351 | 220000108.0 | 2022-01-01 21:47:00 | 12XX Main St | NaN | 29201 | NaN | 32.0 | W | M | Field Interview | 2022.0 | POINT (1989801.7762467265 788862.9678477645) |
| 1 | 25350 | 220000161.0 | 2022-01-02 15:05:00 | 21XX Main St | NaN | 29201 | NaN | 29.0 | B | M | Field Interview | 2022.0 | POINT (1988210.189304456 793174.0931758583) |
Some datasets contain data for every agency in a state. In this case, you may want to know what agencies are available and optionally, only want agencies containing the word Arlington.
[8]:
src = opd.Source("Virginia")
agencies = src.get_agencies(table_type="STOPS", partial_name="Arlington")
agencies
[8]:
["Arlington County Sheriff's Office", 'Arlington County Police Department']
We may also want only load data from a specific agency:
[9]:
tbl = src.load_from_url(2022, table_type="STOPS", agency="Arlington County Police Department")
Data Standardization#
One of the challenges in analyzing police data is that different agencies will use different column names for the same data and will use different codes and terms for the data in the columns. Particularly, if you are looking at multiple datasets, it is valuable for the data to be standardized so that you know in advance what some key columns will be called and what values will be in those columns.
To provide the user with more consistent column names and data, OpenPoliceData provides powerful tools to automatically standardize column names and data in order. Columns that OpenPoliceData can standardize include:
Date
Time,
Gender
Age
Race
Ethnicity
In addition, OpenPoliceData will combine separate date and time columns into a single datetime column and race and ethnicity into a single combined race column.
Let’s examine what columns we have in our data:
[10]:
tbl.table.columns
[10]:
Index(['incident_date', 'agency_name', 'jurisdiction', 'reason_for_stop',
'person_type', 'race', 'ethnicity', 'age', 'gender', 'english_speaking',
'action_taken', 'specific_violation', 'virginia_crime_code',
'person_searched', 'vehicle_searched', 'physical_force_by_officer',
'physical_force_by_subject', 'residency'],
dtype='object')
The DataFrame currently has date, race, ethnicity, age, and gender columns called incident_date, race, ethnicity, age, and gender, respectively.
The unique values in the race and ethnicity columns are:
[11]:
print(f"The unique values in the race column are {tbl.table['race'].unique()}")
print(f"The unique values in the ethnicity column are {tbl.table['ethnicity'].unique()}")
The unique values in the race column are ['WHITE' 'BLACK OR AFRICAN AMERICAN' 'AMERICAN INDIAN OR ALASKA NATIVE'
'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER' 'UNKNOWN']
The unique values in the ethnicity column are ['HISPANIC OR LATINO' 'NOT HISPANIC OR LATINO' 'UNKNOWN']
Now, let’s standardize the data:
[12]:
tbl.standardize()
print("The columns after standardization are:")
tbl.table.columns
The columns after standardization are:
[12]:
Index(['DATE', 'SUBJECT_RACE', 'SUBJECT_AGE', 'SUBJECT_GENDER', 'AGENCY',
'jurisdiction', 'reason_for_stop', 'person_type', 'english_speaking',
'action_taken', 'specific_violation', 'virginia_crime_code',
'person_searched', 'vehicle_searched', 'physical_force_by_officer',
'physical_force_by_subject', 'residency', 'SUBJECT_ETHNICITY',
'SUBJECT_RACE_ONLY', 'RAW_incident_date', 'RAW_agency_name', 'RAW_race',
'RAW_ethnicity', 'RAW_age', 'RAW_gender'],
dtype='object')
The columns have been standardized to include DATE, SUBJECT_RACE (subject indicates the demographics it applies to a subject/civilian rather than a police officer), SUBJECT_AGE, SUBJECT_GENDER, AGENCY, SUBJECT_RACE_ONLY, and SUBJECT_ETHNICTY. Each of these columns has had its data standardized into a common format that would be applied to any data that the function standardize is applied to. Note also that the original non-standardized columns still exist but they
now have “RAW_” appended to their column names.
The original race and ethnicity columns were standardized to SUBJECT_RACE_ONLY and SUBJECT_ETHNICTY and then combined into SUBJECT_RACE. This combination is consistent with practice used in many fields. The standardized race/ethnicities are Latino of all races, White Non-Latino, Black Non-Latino, etc.
The standardized values in the SUBJECT_RACE column are:
[14]:
# This would produce the same thing:
# tbl.table[opd.defs.columns.RACE_SUBJECT].unique()
tbl.table["SUBJECT_RACE"].unique()
[14]:
array(['HISPANIC/LATINO', 'BLACK', 'INDIGENOUS', 'WHITE',
'ASIAN / PACIFIC ISLANDER', 'UNKNOWN'], dtype=object)
These standardized names will be used in all tables with race columns that are standardized (although they can be customized).