Getting Started#

Welcome to the quickstart guide for OpenPoliceData (OPD)! Here, you should find all you need to learn the basics of OPD.

New to Python?: Check out the free first python notebook course and/or the VS Code Python Quick Start Guide and Tutorial.
Questions or Comments?: If you questions or comments about anything related to installing or using OPD, please reach out on our discussion board.

Installation#

Install OPD with pip from PyPI

pip install openpolicedata

For installation in a Jupyter Notebook, replace pip with %pip.

See here for advanced installation including how to install GeoPandas alongside OPD to enable geospatial analysis of data loaded by OPD.

Import#

To use OPD, you must always start by importing it into your Python code:

[2]:

import openpolicedata as opd

We recommend shortening openpolicedata to opd to make your code more readable.

The Basics#

OPD provides access to over 550 police datasets with just 2 simple lines of code (in most cases):

[5]:

# Load traffic stops data from New Orleans for the year 2022.
src = opd.Source("New Orleans")
tbl = src.load(table_type="STOPS", date=2022)

The table attribute contains the loaded data as a pandas DataFrame so it can be analyzed with pandas’ simple and powerful capabilities.

[6]:

# View the 1st 5 rows with pandas' head function
tbl.table.head()

[6]:

	fieldinterviewid	nopd_item	eventdate	district	zone	officerassignment	stopdescription	vehiclemake	vehiclemodel	vehiclestyle	...	subjectid	subjectrace	subjectgender	subjectage	subjectheight	subjectweight	subjecteyecolor	subjecthaircolor	subjectdriverlicstate	zip
0	576759	A0342921	2022-01-04 00:46:00	1	I	1st District	SUSPECT VEHICLE	INFINITY	OTHER	SPORTS UTILITY	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	576758	A0340622	2022-01-04 00:00:00	8	C	8th District	TRAFFIC VIOLATION	HONDA	OTHER	SPORTS UTILITY	...	661974	BLACK	MALE	21	71	170	Brown	Black	LA	70130
2	576760	A0346222	2022-01-04 01:38:00	7	H	7th District	TRAFFIC VIOLATION	GMC - GENERAL MOTORS CORP.	OTHER	SPORTS UTILITY	...	661975	BLACK	FEMALE	37	64	240	Brown	Black	TX	NaN
3	576761	A0349922	2022-01-04 02:25:00	7	I	7th District	CITIZEN CONTACT	NaN	NaN	NaN	...	661976	BLACK	MALE	58	69	158	Brown	Black	NaN	70128
4	576762	A0353222	2022-01-04 03:30:00	8	G	8th District	TRAFFIC VIOLATION	OTHER	OTHER	FOUR DOOR	...	661977	BLACK	MALE	20	74	140	Brown	Black	LA	70130

5 rows × 29 columns

Finding Datasets#

OPD provides the datasets module for querying what datasets are available in OPD. To get all available datasets, query the source table with no inputs:

[7]:

all_datasets = opd.datasets.query()
all_datasets.head()

[7]:

	State	SourceName	Agency	AgencyFull	TableType	coverage_start	coverage_end	last_coverage_check	Year	agency_originated	...	source_url	readme	URL	DataType	date_field	dataset_id	agency_field	min_version	py_min_version	query
0	Arizona	Chandler	Chandler	Chandler Police Department	ARRESTS	2018-01-01	2025-05-30	05/31/2025	MULTIPLE	NaN	...	https://data.chandlerpd.com/catalog/arrest-boo...	<NA>	https://data.chandlerpd.com/catalog/arrest-boo...	CSV	arrest_date_time	NaN	<NA>	0.2	<NA>	NaN
1	Arizona	Chandler	Chandler	Chandler Police Department	CALLS FOR SERVICE	2018-01-01	2025-05-31	05/31/2025	MULTIPLE	NaN	...	https://data.chandlerpd.com/catalog/calls-for-...	<NA>	https://data.chandlerpd.com/catalog/calls-for-...	CSV	call_received_date_time	NaN	<NA>	<NA>	<NA>	NaN
2	Arizona	Chandler	Chandler	Chandler Police Department	INCIDENTS	2018-01-01	2025-05-24	05/31/2025	MULTIPLE	NaN	...	https://data.chandlerpd.com/catalog/general-of...	<NA>	https://data.chandlerpd.com/catalog/general-of...	CSV	report_event_date	NaN	<NA>	0.4.1	<NA>	NaN
3	Arizona	Gilbert	Gilbert	Gilbert Police Department	CALLS FOR SERVICE	2006-11-15	2025-02-24	05/31/2025	MULTIPLE	NaN	...	https://data.gilbertaz.gov/maps/2dcb4c20c9a444...	<NA>	https://maps.gilbertaz.gov/arcgis/rest/service...	ArcGIS	EventDate	NaN	<NA>	<NA>	<NA>	NaN
4	Arizona	Gilbert	Gilbert	Gilbert Police Department	EMPLOYEE	NaT	NaT	07/06/2023	NONE	NaN	...	https://data.gilbertaz.gov/datasets/TOG::gilbe...	<NA>	https://services1.arcgis.com/JLuzSHjNrLL4Okwb/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	{“Department”: “POLICE DEPARTMENT”}

5 rows × 22 columns

The source table provides the information needed to create sources and load data as well as background information about each dataset. It is a DataFrame that can be filtered with pandas filtering operations. Key information includes:

State: Optionally used when creating a Source to distinguish ambiguous sources (i.e. same city name in different states)
SourceName: Original source of the data (typically a shortened name of a police department). Used when creating a Source.
Agency: Shortened agency / police department name. Typically the same as SourceName. Value is MULTIPLE if a datasets contains data for multiple agencies.
TableType: Type of data (TRAFFIC STOPS, USE OF FORCE, etc.). Used when loading data.
coverage_start: Start date of data contained in dataset. Combined with coverage_end, this determines the years available for this datasets when loading data. NOTE: Often, agencies store their data in different datasets for different years so one table type may be spread across multiple datasets corresponding to each year of data.
coverage_end: End date of data contained in dataset at the time of the msot recent update. Combined with coverage_start, this determines the years available for this datasets when loading data. If the data has been updated by the dataset owner since the date in last_coverage_check, more recent data may be available. NOTE: Often, agencies store their data in different datasets for different years so one table type may be spread across multiple datasets corresponding to each year of data.
source_url: Homepage for dataset
readme: URL for data dictionary containing definitions of columns, etc. If empty, the source_url may also contain a data dictionary.

See theData Source Table Dictioanryfor a description of all fields in the data source table.

With its optional inputs, query can be used to filter for desired data. Here is a very specific query using all optional inputs:

[8]:

ds = opd.datasets.query(source_name="Menlo Park", state="California", agency="Menlo Park", table_type="CALLS FOR SERVICE")
ds

[8]:

	State	SourceName	Agency	AgencyFull	TableType	coverage_start	coverage_end	last_coverage_check	Year	agency_originated	...	source_url	readme	URL	DataType	date_field	dataset_id	agency_field	min_version	py_min_version	query
187	California	Menlo Park	Menlo Park	Menlo Park Police Department	CALLS FOR SERVICE	2018-01-01	2018-12-31	07/06/2023	2018	NaN	...	https://data.menlopark.gov/datasets/MenloPark:...	<NA>	https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	NaN
188	California	Menlo Park	Menlo Park	Menlo Park Police Department	CALLS FOR SERVICE	2019-01-01	2019-12-31	07/06/2023	2019	NaN	...	https://data.menlopark.gov/datasets/MenloPark:...	<NA>	https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	NaN
189	California	Menlo Park	Menlo Park	Menlo Park Police Department	CALLS FOR SERVICE	2020-01-01	2020-12-31	07/06/2023	2020	NaN	...	https://data.menlopark.gov/datasets/MenloPark:...	<NA>	https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	NaN
190	California	Menlo Park	Menlo Park	Menlo Park Police Department	CALLS FOR SERVICE	2021-01-01	2021-12-31	07/06/2023	2021	NaN	...	https://data.menlopark.gov/datasets/MenloPark:...	<NA>	https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	NaN
191	California	Menlo Park	Menlo Park	Menlo Park Police Department	CALLS FOR SERVICE	2022-01-01	2022-12-31	04/06/2025	2022	NaN	...	https://data.menlopark.gov/datasets/MenloPark:...	<NA>	https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	NaN
192	California	Menlo Park	Menlo Park	Menlo Park Police Department	CALLS FOR SERVICE	2023-01-01	2023-12-31	04/06/2025	2023	NaN	...	https://data.menlopark.gov/datasets/MenloPark:...	<NA>	https://services7.arcgis.com/uRrQ0O3z2aaiIWYU/...	ArcGIS	<NA>	NaN	<NA>	<NA>	<NA>	NaN

6 rows × 22 columns

get_table_types finds available table types in OPD. Here, we use optional contains input to only get the table types containing the word “STOPS”:

[9]:

table_types = opd.datasets.get_table_types(contains="STOPS")
table_types

[9]:

['PEDESTRIAN STOPS',
 'STOPS',
 'STOPS - INCIDENTS',
 'STOPS - SUBJECTS',
 'TRAFFIC STOPS',
 'TRAFFIC STOPS - INCIDENTS',
 'TRAFFIC STOPS - SUBJECTS']

Loading Data#

The Source class is used to explore datasets and load data. We first need to create a source, which we can use to view all datasets from that source. Let’s create a source of Columbia, South Carolina. We need to specify the state because there are datasets from Columbias from multiple states

[10]:

src = opd.Source("Columbia", state="South Carolina")
src.datasets

[10]:

	State	SourceName	Agency	AgencyFull	TableType	coverage_start	coverage_end	last_coverage_check	Year	agency_originated	...	source_url	readme	URL	DataType	date_field	dataset_id	agency_field	min_version	py_min_version	query
1368	South Carolina	Columbia	Columbia	Columbia Police Department	ARRESTS	2016-01-01	2024-12-31	06/01/2025	MULTIPLE	NaN	...	https://coc-colacitygis.opendata.arcgis.com/da...	<NA>	https://services1.arcgis.com/Mnt8FoJcogKtoVBs/...	ArcGIS	Arrest_Date	NaN	<NA>	0.2	<NA>	NaN
1369	South Carolina	Columbia	Columbia	Columbia Police Department	FIELD CONTACTS	2016-01-01	2024-12-31	06/01/2025	MULTIPLE	NaN	...	https://coc-colacitygis.opendata.arcgis.com/da...	<NA>	https://services1.arcgis.com/Mnt8FoJcogKtoVBs/...	ArcGIS	TOC	NaN	<NA>	<NA>	<NA>	NaN

2 rows × 22 columns

To get a list of available table types:

[11]:

src.get_tables_types()

[11]:

['ARRESTS', 'FIELD CONTACTS']

You can get the number of records for a dataset using get_count. Let’s get the number of records in the year 2022 for the FIELD CONTACTS dataset.

[12]:

src.get_count("FIELD CONTACTS", 2022)

[12]:

You can find which years are available for a given table type:

[13]:

src.get_years(table_type="FIELD CONTACTS")

[13]:

[2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024]

Now, let’s load in some field contacts data for 2022.

[14]:

tbl = src.load("FIELD CONTACTS", 2022)
tbl

[14]:

src_obj:                State SourceName    Agency                  AgencyFull  \
1368  South Carolina   Columbia  Columbia  Columbia Police Department
1369  South Carolina   Columbia  Columbia  Columbia Police Department

           TableType coverage_start coverage_end last_coverage_check  \
1368         ARRESTS     2016-01-01   2024-12-31          06/01/2025
1369  FIELD CONTACTS     2016-01-01   2024-12-31          06/01/2025

          Year agency_originated  ...  \
1368  MULTIPLE               NaN  ...
1369  MULTIPLE               NaN  ...

                                             source_url readme  \
1368  https://coc-colacitygis.opendata.arcgis.com/da...   <NA>
1369  https://coc-colacitygis.opendata.arcgis.com/da...   <NA>

                                                    URL DataType   date_field  \
1368  https://services1.arcgis.com/Mnt8FoJcogKtoVBs/...   ArcGIS  Arrest_Date
1369  https://services1.arcgis.com/Mnt8FoJcogKtoVBs/...   ArcGIS          TOC

     dataset_id agency_field min_version py_min_version query
1368        NaN         <NA>         0.2           <NA>   NaN
1369        NaN         <NA>        <NA>           <NA>   NaN

[2 rows x 22 columns],
state: South Carolina,
source_name: Columbia,
agency: Columbia,
table_type: FIELD CONTACTS,
date: 2022,
description: Field Interview is a collection of data resulting from citizen contact related to suspicious activity.,
url: https://services1.arcgis.com/Mnt8FoJcogKtoVBs/arcgis/rest/services/FieldInterview/FeatureServer/0,
date_field: TOC,
source_url: https://coc-colacitygis.opendata.arcgis.com/datasets/ColaCityGIS::field-interview-1-1-2016-3-31-2022/about,
urls: {'source_url': 'https://coc-colacitygis.opendata.arcgis.com/datasets/ColaCityGIS::field-interview-1-1-2016-3-31-2022/about', 'readme': None, 'data': 'https://services1.arcgis.com/Mnt8FoJcogKtoVBs/arcgis/rest/services/FieldInterview/FeatureServer/0'}

The loaded data is contained in a pandas DataFrame in the table attribute.

NOTE: Known date fields will automatically be converted to pandas datetime format (pandas Period in rare cases). To keep original date format, set date_format=False when calling load. With date_format=False, the loaded data will be exactly the same as the raw source data.

[15]:

tbl.table.head(2)

[15]:

	OBJECTID	Case_Num	TOC	Address	City	Zip	State	Age	Race	Sex	Contact_Type	Year	geometry
0	25351	220000108	2022-01-01 21:47:00	12XX Main St		29201		32	W	M	Field Interview	2022.0	POINT (1989801.776 788862.968)
1	44038	220000108	2022-01-01 21:47:00	12XX Main St		29201		32	W	M	Field Interview	2022.0	POINT (1989801.776 788862.968)

To request multiple years of data, you can include the years in the “date” parameter in the form of [Start Year, Stop Year]. Date ranges can only be used for multi-year datasets.

In src.datasets for a multi-year dataset, the column value for “Year” is “MULTIPLE”, and the columns “coverage_start” to “coverage_end” specifies the dates that exist in a specific TableType data that spans multiple years. For these datasets, the “date” parameter can also be set to “MULTIPLE” to request the entire dataset. For more information on year/date filtering, see the Date Filtering Guide.

[18]:

multiyear_tbl = src.load("FIELD CONTACTS", date=[2021, 2024])
multiyear_tbl.table['Year'].value_counts()

[18]:

Year
2021.0    4786
2023.0    2831
2024.0    2585
2022.0     984
Name: count, dtype: int64

Data can be saved locally as CSV, feather, and parquet files. This allows you to:

Open the data using the software of your choice
Re-open the data in OPD from a local copy

[19]:

tbl.to_csv()  # Can also call to_feather and to_parquet with the same inputs
new_src = opd.Source("Columbia", state="South Carolina")
new_tbl = new_src.load_csv("FIELD CONTACTS", 2022)  # Can also call load_feather and load_parquet with the same inputs
new_tbl.table.head(2)

[19]:

	OBJECTID	Case_Num	TOC	Address	City	Zip	State	Age	Race	Sex	Contact_Type	Year	geometry
0	25351	220000108.0	2022-01-01 21:47:00	12XX Main St	NaN	29201	NaN	32.0	W	M	Field Interview	2022.0	POINT (1989801.7762467265 788862.9678477645)
1	44038	220000108.0	2022-01-01 21:47:00	12XX Main St	NaN	29201	NaN	32.0	W	M	Field Interview	2022.0	POINT (1989801.7762467265 788862.9678477645)

Some datasets contain data for every agency in a state. In this case, you may want to know what agencies are available and optionally, only want agencies containing the word Arlington.

[21]:

src = opd.Source("Virginia")
agencies = src.get_agencies(table_type="STOPS", partial_name="Arlington")
agencies

[21]:

['Arlington County Police Department', "Arlington County Sheriff's Office"]

We may also want only load data from a specific agency:

[22]:

tbl = src.load(table_type="STOPS", date=2022, agency="Arlington County Police Department")

To request data for a range of years

Data Standardization#

One of the challenges in analyzing police data is that different agencies will use different column names for the same data and will use different codes and terms for the data in the columns. Particularly, if you are looking at multiple datasets, it is valuable for the data to be standardized so that you know in advance what some key columns will be called and what values will be in those columns.

To provide the user with more consistent column names and data, OpenPoliceData provides powerful tools to automatically standardize column names and data in order. Columns that OpenPoliceData can standardize include:

Date
Time
Gender
Age
Race
Ethnicity

In addition, OpenPoliceData will combine separate date and time columns into a single datetime column and race and ethnicity into a single combined race column.

Let’s examine what columns are in the Phoenix Use of Force dataset:

[ ]:

src = opd.Source("Phoenix")
tbl = src.load(table_type="USE OF FORCE", date=2022, pbar=False)  # pbar=False does not show progress bar
# Only showing 1st 30 due to large number of columns
tbl.table.columns[:30]

Index(['INC_IA_NO', 'INC_IR_NO', 'EMP_BADGE_NO', 'CIT_NUMBER', 'INC_DATE',
       'INC_YEAR', 'INC_HOUR', 'INC_DAY_WEEK', 'INC_LOC_COUNTY',
       'HUNDRED_BLOCK', 'INC_CITY', 'INC_STATE', 'INC_ZIPCODE', 'INC_PRECINCT',
       'CIT_INJURY_YN', 'CIT_GENDER', 'CIT_AGE', 'SUBJ_AGE_GROUP', 'CIT_RACE',
       'CIT_ETHNICITY', 'SIMPLE_SUBJ_RE_GRP', 'SIMPLE_EMPL_RE_GRP', 'EMPL_SEX',
       'CIT_RESIST_AGG_ACTIV_AGGRESSN', 'CIT_RESIST_ACTIVE_AGGRESSN',
       'CIT_RESIST_ACTIVE_RESISTANCE', 'CIT_RESIST_PASSIVE_RESISTANCE',
       'CIT_RESIST_PSYCH_INTIMIDATION', 'CIT_RESIST_VRBL_NONCOMPLIANCE',
       'CIT_RESIST_NONE'],
      dtype='object')

The data has several columns related to subject demographics: - ‘CIT_GENDER’ - ‘CIT_AGE’ - ‘SUBJ_AGE_GROUP’ - ‘CIT_RACE’ - ‘CIT_ETHNICITY’ - ‘SIMPLE_SUBJ_RE_GRP’

These are not common labels used by datasets from other agencies (i.e. they cannot be predicted in advance). Additionally, when looking at them, the column labels are a bit hard to decipher because they are not all clear and are not consistent in their naming conventions. The data uses 2 different short descriptions for the same subject (CIT and SUBJ), and the RE in ‘SIMPLE_SUBJ_RE_GRP’ is for race/ethnicity so there 3 columns related to race and ethnicity.

Similarly, the office demographics data uses the same RE abbreviation, and the user must know that EMPL is short for employee, which means the officer. - ‘SIMPLE_EMPL_RE_GRP’ - ‘EMPL_SEX’

OPD’s data standardization will automatically identify columns and rename them to standard column names (while optionally keeping the original columns as RAW_{original name}). This enables the user to know in advance what the column names will be.

Now let’s examine what’s in the subject race and ethnicity columns:

[20]:

print(f"The unique values in the race column (CIT_RACE) are {tbl.table['CIT_RACE'].unique()}")
print(f"The unique values in the ethnicity column (CIT_ETHNICITY) are {tbl.table['CIT_ETHNICITY'].unique()}")
print(f"The unique values in the race ethnicity column (SIMPLE_SUBJ_RE_GRP) are {tbl.table['SIMPLE_SUBJ_RE_GRP'].unique()}")

The unique values in the race column (CIT_RACE) are ['White' 'Black' 'American Indian / Alaskan Native'
 'Asian / Pacific Islander' 'Unknown' 'AmIndian']
The unique values in the ethnicity column (CIT_ETHNICITY) are ['Hispanic' 'Non-Hispanic' 'Unknown']
The unique values in the race ethnicity column (SIMPLE_SUBJ_RE_GRP) are ['Hispanic' 'Black or African American' 'White' 'Other']

A few items to note: - Naming conventions are not consistent: Indigenous subjects are labeled ‘American Indian / Alaskan Native’ and ‘AmIndian’. Black subjects are labeled ‘Black’ in the race column and ‘Black or African American’ in the race/ethncity column - ‘Asian / Pacific Islander’ and ‘American Indian / Alaskan Native’ appear in the race column but not the race/ethnicity column, which does not seem correct unless ALL Asian/Pacific island and Indigenous subjects were Hispanic/Latino (since Hispanic/Latino is typically used for Latinos of all races in race/ethnicity columns), which may seem unlikely.

Let’s look at some cases where the race is ‘Asian / Pacific Islander’ or ‘American Indian / Alaskan Native’:

[21]:

i = tbl.table['CIT_RACE'].isin(['American Indian / Alaskan Native', 'Asian / Pacific Islander'])
tbl.table[['CIT_RACE', 'CIT_ETHNICITY', 'SIMPLE_SUBJ_RE_GRP']][i].head()

[21]:

	CIT_RACE	CIT_ETHNICITY	SIMPLE_SUBJ_RE_GRP
19	American Indian / Alaskan Native	Non-Hispanic	Other
20	American Indian / Alaskan Native	Non-Hispanic	Other
24	Asian / Pacific Islander	Non-Hispanic	Other
25	American Indian / Alaskan Native	Hispanic	Hispanic
27	Asian / Pacific Islander	Non-Hispanic	Other

Subjects labeled as ‘Asian / Pacific Islander’ and ‘American Indian / Alaskan Native’ in the race column are relabeled as ‘OTHER’ in the race/ethnicity column. Thus, although it is often preferred to use a combined race/ethnicity column, the way that ‘SIMPLE_SUBJ_RE_GRP’ has been generated actually removes key information.

OPD’s standardization allows the user to more quickly analyze data by automatically identifying columns, renaming them to standard column names, and standardizing the data in those columns.

[22]:

tbl.standardize()

Let’s look at what the standardization did using get_transform_map:

[23]:

std_map = tbl.get_transform_map(minimize=True)
for t in std_map:
    print(f"{t}\n")

orig_column_name: INC_DATE,
new_column_name: DATE,
data_maps: None

orig_column_name: CIT_RACE,
new_column_name: SUBJECT_RACE,
data_maps: {'White': 'WHITE', 'Black': 'BLACK', 'American Indian / Alaskan Native': 'INDIGENOUS', 'Asian / Pacific Islander': 'ASIAN/PACIFIC ISLANDER', 'Unknown': 'UNKNOWN', 'AmIndian': 'INDIGENOUS'}

orig_column_name: SIMPLE_EMPL_RE_GRP,
new_column_name: OFFICER_RACE/ETHNICITY,
data_maps: {'White': 'WHITE', 'Hispanic': 'HISPANIC/LATINO', 'Other': 'OTHER', 'Black or African American': 'BLACK', None: 'UNSPECIFIED'}

orig_column_name: CIT_ETHNICITY,
new_column_name: SUBJECT_ETHNICITY,
data_maps: {'Non-Hispanic': 'NON-HISPANIC/NON-LATINO', 'Hispanic': 'HISPANIC/LATINO', 'Unknown': 'UNKNOWN'}

orig_column_name: ['SUBJECT_RACE', 'SUBJECT_ETHNICITY'],
new_column_name: SUBJECT_RACE/ETHNICITY,
data_maps: None

orig_column_name: SUBJECT_RACE/ETHNICITY,
new_column_name: SUBJECT_RE_GROUP,
data_maps: None

orig_column_name: OFFICER_RACE/ETHNICITY,
new_column_name: OFFICER_RE_GROUP,
data_maps: None

orig_column_name: CIT_INJURY_YN,
new_column_name: SUBJECT_INJURY,
data_maps: {'Yes': 'INJURED', 'No': 'NO INJURY'}

orig_column_name: CIT_AGE,
new_column_name: SUBJECT_AGE,
data_maps: None

orig_column_name: SUBJ_AGE_GROUP,
new_column_name: SUBJECT_AGE_RANGE,
data_maps: {'30s': '30-39', '20s': '20-29', '40s': '40-49', '<20': '0-20', '50s': '50-59', '60s': '60-69', 'Not Available': 'Not Available', '70s': '70-79', '90s': '90-99', '80s': '80-89'}

orig_column_name: CIT_GENDER,
new_column_name: SUBJECT_GENDER,
data_maps: {'Male': 'MALE', 'Female': 'FEMALE'}

orig_column_name: EMPL_SEX,
new_column_name: OFFICER_GENDER,
data_maps: {'Male': 'MALE', 'Female': 'FEMALE', None: 'UNSPECIFIED'}

orig_column_name: INC_ZIPCODE,
new_column_name: ZIP_CODE,
data_maps: None

get_transform_map shows changes made by standardization including the following: - Identifying all demographics columns - Identifying the more informative CIT_RACE as the race column instead of SIMPLE_SUBJ_RE_GRP - The identified race (CIT_RACE) and ethnicity columns (CIT_ETHNICITY) were converted to SUBJECT_RACE and SUBJECT_ETHNICITY, respectively and then SUBJECT_RACE and SUBJECT_ETHNICITY are combined into SUBJECT_RACE_ETHNICITY. - The SUBJECT_RACE_ETHNICITY was copied to another column called SUBJECT_RE_GROUP. RE_GROUP columns like SUBJECT_RE_GROUP and OFFICER_RE_GROUP are added for those who want to be able to easily use a RACE_ETHNICITY column if it exists or a RACE column otherwise. The RE_GROUP column will be a copy of the RACE_ETHNICITY column if a RACE_ETHNICITY column has been generated or a RACE column if a RACE column was found but a RACE_ETHNICITY was not generated. - EMPL was identified as indicating officer demographics - The cryptically named SIMPLE_EMPL_RE_GRP was identified as an OFFICER_RACE column - Values of race, gender, age, and age group are standardized to values that will be consistent across all OPD-loaded tables

In data_maps, get_transform_map also includes dictionaries indicating original values (key) and the resulting standardize value (value).

Printing the columns shows that the standardized columns are in the front while the original columns are prepended with RAW and moved to the back of the list.

[24]:

print(f"The 1st 10 columns after standardization are: {tbl.table.columns[:12]}")
print(f"The last 11 columns after standardization are: {tbl.table.columns[-9:]}")

The 1st 10 columns after standardization are: Index(['DATE', 'SUBJECT_RACE', 'OFFICER_RACE/ETHNICITY', 'SUBJECT_ETHNICITY',
       'SUBJECT_RACE/ETHNICITY', 'SUBJECT_RE_GROUP', 'OFFICER_RE_GROUP',
       'SUBJECT_INJURY', 'SUBJECT_AGE', 'SUBJECT_AGE_RANGE', 'SUBJECT_GENDER',
       'OFFICER_GENDER'],
      dtype='object')
The last 11 columns after standardization are: Index(['RAW_INC_ZIPCODE', 'RAW_CIT_INJURY_YN', 'RAW_CIT_GENDER', 'RAW_CIT_AGE',
       'RAW_SUBJ_AGE_GROUP', 'RAW_CIT_RACE', 'RAW_CIT_ETHNICITY',
       'RAW_SIMPLE_EMPL_RE_GRP', 'RAW_EMPL_SEX'],
      dtype='object')

Finally, we can view what values that the new SUBJECT_RE_GROUP column contains:

[25]:

tbl.table["SUBJECT_RE_GROUP"].unique()
# NOTE: We also provide a columns enum so that the user does not have to remember the standardized column names. This would produce the same thing:
# tbl.table[opd.defs.columns.SUBJECT_RE_GROUP].unique()

[25]:

array(['HISPANIC/LATINO', 'BLACK', 'WHITE', 'INDIGENOUS',
       'ASIAN/PACIFIC ISLANDER', 'UNKNOWN'], dtype=object)