Data Standardization Guide#

One of the challenges in analyzing police data is that different agencies use different column names for the same type of data and use different codes and terms for the data in those columns. If you are looking at multiple datasets, it is particularly valuable for the data to be standardized so that the user knows in advance what some key columns will be called and what the values in those columns will be. To provide the user with more consistent column names and data, OpenPoliceData (OPD) provides powerful tools to automatically standardize column names and data values.

This guide provides details on how the standardization works and how to apply and customize the standardization.

Standardizable Columns#

Column types and the standardized column names that OpenPoliceData can standardize are shown below. If these column types are identified in the standardization process, they will be standardized.

[5]:

import openpolicedata as opd
import textwrap
# Convenience method for printing long text
def wrap_print(text, newline=True):
    text = "\n".join(textwrap.wrap(text, width=110))
    if newline:
        text = "\n"+text
    print(text)

[6]:

wrap_print("Attribute value below is an attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER) that can be used to access column names")
wrap_print('Column Name value below is the corresponding standardized column name (i.e. "OFFICER_AGE") that replaces the original column name '+
           'during the standardization process')

opd.Column


Attribute value below is an attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER) that can be used
to access column names

Column Name value below is the corresponding standardized column name (i.e. "OFFICER_AGE") that replaces the
original column name during the standardization process

[6]:

	Attribute	Column Name	Definition
0	AGE_OFFICER	OFFICER_AGE	Age of officer
1	AGE_OFFICER_SUBJECT	OFFICER/SUBJECT_AGE	Age of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
2	AGE_SUBJECT	SUBJECT_AGE	Age of subject
3	AGE_RANGE_OFFICER	OFFICER_AGE_RANGE	Age Range of officer
4	AGE_RANGE_OFFICER_SUBJECT	OFFICER/SUBJECT_AGE_RANGE	Age Range of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
5	AGE_RANGE_SUBJECT	SUBJECT_AGE_RANGE	Age Range of subject
6	AGENCY	AGENCY	Agency
7	DATE	DATE	Date
8	DATETIME	DATETIME	Combination of date and time when both columns are found (not generated when detected date column contains datetime values)
9	ETHNICITY_OFFICER	OFFICER_ETHNICITY	Ethnicity of officer
10	ETHNICITY_OFFICER_SUBJECT	OFFICER/SUBJECT_ETHNICITY	Ethnicity of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
11	ETHNICITY_SUBJECT	SUBJECT_ETHNICITY	Ethnicity of subject
12	FATAL_OFFICER	OFFICER_FATAL	Whether officer was fatally injured in an officer-involved shooting or use of force
13	FATAL_OFFICER_SUBJECT	OFFICER/SUBJECT_FATAL	Whether officer/subject was fatally injured in an officer-involved shooting or use of force
14	FATAL_SUBJECT	SUBJECT_FATAL	Whether subject was fatally injured in an officer-involved shooting or use of force
15	GENDER_OFFICER	OFFICER_GENDER	Gender of officer
16	GENDER_OFFICER_SUBJECT	OFFICER/SUBJECT_GENDER	Gender of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
17	GENDER_SUBJECT	SUBJECT_GENDER	Gender of subject
18	INCIDENT_ID	INCIDENT_ID	A unique incident ID given to an incident (arrest, use of force, etc.). It can be used to relate information across tables. Only standardized by Table.merge function.
19	INJURY_OFFICER	OFFICER_INJURY	Whether officer was injured in an officer-involved shooting or use of force
20	INJURY_OFFICER_SUBJECT	OFFICER/SUBJECT_INJURY	Whether officer/subject was injured in an officer-involved shooting or use of force
21	INJURY_SUBJECT	SUBJECT_INJURY	Whether subject was injured in an officer-involved shooting or use of force
22	NAME_OFFICER	OFFICER_NAME	Name of officer
23	NAME_OFFICER_SUBJECT	OFFICER/SUBJECT_NAME	Name of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
24	NAME_SUBJECT	SUBJECT_NAME	Name of subject
25	RACE_SUBJECT	SUBJECT_RACE	Race of subject
26	RACE_OFFICER_SUBJECT	OFFICER/SUBJECT_RACE	Race of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
27	RACE_OFFICER	OFFICER_RACE	Race of officer
28	RACE_ETHNICITY_SUBJECT	SUBJECT_RACE/ETHNICITY	Race/Ethnicity of subject
29	RACE_ETHNICITY_OFFICER_SUBJECT	OFFICER/SUBJECT_RACE/ETHNICITY	Race/Ethnicity of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
30	RACE_ETHNICITY_OFFICER	OFFICER_RACE/ETHNICITY	Race/Ethnicity of officer
31	RE_GROUP_OFFICER	OFFICER_RE_GROUP	Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of officer
32	RE_GROUP_OFFICER_SUBJECT	OFFICER/SUBJECT_RE_GROUP	Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
33	RE_GROUP_SUBJECT	SUBJECT_RE_GROUP	Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of subject
34	SUBJECT_OR_OFFICER	SUBJECT_OR_OFFICER	Whether row describes an officer or an subject/civilian
35	TIME	TIME	Time
36	ZIP_CODE	ZIP_CODE	Zip Code

NOTE: For consistency, OpenPoliceData always uses the term subject to describe the community member involved in a police interaction. This is consistent with the Stanford Open Policing Project.

Basic Standardization#

Basic standardization only requires calling standardize on a Table object.

[4]:

src = opd.Source("Virginia")
tbl = src.load(table_type="STOPS", year=2022, agency="Arlington County Police Department")
tbl.standardize()

print(f'The columns after standardization are: \n{tbl.table.columns}\n')

The columns after standardization are:
Index(['DATE', 'SUBJECT_RACE', 'SUBJECT_ETHNICITY', 'SUBJECT_RACE/ETHNICITY',
       'SUBJECT_RE_GROUP', 'SUBJECT_AGE', 'SUBJECT_GENDER', 'AGENCY',
       'JURISDICTION', 'REASON FOR STOP', 'PERSON TYPE', 'ENGLISH SPEAKING',
       'ACTION TAKEN', 'VIOLATION TYPE', 'SPECIFIC VIOLATION',
       'VIRGINIA CRIME CODE', 'PERSON SEARCHED', 'VEHICLE SEARCHED',
       'ADDITIONAL ARREST', 'FORCE USED BY OFFICER', 'FORCE USED BY SUBJECT',
       'RESIDENCY', 'RAW_STOP_DATE', 'RAW_AGENCY NAME', 'RAW_RACE',
       'RAW_ETHNICITY', 'RAW_AGE', 'RAW_GENDER'],
      dtype='object')

Standardized columns are always in all caps. These columns are standardized versions of the original columns that have been prepended with RAW_ (the original columns were lowercase in this dataset. Original column names may also be in all caps). By standardizing, the user does not need to know the exact name of a particular column (they vary greatly). The user can check if a standaridzed column exists by doing the following:

[6]:

if opd.Column.RE_GROUP_SUBJECT in tbl.table: # Alternatively: if "RE_GROUP_SUBJECT" in tbl.table
    race_col = tbl.table[opd.Column.RE_GROUP_SUBJECT]  # Alternatively: tbl.table["RE_GROUP_SUBJECT"]
    wrap_print(f"The values in the original subject race column are {tbl.table['RAW_RACE'].unique()}\n")
    wrap_print(f"The values in the original subject race column are {tbl.table['RAW_ETHNICITY'].unique()}\n")
    wrap_print(f"After combining race and ethnicity columns, the values in the standardized subject race/ethnicity column are {race_col.unique()}")
else:
    wrap_print("There is no subject race/ethnicity or race column")


The values in the original subject race column are ['WHITE' 'BLACK OR AFRICAN AMERICAN' 'AMERICAN INDIAN OR
ALASKA NATIVE'  'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER' 'UNKNOWN']

The values in the original subject race column are ['HISPANIC OR LATINO' 'NOT HISPANIC OR LATINO' 'UNKNOWN']

After combining race and ethnicity columns, the values in the standardized subject race/ethnicity column are
['HISPANIC/LATINO' 'BLACK' 'INDIGENOUS' 'WHITE' 'ASIAN/PACIFIC ISLANDER'  'UNKNOWN']

The values in each standardized column have been converted from the raw values to standardized ones. Thus, in the standardized race/ethnicty column, Black individuals will always be labeled as BLACK despite different datasets using many different encodings (including B, African American, Black or African American, and various shortened versions or typos). Data types will also be consistent, such as as dates will use a native pandas date format:

[7]:

tbl.table["DATE"].dtype

[7]:

dtype('<M8[ns]')

get_transform_map can be used to view changes that were made during standardization. If True, the minimize input removes the attribute orig_value_counts which contains the counts for the 5 most used values in each original column that was standardized.

[8]:

std_map = tbl.get_transform_map(minimize=True)
for t in std_map:
    print(f"{t}\n")

orig_column_name: STOP_DATE,
new_column_name: DATE,
data_maps: None

orig_column_name: RACE,
new_column_name: SUBJECT_RACE,
data_maps: {'WHITE': 'WHITE', 'BLACK OR AFRICAN AMERICAN': 'BLACK', 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER', 'UNKNOWN': 'UNKNOWN', 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}

orig_column_name: ETHNICITY,
new_column_name: SUBJECT_ETHNICITY,
data_maps: {'NOT HISPANIC OR LATINO': 'NON-HISPANIC/NON-LATINO', 'HISPANIC OR LATINO': 'HISPANIC/LATINO', 'UNKNOWN': 'UNKNOWN'}

orig_column_name: ['SUBJECT_RACE', 'SUBJECT_ETHNICITY'],
new_column_name: SUBJECT_RACE/ETHNICITY,
data_maps: None

orig_column_name: SUBJECT_RACE/ETHNICITY,
new_column_name: SUBJECT_RE_GROUP,
data_maps: None

orig_column_name: AGE,
new_column_name: SUBJECT_AGE,
data_maps: None

orig_column_name: GENDER,
new_column_name: SUBJECT_GENDER,
data_maps: {'MALE': 'MALE', 'FEMALE': 'FEMALE', 'UNKNOWN': 'UNKNOWN', None: 'UNSPECIFIED', 'OTHER': 'OTHER'}

orig_column_name: AGENCY NAME,
new_column_name: AGENCY,
data_maps: None

The original data in the table has been replaced by the standardized data:

[9]:

tbl.table.head()

[9]:

	DATE	SUBJECT_RACE	SUBJECT_ETHNICITY	SUBJECT_RACE/ETHNICITY	SUBJECT_RE_GROUP	SUBJECT_AGE	SUBJECT_GENDER	AGENCY	JURISDICTION	REASON FOR STOP	...	ADDITIONAL ARREST	FORCE USED BY OFFICER	FORCE USED BY SUBJECT	RESIDENCY	RAW_STOP_DATE	RAW_AGENCY NAME	RAW_RACE	RAW_ETHNICITY	RAW_AGE	RAW_GENDER
0	2022-01-01	WHITE	HISPANIC/LATINO	HISPANIC/LATINO	HISPANIC/LATINO	17	MALE	Arlington County Police Department	ARLINGTON CO	CALLS FOR SERVICE	...	None	NO	NO	None	2022-01-01	Arlington County Police Department	WHITE	HISPANIC OR LATINO	17	MALE
1	2022-01-01	BLACK	NON-HISPANIC/NON-LATINO	BLACK	BLACK	38	MALE	Arlington County Police Department	ARLINGTON CO	TRAFFIC VIOLATION	...	None	NO	NO	None	2022-01-01	Arlington County Police Department	BLACK OR AFRICAN AMERICAN	NOT HISPANIC OR LATINO	38	MALE
2	2022-01-01	BLACK	NON-HISPANIC/NON-LATINO	BLACK	BLACK	29	FEMALE	Arlington County Police Department	ARLINGTON CO	TRAFFIC VIOLATION	...	None	NO	NO	None	2022-01-01	Arlington County Police Department	BLACK OR AFRICAN AMERICAN	NOT HISPANIC OR LATINO	29	FEMALE
3	2022-01-01	WHITE	HISPANIC/LATINO	HISPANIC/LATINO	HISPANIC/LATINO	31	MALE	Arlington County Police Department	ARLINGTON CO	TRAFFIC VIOLATION	...	None	NO	NO	None	2022-01-01	Arlington County Police Department	WHITE	HISPANIC OR LATINO	31	MALE
4	2022-01-01	INDIGENOUS	NON-HISPANIC/NON-LATINO	INDIGENOUS	INDIGENOUS	44	MALE	Arlington County Police Department	ARLINGTON CO	CALLS FOR SERVICE	...	None	NO	NO	None	2022-01-01	Arlington County Police Department	AMERICAN INDIAN OR ALASKA NATIVE	NOT HISPANIC OR LATINO	44	MALE

5 rows × 28 columns

Getting Demographics Column Names#

Names of race, gender, and age columns can be accessed with the get_race_col, get_gender_col, and get_age_col functions of the Table class, respectively.

[10]:

subject_race_col = tbl.get_race_col()  # or role='SUBJECT'
officer_race_col = tbl.get_race_col(role='OFFICER')  # This returns None because the data has no officer race column

Combining Columns#

The following columns are combined when they are both identified:

DATE and TIME are combined into DATETIME
SUBJECT_RACE and SUBJECT_ETHNICITY are combined into SUBJECT_RACE/ETHNCITY.
OFFICER_RACE and OFFICER_ETHNICITY are combined into OFFICER_RACE/ETHNCITY.
OFFICER/SUBJECT_RACE and OFFICER/SUBJECT_ETHNICITY are combined into OFFICER/SUBJECT_RACE/ETHNCITY. For more info on OFFICER/SUBJECT columns, see here.

RE_GROUP (Race/Ethnicity Group) Columns#

For users who might want to use the race/ethnicity column if it exists and the race column if not, standardization includes the following columns to enable this capability:

SUBJECT_RE_GROUP: Added if SUBJECT_RACE/ETHNCITY or SUBJECT_RACE exist. Will be a copy of SUBJECT_RACE/ETHNCITY if it exists otherwise a copy of SUBJECT_RACE if it exists.
OFFICER_RE_GROUP: Added if OFFICER_RACE/ETHNCITY or OFFICER_RACE exist. Will be a copy of OFFICER_RACE/ETHNCITY if it exists otherwise a copy of OFFICER_RACE if it exists.
OFFICER/SUBJECT_RE_GROUP: Added if OFFICER/SUBJECT_RACE/ETHNCITY or OFFICER/SUBJECT_RACE exist. Will be a copy of OFFICER/SUBJECT_RACE/ETHNCITY if it exists otherwise a copy of OFFICER/SUBJECT_RACE if it exists. For more info on OFFICER/SUBJECT columns, see here.

Officer and Subject Demographics in Separate Rows#

OFFICER/SUBJECT columns (OFFICER/SUBJECT_RACE, OFFICER/SUBJECT_AGE, etc.) exist when a single row either describes an officer or a subject. In these cases, the SUBJECT_OR_OFFICER column will indicate if a given row is for a subject or an officer. See here for an example.

Advanced Standardization#

This section describes topics related to:

Customizing the standardization process
Explaining how rare cases are handled

Multi-Subject and Multi-Officer Data#

Some datasets encode the demographics of multiple subjects and/or officers into a single row of data. In these cases, the standardized demographics are encoded differently than in a typical signal subject / officer per row case. This is best demonstrated by the following example.

[11]:

src = opd.Source("New Orleans")
tbl = src.load(opd.TableType.USE_OF_FORCE, 2022)
# Find rows with demographics of multiple subjects. New Orleans uses the | to separate persons. This is not typical and would not be known in advance.
is_mult = tbl.table['subject_gender'].str.contains("|",regex=False).fillna(False)
tbl.table[is_mult][['subject_ethnicity','subject_gender','subject_age']].head()

[11]:

	subject_ethnicity	subject_gender	subject_age
6	Black \| Black	Female \| Male	42 \| 42
8	Black \| Black	Male \| Male	46 \| 43
9	Black \| Black \| Black \| Black \| Black \| Black	Male \| Male \| Male \| Female \| Female \| Male	19 \| 18 \| 17 \| 14 \| 17 \| 31
25	Black \| Black \| Black	Female \| Female \| Female	18 \| 40 \| 24
26	Hispanic \| Black \| Black	Male \| Male \| Male	16 \| 16 \| 15

In the New Orleans use of force data, when multiple subjects are involved in a use of force incident, their demographics are separated by a | symbol. There is no standard way for encoding this type of multi-subject or multi-officer case and a variety of other formats have been observed for encoding the demographics of multiple people in a row. OPD standardizes these cases into a common format:

[12]:

tbl.standardize()
tbl.table[[opd.Column.RE_GROUP_SUBJECT, opd.Column.GENDER_SUBJECT, opd.Column.AGE_SUBJECT]][is_mult].head()

[12]:

	SUBJECT_RE_GROUP	SUBJECT_GENDER	SUBJECT_AGE
6	{0: 'BLACK', 1: 'BLACK'}	{0: 'FEMALE', 1: 'MALE'}	{0: 42, 1: 42}
8	{0: 'BLACK', 1: 'BLACK'}	{0: 'MALE', 1: 'MALE'}	{0: 46, 1: 43}
9	{0: 'BLACK', 1: 'BLACK', 2: 'BLACK', 3: 'BLACK...	{0: 'MALE', 1: 'MALE', 2: 'MALE', 3: 'FEMALE',...	{0: 19, 1: 18, 2: 17, 3: 14, 4: 17, 5: 31}
25	{0: 'BLACK', 1: 'BLACK', 2: 'BLACK'}	{0: 'FEMALE', 1: 'FEMALE', 2: 'FEMALE'}	{0: 18, 1: 40, 2: 24}
26	{0: 'HISPANIC/LATINO', 1: 'BLACK', 2: 'BLACK'}	{0: 'MALE', 1: 'MALE', 2: 'MALE'}	{0: 16, 1: 16, 2: 15}

OPD encodes multiple subject or officer demographic data as dictionaries. The dictionary key corresponds to which subject or officer that the demographics describe, and the value indicates the race, gender, age, etc. For example, if SUBJECT_RE_GROUP is {0:’BLACK’,1:’WHITE’} and SUBJECT_GENDER is {0:’MALE’,’1’:’FEMALE’}, the first subject is a Black male and the second is a White female.

In some cases, it may be desirable to expand this data from a single row per incident (i.e. containing demographic data for every person involved in the incident) to a single row per subject or officer. The expand function provides this capability:

[13]:

# mismatch indicates what to do when a column to be expanded has the incorrect number of values
# (such as if there are 2 genders and races but only 1 age). 'nan' fills with NaNs. 'splitsingle'
# uses the single value for each expanded row in cases where a column has only 1 value. Default
# is to throw an error.
tbl.expand(person_type='subject', mismatch='nan')
is_mult_index = is_mult[is_mult].index
tbl.table[[opd.Column.RE_GROUP_SUBJECT, opd.Column.GENDER_SUBJECT, opd.Column.AGE_SUBJECT]].loc[is_mult_index[0]:is_mult_index[0]+7]

[13]:

	SUBJECT_RE_GROUP	SUBJECT_GENDER	SUBJECT_AGE
6	BLACK	FEMALE	42
7	BLACK	MALE	42
8	BLACK	MALE	19
9	BLACK	MALE	46
10	BLACK	MALE	43
11	BLACK	MALE	19
12	BLACK	MALE	18
13	BLACK	MALE	17

In the above example, the 1st 2 multiple demographic rows can be seen to have been split into 2 rows each (Black female 42 and Black male 42 and Black male 46 and Black male 43).

Multi-Race or Multi-Ethnicity Data#

Some datasets list multiple races for multi-racial individuals. The California use of force datasets is an example.

[14]:

src = opd.Source("California")
# USE_OF_FORCE_SUBJECTS_OFFICERS datasets contain information on officers and subjects involved in use of force incidents
# They will have a column indicating whether each row applies to an officer or subject
tbl = src.load(opd.TableType.USE_OF_FORCE_SUBJECTS_OFFICERS, 2022)
# Find multi-racial individuals
is_mult = tbl.table['Race_Ethnic_Group'].str.contains(",").fillna(False)
tbl.table[is_mult][['Civilian_Officer','Race_Ethnic_Group','Age','Gender']].head()

[14]:

	Civilian_Officer	Race_Ethnic_Group	Age	Gender
91	Officer	white, asian	36_40	male
106	Officer	american_indian, white	36_40	male
109	Officer	black, white	31_35	male
210	Officer	hispanic, white	41_45	male
244	Officer	black, hispanic	26_30	male

The race/ethnicity column of this datasets includes multiple races separated by commas for multi-racial individuals (also note that the age column is actually an age range column). OPD identifies that the multiple races apply to an indvidual (as opposed to multiple people) and stores the multiple races as a comma-separated list:

[15]:

tbl.standardize()
cols = [opd.Column.SUBJECT_OR_OFFICER, opd.Column.RE_GROUP_OFFICER_SUBJECT, opd.Column.GENDER_OFFICER_SUBJECT, opd.Column.AGE_RANGE_OFFICER_SUBJECT]
tbl.table[is_mult][cols].head()

[15]:

	SUBJECT_OR_OFFICER	OFFICER/SUBJECT_RE_GROUP	OFFICER/SUBJECT_GENDER	OFFICER/SUBJECT_AGE_RANGE
91	OFFICER	ASIAN, WHITE	MALE	36-40
106	OFFICER	INDIGENOUS, WHITE	MALE	36-40
109	OFFICER	BLACK, WHITE	MALE	31-35
210	OFFICER	HISPANIC/LATINO, WHITE	MALE	41-45
244	OFFICER	BLACK, HISPANIC/LATINO	MALE	26-30

Customizing Race/Ethnicity Categories#

standardize provides several inputs for customizing the race/ethnicity categorization that are described below.

`race_cats`#

standardize has a default set of race category values that it uses:

[16]:

default_race_cats = opd.defs.get_race_cats()
default_race_cats

[16]:

{'AAPI': 'ASIAN/PACIFIC ISLANDER',
 'ASIAN': 'ASIAN',
 'BLACK': 'BLACK',
 'LATINO': 'HISPANIC/LATINO',
 'MULTIPLE': 'MULTIPLE',
 'INDIGENOUS': 'INDIGENOUS',
 'OTHER': 'OTHER',
 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED',
 'WHITE': 'WHITE'}

The race categories in default_race_cats are a dictionary. The values of the dictionary are the race category values that will be used during standardization. The keys of the dictionary are codes used internally to look up which category value to use.

[17]:

src = opd.Source("Virginia")
tbl = src.load("STOPS", 2022,agency="Norfolk Police Department")
tbl.standardize()
s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps

orig_column_name: RACE
new_column_name: SUBJECT_RACE

Data Transformation Map:

[17]:

{'BLACK OR AFRICAN AMERICAN': 'BLACK',
 'WHITE': 'WHITE',
 'UNKNOWN': 'UNKNOWN',
 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER',
 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}

The data transformation map is a dictionary showing original data values (keys) and standardized values (dictionary values). It can be seen that each of the original values was able to be identified and transformed to a value in the default_race_cats dictionary shown above.

If the user desires to use different names for the race categories, they can do this by assigning new values:

[18]:

new_race_cats = opd.defs.get_race_cats()
new_race_cats['UNKNOWN'] = "Unrecognized"

tbl = src.load("STOPS", 2022, agency="Norfolk Police Department")
tbl.standardize(race_cats=new_race_cats)
s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps

orig_column_name: RACE
new_column_name: SUBJECT_RACE

Data Transformation Map:

[18]:

{'BLACK OR AFRICAN AMERICAN': 'BLACK',
 'WHITE': 'WHITE',
 'UNKNOWN': 'Unrecognized',
 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER',
 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}

The original value UNKNOWN is now replaced by Unrecognized.

Now, let’s see a more complicated example:

[19]:

src = opd.Source("California")
tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, "MULTIPLE")
tbl.standardize()
s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps

orig_column_name: race
new_column_name: SUBJECT_RACE/ETHNICITY

Data Transformation Map:

[19]:

{'White': 'WHITE',
 'Hispanic': 'HISPANIC/LATINO',
 'Black': 'BLACK',
 'Other': 'OTHER',
 'American Indian': 'INDIGENOUS',
 'Other Asian': 'ASIAN',
 'Filipino': 'Filipino',
 'Vietnamese': 'Vietnamese',
 'Pacific Islander': 'ASIAN/PACIFIC ISLANDER',
 'Asian Indian': 'Asian Indian',
 'Chinese': 'Chinese',
 'Korean': 'Korean',
 'Laotian': 'Laotian',
 'Samoan': 'Samoan',
 'Cambodian': 'Cambodian',
 'Japanese': 'Japanese',
 'Hawaiian': 'ASIAN/PACIFIC ISLANDER',
 'Guamanian': 'Guamanian',
 <NA>: 'UNSPECIFIED'}

In this case, some of the original values are converted to standardized values from default_race_cats and some remain unchanged. By default, values that cannot be identified are not changed (see no_id input to change).

We can create an expanded list of race categories that include more categories. These additional categories are based on categories that have been observed in some datasets. The expanded race categories contain all allowable keys for the race_cats input to standardize.

[20]:

expanded_race_cats = opd.defs.get_race_cats(expand=True)
expanded_race_cats

[20]:

{'AAPI': 'ASIAN/PACIFIC ISLANDER',
 'ASIAN': 'ASIAN',
 'BLACK': 'BLACK',
 'LATINO': 'HISPANIC/LATINO',
 'MULTIPLE': 'MULTIPLE',
 'INDIGENOUS': 'INDIGENOUS',
 'OTHER': 'OTHER',
 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED',
 'WHITE': 'WHITE',
 'PACIFIC_ISLANDER': 'HAWAIIAN/PACIFIC ISLANDER',
 'MIDDLE_EASTERN': 'MIDDLE EASTERN',
 'MIDDLE_EASTERN_SOUTH_ASIAN': 'MIDDLE EASTERN/SOUTH ASIAN',
 'SOUTH_ASIAN': 'SOUTH ASIAN'}

This list includes a few more categories than existed previously including Middle Eastern and South Asian categories. These categories were not used previously because their keys were not in the default dictionary used for race_cats. The expanded list can be used instead of the default list by setting race_cats='expand', which is shorthand for race_cats=opd.defs.get_race_cats(expand=True):

[21]:

tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, "MULTIPLE")
tbl.standardize(race_cats='expand')
s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps

orig_column_name: race
new_column_name: SUBJECT_RACE/ETHNICITY

Data Transformation Map:

[21]:

{'White': 'WHITE',
 'Hispanic': 'HISPANIC/LATINO',
 'Black': 'BLACK',
 'Other': 'OTHER',
 'American Indian': 'INDIGENOUS',
 'Other Asian': 'ASIAN',
 'Filipino': 'Filipino',
 'Vietnamese': 'Vietnamese',
 'Pacific Islander': 'HAWAIIAN/PACIFIC ISLANDER',
 'Asian Indian': 'SOUTH ASIAN',
 'Chinese': 'Chinese',
 'Korean': 'Korean',
 'Laotian': 'Laotian',
 'Samoan': 'Samoan',
 'Cambodian': 'Cambodian',
 'Japanese': 'Japanese',
 'Hawaiian': 'HAWAIIAN/PACIFIC ISLANDER',
 'Guamanian': 'Guamanian',
 <NA>: 'UNSPECIFIED'}

In this case, one more original value (Asian Indian) has been standardized (SOUTH ASIAN).

Similarly, more compact race categories can be generated by inputting race_cats='compact' to standardize, which is equivalent to race_cats=opd.defs.get_race_cats(compact=True). The result is that most labels will be abbreviated:

[22]:

opd.defs.get_race_cats(compact=True)

[22]:

{'AAPI': 'AAPI',
 'ASIAN': 'A',
 'BLACK': 'B',
 'LATINO': 'H',
 'MULTIPLE': 'MULTIPLE',
 'INDIGENOUS': 'I',
 'OTHER': 'O',
 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED',
 'WHITE': 'W'}

`agg_race_cat`#

More aggressive categorization is possible by setting agg_race_cat to True. By default, standardize attempts to not reduce the specificity of the values in the raw data. Setting agg_race_cat to True will attempt to standardize more cases by replacing some nationalities with races based on U.S. Census guidance or in columns that contain multiple races and/or ethnicities, replacing all values containing a term for Hispanic/Latino, such as Black Latino, with the standard category for Hispanic/Latino (i.e. this category is now for Hispanic/Latino of all races and all other categories are non-Hispanic/Latino).

[23]:

tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, "MULTIPLE")
tbl.standardize(race_cats='expand', agg_race_cat=True)
s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps

orig_column_name: race
new_column_name: SUBJECT_RACE/ETHNICITY

Data Transformation Map:

[23]:

{'White': 'WHITE',
 'Hispanic': 'HISPANIC/LATINO',
 'Black': 'BLACK',
 'Other': 'OTHER',
 'American Indian': 'INDIGENOUS',
 'Other Asian': 'ASIAN',
 'Filipino': 'ASIAN',
 'Vietnamese': 'ASIAN',
 'Pacific Islander': 'HAWAIIAN/PACIFIC ISLANDER',
 'Asian Indian': 'SOUTH ASIAN',
 'Chinese': 'ASIAN',
 'Korean': 'ASIAN',
 'Laotian': 'ASIAN',
 'Samoan': 'HAWAIIAN/PACIFIC ISLANDER',
 'Cambodian': 'ASIAN',
 'Japanese': 'ASIAN',
 'Hawaiian': 'HAWAIIAN/PACIFIC ISLANDER',
 'Guamanian': 'HAWAIIAN/PACIFIC ISLANDER',
 <NA>: 'UNSPECIFIED'}

In the example above, each nationality is now replaced by a standard value.

`eth_cats`#

There are also a set of ethnicity categories used in a similar manner to the race categories. standardize has a default set of ethnicity category names that it uses:

[24]:

eth_cats = opd.defs.get_eth_cats()
eth_cats

[24]:

{'LATINO': 'HISPANIC/LATINO',
 'MIDDLE_EASTERN': 'MIDDLE EASTERN',
 'NON-LATINO': 'NON-HISPANIC/NON-LATINO',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED'}

The ethnicity categories in eth_cats are a dictionary. The values of the dictionary are the ethnicity category values that will be used during standardization. The keys of the dictionary are codes used internally to look up which category label to use. No keys can be used besides the ones provided by get_eth_cats. Ethnicity categories typically only indicate whether an individual is Hispanic/Latino or non-Hispanic/Latino.

The names assigned for different ethnicity categories can be customized:

[25]:

new_eth_cats = opd.defs.get_eth_cats()
new_eth_cats['LATINO'] = "Latino"

src = opd.Source("Virginia")
tbl = src.load("STOPS", 2022, agency="Norfolk Police Department")
tbl.standardize(eth_cats=new_eth_cats)
s = tbl.get_transform_map(new=opd.Column.ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
print(s.data_maps)

print(f"\nIn previous examples, raw values for Hispanic/Latino were replaced with '{eth_cats['LATINO']}'. "+\
      f"In this case, customization resulted in replacement with '{new_eth_cats['LATINO']}'.")

orig_column_name: ETHNICITY
new_column_name: SUBJECT_ETHNICITY

Data Transformation Map:
{'NOT HISPANIC OR LATINO': 'NON-HISPANIC/NON-LATINO', 'UNKNOWN': 'UNKNOWN', 'HISPANIC OR LATINO': 'Latino'}

In previous examples, raw values for Hispanic/Latino were replaced with 'HISPANIC/LATINO'. In this case, customization resulted in replacement with 'Latino'.

More compact ethnicity categories can be generated by inputting eth_cats='compact to standardize, which is equivalent to eth_cats=opd.defs.get_eth_cats(compact=True). The result is that most labels will be abbreviated:

[26]:

compact_eth_cats = opd.defs.get_eth_cats(compact=True)
tbl = src.load("STOPS", 2022, agency="Norfolk Police Department")
tbl.standardize(eth_cats='compact')
s = tbl.get_transform_map(new=opd.Column.ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
print(s.data_maps)

print(f"\nIn previous examples, raw values for Hispanic/Latino were replaced with '{eth_cats['LATINO']}'. "+\
      f"In this case, customization resulted in replacement with '{compact_eth_cats['LATINO']}'.")

orig_column_name: ETHNICITY
new_column_name: SUBJECT_ETHNICITY

Data Transformation Map:
{'NOT HISPANIC OR LATINO': 'NH', 'UNKNOWN': 'UNKNOWN', 'HISPANIC OR LATINO': 'H'}

In previous examples, raw values for Hispanic/Latino were replaced with 'HISPANIC/LATINO'. In this case, customization resulted in replacement with 'H'.

`race_eth_combo`#

The race_eth_combo input controls whether and how standardize combines race and ethnicity if both columns are found. By default (race_eth_combo='merge'), they will be merged such that the race will be used for the combined race/ethnicity if the ethnicity is Non-Hispanic/Latino and the ethnicity column will be used otherwise. The resulting categories are Hispanic/Latino of all races and individual race categories will only include non-Hispanic/Latino. Additionally, if the ethnicity is unknown, the race/ethnicity will be set to unknown even if there is a valid value for race.

Other options for race_eth_combo are 'concat' and False. race_eth_combo=False will result in race and ethnicity columns not being combined. race_eth_combo='concat' will concatenate race and ethnicity:

[27]:

src = opd.Source("Virginia")
tbl = src.load(table_type="STOPS", year=2022, agency="Arlington County Police Department")
tbl.standardize(race_eth_combo='concat')

tbl.table[[opd.Column.RACE_SUBJECT, opd.Column.ETHNICITY_SUBJECT, opd.Column.RACE_ETHNICITY_SUBJECT]].head()

[27]:

	SUBJECT_RACE	SUBJECT_ETHNICITY	SUBJECT_RACE/ETHNICITY
0	WHITE	HISPANIC/LATINO	WHITE HISPANIC/LATINO
1	BLACK	NON-HISPANIC/NON-LATINO	BLACK
2	BLACK	NON-HISPANIC/NON-LATINO	BLACK
3	WHITE	HISPANIC/LATINO	WHITE HISPANIC/LATINO
4	INDIGENOUS	NON-HISPANIC/NON-LATINO	INDIGENOUS

Customizing Gender Categories#

There are a set of gender categories that are used in a similar manner to the race categories. standardize has a default set of gender category labels that it uses:

[28]:

gender_cats = opd.defs.get_gender_cats()
gender_cats

[28]:

{'MALE': 'MALE',
 'FEMALE': 'FEMALE',
 'TRANSGENDER_MALE': 'TRANSGENDER MALE',
 'TRANSGENDER_FEMALE': 'TRANSGENDER FEMALE',
 'TRANSGENDER': 'TRANSGENDER',
 'GENDER_NONCONFORMING': 'GENDER NON-CONFORMING',
 'TRANSGENDER_OR_GENDER_NONCONFORMING': 'TRANSGENDER OR GENDER NON-CONFORMING',
 'GENDER_NONBINARY': 'GENDER NON-BINARY',
 'OTHER': 'OTHER',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED'}

The gender categories in gender_cats are a dictionary. The values of the dictionary are the gender category values that will be used during the standardization. The keys of the dictionary are codes used internally to look up which category label to use. No keys can be used besides the ones provided by get_gender_cats.

The names assigned for different gender categories can be customized:

[29]:

new_gender_cats = gender_cats.copy()
new_gender_cats["MALE"] = "M"

src = opd.Source("Berkeley")
tbl = src.load(opd.defs.TableType.STOPS, 2022)
tbl.standardize(gender_cats=new_gender_cats)


wrap_print(f"Unique gender values are {tbl.table[opd.Column.GENDER_SUBJECT].unique()}")
wrap_print(f"In previous examples, raw values for Male were replaced with '{gender_cats['MALE']}'. "+\
      f"In this case, customization resulted in replacement with '{new_gender_cats['MALE']}'.")


Unique gender values are ['M' 'FEMALE' 'TRANSGENDER FEMALE' 'TRANSGENDER MALE' 'UNSPECIFIED']

In previous examples, raw values for Male were replaced with 'MALE'. In this case, customization resulted in
replacement with 'M'.

More compact gender categories can be generated by inputting gender_cats='compact to standardize, gender_cats=opd.defs.get_gender_cats(compact=True). The result is that most labels will be abbreviated:

[30]:

compact_gender_cats = opd.defs.get_gender_cats(compact=True)

src = opd.Source("Berkeley")
tbl = src.load(opd.defs.TableType.STOPS, 2022)
tbl.standardize(gender_cats='compact')


wrap_print(f"Unique gender values are {tbl.table[opd.Column.GENDER_SUBJECT].unique()}")
wrap_print(f"In previous examples, raw values for Female were replaced with '{gender_cats['FEMALE']}'. "+\
      f"In this case, customization resulted in replacement with '{compact_gender_cats['FEMALE']}'.")


Unique gender values are ['M' 'F' 'TF' 'TM' 'UNSPECIFIED']

In previous examples, raw values for Female were replaced with 'FEMALE'. In this case, customization resulted
in replacement with 'F'.

Manually Standardizing Column Names#

The known_cols input is provided to allow the user to force standardize to select certain columns for standardization. This might be useful if:

by default, the desired column is not selected to be standardized OR
if there is more than one possible column for a given column type and the one that standardize chooses to standardize is not the one desired by the user.

The value input to known_cols should be a dictionary where the keys should be attributes of opd.defs.columns and the values are columns in the original table. By default, known_cols is {opd.defs.columns.DATE:t.date_field, opd.defs.columns.AGENCY:t.agency_field} where date_field and agency_field are attributes of the Table object that are pulled from OPD’s datasets table. If a user is inputting their own known_cols, it may be advisable to start from the default input as will be shown below:

[31]:

src = opd.Source("Phoenix")
tbl = src.load(table_type="USE OF FORCE", year=2022, pbar=False)  # pbar=False does not show progress bar
default_known_cols = {opd.Column.DATE:tbl.date_field, opd.Column.AGENCY:tbl.agency_field}
default_known_cols

[31]:

{'DATE': 'INC_DATE', 'AGENCY': None}

OPD’s datasets table lists that the Phoenix use of force dataset has a date column called INC_DATE and that it does not have a column that lists the agency (because all data is for the Phoenix Police). By default, standardize will assign INC_DATE as a date column because that has already been verified by the OPD datasets table. The Phoenix use of force dataset has multiple columns for subject race and ethnicity:

CIT_RACE
CIT_ETHNICITY
SIMPLE_SUBJ_RE_GRP (race/ethnicity group of subject)

With the default inputs, standardize selects CIT_RACE to be SUBJECT_RACE, CIT_ETHNICITY to be SUBJECT_ETHNICITY, and combines those into SUBJECT_RACE/ETHNICITY. The generated SUBJECT_RACE/ETHNICITY column is actually more descriptive than Phoenix’s SIMPLE_SUBJ_RE_GRP column. However, if the user wanted to use SIMPLE_SUBJ_RE_GRP instead, the following could be done:

[32]:

known_cols = default_known_cols.copy()
known_cols[opd.defs.columns.RACE_SUBJECT] = 'SIMPLE_SUBJ_RE_GRP'
tbl.standardize(known_cols=known_cols)

s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
print(s.data_maps)

orig_column_name: SIMPLE_SUBJ_RE_GRP
new_column_name: SUBJECT_RACE

Data Transformation Map:
{'Hispanic': 'HISPANIC/LATINO', 'Black or African American': 'BLACK', 'White': 'WHITE', 'Other': 'OTHER'}

Other Inputs to Standardize#

There are several other optional inputs to standardize that are not described above.

`keep_raw`#

If False, raw columns that are standardized will be removed. If True (default), they will be kept and renamed to indcate that they are the original raw columns.

`verbose`#

Default is False. If set to True, details of the standardization process will be printed. If set to a filename, details of the standardization will be logged to that file.

`no_id`#

Determines how values that are not able to be standardized are handled during standardization of demographics:

‘keep’ (default): Keep the original value
‘null’: Replace with a null value
‘error’: Throw an error

`merge_date_time`#

If True (default), if standardized date and standardize time columns are identified, they will be merged into a combined datetime column.

`empty_time`#

Indicates how null times are treated in a standardized datetime column (if merge_date_time is True). If ‘NaT’, datetime values with null times will be a null value (pd.NaT). If ‘ignore’, the resulting datetime will be the date value without a time.

	subject_ethnicity	subject_gender	subject_age
6	Black \| Black	Female \| Male	42 \| 42
8	Black \| Black	Male \| Male	46 \| 43
9	Black \| Black \| Black \| Black \| Black \| Black	Male \| Male \| Male \| Female \| Female \| Male	19 \| 18 \| 17 \| 14 \| 17 \| 31
25	Black \| Black \| Black	Female \| Female \| Female	18 \| 40 \| 24
26	Hispanic \| Black \| Black	Male \| Male \| Male	16 \| 16 \| 15