Data Standardization Guide#

One of the challenges in analyzing police data is that different agencies use different column names for the same type of data and use different codes and terms for the data in those columns. If you are looking at multiple datasets, it is particularly valuable for the data to be standardized so that the user knows in advance what some key columns will be called and what the values in those columns will be. To provide the user with more consistent column names and data, OpenPoliceData (OPD) provides powerful tools to automatically standardize column names and data values.

This guide provides details on how the standardization works and how to apply and customize the standardization.

Standardizable Columns#

Column types and the standardized column names that OpenPoliceData can standardize are shown below. If these column types are identified in the standardization process, they will be standardized.

[5]:
import openpolicedata as opd
import textwrap
# Convenience method for printing long text
def wrap_print(text, newline=True):
    text = "\n".join(textwrap.wrap(text, width=110))
    if newline:
        text = "\n"+text
    print(text)
[6]:
wrap_print("Attribute value below is an attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER) that can be used to access column names")
wrap_print('Column Name value below is the corresponding standardized column name (i.e. "OFFICER_AGE") that replaces the original column name '+
           'during the standardization process')

opd.Column

Attribute value below is an attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER) that can be used
to access column names

Column Name value below is the corresponding standardized column name (i.e. "OFFICER_AGE") that replaces the
original column name during the standardization process
[6]:
Attribute Column Name Definition
0 AGE_OFFICER OFFICER_AGE Age of officer
1 AGE_OFFICER_SUBJECT OFFICER/SUBJECT_AGE Age of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
2 AGE_SUBJECT SUBJECT_AGE Age of subject
3 AGE_RANGE_OFFICER OFFICER_AGE_RANGE Age Range of officer
4 AGE_RANGE_OFFICER_SUBJECT OFFICER/SUBJECT_AGE_RANGE Age Range of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
5 AGE_RANGE_SUBJECT SUBJECT_AGE_RANGE Age Range of subject
6 AGENCY AGENCY Agency
7 DATE DATE Date
8 DATETIME DATETIME Combination of date and time when both columns are found (not generated when detected date column contains datetime values)
9 ETHNICITY_OFFICER OFFICER_ETHNICITY Ethnicity of officer
10 ETHNICITY_OFFICER_SUBJECT OFFICER/SUBJECT_ETHNICITY Ethnicity of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
11 ETHNICITY_SUBJECT SUBJECT_ETHNICITY Ethnicity of subject
12 FATAL_OFFICER OFFICER_FATAL Whether officer was fatally injured in an officer-involved shooting or use of force
13 FATAL_OFFICER_SUBJECT OFFICER/SUBJECT_FATAL Whether officer/subject was fatally injured in an officer-involved shooting or use of force
14 FATAL_SUBJECT SUBJECT_FATAL Whether subject was fatally injured in an officer-involved shooting or use of force
15 GENDER_OFFICER OFFICER_GENDER Gender of officer
16 GENDER_OFFICER_SUBJECT OFFICER/SUBJECT_GENDER Gender of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
17 GENDER_SUBJECT SUBJECT_GENDER Gender of subject
18 INCIDENT_ID INCIDENT_ID A unique incident ID given to an incident (arrest, use of force, etc.). It can be used to relate information across tables. Only standardized by Table.merge function.
19 INJURY_OFFICER OFFICER_INJURY Whether officer was injured in an officer-involved shooting or use of force
20 INJURY_OFFICER_SUBJECT OFFICER/SUBJECT_INJURY Whether officer/subject was injured in an officer-involved shooting or use of force
21 INJURY_SUBJECT SUBJECT_INJURY Whether subject was injured in an officer-involved shooting or use of force
22 NAME_OFFICER OFFICER_NAME Name of officer
23 NAME_OFFICER_SUBJECT OFFICER/SUBJECT_NAME Name of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
24 NAME_SUBJECT SUBJECT_NAME Name of subject
25 RACE_SUBJECT SUBJECT_RACE Race of subject
26 RACE_OFFICER_SUBJECT OFFICER/SUBJECT_RACE Race of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
27 RACE_OFFICER OFFICER_RACE Race of officer
28 RACE_ETHNICITY_SUBJECT SUBJECT_RACE/ETHNICITY Race/Ethnicity of subject
29 RACE_ETHNICITY_OFFICER_SUBJECT OFFICER/SUBJECT_RACE/ETHNICITY Race/Ethnicity of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
30 RACE_ETHNICITY_OFFICER OFFICER_RACE/ETHNICITY Race/Ethnicity of officer
31 RE_GROUP_OFFICER OFFICER_RE_GROUP Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of officer
32 RE_GROUP_OFFICER_SUBJECT OFFICER/SUBJECT_RE_GROUP Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of either an officer or subject (depending on column "SUBJECT_OR_OFFICER")
33 RE_GROUP_SUBJECT SUBJECT_RE_GROUP Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of subject
34 SUBJECT_OR_OFFICER SUBJECT_OR_OFFICER Whether row describes an officer or an subject/civilian
35 TIME TIME Time
36 ZIP_CODE ZIP_CODE Zip Code

NOTE: For consistency, OpenPoliceData always uses the term subject to describe the community member involved in a police interaction. This is consistent with the Stanford Open Policing Project.

Basic Standardization#

Basic standardization only requires calling standardize on a Table object.

[4]:
src = opd.Source("Virginia")
tbl = src.load(table_type="STOPS", year=2022, agency="Arlington County Police Department")
tbl.standardize()

print(f'The columns after standardization are: \n{tbl.table.columns}\n')
The columns after standardization are:
Index(['DATE', 'SUBJECT_RACE', 'SUBJECT_ETHNICITY', 'SUBJECT_RACE/ETHNICITY',
       'SUBJECT_RE_GROUP', 'SUBJECT_AGE', 'SUBJECT_GENDER', 'AGENCY',
       'JURISDICTION', 'REASON FOR STOP', 'PERSON TYPE', 'ENGLISH SPEAKING',
       'ACTION TAKEN', 'VIOLATION TYPE', 'SPECIFIC VIOLATION',
       'VIRGINIA CRIME CODE', 'PERSON SEARCHED', 'VEHICLE SEARCHED',
       'ADDITIONAL ARREST', 'FORCE USED BY OFFICER', 'FORCE USED BY SUBJECT',
       'RESIDENCY', 'RAW_STOP_DATE', 'RAW_AGENCY NAME', 'RAW_RACE',
       'RAW_ETHNICITY', 'RAW_AGE', 'RAW_GENDER'],
      dtype='object')

Standardized columns are always in all caps. These columns are standardized versions of the original columns that have been prepended with RAW_ (the original columns were lowercase in this dataset. Original column names may also be in all caps). By standardizing, the user does not need to know the exact name of a particular column (they vary greatly). The user can check if a standaridzed column exists by doing the following:

[6]:
if opd.Column.RE_GROUP_SUBJECT in tbl.table: # Alternatively: if "RE_GROUP_SUBJECT" in tbl.table
    race_col = tbl.table[opd.Column.RE_GROUP_SUBJECT]  # Alternatively: tbl.table["RE_GROUP_SUBJECT"]
    wrap_print(f"The values in the original subject race column are {tbl.table['RAW_RACE'].unique()}\n")
    wrap_print(f"The values in the original subject race column are {tbl.table['RAW_ETHNICITY'].unique()}\n")
    wrap_print(f"After combining race and ethnicity columns, the values in the standardized subject race/ethnicity column are {race_col.unique()}")
else:
    wrap_print("There is no subject race/ethnicity or race column")

The values in the original subject race column are ['WHITE' 'BLACK OR AFRICAN AMERICAN' 'AMERICAN INDIAN OR
ALASKA NATIVE'  'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER' 'UNKNOWN']

The values in the original subject race column are ['HISPANIC OR LATINO' 'NOT HISPANIC OR LATINO' 'UNKNOWN']

After combining race and ethnicity columns, the values in the standardized subject race/ethnicity column are
['HISPANIC/LATINO' 'BLACK' 'INDIGENOUS' 'WHITE' 'ASIAN/PACIFIC ISLANDER'  'UNKNOWN']

The values in each standardized column have been converted from the raw values to standardized ones. Thus, in the standardized race/ethnicty column, Black individuals will always be labeled as BLACK despite different datasets using many different encodings (including B, African American, Black or African American, and various shortened versions or typos). Data types will also be consistent, such as as dates will use a native pandas date format:

[7]:
tbl.table["DATE"].dtype
[7]:
dtype('<M8[ns]')

get_transform_map can be used to view changes that were made during standardization. If True, the minimize input removes the attribute orig_value_counts which contains the counts for the 5 most used values in each original column that was standardized.

[8]:
std_map = tbl.get_transform_map(minimize=True)
for t in std_map:
    print(f"{t}\n")
orig_column_name: STOP_DATE,
new_column_name: DATE,
data_maps: None

orig_column_name: RACE,
new_column_name: SUBJECT_RACE,
data_maps: {'WHITE': 'WHITE', 'BLACK OR AFRICAN AMERICAN': 'BLACK', 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER', 'UNKNOWN': 'UNKNOWN', 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}

orig_column_name: ETHNICITY,
new_column_name: SUBJECT_ETHNICITY,
data_maps: {'NOT HISPANIC OR LATINO': 'NON-HISPANIC/NON-LATINO', 'HISPANIC OR LATINO': 'HISPANIC/LATINO', 'UNKNOWN': 'UNKNOWN'}

orig_column_name: ['SUBJECT_RACE', 'SUBJECT_ETHNICITY'],
new_column_name: SUBJECT_RACE/ETHNICITY,
data_maps: None

orig_column_name: SUBJECT_RACE/ETHNICITY,
new_column_name: SUBJECT_RE_GROUP,
data_maps: None

orig_column_name: AGE,
new_column_name: SUBJECT_AGE,
data_maps: None

orig_column_name: GENDER,
new_column_name: SUBJECT_GENDER,
data_maps: {'MALE': 'MALE', 'FEMALE': 'FEMALE', 'UNKNOWN': 'UNKNOWN', None: 'UNSPECIFIED', 'OTHER': 'OTHER'}

orig_column_name: AGENCY NAME,
new_column_name: AGENCY,
data_maps: None

The original data in the table has been replaced by the standardized data:

[9]:
tbl.table.head()
[9]:
DATE SUBJECT_RACE SUBJECT_ETHNICITY SUBJECT_RACE/ETHNICITY SUBJECT_RE_GROUP SUBJECT_AGE SUBJECT_GENDER AGENCY JURISDICTION REASON FOR STOP ... ADDITIONAL ARREST FORCE USED BY OFFICER FORCE USED BY SUBJECT RESIDENCY RAW_STOP_DATE RAW_AGENCY NAME RAW_RACE RAW_ETHNICITY RAW_AGE RAW_GENDER
0 2022-01-01 WHITE HISPANIC/LATINO HISPANIC/LATINO HISPANIC/LATINO 17 MALE Arlington County Police Department ARLINGTON CO CALLS FOR SERVICE ... None NO NO None 2022-01-01 Arlington County Police Department WHITE HISPANIC OR LATINO 17 MALE
1 2022-01-01 BLACK NON-HISPANIC/NON-LATINO BLACK BLACK 38 MALE Arlington County Police Department ARLINGTON CO TRAFFIC VIOLATION ... None NO NO None 2022-01-01 Arlington County Police Department BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO 38 MALE
2 2022-01-01 BLACK NON-HISPANIC/NON-LATINO BLACK BLACK 29 FEMALE Arlington County Police Department ARLINGTON CO TRAFFIC VIOLATION ... None NO NO None 2022-01-01 Arlington County Police Department BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO 29 FEMALE
3 2022-01-01 WHITE HISPANIC/LATINO HISPANIC/LATINO HISPANIC/LATINO 31 MALE Arlington County Police Department ARLINGTON CO TRAFFIC VIOLATION ... None NO NO None 2022-01-01 Arlington County Police Department WHITE HISPANIC OR LATINO 31 MALE
4 2022-01-01 INDIGENOUS NON-HISPANIC/NON-LATINO INDIGENOUS INDIGENOUS 44 MALE Arlington County Police Department ARLINGTON CO CALLS FOR SERVICE ... None NO NO None 2022-01-01 Arlington County Police Department AMERICAN INDIAN OR ALASKA NATIVE NOT HISPANIC OR LATINO 44 MALE

5 rows × 28 columns

Getting Demographics Column Names#

Names of race, gender, and age columns can be accessed with the get_race_col, get_gender_col, and get_age_col functions of the Table class, respectively.

[10]:
subject_race_col = tbl.get_race_col()  # or role='SUBJECT'
officer_race_col = tbl.get_race_col(role='OFFICER')  # This returns None because the data has no officer race column

Combining Columns#

The following columns are combined when they are both identified:

  • DATE and TIME are combined into DATETIME

  • SUBJECT_RACE and SUBJECT_ETHNICITY are combined into SUBJECT_RACE/ETHNCITY.

  • OFFICER_RACE and OFFICER_ETHNICITY are combined into OFFICER_RACE/ETHNCITY.

  • OFFICER/SUBJECT_RACE and OFFICER/SUBJECT_ETHNICITY are combined into OFFICER/SUBJECT_RACE/ETHNCITY. For more info on OFFICER/SUBJECT columns, see here.

RE_GROUP (Race/Ethnicity Group) Columns#

For users who might want to use the race/ethnicity column if it exists and the race column if not, standardization includes the following columns to enable this capability:

  • SUBJECT_RE_GROUP: Added if SUBJECT_RACE/ETHNCITY or SUBJECT_RACE exist. Will be a copy of SUBJECT_RACE/ETHNCITY if it exists otherwise a copy of SUBJECT_RACE if it exists.

  • OFFICER_RE_GROUP: Added if OFFICER_RACE/ETHNCITY or OFFICER_RACE exist. Will be a copy of OFFICER_RACE/ETHNCITY if it exists otherwise a copy of OFFICER_RACE if it exists.

  • OFFICER/SUBJECT_RE_GROUP: Added if OFFICER/SUBJECT_RACE/ETHNCITY or OFFICER/SUBJECT_RACE exist. Will be a copy of OFFICER/SUBJECT_RACE/ETHNCITY if it exists otherwise a copy of OFFICER/SUBJECT_RACE if it exists. For more info on OFFICER/SUBJECT columns, see here.

Officer and Subject Demographics in Separate Rows#

OFFICER/SUBJECT columns (OFFICER/SUBJECT_RACE, OFFICER/SUBJECT_AGE, etc.) exist when a single row either describes an officer or a subject. In these cases, the SUBJECT_OR_OFFICER column will indicate if a given row is for a subject or an officer. See here for an example.

Advanced Standardization#

This section describes topics related to:

  • Customizing the standardization process

  • Explaining how rare cases are handled

Multi-Subject and Multi-Officer Data#

Some datasets encode the demographics of multiple subjects and/or officers into a single row of data. In these cases, the standardized demographics are encoded differently than in a typical signal subject / officer per row case. This is best demonstrated by the following example.

[11]:
src = opd.Source("New Orleans")
tbl = src.load(opd.TableType.USE_OF_FORCE, 2022)
# Find rows with demographics of multiple subjects. New Orleans uses the | to separate persons. This is not typical and would not be known in advance.
is_mult = tbl.table['subject_gender'].str.contains("|",regex=False).fillna(False)
tbl.table[is_mult][['subject_ethnicity','subject_gender','subject_age']].head()
[11]:
subject_ethnicity subject_gender subject_age
6 Black | Black Female | Male 42 | 42
8 Black | Black Male | Male 46 | 43
9 Black | Black | Black | Black | Black | Black Male | Male | Male | Female | Female | Male 19 | 18 | 17 | 14 | 17 | 31
25 Black | Black | Black Female | Female | Female 18 | 40 | 24
26 Hispanic | Black | Black Male | Male | Male 16 | 16 | 15

In the New Orleans use of force data, when multiple subjects are involved in a use of force incident, their demographics are separated by a | symbol. There is no standard way for encoding this type of multi-subject or multi-officer case and a variety of other formats have been observed for encoding the demographics of multiple people in a row. OPD standardizes these cases into a common format:

[12]:
tbl.standardize()
tbl.table[[opd.Column.RE_GROUP_SUBJECT, opd.Column.GENDER_SUBJECT, opd.Column.AGE_SUBJECT]][is_mult].head()
[12]:
SUBJECT_RE_GROUP SUBJECT_GENDER SUBJECT_AGE
6 {0: 'BLACK', 1: 'BLACK'} {0: 'FEMALE', 1: 'MALE'} {0: 42, 1: 42}
8 {0: 'BLACK', 1: 'BLACK'} {0: 'MALE', 1: 'MALE'} {0: 46, 1: 43}
9 {0: 'BLACK', 1: 'BLACK', 2: 'BLACK', 3: 'BLACK... {0: 'MALE', 1: 'MALE', 2: 'MALE', 3: 'FEMALE',... {0: 19, 1: 18, 2: 17, 3: 14, 4: 17, 5: 31}
25 {0: 'BLACK', 1: 'BLACK', 2: 'BLACK'} {0: 'FEMALE', 1: 'FEMALE', 2: 'FEMALE'} {0: 18, 1: 40, 2: 24}
26 {0: 'HISPANIC/LATINO', 1: 'BLACK', 2: 'BLACK'} {0: 'MALE', 1: 'MALE', 2: 'MALE'} {0: 16, 1: 16, 2: 15}

OPD encodes multiple subject or officer demographic data as dictionaries. The dictionary key corresponds to which subject or officer that the demographics describe, and the value indicates the race, gender, age, etc. For example, if SUBJECT_RE_GROUP is {0:’BLACK’,1:’WHITE’} and SUBJECT_GENDER is {0:’MALE’,’1’:’FEMALE’}, the first subject is a Black male and the second is a White female.

In some cases, it may be desirable to expand this data from a single row per incident (i.e. containing demographic data for every person involved in the incident) to a single row per subject or officer. The expand function provides this capability:

[13]:
# mismatch indicates what to do when a column to be expanded has the incorrect number of values
# (such as if there are 2 genders and races but only 1 age). 'nan' fills with NaNs. 'splitsingle'
# uses the single value for each expanded row in cases where a column has only 1 value. Default
# is to throw an error.
tbl.expand(person_type='subject', mismatch='nan')
is_mult_index = is_mult[is_mult].index
tbl.table[[opd.Column.RE_GROUP_SUBJECT, opd.Column.GENDER_SUBJECT, opd.Column.AGE_SUBJECT]].loc[is_mult_index[0]:is_mult_index[0]+7]
[13]:
SUBJECT_RE_GROUP SUBJECT_GENDER SUBJECT_AGE
6 BLACK FEMALE 42
7 BLACK MALE 42
8 BLACK MALE 19
9 BLACK MALE 46
10 BLACK MALE 43
11 BLACK MALE 19
12 BLACK MALE 18
13 BLACK MALE 17

In the above example, the 1st 2 multiple demographic rows can be seen to have been split into 2 rows each (Black female 42 and Black male 42 and Black male 46 and Black male 43).

Multi-Race or Multi-Ethnicity Data#

Some datasets list multiple races for multi-racial individuals. The California use of force datasets is an example.

[14]:
src = opd.Source("California")
# USE_OF_FORCE_SUBJECTS_OFFICERS datasets contain information on officers and subjects involved in use of force incidents
# They will have a column indicating whether each row applies to an officer or subject
tbl = src.load(opd.TableType.USE_OF_FORCE_SUBJECTS_OFFICERS, 2022)
# Find multi-racial individuals
is_mult = tbl.table['Race_Ethnic_Group'].str.contains(",").fillna(False)
tbl.table[is_mult][['Civilian_Officer','Race_Ethnic_Group','Age','Gender']].head()

[14]:
Civilian_Officer Race_Ethnic_Group Age Gender
91 Officer white, asian 36_40 male
106 Officer american_indian, white 36_40 male
109 Officer black, white 31_35 male
210 Officer hispanic, white 41_45 male
244 Officer black, hispanic 26_30 male

The race/ethnicity column of this datasets includes multiple races separated by commas for multi-racial individuals (also note that the age column is actually an age range column). OPD identifies that the multiple races apply to an indvidual (as opposed to multiple people) and stores the multiple races as a comma-separated list:

[15]:
tbl.standardize()
cols = [opd.Column.SUBJECT_OR_OFFICER, opd.Column.RE_GROUP_OFFICER_SUBJECT, opd.Column.GENDER_OFFICER_SUBJECT, opd.Column.AGE_RANGE_OFFICER_SUBJECT]
tbl.table[is_mult][cols].head()
[15]:
SUBJECT_OR_OFFICER OFFICER/SUBJECT_RE_GROUP OFFICER/SUBJECT_GENDER OFFICER/SUBJECT_AGE_RANGE
91 OFFICER ASIAN, WHITE MALE 36-40
106 OFFICER INDIGENOUS, WHITE MALE 36-40
109 OFFICER BLACK, WHITE MALE 31-35
210 OFFICER HISPANIC/LATINO, WHITE MALE 41-45
244 OFFICER BLACK, HISPANIC/LATINO MALE 26-30

Customizing Race/Ethnicity Categories#

standardize provides several inputs for customizing the race/ethnicity categorization that are described below.

race_cats#

standardize has a default set of race category values that it uses:

[16]:
default_race_cats = opd.defs.get_race_cats()
default_race_cats
[16]:
{'AAPI': 'ASIAN/PACIFIC ISLANDER',
 'ASIAN': 'ASIAN',
 'BLACK': 'BLACK',
 'LATINO': 'HISPANIC/LATINO',
 'MULTIPLE': 'MULTIPLE',
 'INDIGENOUS': 'INDIGENOUS',
 'OTHER': 'OTHER',
 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED',
 'WHITE': 'WHITE'}

The race categories in default_race_cats are a dictionary. The values of the dictionary are the race category values that will be used during standardization. The keys of the dictionary are codes used internally to look up which category value to use.

[17]:
src = opd.Source("Virginia")
tbl = src.load("STOPS", 2022,agency="Norfolk Police Department")
tbl.standardize()
s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps
orig_column_name: RACE
new_column_name: SUBJECT_RACE

Data Transformation Map:
[17]:
{'BLACK OR AFRICAN AMERICAN': 'BLACK',
 'WHITE': 'WHITE',
 'UNKNOWN': 'UNKNOWN',
 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER',
 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}

The data transformation map is a dictionary showing original data values (keys) and standardized values (dictionary values). It can be seen that each of the original values was able to be identified and transformed to a value in the default_race_cats dictionary shown above.

If the user desires to use different names for the race categories, they can do this by assigning new values:

[18]:
new_race_cats = opd.defs.get_race_cats()
new_race_cats['UNKNOWN'] = "Unrecognized"

tbl = src.load("STOPS", 2022, agency="Norfolk Police Department")
tbl.standardize(race_cats=new_race_cats)
s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps
orig_column_name: RACE
new_column_name: SUBJECT_RACE

Data Transformation Map:
[18]:
{'BLACK OR AFRICAN AMERICAN': 'BLACK',
 'WHITE': 'WHITE',
 'UNKNOWN': 'Unrecognized',
 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER',
 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}

The original value UNKNOWN is now replaced by Unrecognized.

Now, let’s see a more complicated example:

[19]:
src = opd.Source("California")
tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, "MULTIPLE")
tbl.standardize()
s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps
orig_column_name: race
new_column_name: SUBJECT_RACE/ETHNICITY

Data Transformation Map:
[19]:
{'White': 'WHITE',
 'Hispanic': 'HISPANIC/LATINO',
 'Black': 'BLACK',
 'Other': 'OTHER',
 'American Indian': 'INDIGENOUS',
 'Other Asian': 'ASIAN',
 'Filipino': 'Filipino',
 'Vietnamese': 'Vietnamese',
 'Pacific Islander': 'ASIAN/PACIFIC ISLANDER',
 'Asian Indian': 'Asian Indian',
 'Chinese': 'Chinese',
 'Korean': 'Korean',
 'Laotian': 'Laotian',
 'Samoan': 'Samoan',
 'Cambodian': 'Cambodian',
 'Japanese': 'Japanese',
 'Hawaiian': 'ASIAN/PACIFIC ISLANDER',
 'Guamanian': 'Guamanian',
 <NA>: 'UNSPECIFIED'}

In this case, some of the original values are converted to standardized values from default_race_cats and some remain unchanged. By default, values that cannot be identified are not changed (see no_id input to change).

We can create an expanded list of race categories that include more categories. These additional categories are based on categories that have been observed in some datasets. The expanded race categories contain all allowable keys for the race_cats input to standardize.

[20]:
expanded_race_cats = opd.defs.get_race_cats(expand=True)
expanded_race_cats
[20]:
{'AAPI': 'ASIAN/PACIFIC ISLANDER',
 'ASIAN': 'ASIAN',
 'BLACK': 'BLACK',
 'LATINO': 'HISPANIC/LATINO',
 'MULTIPLE': 'MULTIPLE',
 'INDIGENOUS': 'INDIGENOUS',
 'OTHER': 'OTHER',
 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED',
 'WHITE': 'WHITE',
 'PACIFIC_ISLANDER': 'HAWAIIAN/PACIFIC ISLANDER',
 'MIDDLE_EASTERN': 'MIDDLE EASTERN',
 'MIDDLE_EASTERN_SOUTH_ASIAN': 'MIDDLE EASTERN/SOUTH ASIAN',
 'SOUTH_ASIAN': 'SOUTH ASIAN'}

This list includes a few more categories than existed previously including Middle Eastern and South Asian categories. These categories were not used previously because their keys were not in the default dictionary used for race_cats. The expanded list can be used instead of the default list by setting race_cats='expand', which is shorthand for race_cats=opd.defs.get_race_cats(expand=True):

[21]:
tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, "MULTIPLE")
tbl.standardize(race_cats='expand')
s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps
orig_column_name: race
new_column_name: SUBJECT_RACE/ETHNICITY

Data Transformation Map:
[21]:
{'White': 'WHITE',
 'Hispanic': 'HISPANIC/LATINO',
 'Black': 'BLACK',
 'Other': 'OTHER',
 'American Indian': 'INDIGENOUS',
 'Other Asian': 'ASIAN',
 'Filipino': 'Filipino',
 'Vietnamese': 'Vietnamese',
 'Pacific Islander': 'HAWAIIAN/PACIFIC ISLANDER',
 'Asian Indian': 'SOUTH ASIAN',
 'Chinese': 'Chinese',
 'Korean': 'Korean',
 'Laotian': 'Laotian',
 'Samoan': 'Samoan',
 'Cambodian': 'Cambodian',
 'Japanese': 'Japanese',
 'Hawaiian': 'HAWAIIAN/PACIFIC ISLANDER',
 'Guamanian': 'Guamanian',
 <NA>: 'UNSPECIFIED'}

In this case, one more original value (Asian Indian) has been standardized (SOUTH ASIAN).

Similarly, more compact race categories can be generated by inputting race_cats='compact' to standardize, which is equivalent to race_cats=opd.defs.get_race_cats(compact=True). The result is that most labels will be abbreviated:

[22]:
opd.defs.get_race_cats(compact=True)
[22]:
{'AAPI': 'AAPI',
 'ASIAN': 'A',
 'BLACK': 'B',
 'LATINO': 'H',
 'MULTIPLE': 'MULTIPLE',
 'INDIGENOUS': 'I',
 'OTHER': 'O',
 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED',
 'WHITE': 'W'}

agg_race_cat#

More aggressive categorization is possible by setting agg_race_cat to True. By default, standardize attempts to not reduce the specificity of the values in the raw data. Setting agg_race_cat to True will attempt to standardize more cases by replacing some nationalities with races based on U.S. Census guidance or in columns that contain multiple races and/or ethnicities, replacing all values containing a term for Hispanic/Latino, such as Black Latino, with the standard category for Hispanic/Latino (i.e. this category is now for Hispanic/Latino of all races and all other categories are non-Hispanic/Latino).

[23]:
tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, "MULTIPLE")
tbl.standardize(race_cats='expand', agg_race_cat=True)
s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
s.data_maps
orig_column_name: race
new_column_name: SUBJECT_RACE/ETHNICITY

Data Transformation Map:
[23]:
{'White': 'WHITE',
 'Hispanic': 'HISPANIC/LATINO',
 'Black': 'BLACK',
 'Other': 'OTHER',
 'American Indian': 'INDIGENOUS',
 'Other Asian': 'ASIAN',
 'Filipino': 'ASIAN',
 'Vietnamese': 'ASIAN',
 'Pacific Islander': 'HAWAIIAN/PACIFIC ISLANDER',
 'Asian Indian': 'SOUTH ASIAN',
 'Chinese': 'ASIAN',
 'Korean': 'ASIAN',
 'Laotian': 'ASIAN',
 'Samoan': 'HAWAIIAN/PACIFIC ISLANDER',
 'Cambodian': 'ASIAN',
 'Japanese': 'ASIAN',
 'Hawaiian': 'HAWAIIAN/PACIFIC ISLANDER',
 'Guamanian': 'HAWAIIAN/PACIFIC ISLANDER',
 <NA>: 'UNSPECIFIED'}

In the example above, each nationality is now replaced by a standard value.

eth_cats#

There are also a set of ethnicity categories used in a similar manner to the race categories. standardize has a default set of ethnicity category names that it uses:

[24]:
eth_cats = opd.defs.get_eth_cats()
eth_cats
[24]:
{'LATINO': 'HISPANIC/LATINO',
 'MIDDLE_EASTERN': 'MIDDLE EASTERN',
 'NON-LATINO': 'NON-HISPANIC/NON-LATINO',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED'}

The ethnicity categories in eth_cats are a dictionary. The values of the dictionary are the ethnicity category values that will be used during standardization. The keys of the dictionary are codes used internally to look up which category label to use. No keys can be used besides the ones provided by get_eth_cats. Ethnicity categories typically only indicate whether an individual is Hispanic/Latino or non-Hispanic/Latino.

The names assigned for different ethnicity categories can be customized:

[25]:
new_eth_cats = opd.defs.get_eth_cats()
new_eth_cats['LATINO'] = "Latino"

src = opd.Source("Virginia")
tbl = src.load("STOPS", 2022, agency="Norfolk Police Department")
tbl.standardize(eth_cats=new_eth_cats)
s = tbl.get_transform_map(new=opd.Column.ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
print(s.data_maps)

print(f"\nIn previous examples, raw values for Hispanic/Latino were replaced with '{eth_cats['LATINO']}'. "+\
      f"In this case, customization resulted in replacement with '{new_eth_cats['LATINO']}'.")
orig_column_name: ETHNICITY
new_column_name: SUBJECT_ETHNICITY

Data Transformation Map:
{'NOT HISPANIC OR LATINO': 'NON-HISPANIC/NON-LATINO', 'UNKNOWN': 'UNKNOWN', 'HISPANIC OR LATINO': 'Latino'}

In previous examples, raw values for Hispanic/Latino were replaced with 'HISPANIC/LATINO'. In this case, customization resulted in replacement with 'Latino'.

More compact ethnicity categories can be generated by inputting eth_cats='compact to standardize, which is equivalent to eth_cats=opd.defs.get_eth_cats(compact=True). The result is that most labels will be abbreviated:

[26]:
compact_eth_cats = opd.defs.get_eth_cats(compact=True)
tbl = src.load("STOPS", 2022, agency="Norfolk Police Department")
tbl.standardize(eth_cats='compact')
s = tbl.get_transform_map(new=opd.Column.ETHNICITY_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
print(s.data_maps)

print(f"\nIn previous examples, raw values for Hispanic/Latino were replaced with '{eth_cats['LATINO']}'. "+\
      f"In this case, customization resulted in replacement with '{compact_eth_cats['LATINO']}'.")
orig_column_name: ETHNICITY
new_column_name: SUBJECT_ETHNICITY

Data Transformation Map:
{'NOT HISPANIC OR LATINO': 'NH', 'UNKNOWN': 'UNKNOWN', 'HISPANIC OR LATINO': 'H'}

In previous examples, raw values for Hispanic/Latino were replaced with 'HISPANIC/LATINO'. In this case, customization resulted in replacement with 'H'.

race_eth_combo#

The race_eth_combo input controls whether and how standardize combines race and ethnicity if both columns are found. By default (race_eth_combo='merge'), they will be merged such that the race will be used for the combined race/ethnicity if the ethnicity is Non-Hispanic/Latino and the ethnicity column will be used otherwise. The resulting categories are Hispanic/Latino of all races and individual race categories will only include non-Hispanic/Latino. Additionally, if the ethnicity is unknown, the race/ethnicity will be set to unknown even if there is a valid value for race.

Other options for race_eth_combo are 'concat' and False. race_eth_combo=False will result in race and ethnicity columns not being combined. race_eth_combo='concat' will concatenate race and ethnicity:

[27]:
src = opd.Source("Virginia")
tbl = src.load(table_type="STOPS", year=2022, agency="Arlington County Police Department")
tbl.standardize(race_eth_combo='concat')

tbl.table[[opd.Column.RACE_SUBJECT, opd.Column.ETHNICITY_SUBJECT, opd.Column.RACE_ETHNICITY_SUBJECT]].head()
[27]:
SUBJECT_RACE SUBJECT_ETHNICITY SUBJECT_RACE/ETHNICITY
0 WHITE HISPANIC/LATINO WHITE HISPANIC/LATINO
1 BLACK NON-HISPANIC/NON-LATINO BLACK
2 BLACK NON-HISPANIC/NON-LATINO BLACK
3 WHITE HISPANIC/LATINO WHITE HISPANIC/LATINO
4 INDIGENOUS NON-HISPANIC/NON-LATINO INDIGENOUS

Customizing Gender Categories#

There are a set of gender categories that are used in a similar manner to the race categories. standardize has a default set of gender category labels that it uses:

[28]:
gender_cats = opd.defs.get_gender_cats()
gender_cats
[28]:
{'MALE': 'MALE',
 'FEMALE': 'FEMALE',
 'TRANSGENDER_MALE': 'TRANSGENDER MALE',
 'TRANSGENDER_FEMALE': 'TRANSGENDER FEMALE',
 'TRANSGENDER': 'TRANSGENDER',
 'GENDER_NONCONFORMING': 'GENDER NON-CONFORMING',
 'TRANSGENDER_OR_GENDER_NONCONFORMING': 'TRANSGENDER OR GENDER NON-CONFORMING',
 'GENDER_NONBINARY': 'GENDER NON-BINARY',
 'OTHER': 'OTHER',
 'UNKNOWN': 'UNKNOWN',
 'UNSPECIFIED': 'UNSPECIFIED'}

The gender categories in gender_cats are a dictionary. The values of the dictionary are the gender category values that will be used during the standardization. The keys of the dictionary are codes used internally to look up which category label to use. No keys can be used besides the ones provided by get_gender_cats.

The names assigned for different gender categories can be customized:

[29]:
new_gender_cats = gender_cats.copy()
new_gender_cats["MALE"] = "M"

src = opd.Source("Berkeley")
tbl = src.load(opd.defs.TableType.STOPS, 2022)
tbl.standardize(gender_cats=new_gender_cats)


wrap_print(f"Unique gender values are {tbl.table[opd.Column.GENDER_SUBJECT].unique()}")
wrap_print(f"In previous examples, raw values for Male were replaced with '{gender_cats['MALE']}'. "+\
      f"In this case, customization resulted in replacement with '{new_gender_cats['MALE']}'.")

Unique gender values are ['M' 'FEMALE' 'TRANSGENDER FEMALE' 'TRANSGENDER MALE' 'UNSPECIFIED']

In previous examples, raw values for Male were replaced with 'MALE'. In this case, customization resulted in
replacement with 'M'.

More compact gender categories can be generated by inputting gender_cats='compact to standardize, gender_cats=opd.defs.get_gender_cats(compact=True). The result is that most labels will be abbreviated:

[30]:
compact_gender_cats = opd.defs.get_gender_cats(compact=True)

src = opd.Source("Berkeley")
tbl = src.load(opd.defs.TableType.STOPS, 2022)
tbl.standardize(gender_cats='compact')


wrap_print(f"Unique gender values are {tbl.table[opd.Column.GENDER_SUBJECT].unique()}")
wrap_print(f"In previous examples, raw values for Female were replaced with '{gender_cats['FEMALE']}'. "+\
      f"In this case, customization resulted in replacement with '{compact_gender_cats['FEMALE']}'.")

Unique gender values are ['M' 'F' 'TF' 'TM' 'UNSPECIFIED']

In previous examples, raw values for Female were replaced with 'FEMALE'. In this case, customization resulted
in replacement with 'F'.

Manually Standardizing Column Names#

The known_cols input is provided to allow the user to force standardize to select certain columns for standardization. This might be useful if:

  • by default, the desired column is not selected to be standardized OR

  • if there is more than one possible column for a given column type and the one that standardize chooses to standardize is not the one desired by the user.

The value input to known_cols should be a dictionary where the keys should be attributes of opd.defs.columns and the values are columns in the original table. By default, known_cols is {opd.defs.columns.DATE:t.date_field, opd.defs.columns.AGENCY:t.agency_field} where date_field and agency_field are attributes of the Table object that are pulled from OPD’s datasets table. If a user is inputting their own known_cols, it may be advisable to start from the default input as will be shown below:

[31]:
src = opd.Source("Phoenix")
tbl = src.load(table_type="USE OF FORCE", year=2022, pbar=False)  # pbar=False does not show progress bar
default_known_cols = {opd.Column.DATE:tbl.date_field, opd.Column.AGENCY:tbl.agency_field}
default_known_cols
[31]:
{'DATE': 'INC_DATE', 'AGENCY': None}

OPD’s datasets table lists that the Phoenix use of force dataset has a date column called INC_DATE and that it does not have a column that lists the agency (because all data is for the Phoenix Police). By default, standardize will assign INC_DATE as a date column because that has already been verified by the OPD datasets table. The Phoenix use of force dataset has multiple columns for subject race and ethnicity:

  • CIT_RACE

  • CIT_ETHNICITY

  • SIMPLE_SUBJ_RE_GRP (race/ethnicity group of subject)

With the default inputs, standardize selects CIT_RACE to be SUBJECT_RACE, CIT_ETHNICITY to be SUBJECT_ETHNICITY, and combines those into SUBJECT_RACE/ETHNICITY. The generated SUBJECT_RACE/ETHNICITY column is actually more descriptive than Phoenix’s SIMPLE_SUBJ_RE_GRP column. However, if the user wanted to use SIMPLE_SUBJ_RE_GRP instead, the following could be done:

[32]:
known_cols = default_known_cols.copy()
known_cols[opd.defs.columns.RACE_SUBJECT] = 'SIMPLE_SUBJ_RE_GRP'
tbl.standardize(known_cols=known_cols)

s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)

print(f"orig_column_name: {s.orig_column_name}")
print(f"new_column_name: {s.new_column_name}\n")
print("Data Transformation Map:")
print(s.data_maps)
orig_column_name: SIMPLE_SUBJ_RE_GRP
new_column_name: SUBJECT_RACE

Data Transformation Map:
{'Hispanic': 'HISPANIC/LATINO', 'Black or African American': 'BLACK', 'White': 'WHITE', 'Other': 'OTHER'}

Other Inputs to Standardize#

There are several other optional inputs to standardize that are not described above.

keep_raw#

If False, raw columns that are standardized will be removed. If True (default), they will be kept and renamed to indcate that they are the original raw columns.

verbose#

Default is False. If set to True, details of the standardization process will be printed. If set to a filename, details of the standardization will be logged to that file.

no_id#

Determines how values that are not able to be standardized are handled during standardization of demographics:

  • ‘keep’ (default): Keep the original value

  • ‘null’: Replace with a null value

  • ‘error’: Throw an error

merge_date_time#

If True (default), if standardized date and standardize time columns are identified, they will be merged into a combined datetime column.

empty_time#

Indicates how null times are treated in a standardized datetime column (if merge_date_time is True). If ‘NaT’, datetime values with null times will be a null value (pd.NaT). If ‘ignore’, the resulting datetime will be the date value without a time.