{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"nbsphinx-toctree": {
"maxdepth": 3
}
},
"source": [
"# Data Standardization Guide\n",
"This cell is hidden and is for indicating the max depth of the toc tree"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Standardization Guide\n",
"One of the challenges in analyzing police data is that different agencies use different column names for the same type of data and use different codes and terms for the data in those columns. If you are looking at multiple datasets, it is particularly valuable for the data to be standardized so that the user knows in advance what some key columns will be called and what the values in those columns will be. To provide the user with more consistent column names and data, OpenPoliceData (OPD) provides powerful tools to automatically standardize column names and data values. \n",
"\n",
"This guide provides details on how the standardization works and how to apply and customize the standardization.\n",
"\n",
"## Standardizable Columns\n",
"Column types and the standardized column names that OpenPoliceData can standardize are shown below. If these column types are identified in the standardization process, they will be standardized."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"nbsphinx": "hidden"
},
"outputs": [],
"source": [
"# This cell should have \"nbsphinx\": \"hidden\" in its metadata and not be included in the documentation!\n",
"import sys\n",
"sys.path.append(\"../../../..\")\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"import openpolicedata as opd\n",
"import textwrap\n",
"# Convenience method for printing long text\n",
"def wrap_print(text, newline=True):\n",
" text = \"\\n\".join(textwrap.wrap(text, width=110))\n",
" if newline:\n",
" text = \"\\n\"+text\n",
" print(text)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Attribute value below is an attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER) that can be used\n",
"to access column names\n",
"\n",
"Column Name value below is the corresponding standardized column name (i.e. \"OFFICER_AGE\") that replaces the\n",
"original column name during the standardization process\n"
]
},
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" | \n",
" Attribute | \n",
" Column Name | \n",
" Definition | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" AGE_OFFICER | \n",
" OFFICER_AGE | \n",
" Age of officer | \n",
"
\n",
" \n",
" | 1 | \n",
" AGE_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_AGE | \n",
" Age of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 2 | \n",
" AGE_SUBJECT | \n",
" SUBJECT_AGE | \n",
" Age of subject | \n",
"
\n",
" \n",
" | 3 | \n",
" AGE_RANGE_OFFICER | \n",
" OFFICER_AGE_RANGE | \n",
" Age Range of officer | \n",
"
\n",
" \n",
" | 4 | \n",
" AGE_RANGE_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_AGE_RANGE | \n",
" Age Range of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 5 | \n",
" AGE_RANGE_SUBJECT | \n",
" SUBJECT_AGE_RANGE | \n",
" Age Range of subject | \n",
"
\n",
" \n",
" | 6 | \n",
" AGENCY | \n",
" AGENCY | \n",
" Agency | \n",
"
\n",
" \n",
" | 7 | \n",
" DATE | \n",
" DATE | \n",
" Date | \n",
"
\n",
" \n",
" | 8 | \n",
" DATETIME | \n",
" DATETIME | \n",
" Combination of date and time when both columns are found (not generated when detected date column contains datetime values) | \n",
"
\n",
" \n",
" | 9 | \n",
" ETHNICITY_OFFICER | \n",
" OFFICER_ETHNICITY | \n",
" Ethnicity of officer | \n",
"
\n",
" \n",
" | 10 | \n",
" ETHNICITY_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_ETHNICITY | \n",
" Ethnicity of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 11 | \n",
" ETHNICITY_SUBJECT | \n",
" SUBJECT_ETHNICITY | \n",
" Ethnicity of subject | \n",
"
\n",
" \n",
" | 12 | \n",
" FATAL_OFFICER | \n",
" OFFICER_FATAL | \n",
" Whether officer was fatally injured in an officer-involved shooting or use of force | \n",
"
\n",
" \n",
" | 13 | \n",
" FATAL_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_FATAL | \n",
" Whether officer/subject was fatally injured in an officer-involved shooting or use of force | \n",
"
\n",
" \n",
" | 14 | \n",
" FATAL_SUBJECT | \n",
" SUBJECT_FATAL | \n",
" Whether subject was fatally injured in an officer-involved shooting or use of force | \n",
"
\n",
" \n",
" | 15 | \n",
" GENDER_OFFICER | \n",
" OFFICER_GENDER | \n",
" Gender of officer | \n",
"
\n",
" \n",
" | 16 | \n",
" GENDER_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_GENDER | \n",
" Gender of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 17 | \n",
" GENDER_SUBJECT | \n",
" SUBJECT_GENDER | \n",
" Gender of subject | \n",
"
\n",
" \n",
" | 18 | \n",
" INCIDENT_ID | \n",
" INCIDENT_ID | \n",
" A unique incident ID given to an incident (arrest, use of force, etc.). It can be used to relate information across tables. Only standardized by Table.merge function. | \n",
"
\n",
" \n",
" | 19 | \n",
" INJURY_OFFICER | \n",
" OFFICER_INJURY | \n",
" Whether officer was injured in an officer-involved shooting or use of force | \n",
"
\n",
" \n",
" | 20 | \n",
" INJURY_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_INJURY | \n",
" Whether officer/subject was injured in an officer-involved shooting or use of force | \n",
"
\n",
" \n",
" | 21 | \n",
" INJURY_SUBJECT | \n",
" SUBJECT_INJURY | \n",
" Whether subject was injured in an officer-involved shooting or use of force | \n",
"
\n",
" \n",
" | 22 | \n",
" NAME_OFFICER | \n",
" OFFICER_NAME | \n",
" Name of officer | \n",
"
\n",
" \n",
" | 23 | \n",
" NAME_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_NAME | \n",
" Name of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 24 | \n",
" NAME_SUBJECT | \n",
" SUBJECT_NAME | \n",
" Name of subject | \n",
"
\n",
" \n",
" | 25 | \n",
" RACE_SUBJECT | \n",
" SUBJECT_RACE | \n",
" Race of subject | \n",
"
\n",
" \n",
" | 26 | \n",
" RACE_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_RACE | \n",
" Race of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 27 | \n",
" RACE_OFFICER | \n",
" OFFICER_RACE | \n",
" Race of officer | \n",
"
\n",
" \n",
" | 28 | \n",
" RACE_ETHNICITY_SUBJECT | \n",
" SUBJECT_RACE/ETHNICITY | \n",
" Race/Ethnicity of subject | \n",
"
\n",
" \n",
" | 29 | \n",
" RACE_ETHNICITY_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_RACE/ETHNICITY | \n",
" Race/Ethnicity of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 30 | \n",
" RACE_ETHNICITY_OFFICER | \n",
" OFFICER_RACE/ETHNICITY | \n",
" Race/Ethnicity of officer | \n",
"
\n",
" \n",
" | 31 | \n",
" RE_GROUP_OFFICER | \n",
" OFFICER_RE_GROUP | \n",
" Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of officer | \n",
"
\n",
" \n",
" | 32 | \n",
" RE_GROUP_OFFICER_SUBJECT | \n",
" OFFICER/SUBJECT_RE_GROUP | \n",
" Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of either an officer or subject (depending on column \"SUBJECT_OR_OFFICER\") | \n",
"
\n",
" \n",
" | 33 | \n",
" RE_GROUP_SUBJECT | \n",
" SUBJECT_RE_GROUP | \n",
" Convenience column identical to Race/Ethnicity (if there is a merged Race/Ethncity column) or Race (otherwise) of subject | \n",
"
\n",
" \n",
" | 34 | \n",
" SUBJECT_OR_OFFICER | \n",
" SUBJECT_OR_OFFICER | \n",
" Whether row describes an officer or an subject/civilian | \n",
"
\n",
" \n",
" | 35 | \n",
" TIME | \n",
" TIME | \n",
" Time | \n",
"
\n",
" \n",
" | 36 | \n",
" ZIP_CODE | \n",
" ZIP_CODE | \n",
" Zip Code | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
" Attribute Column Name \\\n",
"0 AGE_OFFICER OFFICER_AGE \n",
"1 AGE_OFFICER_SUBJECT OFFICER/SUBJECT_AGE \n",
"2 AGE_SUBJECT SUBJECT_AGE \n",
"3 AGE_RANGE_OFFICER OFFICER_AGE_RANGE \n",
"4 AGE_RANGE_OFFICER_SUBJECT OFFICER/SUBJECT_AGE_RANGE \n",
"5 AGE_RANGE_SUBJECT SUBJECT_AGE_RANGE \n",
"6 AGENCY AGENCY \n",
"7 DATE DATE \n",
"8 DATETIME DATETIME \n",
"9 ETHNICITY_OFFICER OFFICER_ETHNICITY \n",
"10 ETHNICITY_OFFICER_SUBJECT OFFICER/SUBJECT_ETHNICITY \n",
"11 ETHNICITY_SUBJECT SUBJECT_ETHNICITY \n",
"12 FATAL_OFFICER OFFICER_FATAL \n",
"13 FATAL_OFFICER_SUBJECT OFFICER/SUBJECT_FATAL \n",
"14 FATAL_SUBJECT SUBJECT_FATAL \n",
"15 GENDER_OFFICER OFFICER_GENDER \n",
"16 GENDER_OFFICER_SUBJECT OFFICER/SUBJECT_GENDER \n",
"17 GENDER_SUBJECT SUBJECT_GENDER \n",
"18 INCIDENT_ID INCIDENT_ID \n",
"19 INJURY_OFFICER OFFICER_INJURY \n",
"20 INJURY_OFFICER_SUBJECT OFFICER/SUBJECT_INJURY \n",
"21 INJURY_SUBJECT SUBJECT_INJURY \n",
"22 NAME_OFFICER OFFICER_NAME \n",
"23 NAME_OFFICER_SUBJECT OFFICER/SUBJECT_NAME \n",
"24 NAME_SUBJECT SUBJECT_NAME \n",
"25 RACE_SUBJECT SUBJECT_RACE \n",
"26 RACE_OFFICER_SUBJECT OFFICER/SUBJECT_RACE \n",
"27 RACE_OFFICER OFFICER_RACE \n",
"28 RACE_ETHNICITY_SUBJECT SUBJECT_RACE/ETHNICITY \n",
"29 RACE_ETHNICITY_OFFICER_SUBJECT OFFICER/SUBJECT_RACE/ETHNICITY \n",
"30 RACE_ETHNICITY_OFFICER OFFICER_RACE/ETHNICITY \n",
"31 RE_GROUP_OFFICER OFFICER_RE_GROUP \n",
"32 RE_GROUP_OFFICER_SUBJECT OFFICER/SUBJECT_RE_GROUP \n",
"33 RE_GROUP_SUBJECT SUBJECT_RE_GROUP \n",
"34 SUBJECT_OR_OFFICER SUBJECT_OR_OFFICER \n",
"35 TIME TIME \n",
"36 ZIP_CODE ZIP_CODE \n",
"\n",
" Definition \n",
"0 Age of officer \n",
"1 Age of either an officer or subject (depending... \n",
"2 Age of subject \n",
"3 Age Range of officer \n",
"4 Age Range of either an officer or subject (dep... \n",
"5 Age Range of subject \n",
"6 Agency \n",
"7 Date \n",
"8 Combination of date and time when both columns... \n",
"9 Ethnicity of officer \n",
"10 Ethnicity of either an officer or subject (dep... \n",
"11 Ethnicity of subject \n",
"12 Whether officer was fatally injured in an offi... \n",
"13 Whether officer/subject was fatally injured in... \n",
"14 Whether subject was fatally injured in an offi... \n",
"15 Gender of officer \n",
"16 Gender of either an officer or subject (depend... \n",
"17 Gender of subject \n",
"18 A unique incident ID given to an incident (arr... \n",
"19 Whether officer was injured in an officer-invo... \n",
"20 Whether officer/subject was injured in an offi... \n",
"21 Whether subject was injured in an officer-invo... \n",
"22 Name of officer \n",
"23 Name of either an officer or subject (dependin... \n",
"24 Name of subject \n",
"25 Race of subject \n",
"26 Race of either an officer or subject (dependin... \n",
"27 Race of officer \n",
"28 Race/Ethnicity of subject \n",
"29 Race/Ethnicity of either an officer or subject... \n",
"30 Race/Ethnicity of officer \n",
"31 Convenience column identical to Race/Ethnicity... \n",
"32 Convenience column identical to Race/Ethnicity... \n",
"33 Convenience column identical to Race/Ethnicity... \n",
"34 Whether row describes an officer or an subject... \n",
"35 Time \n",
"36 Zip Code "
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"wrap_print(\"Attribute value below is an attribute of opd.defs.columns (i.e. opd.defs.columns.AGE_OFFICER) that can be used to access column names\")\n",
"wrap_print('Column Name value below is the corresponding standardized column name (i.e. \"OFFICER_AGE\") that replaces the original column name '+\n",
" 'during the standardization process')\n",
"\n",
"opd.Column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"NOTE: For consistency, OpenPoliceData always uses the term subject to describe the community member involved in a police interaction. This is consistent with the [Stanford Open Policing Project](https://github.com/stanford-policylab/opp/blob/master/data_readme.md).\n",
"\n",
"## Basic Standardization\n",
"Basic standardization only requires calling `standardize` on a `Table` object."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The columns after standardization are: \n",
"Index(['DATE', 'SUBJECT_RACE', 'SUBJECT_ETHNICITY', 'SUBJECT_RACE/ETHNICITY',\n",
" 'SUBJECT_RE_GROUP', 'SUBJECT_AGE', 'SUBJECT_GENDER', 'AGENCY',\n",
" 'JURISDICTION', 'REASON FOR STOP', 'PERSON TYPE', 'ENGLISH SPEAKING',\n",
" 'ACTION TAKEN', 'VIOLATION TYPE', 'SPECIFIC VIOLATION',\n",
" 'VIRGINIA CRIME CODE', 'PERSON SEARCHED', 'VEHICLE SEARCHED',\n",
" 'ADDITIONAL ARREST', 'FORCE USED BY OFFICER', 'FORCE USED BY SUBJECT',\n",
" 'RESIDENCY', 'RAW_STOP_DATE', 'RAW_AGENCY NAME', 'RAW_RACE',\n",
" 'RAW_ETHNICITY', 'RAW_AGE', 'RAW_GENDER'],\n",
" dtype='object')\n",
"\n"
]
}
],
"source": [
"src = opd.Source(\"Virginia\")\n",
"tbl = src.load(table_type=\"STOPS\", year=2022, agency=\"Arlington County Police Department\")\n",
"tbl.standardize()\n",
"\n",
"print(f'The columns after standardization are: \\n{tbl.table.columns}\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Standardized columns are always in all caps. These columns are standardized versions of the original columns that have been prepended with `RAW_` (*the original columns were lowercase in this dataset. Original column names may also be in all caps*). By standardizing, the user does not need to know the exact name of a particular column (they vary greatly). The user can check if a standaridzed column exists by doing the following:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"The values in the original subject race column are ['WHITE' 'BLACK OR AFRICAN AMERICAN' 'AMERICAN INDIAN OR\n",
"ALASKA NATIVE' 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER' 'UNKNOWN']\n",
"\n",
"The values in the original subject race column are ['HISPANIC OR LATINO' 'NOT HISPANIC OR LATINO' 'UNKNOWN']\n",
"\n",
"After combining race and ethnicity columns, the values in the standardized subject race/ethnicity column are\n",
"['HISPANIC/LATINO' 'BLACK' 'INDIGENOUS' 'WHITE' 'ASIAN/PACIFIC ISLANDER' 'UNKNOWN']\n"
]
}
],
"source": [
"if opd.Column.RE_GROUP_SUBJECT in tbl.table: # Alternatively: if \"RE_GROUP_SUBJECT\" in tbl.table\n",
" race_col = tbl.table[opd.Column.RE_GROUP_SUBJECT] # Alternatively: tbl.table[\"RE_GROUP_SUBJECT\"]\n",
" wrap_print(f\"The values in the original subject race column are {tbl.table['RAW_RACE'].unique()}\\n\")\n",
" wrap_print(f\"The values in the original subject race column are {tbl.table['RAW_ETHNICITY'].unique()}\\n\")\n",
" wrap_print(f\"After combining race and ethnicity columns, the values in the standardized subject race/ethnicity column are {race_col.unique()}\")\n",
"else:\n",
" wrap_print(\"There is no subject race/ethnicity or race column\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The values in each standardized column have been converted from the raw values to standardized ones. Thus, in the standardized race/ethnicty column, Black individuals will always be labeled as BLACK despite different datasets using many different encodings (including B, African American, Black or African American, and various shortened versions or typos). Data types will also be consistent, such as as dates will use a native pandas date format:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dtype('\n",
"\n",
"\n",
" \n",
" \n",
" | \n",
" DATE | \n",
" SUBJECT_RACE | \n",
" SUBJECT_ETHNICITY | \n",
" SUBJECT_RACE/ETHNICITY | \n",
" SUBJECT_RE_GROUP | \n",
" SUBJECT_AGE | \n",
" SUBJECT_GENDER | \n",
" AGENCY | \n",
" JURISDICTION | \n",
" REASON FOR STOP | \n",
" ... | \n",
" ADDITIONAL ARREST | \n",
" FORCE USED BY OFFICER | \n",
" FORCE USED BY SUBJECT | \n",
" RESIDENCY | \n",
" RAW_STOP_DATE | \n",
" RAW_AGENCY NAME | \n",
" RAW_RACE | \n",
" RAW_ETHNICITY | \n",
" RAW_AGE | \n",
" RAW_GENDER | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" 2022-01-01 | \n",
" WHITE | \n",
" HISPANIC/LATINO | \n",
" HISPANIC/LATINO | \n",
" HISPANIC/LATINO | \n",
" 17 | \n",
" MALE | \n",
" Arlington County Police Department | \n",
" ARLINGTON CO | \n",
" CALLS FOR SERVICE | \n",
" ... | \n",
" None | \n",
" NO | \n",
" NO | \n",
" None | \n",
" 2022-01-01 | \n",
" Arlington County Police Department | \n",
" WHITE | \n",
" HISPANIC OR LATINO | \n",
" 17 | \n",
" MALE | \n",
"
\n",
" \n",
" | 1 | \n",
" 2022-01-01 | \n",
" BLACK | \n",
" NON-HISPANIC/NON-LATINO | \n",
" BLACK | \n",
" BLACK | \n",
" 38 | \n",
" MALE | \n",
" Arlington County Police Department | \n",
" ARLINGTON CO | \n",
" TRAFFIC VIOLATION | \n",
" ... | \n",
" None | \n",
" NO | \n",
" NO | \n",
" None | \n",
" 2022-01-01 | \n",
" Arlington County Police Department | \n",
" BLACK OR AFRICAN AMERICAN | \n",
" NOT HISPANIC OR LATINO | \n",
" 38 | \n",
" MALE | \n",
"
\n",
" \n",
" | 2 | \n",
" 2022-01-01 | \n",
" BLACK | \n",
" NON-HISPANIC/NON-LATINO | \n",
" BLACK | \n",
" BLACK | \n",
" 29 | \n",
" FEMALE | \n",
" Arlington County Police Department | \n",
" ARLINGTON CO | \n",
" TRAFFIC VIOLATION | \n",
" ... | \n",
" None | \n",
" NO | \n",
" NO | \n",
" None | \n",
" 2022-01-01 | \n",
" Arlington County Police Department | \n",
" BLACK OR AFRICAN AMERICAN | \n",
" NOT HISPANIC OR LATINO | \n",
" 29 | \n",
" FEMALE | \n",
"
\n",
" \n",
" | 3 | \n",
" 2022-01-01 | \n",
" WHITE | \n",
" HISPANIC/LATINO | \n",
" HISPANIC/LATINO | \n",
" HISPANIC/LATINO | \n",
" 31 | \n",
" MALE | \n",
" Arlington County Police Department | \n",
" ARLINGTON CO | \n",
" TRAFFIC VIOLATION | \n",
" ... | \n",
" None | \n",
" NO | \n",
" NO | \n",
" None | \n",
" 2022-01-01 | \n",
" Arlington County Police Department | \n",
" WHITE | \n",
" HISPANIC OR LATINO | \n",
" 31 | \n",
" MALE | \n",
"
\n",
" \n",
" | 4 | \n",
" 2022-01-01 | \n",
" INDIGENOUS | \n",
" NON-HISPANIC/NON-LATINO | \n",
" INDIGENOUS | \n",
" INDIGENOUS | \n",
" 44 | \n",
" MALE | \n",
" Arlington County Police Department | \n",
" ARLINGTON CO | \n",
" CALLS FOR SERVICE | \n",
" ... | \n",
" None | \n",
" NO | \n",
" NO | \n",
" None | \n",
" 2022-01-01 | \n",
" Arlington County Police Department | \n",
" AMERICAN INDIAN OR ALASKA NATIVE | \n",
" NOT HISPANIC OR LATINO | \n",
" 44 | \n",
" MALE | \n",
"
\n",
" \n",
"
\n",
"5 rows × 28 columns
\n",
""
],
"text/plain": [
" DATE SUBJECT_RACE SUBJECT_ETHNICITY SUBJECT_RACE/ETHNICITY \\\n",
"0 2022-01-01 WHITE HISPANIC/LATINO HISPANIC/LATINO \n",
"1 2022-01-01 BLACK NON-HISPANIC/NON-LATINO BLACK \n",
"2 2022-01-01 BLACK NON-HISPANIC/NON-LATINO BLACK \n",
"3 2022-01-01 WHITE HISPANIC/LATINO HISPANIC/LATINO \n",
"4 2022-01-01 INDIGENOUS NON-HISPANIC/NON-LATINO INDIGENOUS \n",
"\n",
" SUBJECT_RE_GROUP SUBJECT_AGE SUBJECT_GENDER \\\n",
"0 HISPANIC/LATINO 17 MALE \n",
"1 BLACK 38 MALE \n",
"2 BLACK 29 FEMALE \n",
"3 HISPANIC/LATINO 31 MALE \n",
"4 INDIGENOUS 44 MALE \n",
"\n",
" AGENCY JURISDICTION REASON FOR STOP ... \\\n",
"0 Arlington County Police Department ARLINGTON CO CALLS FOR SERVICE ... \n",
"1 Arlington County Police Department ARLINGTON CO TRAFFIC VIOLATION ... \n",
"2 Arlington County Police Department ARLINGTON CO TRAFFIC VIOLATION ... \n",
"3 Arlington County Police Department ARLINGTON CO TRAFFIC VIOLATION ... \n",
"4 Arlington County Police Department ARLINGTON CO CALLS FOR SERVICE ... \n",
"\n",
" ADDITIONAL ARREST FORCE USED BY OFFICER FORCE USED BY SUBJECT RESIDENCY \\\n",
"0 None NO NO None \n",
"1 None NO NO None \n",
"2 None NO NO None \n",
"3 None NO NO None \n",
"4 None NO NO None \n",
"\n",
" RAW_STOP_DATE RAW_AGENCY NAME \\\n",
"0 2022-01-01 Arlington County Police Department \n",
"1 2022-01-01 Arlington County Police Department \n",
"2 2022-01-01 Arlington County Police Department \n",
"3 2022-01-01 Arlington County Police Department \n",
"4 2022-01-01 Arlington County Police Department \n",
"\n",
" RAW_RACE RAW_ETHNICITY RAW_AGE RAW_GENDER \n",
"0 WHITE HISPANIC OR LATINO 17 MALE \n",
"1 BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO 38 MALE \n",
"2 BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO 29 FEMALE \n",
"3 WHITE HISPANIC OR LATINO 31 MALE \n",
"4 AMERICAN INDIAN OR ALASKA NATIVE NOT HISPANIC OR LATINO 44 MALE \n",
"\n",
"[5 rows x 28 columns]"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl.table.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Getting Demographics Column Names\n",
"Names of race, gender, and age columns can be accessed with the get_race_col, get_gender_col, and get_age_col functions of the Table class, respectively."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"subject_race_col = tbl.get_race_col() # or role='SUBJECT'\n",
"officer_race_col = tbl.get_race_col(role='OFFICER') # This returns None because the data has no officer race column"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combining Columns\n",
"The following columns are combined when they are both identified:\n",
"\n",
"- DATE and TIME are combined into DATETIME\n",
"- SUBJECT_RACE and SUBJECT_ETHNICITY are combined into SUBJECT_RACE/ETHNCITY. \n",
"- OFFICER_RACE and OFFICER_ETHNICITY are combined into OFFICER_RACE/ETHNCITY. \n",
"- OFFICER/SUBJECT_RACE and OFFICER/SUBJECT_ETHNICITY are combined into OFFICER/SUBJECT_RACE/ETHNCITY. For more info on OFFICER/SUBJECT columns, see [here](#officer-and-subject-demographics-in-separate-rows).\n",
"\n",
"### RE_GROUP (Race/Ethnicity Group) Columns\n",
"For users who might want to use the race/ethnicity column if it exists and the race column if not, standardization includes the following columns to enable this capability:\n",
"\n",
"- SUBJECT_RE_GROUP: Added if SUBJECT_RACE/ETHNCITY or SUBJECT_RACE exist. Will be a copy of SUBJECT_RACE/ETHNCITY if it exists otherwise a copy of SUBJECT_RACE if it exists. \n",
"- OFFICER_RE_GROUP: Added if OFFICER_RACE/ETHNCITY or OFFICER_RACE exist. Will be a copy of OFFICER_RACE/ETHNCITY if it exists otherwise a copy of OFFICER_RACE if it exists. \n",
"- OFFICER/SUBJECT_RE_GROUP: Added if OFFICER/SUBJECT_RACE/ETHNCITY or OFFICER/SUBJECT_RACE exist. Will be a copy of OFFICER/SUBJECT_RACE/ETHNCITY if it exists otherwise a copy of OFFICER/SUBJECT_RACE if it exists. For more info on OFFICER/SUBJECT columns, see [here](#officer-and-subject-demographics-in-separate-rows)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Officer and Subject Demographics in Separate Rows\n",
"OFFICER/SUBJECT columns (OFFICER/SUBJECT_RACE, OFFICER/SUBJECT_AGE, etc.) exist when a single row either describes an officer or a subject. In these cases, the SUBJECT_OR_OFFICER column will indicate if a given row is for a subject or an officer. See [here](#multi-race-or-multi-ethnicity-data) for an example.\n",
"\n",
"## Advanced Standardization\n",
"This section describes topics related to:\n",
"\n",
"- Customizing the standardization process \n",
"- Explaining how rare cases are handled\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Multi-Subject and Multi-Officer Data\n",
"Some datasets encode the demographics of multiple subjects and/or officers into a single row of data. In these cases, the standardized demographics are encoded differently than in a typical signal subject / officer per row case. This is best demonstrated by the following example."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" subject_ethnicity | \n",
" subject_gender | \n",
" subject_age | \n",
"
\n",
" \n",
" \n",
" \n",
" | 6 | \n",
" Black | Black | \n",
" Female | Male | \n",
" 42 | 42 | \n",
"
\n",
" \n",
" | 8 | \n",
" Black | Black | \n",
" Male | Male | \n",
" 46 | 43 | \n",
"
\n",
" \n",
" | 9 | \n",
" Black | Black | Black | Black | Black | Black | \n",
" Male | Male | Male | Female | Female | Male | \n",
" 19 | 18 | 17 | 14 | 17 | 31 | \n",
"
\n",
" \n",
" | 25 | \n",
" Black | Black | Black | \n",
" Female | Female | Female | \n",
" 18 | 40 | 24 | \n",
"
\n",
" \n",
" | 26 | \n",
" Hispanic | Black | Black | \n",
" Male | Male | Male | \n",
" 16 | 16 | 15 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" subject_ethnicity \\\n",
"6 Black | Black \n",
"8 Black | Black \n",
"9 Black | Black | Black | Black | Black | Black \n",
"25 Black | Black | Black \n",
"26 Hispanic | Black | Black \n",
"\n",
" subject_gender subject_age \n",
"6 Female | Male 42 | 42 \n",
"8 Male | Male 46 | 43 \n",
"9 Male | Male | Male | Female | Female | Male 19 | 18 | 17 | 14 | 17 | 31 \n",
"25 Female | Female | Female 18 | 40 | 24 \n",
"26 Male | Male | Male 16 | 16 | 15 "
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"src = opd.Source(\"New Orleans\")\n",
"tbl = src.load(opd.TableType.USE_OF_FORCE, 2022)\n",
"# Find rows with demographics of multiple subjects. New Orleans uses the | to separate persons. This is not typical and would not be known in advance.\n",
"is_mult = tbl.table['subject_gender'].str.contains(\"|\",regex=False).fillna(False)\n",
"tbl.table[is_mult][['subject_ethnicity','subject_gender','subject_age']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the New Orleans use of force data, when multiple subjects are involved in a use of force incident, their demographics are separated by a | symbol. There is no standard way for encoding this type of multi-subject or multi-officer case and a variety of other formats have been observed for encoding the demographics of multiple people in a row. OPD standardizes these cases into a common format:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SUBJECT_RE_GROUP | \n",
" SUBJECT_GENDER | \n",
" SUBJECT_AGE | \n",
"
\n",
" \n",
" \n",
" \n",
" | 6 | \n",
" {0: 'BLACK', 1: 'BLACK'} | \n",
" {0: 'FEMALE', 1: 'MALE'} | \n",
" {0: 42, 1: 42} | \n",
"
\n",
" \n",
" | 8 | \n",
" {0: 'BLACK', 1: 'BLACK'} | \n",
" {0: 'MALE', 1: 'MALE'} | \n",
" {0: 46, 1: 43} | \n",
"
\n",
" \n",
" | 9 | \n",
" {0: 'BLACK', 1: 'BLACK', 2: 'BLACK', 3: 'BLACK... | \n",
" {0: 'MALE', 1: 'MALE', 2: 'MALE', 3: 'FEMALE',... | \n",
" {0: 19, 1: 18, 2: 17, 3: 14, 4: 17, 5: 31} | \n",
"
\n",
" \n",
" | 25 | \n",
" {0: 'BLACK', 1: 'BLACK', 2: 'BLACK'} | \n",
" {0: 'FEMALE', 1: 'FEMALE', 2: 'FEMALE'} | \n",
" {0: 18, 1: 40, 2: 24} | \n",
"
\n",
" \n",
" | 26 | \n",
" {0: 'HISPANIC/LATINO', 1: 'BLACK', 2: 'BLACK'} | \n",
" {0: 'MALE', 1: 'MALE', 2: 'MALE'} | \n",
" {0: 16, 1: 16, 2: 15} | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SUBJECT_RE_GROUP \\\n",
"6 {0: 'BLACK', 1: 'BLACK'} \n",
"8 {0: 'BLACK', 1: 'BLACK'} \n",
"9 {0: 'BLACK', 1: 'BLACK', 2: 'BLACK', 3: 'BLACK... \n",
"25 {0: 'BLACK', 1: 'BLACK', 2: 'BLACK'} \n",
"26 {0: 'HISPANIC/LATINO', 1: 'BLACK', 2: 'BLACK'} \n",
"\n",
" SUBJECT_GENDER \\\n",
"6 {0: 'FEMALE', 1: 'MALE'} \n",
"8 {0: 'MALE', 1: 'MALE'} \n",
"9 {0: 'MALE', 1: 'MALE', 2: 'MALE', 3: 'FEMALE',... \n",
"25 {0: 'FEMALE', 1: 'FEMALE', 2: 'FEMALE'} \n",
"26 {0: 'MALE', 1: 'MALE', 2: 'MALE'} \n",
"\n",
" SUBJECT_AGE \n",
"6 {0: 42, 1: 42} \n",
"8 {0: 46, 1: 43} \n",
"9 {0: 19, 1: 18, 2: 17, 3: 14, 4: 17, 5: 31} \n",
"25 {0: 18, 1: 40, 2: 24} \n",
"26 {0: 16, 1: 16, 2: 15} "
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl.standardize()\n",
"tbl.table[[opd.Column.RE_GROUP_SUBJECT, opd.Column.GENDER_SUBJECT, opd.Column.AGE_SUBJECT]][is_mult].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OPD encodes multiple subject or officer demographic data as dictionaries. The dictionary key corresponds to which subject or officer that the demographics describe, and the value indicates the race, gender, age, etc. For example, if SUBJECT_RE_GROUP is {0:'BLACK',1:'WHITE'} and SUBJECT_GENDER is {0:'MALE','1':'FEMALE'}, the first subject is a Black male and the second is a White female.\n",
"\n",
"In some cases, it may be desirable to expand this data from a single row per incident (i.e. containing demographic data for every person involved in the incident) to a single row per subject or officer. The `expand` function provides this capability:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SUBJECT_RE_GROUP | \n",
" SUBJECT_GENDER | \n",
" SUBJECT_AGE | \n",
"
\n",
" \n",
" \n",
" \n",
" | 6 | \n",
" BLACK | \n",
" FEMALE | \n",
" 42 | \n",
"
\n",
" \n",
" | 7 | \n",
" BLACK | \n",
" MALE | \n",
" 42 | \n",
"
\n",
" \n",
" | 8 | \n",
" BLACK | \n",
" MALE | \n",
" 19 | \n",
"
\n",
" \n",
" | 9 | \n",
" BLACK | \n",
" MALE | \n",
" 46 | \n",
"
\n",
" \n",
" | 10 | \n",
" BLACK | \n",
" MALE | \n",
" 43 | \n",
"
\n",
" \n",
" | 11 | \n",
" BLACK | \n",
" MALE | \n",
" 19 | \n",
"
\n",
" \n",
" | 12 | \n",
" BLACK | \n",
" MALE | \n",
" 18 | \n",
"
\n",
" \n",
" | 13 | \n",
" BLACK | \n",
" MALE | \n",
" 17 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SUBJECT_RE_GROUP SUBJECT_GENDER SUBJECT_AGE\n",
"6 BLACK FEMALE 42\n",
"7 BLACK MALE 42\n",
"8 BLACK MALE 19\n",
"9 BLACK MALE 46\n",
"10 BLACK MALE 43\n",
"11 BLACK MALE 19\n",
"12 BLACK MALE 18\n",
"13 BLACK MALE 17"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# mismatch indicates what to do when a column to be expanded has the incorrect number of values \n",
"# (such as if there are 2 genders and races but only 1 age). 'nan' fills with NaNs. 'splitsingle'\n",
"# uses the single value for each expanded row in cases where a column has only 1 value. Default\n",
"# is to throw an error.\n",
"tbl.expand(person_type='subject', mismatch='nan') \n",
"is_mult_index = is_mult[is_mult].index\n",
"tbl.table[[opd.Column.RE_GROUP_SUBJECT, opd.Column.GENDER_SUBJECT, opd.Column.AGE_SUBJECT]].loc[is_mult_index[0]:is_mult_index[0]+7]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the above example, the 1st 2 multiple demographic rows can be seen to have been split into 2 rows each (Black female 42 and Black male 42 and Black male 46 and Black male 43).\n",
"\n",
"### Multi-Race or Multi-Ethnicity Data\n",
"Some datasets list multiple races for multi-racial individuals. The California use of force datasets is an example."
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
},
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" Civilian_Officer | \n",
" Race_Ethnic_Group | \n",
" Age | \n",
" Gender | \n",
"
\n",
" \n",
" \n",
" \n",
" | 91 | \n",
" Officer | \n",
" white, asian | \n",
" 36_40 | \n",
" male | \n",
"
\n",
" \n",
" | 106 | \n",
" Officer | \n",
" american_indian, white | \n",
" 36_40 | \n",
" male | \n",
"
\n",
" \n",
" | 109 | \n",
" Officer | \n",
" black, white | \n",
" 31_35 | \n",
" male | \n",
"
\n",
" \n",
" | 210 | \n",
" Officer | \n",
" hispanic, white | \n",
" 41_45 | \n",
" male | \n",
"
\n",
" \n",
" | 244 | \n",
" Officer | \n",
" black, hispanic | \n",
" 26_30 | \n",
" male | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" Civilian_Officer Race_Ethnic_Group Age Gender\n",
"91 Officer white, asian 36_40 male\n",
"106 Officer american_indian, white 36_40 male\n",
"109 Officer black, white 31_35 male\n",
"210 Officer hispanic, white 41_45 male\n",
"244 Officer black, hispanic 26_30 male"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"src = opd.Source(\"California\")\n",
"# USE_OF_FORCE_SUBJECTS_OFFICERS datasets contain information on officers and subjects involved in use of force incidents\n",
"# They will have a column indicating whether each row applies to an officer or subject\n",
"tbl = src.load(opd.TableType.USE_OF_FORCE_SUBJECTS_OFFICERS, 2022)\n",
"# Find multi-racial individuals\n",
"is_mult = tbl.table['Race_Ethnic_Group'].str.contains(\",\").fillna(False)\n",
"tbl.table[is_mult][['Civilian_Officer','Race_Ethnic_Group','Age','Gender']].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The race/ethnicity column of this datasets includes multiple races separated by commas for multi-racial individuals (also note that the age column is actually an age range column). OPD identifies that the multiple races apply to an indvidual (as opposed to [multiple people](#multi-subject-and-multi-officer-data)) and stores the multiple races as a comma-separated list:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SUBJECT_OR_OFFICER | \n",
" OFFICER/SUBJECT_RE_GROUP | \n",
" OFFICER/SUBJECT_GENDER | \n",
" OFFICER/SUBJECT_AGE_RANGE | \n",
"
\n",
" \n",
" \n",
" \n",
" | 91 | \n",
" OFFICER | \n",
" ASIAN, WHITE | \n",
" MALE | \n",
" 36-40 | \n",
"
\n",
" \n",
" | 106 | \n",
" OFFICER | \n",
" INDIGENOUS, WHITE | \n",
" MALE | \n",
" 36-40 | \n",
"
\n",
" \n",
" | 109 | \n",
" OFFICER | \n",
" BLACK, WHITE | \n",
" MALE | \n",
" 31-35 | \n",
"
\n",
" \n",
" | 210 | \n",
" OFFICER | \n",
" HISPANIC/LATINO, WHITE | \n",
" MALE | \n",
" 41-45 | \n",
"
\n",
" \n",
" | 244 | \n",
" OFFICER | \n",
" BLACK, HISPANIC/LATINO | \n",
" MALE | \n",
" 26-30 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SUBJECT_OR_OFFICER OFFICER/SUBJECT_RE_GROUP OFFICER/SUBJECT_GENDER \\\n",
"91 OFFICER ASIAN, WHITE MALE \n",
"106 OFFICER INDIGENOUS, WHITE MALE \n",
"109 OFFICER BLACK, WHITE MALE \n",
"210 OFFICER HISPANIC/LATINO, WHITE MALE \n",
"244 OFFICER BLACK, HISPANIC/LATINO MALE \n",
"\n",
" OFFICER/SUBJECT_AGE_RANGE \n",
"91 36-40 \n",
"106 36-40 \n",
"109 31-35 \n",
"210 41-45 \n",
"244 26-30 "
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl.standardize()\n",
"cols = [opd.Column.SUBJECT_OR_OFFICER, opd.Column.RE_GROUP_OFFICER_SUBJECT, opd.Column.GENDER_OFFICER_SUBJECT, opd.Column.AGE_RANGE_OFFICER_SUBJECT]\n",
"tbl.table[is_mult][cols].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Customizing Race/Ethnicity Categories\n",
"`standardize` provides several inputs for customizing the race/ethnicity categorization that are described below. \n",
"\n",
"#### `race_cats`\n",
"`standardize` has a default set of race category values that it uses:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'AAPI': 'ASIAN/PACIFIC ISLANDER',\n",
" 'ASIAN': 'ASIAN',\n",
" 'BLACK': 'BLACK',\n",
" 'LATINO': 'HISPANIC/LATINO',\n",
" 'MULTIPLE': 'MULTIPLE',\n",
" 'INDIGENOUS': 'INDIGENOUS',\n",
" 'OTHER': 'OTHER',\n",
" 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',\n",
" 'UNKNOWN': 'UNKNOWN',\n",
" 'UNSPECIFIED': 'UNSPECIFIED',\n",
" 'WHITE': 'WHITE'}"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"default_race_cats = opd.defs.get_race_cats()\n",
"default_race_cats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The race categories in `default_race_cats` are a dictionary. The values of the dictionary are the race category values that will be used during standardization. The keys of the dictionary are codes used internally to look up which category value to use. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: RACE\n",
"new_column_name: SUBJECT_RACE\n",
"\n",
"Data Transformation Map:\n"
]
},
{
"data": {
"text/plain": [
"{'BLACK OR AFRICAN AMERICAN': 'BLACK',\n",
" 'WHITE': 'WHITE',\n",
" 'UNKNOWN': 'UNKNOWN',\n",
" 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER',\n",
" 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"src = opd.Source(\"Virginia\")\n",
"tbl = src.load(\"STOPS\", 2022,agency=\"Norfolk Police Department\")\n",
"tbl.standardize()\n",
"s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"s.data_maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The data transformation map is a dictionary showing original data values (keys) and standardized values (dictionary values). It can be seen that each of the original values was able to be identified and transformed to a value in the `default_race_cats` dictionary shown above.\n",
"\n",
"If the user desires to use different names for the race categories, they can do this by assigning new values:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: RACE\n",
"new_column_name: SUBJECT_RACE\n",
"\n",
"Data Transformation Map:\n"
]
},
{
"data": {
"text/plain": [
"{'BLACK OR AFRICAN AMERICAN': 'BLACK',\n",
" 'WHITE': 'WHITE',\n",
" 'UNKNOWN': 'Unrecognized',\n",
" 'ASIAN OR NATIVE HAWAIIAN OR OTHER PACIFIC ISLANDER': 'ASIAN/PACIFIC ISLANDER',\n",
" 'AMERICAN INDIAN OR ALASKA NATIVE': 'INDIGENOUS'}"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_race_cats = opd.defs.get_race_cats()\n",
"new_race_cats['UNKNOWN'] = \"Unrecognized\"\n",
"\n",
"tbl = src.load(\"STOPS\", 2022, agency=\"Norfolk Police Department\")\n",
"tbl.standardize(race_cats=new_race_cats)\n",
"s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"s.data_maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The original value UNKNOWN is now replaced by Unrecognized.\n",
"\n",
"Now, let's see a more complicated example:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: race\n",
"new_column_name: SUBJECT_RACE/ETHNICITY\n",
"\n",
"Data Transformation Map:\n"
]
},
{
"data": {
"text/plain": [
"{'White': 'WHITE',\n",
" 'Hispanic': 'HISPANIC/LATINO',\n",
" 'Black': 'BLACK',\n",
" 'Other': 'OTHER',\n",
" 'American Indian': 'INDIGENOUS',\n",
" 'Other Asian': 'ASIAN',\n",
" 'Filipino': 'Filipino',\n",
" 'Vietnamese': 'Vietnamese',\n",
" 'Pacific Islander': 'ASIAN/PACIFIC ISLANDER',\n",
" 'Asian Indian': 'Asian Indian',\n",
" 'Chinese': 'Chinese',\n",
" 'Korean': 'Korean',\n",
" 'Laotian': 'Laotian',\n",
" 'Samoan': 'Samoan',\n",
" 'Cambodian': 'Cambodian',\n",
" 'Japanese': 'Japanese',\n",
" 'Hawaiian': 'ASIAN/PACIFIC ISLANDER',\n",
" 'Guamanian': 'Guamanian',\n",
" : 'UNSPECIFIED'}"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"src = opd.Source(\"California\")\n",
"tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, \"MULTIPLE\")\n",
"tbl.standardize()\n",
"s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"s.data_maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, some of the original values are converted to standardized values from `default_race_cats` and some remain unchanged. By default, values that cannot be identified are not changed (see [no_id](#no_id) input to change). \n",
"\n",
"We can create an expanded list of race categories that include more categories. These additional categories are based on categories that have been observed in some datasets. The expanded race categories contain all allowable keys for the `race_cats` input to `standardize`."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'AAPI': 'ASIAN/PACIFIC ISLANDER',\n",
" 'ASIAN': 'ASIAN',\n",
" 'BLACK': 'BLACK',\n",
" 'LATINO': 'HISPANIC/LATINO',\n",
" 'MULTIPLE': 'MULTIPLE',\n",
" 'INDIGENOUS': 'INDIGENOUS',\n",
" 'OTHER': 'OTHER',\n",
" 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',\n",
" 'UNKNOWN': 'UNKNOWN',\n",
" 'UNSPECIFIED': 'UNSPECIFIED',\n",
" 'WHITE': 'WHITE',\n",
" 'PACIFIC_ISLANDER': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" 'MIDDLE_EASTERN': 'MIDDLE EASTERN',\n",
" 'MIDDLE_EASTERN_SOUTH_ASIAN': 'MIDDLE EASTERN/SOUTH ASIAN',\n",
" 'SOUTH_ASIAN': 'SOUTH ASIAN'}"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"expanded_race_cats = opd.defs.get_race_cats(expand=True)\n",
"expanded_race_cats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This list includes a few more categories than existed previously including Middle Eastern and South Asian categories. These categories were not used previously because their keys were not in the default dictionary used for `race_cats`. The expanded list can be used instead of the default list by setting `race_cats='expand'`, which is shorthand for `race_cats=opd.defs.get_race_cats(expand=True)`:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: race\n",
"new_column_name: SUBJECT_RACE/ETHNICITY\n",
"\n",
"Data Transformation Map:\n"
]
},
{
"data": {
"text/plain": [
"{'White': 'WHITE',\n",
" 'Hispanic': 'HISPANIC/LATINO',\n",
" 'Black': 'BLACK',\n",
" 'Other': 'OTHER',\n",
" 'American Indian': 'INDIGENOUS',\n",
" 'Other Asian': 'ASIAN',\n",
" 'Filipino': 'Filipino',\n",
" 'Vietnamese': 'Vietnamese',\n",
" 'Pacific Islander': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" 'Asian Indian': 'SOUTH ASIAN',\n",
" 'Chinese': 'Chinese',\n",
" 'Korean': 'Korean',\n",
" 'Laotian': 'Laotian',\n",
" 'Samoan': 'Samoan',\n",
" 'Cambodian': 'Cambodian',\n",
" 'Japanese': 'Japanese',\n",
" 'Hawaiian': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" 'Guamanian': 'Guamanian',\n",
" : 'UNSPECIFIED'}"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, \"MULTIPLE\")\n",
"tbl.standardize(race_cats='expand')\n",
"s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"s.data_maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this case, one more original value (Asian Indian) has been standardized (SOUTH ASIAN).\n",
"\n",
"Similarly, more compact race categories can be generated by inputting `race_cats='compact'` to `standardize`, which is equivalent to `race_cats=opd.defs.get_race_cats(compact=True)`. The result is that most labels will be abbreviated:"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'AAPI': 'AAPI',\n",
" 'ASIAN': 'A',\n",
" 'BLACK': 'B',\n",
" 'LATINO': 'H',\n",
" 'MULTIPLE': 'MULTIPLE',\n",
" 'INDIGENOUS': 'I',\n",
" 'OTHER': 'O',\n",
" 'OTHER / UNKNOWN': 'OTHER OR UNKNOWN',\n",
" 'UNKNOWN': 'UNKNOWN',\n",
" 'UNSPECIFIED': 'UNSPECIFIED',\n",
" 'WHITE': 'W'}"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"opd.defs.get_race_cats(compact=True)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `agg_race_cat`\n",
"More aggressive categorization is possible by setting `agg_race_cat` to True. By default, `standardize` attempts to not reduce the specificity of the values in the raw data. Setting `agg_race_cat` to True will attempt to standardize more cases by replacing some nationalities with races based on [U.S. Census guidance](https://www.census.gov/programs-surveys/decennial-census/decade/2020/planning-management/release/faqs-race-ethnicity.html) or in columns that contain [multiple races and/or ethnicities](#multi-raceethnicity-data), replacing all values containing a term for Hispanic/Latino, such as Black Latino, with the standard category for Hispanic/Latino (i.e. this category is now for Hispanic/Latino of all races and all other categories are non-Hispanic/Latino)."
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: race\n",
"new_column_name: SUBJECT_RACE/ETHNICITY\n",
"\n",
"Data Transformation Map:\n"
]
},
{
"data": {
"text/plain": [
"{'White': 'WHITE',\n",
" 'Hispanic': 'HISPANIC/LATINO',\n",
" 'Black': 'BLACK',\n",
" 'Other': 'OTHER',\n",
" 'American Indian': 'INDIGENOUS',\n",
" 'Other Asian': 'ASIAN',\n",
" 'Filipino': 'ASIAN',\n",
" 'Vietnamese': 'ASIAN',\n",
" 'Pacific Islander': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" 'Asian Indian': 'SOUTH ASIAN',\n",
" 'Chinese': 'ASIAN',\n",
" 'Korean': 'ASIAN',\n",
" 'Laotian': 'ASIAN',\n",
" 'Samoan': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" 'Cambodian': 'ASIAN',\n",
" 'Japanese': 'ASIAN',\n",
" 'Hawaiian': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" 'Guamanian': 'HAWAIIAN/PACIFIC ISLANDER',\n",
" : 'UNSPECIFIED'}"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"tbl = src.load(opd.TableType.DEATHS_IN_CUSTODY, \"MULTIPLE\")\n",
"tbl.standardize(race_cats='expand', agg_race_cat=True)\n",
"s = tbl.get_transform_map(new=opd.Column.RACE_ETHNICITY_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"s.data_maps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the example above, each nationality is now replaced by a standard value.\n",
"\n",
"#### `eth_cats`\n",
"There are also a set of ethnicity categories used in a similar manner to the [race categories](#race_cats). `standardize` has a default set of ethnicity category names that it uses:"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'LATINO': 'HISPANIC/LATINO',\n",
" 'MIDDLE_EASTERN': 'MIDDLE EASTERN',\n",
" 'NON-LATINO': 'NON-HISPANIC/NON-LATINO',\n",
" 'UNKNOWN': 'UNKNOWN',\n",
" 'UNSPECIFIED': 'UNSPECIFIED'}"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"eth_cats = opd.defs.get_eth_cats()\n",
"eth_cats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The ethnicity categories in `eth_cats` are a dictionary. The values of the dictionary are the ethnicity category values that will be used during standardization. The keys of the dictionary are codes used internally to look up which category label to use. No keys can be used besides the ones provided by `get_eth_cats`. Ethnicity categories typically only indicate whether an individual is Hispanic/Latino or non-Hispanic/Latino.\n",
"\n",
"The names assigned for different ethnicity categories can be customized:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: ETHNICITY\n",
"new_column_name: SUBJECT_ETHNICITY\n",
"\n",
"Data Transformation Map:\n",
"{'NOT HISPANIC OR LATINO': 'NON-HISPANIC/NON-LATINO', 'UNKNOWN': 'UNKNOWN', 'HISPANIC OR LATINO': 'Latino'}\n",
"\n",
"In previous examples, raw values for Hispanic/Latino were replaced with 'HISPANIC/LATINO'. In this case, customization resulted in replacement with 'Latino'.\n"
]
}
],
"source": [
"new_eth_cats = opd.defs.get_eth_cats()\n",
"new_eth_cats['LATINO'] = \"Latino\"\n",
"\n",
"src = opd.Source(\"Virginia\")\n",
"tbl = src.load(\"STOPS\", 2022, agency=\"Norfolk Police Department\")\n",
"tbl.standardize(eth_cats=new_eth_cats)\n",
"s = tbl.get_transform_map(new=opd.Column.ETHNICITY_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"print(s.data_maps)\n",
"\n",
"print(f\"\\nIn previous examples, raw values for Hispanic/Latino were replaced with '{eth_cats['LATINO']}'. \"+\\\n",
" f\"In this case, customization resulted in replacement with '{new_eth_cats['LATINO']}'.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More compact ethnicity categories can be generated by inputting `eth_cats='compact` to `standardize`, which is equivalent to `eth_cats=opd.defs.get_eth_cats(compact=True)`. The result is that most labels will be abbreviated:"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: ETHNICITY\n",
"new_column_name: SUBJECT_ETHNICITY\n",
"\n",
"Data Transformation Map:\n",
"{'NOT HISPANIC OR LATINO': 'NH', 'UNKNOWN': 'UNKNOWN', 'HISPANIC OR LATINO': 'H'}\n",
"\n",
"In previous examples, raw values for Hispanic/Latino were replaced with 'HISPANIC/LATINO'. In this case, customization resulted in replacement with 'H'.\n"
]
}
],
"source": [
"compact_eth_cats = opd.defs.get_eth_cats(compact=True)\n",
"tbl = src.load(\"STOPS\", 2022, agency=\"Norfolk Police Department\")\n",
"tbl.standardize(eth_cats='compact')\n",
"s = tbl.get_transform_map(new=opd.Column.ETHNICITY_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"print(s.data_maps)\n",
"\n",
"print(f\"\\nIn previous examples, raw values for Hispanic/Latino were replaced with '{eth_cats['LATINO']}'. \"+\\\n",
" f\"In this case, customization resulted in replacement with '{compact_eth_cats['LATINO']}'.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### `race_eth_combo`\n",
"The `race_eth_combo` input controls whether and how `standardize` combines race and ethnicity if both columns are found. By default (`race_eth_combo='merge'`), they will be merged such that the race will be used for the combined race/ethnicity if the ethnicity is Non-Hispanic/Latino and the ethnicity column will be used otherwise. The resulting categories are Hispanic/Latino of all races and individual race categories will only include non-Hispanic/Latino. Additionally, if the ethnicity is unknown, the race/ethnicity will be set to unknown even if there is a valid value for race.\n",
"\n",
"Other options for `race_eth_combo` are `'concat'` and `False`. `race_eth_combo=False` will result in race and ethnicity columns not being combined. `race_eth_combo='concat'` will concatenate race and ethnicity:"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" SUBJECT_RACE | \n",
" SUBJECT_ETHNICITY | \n",
" SUBJECT_RACE/ETHNICITY | \n",
"
\n",
" \n",
" \n",
" \n",
" | 0 | \n",
" WHITE | \n",
" HISPANIC/LATINO | \n",
" WHITE HISPANIC/LATINO | \n",
"
\n",
" \n",
" | 1 | \n",
" BLACK | \n",
" NON-HISPANIC/NON-LATINO | \n",
" BLACK | \n",
"
\n",
" \n",
" | 2 | \n",
" BLACK | \n",
" NON-HISPANIC/NON-LATINO | \n",
" BLACK | \n",
"
\n",
" \n",
" | 3 | \n",
" WHITE | \n",
" HISPANIC/LATINO | \n",
" WHITE HISPANIC/LATINO | \n",
"
\n",
" \n",
" | 4 | \n",
" INDIGENOUS | \n",
" NON-HISPANIC/NON-LATINO | \n",
" INDIGENOUS | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" SUBJECT_RACE SUBJECT_ETHNICITY SUBJECT_RACE/ETHNICITY\n",
"0 WHITE HISPANIC/LATINO WHITE HISPANIC/LATINO\n",
"1 BLACK NON-HISPANIC/NON-LATINO BLACK\n",
"2 BLACK NON-HISPANIC/NON-LATINO BLACK\n",
"3 WHITE HISPANIC/LATINO WHITE HISPANIC/LATINO\n",
"4 INDIGENOUS NON-HISPANIC/NON-LATINO INDIGENOUS"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"src = opd.Source(\"Virginia\")\n",
"tbl = src.load(table_type=\"STOPS\", year=2022, agency=\"Arlington County Police Department\")\n",
"tbl.standardize(race_eth_combo='concat')\n",
"\n",
"tbl.table[[opd.Column.RACE_SUBJECT, opd.Column.ETHNICITY_SUBJECT, opd.Column.RACE_ETHNICITY_SUBJECT]].head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Customizing Gender Categories\n",
"There are a set of gender categories that are used in a similar manner to the [race categories](#race_cats). `standardize` has a default set of gender category labels that it uses:"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'MALE': 'MALE',\n",
" 'FEMALE': 'FEMALE',\n",
" 'TRANSGENDER_MALE': 'TRANSGENDER MALE',\n",
" 'TRANSGENDER_FEMALE': 'TRANSGENDER FEMALE',\n",
" 'TRANSGENDER': 'TRANSGENDER',\n",
" 'GENDER_NONCONFORMING': 'GENDER NON-CONFORMING',\n",
" 'TRANSGENDER_OR_GENDER_NONCONFORMING': 'TRANSGENDER OR GENDER NON-CONFORMING',\n",
" 'GENDER_NONBINARY': 'GENDER NON-BINARY',\n",
" 'OTHER': 'OTHER',\n",
" 'UNKNOWN': 'UNKNOWN',\n",
" 'UNSPECIFIED': 'UNSPECIFIED'}"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"gender_cats = opd.defs.get_gender_cats()\n",
"gender_cats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The gender categories in `gender_cats` are a dictionary. The values of the dictionary are the gender category values that will be used during the standardization. The keys of the dictionary are codes used internally to look up which category label to use. No keys can be used besides the ones provided by `get_gender_cats`. \n",
"\n",
"The names assigned for different gender categories can be customized:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Unique gender values are ['M' 'FEMALE' 'TRANSGENDER FEMALE' 'TRANSGENDER MALE' 'UNSPECIFIED']\n",
"\n",
"In previous examples, raw values for Male were replaced with 'MALE'. In this case, customization resulted in\n",
"replacement with 'M'.\n"
]
}
],
"source": [
"new_gender_cats = gender_cats.copy()\n",
"new_gender_cats[\"MALE\"] = \"M\"\n",
"\n",
"src = opd.Source(\"Berkeley\")\n",
"tbl = src.load(opd.defs.TableType.STOPS, 2022)\n",
"tbl.standardize(gender_cats=new_gender_cats)\n",
"\n",
"\n",
"wrap_print(f\"Unique gender values are {tbl.table[opd.Column.GENDER_SUBJECT].unique()}\")\n",
"wrap_print(f\"In previous examples, raw values for Male were replaced with '{gender_cats['MALE']}'. \"+\\\n",
" f\"In this case, customization resulted in replacement with '{new_gender_cats['MALE']}'.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"More compact gender categories can be generated by inputting `gender_cats='compact` to `standardize`, `gender_cats=opd.defs.get_gender_cats(compact=True)`. The result is that most labels will be abbreviated:"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\n",
"Unique gender values are ['M' 'F' 'TF' 'TM' 'UNSPECIFIED']\n",
"\n",
"In previous examples, raw values for Female were replaced with 'FEMALE'. In this case, customization resulted\n",
"in replacement with 'F'.\n"
]
}
],
"source": [
"compact_gender_cats = opd.defs.get_gender_cats(compact=True)\n",
"\n",
"src = opd.Source(\"Berkeley\")\n",
"tbl = src.load(opd.defs.TableType.STOPS, 2022)\n",
"tbl.standardize(gender_cats='compact')\n",
"\n",
"\n",
"wrap_print(f\"Unique gender values are {tbl.table[opd.Column.GENDER_SUBJECT].unique()}\")\n",
"wrap_print(f\"In previous examples, raw values for Female were replaced with '{gender_cats['FEMALE']}'. \"+\\\n",
" f\"In this case, customization resulted in replacement with '{compact_gender_cats['FEMALE']}'.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Manually Standardizing Column Names\n",
"The `known_cols` input is provided to allow the user to force `standardize` to select certain columns for standardization. This might be useful if:\n",
"\n",
"- by default, the desired column is not selected to be standardized OR \n",
"- if there is more than one possible column for a given column type and the one that `standardize` chooses to standardize is not the one desired by the user.\n",
"\n",
"The value input to `known_cols` should be a dictionary where the keys should be attributes of `opd.defs.columns` and the values are columns in the original table. By default, `known_cols` is {opd.defs.columns.DATE:t.date_field, opd.defs.columns.AGENCY:t.agency_field} where `date_field` and `agency_field` are attributes of the `Table` object that are pulled from OPD's datasets table. If a user is inputting their own `known_cols`, it may be advisable to start from the default input as will be shown below: "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'DATE': 'INC_DATE', 'AGENCY': None}"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"src = opd.Source(\"Phoenix\")\n",
"tbl = src.load(table_type=\"USE OF FORCE\", year=2022, pbar=False) # pbar=False does not show progress bar\n",
"default_known_cols = {opd.Column.DATE:tbl.date_field, opd.Column.AGENCY:tbl.agency_field}\n",
"default_known_cols"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"OPD's datasets table lists that the Phoenix use of force dataset has a date column called INC_DATE and that it does not have a column that lists the agency (because all data is for the Phoenix Police). By default, `standardize` will assign INC_DATE as a date column because that has already been verified by the OPD datasets table. The Phoenix use of force dataset has multiple columns for subject race and ethnicity:\n",
"\n",
"- CIT_RACE\n",
"- CIT_ETHNICITY\n",
"- SIMPLE_SUBJ_RE_GRP (race/ethnicity group of subject)\n",
"\n",
"With the default inputs, `standardize` selects CIT_RACE to be `SUBJECT_RACE`, CIT_ETHNICITY to be `SUBJECT_ETHNICITY`, and combines those into `SUBJECT_RACE/ETHNICITY`. The generated `SUBJECT_RACE/ETHNICITY` column is actually more descriptive than Phoenix's SIMPLE_SUBJ_RE_GRP column. However, if the user wanted to use SIMPLE_SUBJ_RE_GRP instead, the following could be done:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"orig_column_name: SIMPLE_SUBJ_RE_GRP\n",
"new_column_name: SUBJECT_RACE\n",
"\n",
"Data Transformation Map:\n",
"{'Hispanic': 'HISPANIC/LATINO', 'Black or African American': 'BLACK', 'White': 'WHITE', 'Other': 'OTHER'}\n"
]
}
],
"source": [
"known_cols = default_known_cols.copy()\n",
"known_cols[opd.defs.columns.RACE_SUBJECT] = 'SIMPLE_SUBJ_RE_GRP'\n",
"tbl.standardize(known_cols=known_cols)\n",
"\n",
"s = tbl.get_transform_map(new=opd.Column.RACE_SUBJECT)\n",
"\n",
"print(f\"orig_column_name: {s.orig_column_name}\")\n",
"print(f\"new_column_name: {s.new_column_name}\\n\")\n",
"print(\"Data Transformation Map:\")\n",
"print(s.data_maps)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Other Inputs to Standardize\n",
"There are several other optional inputs to `standardize` that are not described above.\n",
"\n",
"#### `keep_raw`\n",
"If False, raw columns that are standardized will be removed. If True (default), they will be kept and renamed to indcate that they are the original raw columns.\n",
"\n",
"#### `verbose`\n",
"Default is False. If set to True, details of the standardization process will be printed. If set to a filename, details of the standardization will be logged to that file.\n",
"\n",
"#### `no_id`\n",
"Determines how values that are not able to be standardized are handled during standardization of demographics:\n",
"\n",
"- 'keep' (default): Keep the original value\n",
"- 'null': Replace with a null value\n",
"- 'error': Throw an error\n",
"\n",
"#### `merge_date_time`\n",
"If True (default), if standardized date and standardize time columns are identified, they will be merged into a combined datetime column.\n",
"\n",
"#### `empty_time`\n",
"Indicates how null times are treated in a standardized datetime column (if `merge_date_time` is True). If 'NaT', datetime values with null times will be a null value (pd.NaT). If 'ignore', the resulting datetime will be the date value without a time."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "docs-test",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
},
"orig_nbformat": 4
},
"nbformat": 4,
"nbformat_minor": 2
}