Year Filtering Guide#

Year filtering is necessary in multiple methods (load, load_iter, load_from_csv, get_count, get_agencies) of the Source class that is used to interact with datasets from a particular agency. In most cases, the basic year filtering described in the Getting Started Guide is all that is needed. However, this guide is provided to document how to handle the rare more complex cases.

Year filtering serves 2 purposes:

  • Identifying which single dataset to request data or information from

  • Filtering a dataset for only a specified year or for a date/year range

NOTE: The combination of inputs including year for the load, load_iter, load_from_csv, get_count, and get_agencies methods must specify a single dataset. OPD requires this because the structure and contents of different datasets can differ even between what appear to be the same data from the same source but for 2 different years. After loading data with OPD, you can concatenate datasets manually using pandas.

This guide describes how to handle the following cases:

Most datasets associated with a single source and table type are either contained in a single multi-year dataset or in multiple single year datasets. To better demonstrate year filtering concepts, this guide will mostly use Use of Force datasets from Northampton and Asheville because Northampton consists of a multi-year dataset AND a single year dataset and Asheville consists of 2 multi-year datasets.

[2]:
import openpolicedata as opd
northampton = opd.Source('Northampton')
northampton.datasets[northampton.datasets['TableType']=='USE OF FORCE']
[2]:
State SourceName Agency AgencyFull TableType coverage_start coverage_end last_coverage_check Description source_url readme URL Year DataType date_field dataset_id agency_field min_version query
502 Massachusetts Northampton Northampton Northampton Police Department USE OF FORCE 2014-01-01 2020-12-31 8/13/2023 <NA> https://northamptonpd.com/open-data-portal.html <NA> https://northamptonpd.com/images/ODP%20Spreads... MULTIPLE Excel <NA> <NA> <NA> 0.3.1 NaN
503 Massachusetts Northampton Northampton Northampton Police Department USE OF FORCE 2021-01-01 2021-12-31 8/13/2023 <NA> https://northamptonpd.com/open-data-portal.html <NA> https://northamptonpd.com/images/ODP%20Spreads... 2021 Excel <NA> <NA> <NA> 0.3.1 NaN
[3]:
asheville = opd.Source("Asheville")
uof_datasets = asheville.datasets[asheville.datasets['TableType']=='USE OF FORCE']
uof_datasets
[3]:
State SourceName Agency AgencyFull TableType coverage_start coverage_end last_coverage_check Description source_url readme URL Year DataType date_field dataset_id agency_field min_version query
711 North Carolina Asheville Asheville Asheville Police Department USE OF FORCE 2018-04-12 2020-12-26 05/10/2024 APD Use of Force Data View from 2018 through 2... https://data-avl.opendata.arcgis.com/datasets/... https://docs.google.com/document/d/1sScS5Jez1w... https://services.arcgis.com/aJ16ENn1AaqdFlqx/a... MULTIPLE ArcGIS date_occurred <NA> <NA> <NA> NaN
712 North Carolina Asheville Asheville Asheville Police Department USE OF FORCE 2020-12-16 2024-03-29 05/10/2024 APD Use of Force data starting January 2021 https://data-avl.opendata.arcgis.com/datasets/... https://docs.google.com/document/d/1sScS5Jez1w... https://services.arcgis.com/aJ16ENn1AaqdFlqx/a... MULTIPLE ArcGIS occurred_date <NA> <NA> <NA> NaN

Request single year of single year dataset#

Requesting a single year of a single year dataset simply requires inputting the value in the Year column of the datasets table as the year input:

[24]:
table = northampton.load('USE OF FORCE', 2021)
table
[24]:
state: Massachusetts,
source_name: Northampton,
agency: Northampton,
table_type: USE OF FORCE,
year: 2021,
description: <NA>,
url: https://northamptonpd.com/images/ODP%20Spreadsheets/2021/Use%20of%20Force.xlsx,
source_url: https://northamptonpd.com/open-data-portal.html,
urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/2021/Use%20of%20Force.xlsx'}

Request single year of multi-year dataset#

Requesting a single year of a multi-year dataset requires inputting the desired year as the year input. If the desired year corresponds to a single-year dataset, use the url_contains input to indicate that the multi-year dataset is desired.

[25]:
table = northampton.load('USE OF FORCE', 2020)
table
[25]:
state: Massachusetts,
source_name: Northampton,
agency: Northampton,
table_type: USE OF FORCE,
year: MULTIPLE,
description: <NA>,
url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,
source_url: https://northamptonpd.com/open-data-portal.html,
urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}

There are no overlapping years between the multi-year (2014-2020) and single year (2021) datasets for Northampton. However, to demonstrate what would be required if the multi-year dataset also contained 2021 data and the user wanted to request 2021 data from that dataset, let’s enter a substring of the URL (from the URL column of the datasets table) of the multi-year dataset (from the datasets table above) to request 2021 data from the multi-year dataset (the returned data will have 0 rows).

NOTE: Overlap between single year and multi-year datasets does not currently exist in OPD’s datasets as of April 2024.

[26]:
table = northampton.load('USE OF FORCE', 2021, url_contains='2014-2020')
table
[26]:
state: Massachusetts,
source_name: Northampton,
agency: Northampton,
table_type: USE OF FORCE,
year: MULTIPLE,
description: <NA>,
url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,
source_url: https://northamptonpd.com/open-data-portal.html,
urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}

Request all of a multi-year dataset#

Requesting the entirety of multi-year dataset requires inputting “MULTIPLE” as the year input. If there are multiple multi-year datasets, use the url_contains input to indicate the desired multi-year dataset.

[27]:
table = northampton.load('USE OF FORCE', "MULTIPLE")
table
[27]:
state: Massachusetts,
source_name: Northampton,
agency: Northampton,
table_type: USE OF FORCE,
year: MULTIPLE,
description: <NA>,
url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,
source_url: https://northamptonpd.com/open-data-portal.html,
urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}

The Asheville dataset has multiple multi-year datasets for use of force. Use url_contains to distinguish between them (by inputting all or portion of the URL in the URL column of the datasets table).

[28]:
table = asheville.load('USE OF FORCE', "MULTIPLE", url_contains=uof_datasets.iloc[0]['URL'])
table
[28]:
state: North Carolina,
source_name: Asheville,
agency: Asheville,
table_type: USE OF FORCE,
year: MULTIPLE,
description: APD Use of Force Data View from 2018 through 2020.,
url: https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0,
date_field: date_occurred,
source_url: https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about,
readme: https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing,
urls: {'source_url': 'https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about', 'readme': 'https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing', 'data': 'https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0'}

Request year range of multi-year dataset#

You can request a range of years for multi-year datasets. Below, we will review the different possible cases starting with the most common.

First, let’s request a year range for when there is a single multi-year dataset. Enter a list with the start and stop year as the year.

[29]:
table = northampton.load('USE OF FORCE', [2018,2020])
table
[29]:
state: Massachusetts,
source_name: Northampton,
agency: Northampton,
table_type: USE OF FORCE,
year: MULTIPLE,
description: <NA>,
url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,
source_url: https://northamptonpd.com/open-data-portal.html,
urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}

A similar request can be made if there are multiple multi-year datasets. If the requested year range is all within a single dataset, you can just input a year range.

[5]:
table = asheville.load('USE OF FORCE', [2018,2019])
table
[5]:
state: North Carolina,
source_name: Asheville,
agency: Asheville,
table_type: USE OF FORCE,
year: [2018, 2019],
description: APD Use of Force Data View from 2018 through 2020.,
url: https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0,
date_field: date_occurred,
source_url: https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about,
readme: https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing,
urls: {'source_url': 'https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about', 'readme': 'https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing', 'data': 'https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0'}

The url_contains input is required if multi-year datasets overlap AND data is requested in the overlapping region.

[6]:
table = asheville.load('USE OF FORCE', [2019,2020], url_contains=uof_datasets.iloc[0]['URL'])
table
[6]:
state: North Carolina,
source_name: Asheville,
agency: Asheville,
table_type: USE OF FORCE,
year: [2019, 2020],
description: APD Use of Force Data View from 2018 through 2020.,
url: https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0,
date_field: date_occurred,
source_url: https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about,
readme: https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing,
urls: {'source_url': 'https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about', 'readme': 'https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing', 'data': 'https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0'}

Request date range of multi-year dataset#

You can request a range of dates for multi-year datasets. Requesting a date range is very similar to requesting a year range so it is recommended that the filtering by year range section be read first.

To request data for a date range, enter a list with the start and stop date in YYYY-MM-DD format as the year input.

[32]:
table = asheville.load('USE OF FORCE', ['2020-09-13', '2020-10-10'])
print(f"Earliest date in table is {table.table['date_occurred'].min()}")
print(f"Latest date in table is {table.table['date_occurred'].max()}")
Earliest date in table is 2020-09-13 00:00:00
Latest date in table is 2020-10-10 00:00:00

Filtering of Excel and CSV Files with load_iter and get_count#

By default, OPD discourages you (by throwing an error) from using year/date filtering when calling load_iter and get_count for file-based datasets (DataType is CSV or Excel). The reason for this is that these files cannot be filtered by year/date without reading in the whole file. In most cases, it is likely more efficent to:

  • load_iter replacement for CSV and Excel datasets: use load instead

  • get_count replacement for CSV and Excel datasets: use load to load the entire dataset and use pandas operations to filter the data and find the number of rows

However, the load_iter and get_count can be used for Excel and CSV files by setting force=True.

The Chicago STOPS data is a CSV file:

[13]:
chicago = opd.Source('Chicago')
chicago.datasets.loc[chicago.datasets['TableType']=='STOPS', ['State', 'SourceName', 'TableType', 'coverage_start','coverage_end','DataType', 'Year']]
[13]:
State SourceName TableType coverage_start coverage_end DataType Year
300 Illinois Chicago STOPS 2012-01-01 2020-05-16 CSV MULTIPLE

Using get_count for Excel and CSV files:

[14]:
year_range = [2019,2020]
count = chicago.get_count('STOPS', year_range, force=True)
print(f'Number of rows from {year_range[0]} to {year_range[1]} is {count}')

Number of rows from 2019 to 2020 is 762553

If the data will also be loaded, this is a more efficient method for getting the count for Excel and CSV files:

[15]:
table = chicago.load('STOPS', 'MULTIPLE')
table.table.columns

[15]:
Index(['raw_row_number', 'date', 'time', 'location', 'lat', 'lng',
       'geocode_source', 'beat', 'district', 'subject_age', 'subject_race',
       'subject_sex', 'officer_id_hash', 'officer_age', 'officer_race',
       'officer_sex', 'officer_years_of_service', 'unit', 'type', 'violation',
       'arrest_made', 'citation_issued', 'outcome', 'contraband_found',
       'contraband_drugs', 'contraband_weapons', 'contraband_alcohol',
       'contraband_other', 'search_conducted', 'search_person',
       'search_vehicle', 'vehicle_make', 'vehicle_model', 'raw_race',
       'raw_driver_race'],
      dtype='object')

The date column for this dataset is called date.

[21]:
df_filtered = table.table[ (table.table['date'].dt.year>=year_range[0]) & (table.table['date'].dt.year<=year_range[1])]

print(f'Number of rows from {year_range[0]} to {year_range[1]} is {len(df_filtered)}')
Number of rows from 2019 to 2020 is 762553