{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Year Filtering Guide\n", "Year filtering is necessary in multiple methods (load, load_iter, load_from_csv, get_count, get_agencies) of the Source class that is used to interact with datasets from a particular agency. In most cases, the basic year filtering described in the [Getting Started Guide](./index.ipynb) is all that is needed. However, this guide is provided to document how to handle the rare more complex cases.\n", "\n", "Year filtering serves 2 purposes:\n", "\n", "* Identifying which single dataset to request data or information from\n", "* Filtering a dataset for only a specified year or for a date/year range\n", "\n", "> **NOTE**: The combination of inputs including `year` for the load, load_iter, load_from_csv, get_count, and get_agencies methods must specify a single dataset. OPD requires this because the structure and contents of different datasets can differ even between what appear to be the same data from the same source but for 2 different years. After loading data with OPD, you can [concatenate](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html) datasets manually using pandas.\n", "\n", "This guide describes how to handle the following cases:\n", "\n", "* [Request single year of single year dataset](#request-single-year-of-single-year-dataset)\n", "* [Request single year of multi-year dataset](#request-single-year-of-multi-year-dataset)\n", "* [Request all of a multi-year dataset](#request-all-of-a-multi-year-dataset)\n", "* [Request year range of a multi-year dataset](#request-year-range-of-multi-year-dataset)\n", "* [Request date range of a multi-year dataset](#request-date-range-of-multi-year-dataset)\n", "* [Filtering of Excel and CSV Files with load_iter and get_count](#filtering-of-excel-and-csv-files-with-load_iter-and-get_count)\n", "\n", "Most datasets associated with a single source and table type are either contained in a single multi-year dataset or in multiple single year datasets. To better demonstrate year filtering concepts, this guide will mostly use Use of Force datasets from Northampton and Asheville because Northampton consists of a multi-year dataset AND a single year dataset and Asheville consists of 2 multi-year datasets." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "nbsphinx": "hidden" }, "outputs": [], "source": [ "# This cell should have \"nbsphinx\": \"hidden\" in its metadata and not be included in the documentation!\n", "import sys\n", "sys.path.append(\"../../..\")\n", "\n", "import warnings\n", "warnings.filterwarnings(\"ignore\")" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateSourceNameAgencyAgencyFullTableTypecoverage_startcoverage_endlast_coverage_checkYearDescriptionsource_urlreadmeURLDataTypedate_fielddataset_idagency_fieldmin_versionquery
499MassachusettsNorthamptonNorthamptonNorthampton Police DepartmentUSE OF FORCE2014-01-012020-12-318/13/2023MULTIPLE<NA>https://northamptonpd.com/open-data-portal.html<NA>https://northamptonpd.com/images/ODP%20Spreads...Excel<NA><NA><NA>0.3.1NaN
500MassachusettsNorthamptonNorthamptonNorthampton Police DepartmentUSE OF FORCE2021-01-012021-12-318/13/20232021<NA>https://northamptonpd.com/open-data-portal.html<NA>https://northamptonpd.com/images/ODP%20Spreads...Excel<NA><NA><NA>0.3.1NaN
\n", "
" ], "text/plain": [ " State SourceName Agency AgencyFull \\\n", "499 Massachusetts Northampton Northampton Northampton Police Department \n", "500 Massachusetts Northampton Northampton Northampton Police Department \n", "\n", " TableType coverage_start coverage_end last_coverage_check Year \\\n", "499 USE OF FORCE 2014-01-01 2020-12-31 8/13/2023 MULTIPLE \n", "500 USE OF FORCE 2021-01-01 2021-12-31 8/13/2023 2021 \n", "\n", " Description source_url readme \\\n", "499 https://northamptonpd.com/open-data-portal.html \n", "500 https://northamptonpd.com/open-data-portal.html \n", "\n", " URL DataType date_field \\\n", "499 https://northamptonpd.com/images/ODP%20Spreads... Excel \n", "500 https://northamptonpd.com/images/ODP%20Spreads... Excel \n", "\n", " dataset_id agency_field min_version query \n", "499 0.3.1 NaN \n", "500 0.3.1 NaN " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import openpolicedata as opd\n", "northampton = opd.Source('Northampton')\n", "northampton.datasets[northampton.datasets['TableType']=='USE OF FORCE']" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateSourceNameAgencyAgencyFullTableTypecoverage_startcoverage_endlast_coverage_checkYearDescriptionsource_urlreadmeURLDataTypedate_fielddataset_idagency_fieldmin_versionquery
729North CarolinaAshevilleAshevilleAsheville Police DepartmentUSE OF FORCE2018-04-122020-12-2605/10/2024MULTIPLEAPD Use of Force Data View from 2018 through 2...https://data-avl.opendata.arcgis.com/datasets/...https://docs.google.com/document/d/1sScS5Jez1w...https://services.arcgis.com/aJ16ENn1AaqdFlqx/a...ArcGISdate_occurred<NA><NA>0.7NaN
730North CarolinaAshevilleAshevilleAsheville Police DepartmentUSE OF FORCE2020-12-162024-03-2905/10/2024MULTIPLEAPD Use of Force data starting January 2021https://data-avl.opendata.arcgis.com/datasets/...https://docs.google.com/document/d/1sScS5Jez1w...https://services.arcgis.com/aJ16ENn1AaqdFlqx/a...ArcGISoccurred_date<NA><NA>0.7NaN
\n", "
" ], "text/plain": [ " State SourceName Agency AgencyFull \\\n", "729 North Carolina Asheville Asheville Asheville Police Department \n", "730 North Carolina Asheville Asheville Asheville Police Department \n", "\n", " TableType coverage_start coverage_end last_coverage_check Year \\\n", "729 USE OF FORCE 2018-04-12 2020-12-26 05/10/2024 MULTIPLE \n", "730 USE OF FORCE 2020-12-16 2024-03-29 05/10/2024 MULTIPLE \n", "\n", " Description \\\n", "729 APD Use of Force Data View from 2018 through 2... \n", "730 APD Use of Force data starting January 2021 \n", "\n", " source_url \\\n", "729 https://data-avl.opendata.arcgis.com/datasets/... \n", "730 https://data-avl.opendata.arcgis.com/datasets/... \n", "\n", " readme \\\n", "729 https://docs.google.com/document/d/1sScS5Jez1w... \n", "730 https://docs.google.com/document/d/1sScS5Jez1w... \n", "\n", " URL DataType \\\n", "729 https://services.arcgis.com/aJ16ENn1AaqdFlqx/a... ArcGIS \n", "730 https://services.arcgis.com/aJ16ENn1AaqdFlqx/a... ArcGIS \n", "\n", " date_field dataset_id agency_field min_version query \n", "729 date_occurred 0.7 NaN \n", "730 occurred_date 0.7 NaN " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "asheville = opd.Source(\"Asheville\")\n", "uof_datasets = asheville.datasets[asheville.datasets['TableType']=='USE OF FORCE']\n", "uof_datasets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Request single year of single year dataset\n", "Requesting a single year of a single year dataset simply requires inputting the value in the `Year` column of the datasets table as the `year` input:" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: Massachusetts,\n", "source_name: Northampton,\n", "agency: Northampton,\n", "table_type: USE OF FORCE,\n", "year: 2021,\n", "description: ,\n", "url: https://northamptonpd.com/images/ODP%20Spreadsheets/2021/Use%20of%20Force.xlsx,\n", "source_url: https://northamptonpd.com/open-data-portal.html,\n", "urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/2021/Use%20of%20Force.xlsx'}" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = northampton.load('USE OF FORCE', 2021)\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Request single year of multi-year dataset\n", "Requesting a single year of a multi-year dataset requires inputting the desired year as the `year` input. If the desired year corresponds to a single-year dataset, use the `url_contains` input to indicate that the multi-year dataset is desired." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: Massachusetts,\n", "source_name: Northampton,\n", "agency: Northampton,\n", "table_type: USE OF FORCE,\n", "year: MULTIPLE,\n", "description: ,\n", "url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,\n", "source_url: https://northamptonpd.com/open-data-portal.html,\n", "urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = northampton.load('USE OF FORCE', 2020)\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no overlapping years between the multi-year (2014-2020) and single year (2021) datasets for Northampton. However, to demonstrate what would be required if the multi-year dataset also contained 2021 data and the user wanted to request 2021 data from that dataset, let's enter a substring of the URL (from the `URL` column of the datasets table) of the multi-year dataset (from the datasets table above) to request 2021 data from the multi-year dataset (the returned data will have 0 rows).\n", "\n", "> NOTE: Overlap between single year and multi-year datasets does not currently exist in OPD's datasets as of April 2024." ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: Massachusetts,\n", "source_name: Northampton,\n", "agency: Northampton,\n", "table_type: USE OF FORCE,\n", "year: MULTIPLE,\n", "description: ,\n", "url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,\n", "source_url: https://northamptonpd.com/open-data-portal.html,\n", "urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = northampton.load('USE OF FORCE', 2021, url_contains='2014-2020')\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Request all of a multi-year dataset\n", "Requesting the entirety of multi-year dataset requires inputting \"MULTIPLE\" as the `year` input. If there are multiple multi-year datasets, use the `url_contains` input to indicate the desired multi-year dataset." ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: Massachusetts,\n", "source_name: Northampton,\n", "agency: Northampton,\n", "table_type: USE OF FORCE,\n", "year: MULTIPLE,\n", "description: ,\n", "url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,\n", "source_url: https://northamptonpd.com/open-data-portal.html,\n", "urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = northampton.load('USE OF FORCE', \"MULTIPLE\")\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Asheville data has multiple multi-year datasets for use of force. Use `url_contains` to distinguish between them (by inputting all or portion of the URL in the `URL` column of the datasets table)." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: North Carolina,\n", "source_name: Asheville,\n", "agency: Asheville,\n", "table_type: USE OF FORCE,\n", "year: MULTIPLE,\n", "description: APD Use of Force Data View from 2018 through 2020.,\n", "url: https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0,\n", "date_field: date_occurred,\n", "source_url: https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about,\n", "readme: https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing,\n", "urls: {'source_url': 'https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about', 'readme': 'https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing', 'data': 'https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0'}" ] }, "execution_count": 28, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = asheville.load('USE OF FORCE', \"MULTIPLE\", url_contains=uof_datasets.iloc[0]['URL'])\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Mesa, AZ data also has multiple multi-year datasets (for calls for service). However, the URL cannot be used to distinguish between them because their URLs are the same." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SourceNameTableTypecoverage_startcoverage_endYearDataTypeURLdataset_id
6MesaCALLS FOR SERVICE2017-01-012020-12-31MULTIPLESocratadata.mesaaz.govex94-c5ad
7MesaCALLS FOR SERVICE2021-01-012024-06-20MULTIPLESocratadata.mesaaz.govizhu-764k
\n", "
" ], "text/plain": [ " SourceName TableType coverage_start coverage_end Year \\\n", "6 Mesa CALLS FOR SERVICE 2017-01-01 2020-12-31 MULTIPLE \n", "7 Mesa CALLS FOR SERVICE 2021-01-01 2024-06-20 MULTIPLE \n", "\n", " DataType URL dataset_id \n", "6 Socrata data.mesaaz.gov ex94-c5ad \n", "7 Socrata data.mesaaz.gov izhu-764k " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mesa = opd.Source('Mesa')\n", "mesa.datasets.loc[mesa.datasets['TableType']=='CALLS FOR SERVICE', ['SourceName','TableType','coverage_start','coverage_end','Year','DataType','URL','dataset_id']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The URL is the same because for some data types, such as Socrata, a dataset ID is used with the URL to access the dataset. In this case, set the `id_contains` to the dataset ID to distinguish between the datasets." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: Arizona,\n", "source_name: Mesa,\n", "agency: Mesa,\n", "table_type: CALLS FOR SERVICE,\n", "year: MULTIPLE,\n", "description: ,\n", "url: data.mesaaz.gov,\n", "date_field: received_date_time,\n", "source_url: https://data.mesaaz.gov/Police/Police-Dispatch-Events-2021-present/izhu-764k/about_data,\n", "readme: https://data.mesaaz.gov/Police/Police-Dispatch-Events-2021-present/izhu-764k/about_data,\n", "urls: {'source_url': 'https://data.mesaaz.gov/Police/Police-Dispatch-Events-2021-present/izhu-764k/about_data', 'readme': 'https://data.mesaaz.gov/Police/Police-Dispatch-Events-2021-present/izhu-764k/about_data', 'data': 'data.mesaaz.gov'}" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ids = mesa.datasets.loc[mesa.datasets['TableType']=='CALLS FOR SERVICE','dataset_id']\n", "table = mesa.load('CALLS FOR SERVICE', \"MULTIPLE\", id_contains=ids.iloc[1], nrows=1000)\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Request year range of multi-year dataset\n", "You can request a range of years for multi-year datasets. Below, we will review the different possible cases starting with the most common.\n", "\n", "First, let's request a year range for when there is a single multi-year dataset. Enter a list with the start and stop year as the `year`." ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: Massachusetts,\n", "source_name: Northampton,\n", "agency: Northampton,\n", "table_type: USE OF FORCE,\n", "year: MULTIPLE,\n", "description: ,\n", "url: https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx,\n", "source_url: https://northamptonpd.com/open-data-portal.html,\n", "urls: {'source_url': 'https://northamptonpd.com/open-data-portal.html', 'readme': None, 'data': 'https://northamptonpd.com/images/ODP%20Spreadsheets/NPD_Use_of_Force_2014-2020_incident_level_data.xlsx'}" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = northampton.load('USE OF FORCE', [2018,2020])\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A similar request can be made if there are multiple multi-year datasets. If the requested year range is all within a single dataset, you can just input a year range." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: North Carolina,\n", "source_name: Asheville,\n", "agency: Asheville,\n", "table_type: USE OF FORCE,\n", "year: [2018, 2019],\n", "description: APD Use of Force Data View from 2018 through 2020.,\n", "url: https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0,\n", "date_field: date_occurred,\n", "source_url: https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about,\n", "readme: https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing,\n", "urls: {'source_url': 'https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about', 'readme': 'https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing', 'data': 'https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0'}" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = asheville.load('USE OF FORCE', [2018,2019])\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `url_contains` input is required if multi-year datasets overlap AND data is requested in the overlapping region. " ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "state: North Carolina,\n", "source_name: Asheville,\n", "agency: Asheville,\n", "table_type: USE OF FORCE,\n", "year: [2019, 2020],\n", "description: APD Use of Force Data View from 2018 through 2020.,\n", "url: https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0,\n", "date_field: date_occurred,\n", "source_url: https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about,\n", "readme: https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing,\n", "urls: {'source_url': 'https://data-avl.opendata.arcgis.com/datasets/7bc4d78e6bc64f2c9ca02fb2486aad0f_0/about', 'readme': 'https://docs.google.com/document/d/1sScS5Jez1w8I-DUb6bagsj8efad7tDtIkbebicTXUOM/edit?usp=sharing', 'data': 'https://services.arcgis.com/aJ16ENn1AaqdFlqx/arcgis/rest/services/APDUseOfForce/FeatureServer/0'}" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = asheville.load('USE OF FORCE', [2019,2020], url_contains=uof_datasets.iloc[0]['URL'])\n", "table" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Request date range of multi-year dataset\n", "You can request a range of dates for multi-year datasets. Requesting a date range is very similar to requesting a year range so it is recommended that the [filtering by year range](#request-year-range-of-multi-year-dataset) section be read first. \n", "\n", "To request data for a date range, enter a list with the start and stop date in YYYY-MM-DD format as the `year` input." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Earliest date in table is 2020-09-13 00:00:00\n", "Latest date in table is 2020-10-10 00:00:00\n" ] } ], "source": [ "table = asheville.load('USE OF FORCE', ['2020-09-13', '2020-10-10'])\n", "print(f\"Earliest date in table is {table.table['date_occurred'].min()}\")\n", "print(f\"Latest date in table is {table.table['date_occurred'].max()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering of Excel and CSV Files with load_iter and get_count\n", "By default, OPD discourages you (by throwing an error) from using year/date filtering when calling `load_iter` and `get_count` for file-based datasets (`DataType` is CSV or Excel). The reason for this is that these files cannot be filtered by year/date without reading in the whole file. In most cases, it is likely more efficent to:\n", "\n", "* `load_iter` replacement for CSV and Excel datasets: use `load` instead\n", "* `get_count` replacement for CSV and Excel datasets: use `load` to load the entire dataset and use pandas operations to filter the data and find the number of rows\n", "\n", "However, the `load_iter` and `get_count` can be used for Excel and CSV files by setting `force=True`.\n", "\n", "The Chicago STOPS data is a CSV file:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
StateSourceNameTableTypecoverage_startcoverage_endDataTypeYear
300IllinoisChicagoSTOPS2012-01-012020-05-16CSVMULTIPLE
\n", "
" ], "text/plain": [ " State SourceName TableType coverage_start coverage_end DataType \\\n", "300 Illinois Chicago STOPS 2012-01-01 2020-05-16 CSV \n", "\n", " Year \n", "300 MULTIPLE " ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "chicago = opd.Source('Chicago')\n", "chicago.datasets.loc[chicago.datasets['TableType']=='STOPS', ['State', 'SourceName', 'TableType', 'coverage_start','coverage_end','DataType', 'Year']]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using `get_count` for Excel and CSV files:" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Number of rows from 2019 to 2020 is 762553\n" ] } ], "source": [ "year_range = [2019,2020]\n", "count = chicago.get_count('STOPS', year_range, force=True)\n", "print(f'Number of rows from {year_range[0]} to {year_range[1]} is {count}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If the data will also be loaded, this is a more efficient method for getting the count for Excel and CSV files:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ " \r" ] }, { "data": { "text/plain": [ "Index(['raw_row_number', 'date', 'time', 'location', 'lat', 'lng',\n", " 'geocode_source', 'beat', 'district', 'subject_age', 'subject_race',\n", " 'subject_sex', 'officer_id_hash', 'officer_age', 'officer_race',\n", " 'officer_sex', 'officer_years_of_service', 'unit', 'type', 'violation',\n", " 'arrest_made', 'citation_issued', 'outcome', 'contraband_found',\n", " 'contraband_drugs', 'contraband_weapons', 'contraband_alcohol',\n", " 'contraband_other', 'search_conducted', 'search_person',\n", " 'search_vehicle', 'vehicle_make', 'vehicle_model', 'raw_race',\n", " 'raw_driver_race'],\n", " dtype='object')" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "table = chicago.load('STOPS', 'MULTIPLE')\n", "table.table.columns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The date column for this dataset is called `date`." ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of rows from 2019 to 2020 is 762553\n" ] } ], "source": [ "df_filtered = table.table[ (table.table['date'].dt.year>=year_range[0]) & (table.table['date'].dt.year<=year_range[1])]\n", "\n", "print(f'Number of rows from {year_range[0]} to {year_range[1]} is {len(df_filtered)}')" ] } ], "metadata": { "kernelspec": { "display_name": "opd-py3.12", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 2 }