Extract-Transform-Load

Extract-Transform-Load functions.

capmetrics_etl.etl.calibrate_day_of_week(timestamp, day_of_week)

Adjusts the passed datetime to reflect the desired day of week. The calibration chooses the requested day of week that is present in the week of the original datetime timestamp. In the case of a Monday timestamp being passed with a ‘sunday’ day of week argument, that means that the returned, calibrated Sunday updated onto the timestamp will not be the immediately previous day but instead the end of the week.

Parameters:
  • timestamp – A Python datetime reflecting when a ridership data point occured.
  • day_of_week (str) – ‘weekday’, ‘saturday’, or ‘sunday’ expected. A Monday is used as the substitute for ‘weekday’.
Returns:

The Python datetime timestamp with date that conforms to required day of week.

capmetrics_etl.etl.deactivate_current_period(route_number, period, ridership_model, session)

“Deactivates” a ridership metric model by setting its current property to False.

Parameters:
  • route_number (int) – A service route’s identifying number.
  • period (dict) – The covered season and year.
  • ridership_model – SQLAlchemy model class that persists a period’s ridership metric.
  • session – SQLAlcemeny session instance.
Returns:

True if an existing instance had its current property changed. False otherwise.

Return type:

bool

capmetrics_etl.etl.extract_day_of_week(period_row, period_column, worksheet)

Extracts the day of week for a ridership data column be searching for ‘weekday’ or ‘saturday’ or ‘sunday’ strings under a period header.

Parameters:
  • period_row – Row for the period header. The day of week should be under it.
  • period_column – Column for the period header.
  • worksheet – The worksheet where the search occurs.
Returns:

A string with the day of week if one present; None otherwise.

capmetrics_etl.etl.find_period(worksheet, periods, column_index, minimum_search=10)

Searches for a performance period in a column and places the period information into a passed-in dictionary.

Parameters:
  • worksheet – The spreadsheet where the column search occurs.
  • periods (dict) – The dict holding period data for the worksheet’s columns.
  • column_index (int) – The column searched.
  • minimum_search (int) – The number of rows without a match searched before quitting.
Returns:

True if a period is found.

Return type:

boolean

capmetrics_etl.etl.get_period_timestamp(day_of_week, season, year)

Selects a Python datetime timestamp to represent a ‘season’ as a specific day of the year. The function selects the first day of week matched the passed argument in the first month of the season matching the passed in season.

For ‘winter’, the selected month is January. For ‘spring’, the selected month is April. For ‘summer’, the selected month is July. For fall, the selected month is October. While not perfectly matching the actual seasons, it matches a quarterly schedule that is intuitive.

Parameters:
  • day_of_week (str) – ‘weekday’, ‘saturday’, and ‘sunday’ are expected.
  • season (str) – ‘winter’, ‘spring’, ‘summer’, ‘fall’ are expected.
  • year (int) – The calendar year.
Returns:

A Python datetime object that represents that ‘season’.

capmetrics_etl.etl.get_periods(worksheet, minimum_search=10)

Scans columns for period headers to create a helpful dict that matches season and year info to a worksheet column.

Period info dictionaries include column, season, year, day_of_week keys.

Parameters:
  • worksheet – Excel worksheet searched.
  • minimum_search – The number of columns without a period that are searched before quitting.
Returns:

Period info dicts keyed to column.

Return type:

dict

capmetrics_etl.etl.get_route_info(file_location, worksheet_name)

This function begins by iterating through the rows in a worksheet, searching for route data or headers for route data.

It returns a dict that indicates if it found properly formatted header data for route numbers, names, and service types, as well as the field data for each route’s row.

Parameters:
  • file_location (str) – Where the file with the data spreadsheets is located.
  • worksheet_name (str) – String name of the individual worksheet with cells that will be searched and processed.
Returns:

The function returns a dict with numbers_available (boolean), names_available (boolean), routes (list)

Return type:

dict

capmetrics_etl.etl.handle_ridership_cell(route_number, period, ridership_cell, ridership_model, session, report=None)

Extracts ridership metric and deactivates previous versions of a performance metric for a specific period.

Parameters:
  • route_number (int) – A transit service route number.
  • period (dict) – Includes day_of_week, season, year, timestamp keys.
  • ridership_cell – The spreadsheet cell being handled.
  • ridership_model – The SQLAlchemy model class that will be persisted.
  • session – The SQLAlchemy session.
  • report – An optional ~.models.ETLReport instance. Default value is None
Returns:

The ETLReport instance if passed into function; None otherwise.

capmetrics_etl.etl.merge_route_data(results)

Ensures that there is only one route info dict per route number. The approach is pretty simple - the last route info dict parsed is the final set value into the returned dict.

Parameters:results (list) – A list of dictionaries with route info keyed to routes.
Returns:A dict of route info dict values keyed to route numbers.
Return type:dict
capmetrics_etl.etl.parse_worksheet_ridership(worksheet, periods, ridership_model, session, report=None)

Parses an Excel worksheet by iterating down rows (routes) and columns (ridership by period) to create/update ridership data.

For each row, the function examines the metrics in each column by iterating the periods dictionary passed into the function through the columns that make up the keys, with additional handling logic finding the data for the row’s route within each column.

Parameters:
  • worksheet – The Excel worksheet to be parsed.
  • periods (dict) – Keyed to the column, with period data as the value.
  • ridership_model – A ridership metric model such as DailyRidership.
  • session – SQLAlchemy database session.
  • report – An optional ETLReport.
capmetrics_etl.etl.run_excel_etl(configuration, session)

Consumes an Excel file with CapMetro data and updates database tables with the file’s data.

Parameters:
  • configuration (dict) – ETL configuration settings.
  • session – SQLAlchemy session.
capmetrics_etl.etl.store_route(session, route_number, route_info, report=None)

Creates or updates a route from passed information.

The route name and service type are converted to all-caps strings.

Parameters:
  • session – An SQLAlchemy session.
  • route_number (str) – The digit-only label for a route number.
  • route_info (dict) – Contains route name and service type data.
  • report – Optional ETLReport model for capturing ETL operations data.
capmetrics_etl.etl.update_route_info(file_location, session, worksheets)

Saves latest route model information into database.

Parameters:
  • file_location (str) – Location of Excel file with data.
  • session – SQLAlchemy database session.
  • worksheets (list) – The string names of the worksheets to be searched for route info.
  • timezone – A pytz-generated timezone info object.
Returns:

A report with basic ETL job metrics

Return type:

ETLReport