Extract-Transform-Load

Extract-Transform-Load functions.

capmetrics_etl.etl.add_ridership_fact(system_facts, fact)

Fills the passed dict with a mapping of ISO 8601 timestamp keys to a ridership data dict. The ridership dict pairs service types with ridership facts, as well as a temporal key associated with a dict containing day_of_week, season, and calendar_year data. Example:

{
    '2015-01-02T03:04:56-00:00': {
        'BUS': 10100,
        'RAIL': 220,
        'temporal': {
            'day_of_week': 'weekday',
            'season': 'winter',
            'calendar_year': 2015
        }
    }
}
Parameters:
  • system_facts (dict) –
  • fact (DailyRidership`) – A DailyRidership instance.
capmetrics_etl.etl.deactivate_current_period(route_number, period, ridership_model, session)

“Deactivates” a ridership metric model by setting its is_current property to False.

Parameters:
  • route_number (int) – A service route’s identifying number.
  • period (dict) – The covered season and year.
  • ridership_model – SQLAlchemy model class that persists a period’s ridership metric.
  • session – SQLAlcemeny session instance.
Returns:

True if an existing instance had its is_current property changed. False otherwise.

Return type:

bool

capmetrics_etl.etl.deactivate_previous_system_ridership_facts(session)

Selects all persisted SystemRidership models that are currently active and commits their is_active property to False.

Parameters:session – SQLAlchemy session.
capmetrics_etl.etl.deactivate_previous_weekly_performance(session)

Selects all persisted WeeklyRidership models that are currently active and commits their is_current property to False.

Parameters:session – SQLAlchemy session.
capmetrics_etl.etl.extract_day_of_week(period_row, period_column, worksheet)

Extracts the day of week for a ridership data column be searching for ‘weekday’ or ‘saturday’ or ‘sunday’ strings under a period header.

Parameters:
  • period_row – Row for the period header. The day of week should be under it.
  • period_column – Column for the period header.
  • worksheet – The worksheet where the search occurs.
Returns:

A string with the day of week if one present; None otherwise.

capmetrics_etl.etl.find_period(worksheet, periods, column_index, minimum_search=10)

Searches for a performance period in a column and places the period information into a passed-in dictionary.

Parameters:
  • worksheet – The spreadsheet where the column search occurs.
  • periods (dict) – The dict holding period data for the worksheet’s columns.
  • column_index (int) – The column searched.
  • minimum_search (int) – The number of rows without a match searched before quitting.
Returns:

True if a period is found.

Return type:

boolean

capmetrics_etl.etl.get_high_ridership_routes(session, timestamp, size=10)

Determines the Routes with highest DailyRidership counts of ridership.

Parameters:
  • session – SQLAlchemy session
  • timestamp – A datetime object serving as the timestamp that will filter ridership data.
  • size – The number of routes to return. The default is 10, so the ‘top ten’ routes are returned.
Returns:

list

capmetrics_etl.etl.get_latest_measurement_timestamp(session)

Queries DailyRidership models to find the latest (most recent) measurement timestamp (which is a datetime object).

Parameters:session – SQLAlchemy session.
Returns:datetime.datetime
capmetrics_etl.etl.get_periods(worksheet, minimum_search=10)

Scans columns for period headers to create a helpful dict that matches season and year info to a worksheet column.

Period info dictionaries include column, season, year, day_of_week keys.

Parameters:
  • worksheet – Excel worksheet searched.
  • minimum_search – The number of columns without a period that are searched before quitting.
Returns:

Period info dicts keyed to column.

Return type:

dict

capmetrics_etl.etl.get_route_info(file_location, worksheet_name)

This function begins by iterating through the rows in a worksheet, searching for route data or headers for route data.

It returns a dict that indicates if it found properly formatted header data for route numbers, names, and service types, as well as the field data for each route’s row.

Parameters:
  • file_location (str) – Where the file with the data spreadsheets is located.
  • worksheet_name (str) – String name of the individual worksheet with cells that will be searched and processed.
Returns:

The function returns a dict with numbers_available (boolean), names_available (boolean), routes (list)

Return type:

dict

capmetrics_etl.etl.handle_ridership_cell(route_number, period, ridership_cell, ridership_model, session, report=None)

Extracts ridership metric and deactivates previous versions of a performance metric for a specific period.

Parameters:
  • route_number (int) – A transit service route number.
  • period (dict) – Includes day_of_week, season, year, timestamp keys.
  • ridership_cell – The spreadsheet cell being handled.
  • ridership_model – The SQLAlchemy model class that will be persisted.
  • session – The SQLAlchemy session.
  • report – An optional ETLReport instance. Default value is None
Returns:

The ETLReport instance if passed into function; None otherwise.

capmetrics_etl.etl.merge_route_data(results)

Ensures that there is only one route info dict per route number. The approach is pretty simple - the last route info dict parsed is the final set value into the returned dict.

Parameters:results (list) – A list of dictionaries with route info keyed to routes.
Returns:A dict of route info dict values keyed to route numbers.
Return type:dict
capmetrics_etl.etl.parse_worksheet_ridership(worksheet, periods, ridership_model, session, report=None)

Parses an Excel worksheet by iterating down rows (routes) and columns (ridership by period) to create/update ridership data.

For each row, the function examines the metrics in each column by iterating the periods dictionary passed into the function through the columns that make up the keys, with additional handling logic finding the data for the row’s route within each column.

Parameters:
  • worksheet – The Excel worksheet to be parsed.
  • periods (dict) – Keyed to the column, with period data as the value.
  • ridership_model – A ridership metric model such as DailyRidership.
  • session – SQLAlchemy database session.
  • report – An optional ETLReport.
capmetrics_etl.etl.run_excel_etl(data_source_file, session, configuration)

Consumes an Excel file with CapMetro data and updates database tables with the file’s data.

Parameters:
  • data_source_file (str) – Location of the Excel file to be analyzed.
  • session – SQLAlchemy session.
  • configuration (dict) – ETL configuration settings.
capmetrics_etl.etl.store_route(session, route_number, route_info, report=None)

Creates or updates a route from passed information.

The route name and service type are converted to all-caps strings.

Parameters:
  • session – An SQLAlchemy session.
  • route_number (str) – The digit-only label for a route number.
  • route_info (dict) – Contains route name and service type data.
  • report – Optional ETLReport model for capturing ETL operations data.
capmetrics_etl.etl.store_system_ridership(system_facts, session)

Creates and saves SystemRidership models from passed data.

Parameters:
  • system_facts (dict) – System ridership data.
  • session – A SQLAlchemy session.
capmetrics_etl.etl.to_service_facts(ridership_facts)

Dictionary key hierarchy for aggregation.

A service_facts dict is keyed to a service type string. That is associated with a season_facts dict which has keys for the different ISO 8601 timestamps for the first day of a fact’s season - regardless of the original day of week for the fact. Each seasonal key is matched with a dict that keys day types in the data with a single ridership number.

Example:

{
    'BUS': {
        '2015-01-02T00:00:00-00:00': {
            'weekday': 10100,
            'saturday': 2300,
            'sunday': 400
        }
    }
}
Parameters:ridership_facts (list) – A list of ridership models.
capmetrics_etl.etl.update_ridership(file_location, worksheet_names, ridership_model, session)
Parameters:
  • file_location (str) – The location of a data store Excel file.
  • worksheet_names (list) – A list of strings with Excel file worksheet names.
  • ridership_model – A ridership model, such as DailyRidership model.
  • session – A SQLAlchemy session.
Returns:

An ETLReport.

capmetrics_etl.etl.update_route_info(file_location, session, worksheets)

Saves latest route model information into database.

Parameters:
  • file_location (str) – Location of Excel file with data.
  • session – SQLAlchemy database session.
  • worksheets (list) – The string names of the worksheets to be searched for route info.
  • timezone – A pytz-generated timezone info object.
Returns:

A report with basic ETL job metrics

Return type:

ETLReport

capmetrics_etl.etl.update_system_ridership(session)

Updates the persisted SystemRidership models; it ‘deactivates’ the existing models and then saves new models.

Parameters:session – An SQLAlchemy session.

Updates SystemTrend models based on existing system trend data for service types.

Parameters:session – SQL Alchemy session.