Data Quality

Data quality assurance functions.

capmetrics_etl.quality.check_for_ridership_columns(file_location, worksheet_names)

Determines if all passed worksheets contain ridership data.

Parameters:
  • file_location (str) – The Excel file to examine.
  • worksheet_names (list) – A list of worksheet name strings.
Returns:

True if all worksheets have ridership data. False otherwise.

Return type:

bool

capmetrics_etl.quality.check_quality(file_location, worksheet_names)

Runs ‘sanity check’ on data quality. Specifically:

  1. Worksheet completeness - Check for the presence of all six of the worksheets from which data is extracted.
  2. Route rows present - Check for at least one data point of route number and route name column in the 6 ridership worksheets.
  3. Ridership columns present - Check for at least one ridership data column in all 6 ridership data worksheets.
Parameters:
  • file_location (str) – The data file location.
  • worksheet_names (list) – A list of string names with the worksheets that will be checked for data quality.
Returns:

True if the worksheets pass the data quality check. False otherwise.

Return type:

bool

capmetrics_etl.quality.check_route_info(file_location, worksheet_name)

Determines if the passed worksheet has route data.

Parameters:
  • file_location (str) – The Excel file to examine.
  • worksheet_name (str) – A worksheet name string.
Returns:

True if all worksheet has route data. False otherwise.

Return type:

bool

capmetrics_etl.quality.check_route_presence(file_location, worksheet_names)

Determines if the passed worksheets have route data.

Parameters:
  • file_location (str) – The Excel file to examine.
  • worksheet_names (list) – A list of worksheet name strings.
Returns:

True if all worksheets have route data. False otherwise.

Return type:

bool

capmetrics_etl.quality.check_worksheet_completeness(file_location, worksheet_names)

Checks for a list of worksheet names for data extraction. For example:

'Ridership by Route Weekday',
'Ridership by Route Saturday',
'Ridership by Route Sunday',
'Riders per Hour Weekday',
'Riders Hour Saturday',
'Riders per Hour Sunday'
Parameters:
  • file_location – The location of the target Excel file.
  • worksheet_names (list) – A list of string representing Excel worksheet names.
Returns:

A boolean indicating whether the required worksheets were found and a list of the worksheet names that were not found (if any).

Return type:

tuple

capmetrics_etl.quality.has_ridership_data_column(worksheet, floor=10)

Determines if passed worksheet contains a ridership data point.

Parameters:
  • worksheet – An Excel worksheet.
  • floor (int) – The minimum number of columns and rows to search for the initial ridership data point.
Returns:

True if worksheet has ridership data. False otherwise.

Return type:

bool