Data Quality¶
Data quality assurance functions.
-
capmetrics_etl.quality.check_for_ridership_columns(file_location, worksheet_names)¶ Determines if all passed worksheets contain ridership data.
Parameters: - file_location (str) – The Excel file to examine.
- worksheet_names (list) – A list of worksheet name strings.
Returns: Trueif all worksheets have ridership data.Falseotherwise.Return type: bool
-
capmetrics_etl.quality.check_quality(file_location, worksheet_names)¶ Runs ‘sanity check’ on data quality. Specifically:
- Worksheet completeness - Check for the presence of all six of the worksheets from which data is extracted.
- Route rows present - Check for at least one data point of route number and route name column in the 6 ridership worksheets.
- Ridership columns present - Check for at least one ridership data column in all 6 ridership data worksheets.
Parameters: - file_location (str) – The data file location.
- worksheet_names (list) – A list of string names with the worksheets that will be checked for data quality.
Returns: Trueif the worksheets pass the data quality check.Falseotherwise.Return type: bool
-
capmetrics_etl.quality.check_route_info(file_location, worksheet_name)¶ Determines if the passed worksheet has route data.
Parameters: - file_location (str) – The Excel file to examine.
- worksheet_name (str) – A worksheet name string.
Returns: Trueif all worksheet has route data.Falseotherwise.Return type: bool
-
capmetrics_etl.quality.check_route_presence(file_location, worksheet_names)¶ Determines if the passed worksheets have route data.
Parameters: - file_location (str) – The Excel file to examine.
- worksheet_names (list) – A list of worksheet name strings.
Returns: Trueif all worksheets have route data.Falseotherwise.Return type: bool
-
capmetrics_etl.quality.check_worksheet_completeness(file_location, worksheet_names)¶ Checks for a list of worksheet names for data extraction. For example:
'Ridership by Route Weekday', 'Ridership by Route Saturday', 'Ridership by Route Sunday', 'Riders per Hour Weekday', 'Riders Hour Saturday', 'Riders per Hour Sunday'
Parameters: - file_location – The location of the target Excel file.
- worksheet_names (list) – A list of string representing Excel worksheet names.
Returns: A boolean indicating whether the required worksheets were found and a list of the worksheet names that were not found (if any).
Return type: tuple
-
capmetrics_etl.quality.has_ridership_data_column(worksheet, floor=10)¶ Determines if passed worksheet contains a ridership data point.
Parameters: - worksheet – An Excel worksheet.
- floor (int) – The minimum number of columns and rows to search for the initial ridership data point.
Returns: Trueif worksheet has ridership data.Falseotherwise.Return type: bool