vignettes/Introduction-to-DaQAPO.Rmd
Introduction-to-DaQAPO.Rmd
Process mining techniques generate valuable insights in business processes using automatically generated process execution data. However, despite the extensive opportunities that process mining techniques provide, the garbage in - garbage out principle still applies. Data quality issues are widespread in real-life data and can generate misleading results when used for analysis purposes. Currently, there is no systematic way to perform data quality assessment on process-oriented data. To fill this gap, we introduce DaQAPO - Data Quality Assessment for Process-Oriented data. It provides a set of assessment functions to identify a wide array of quality issues.
We identify two stages in the data quality assessment process:
If the user desires to remove anomalies detected by quality tests, he has the ability to do so.
Before we can perform the first stage - reading data - we must have access to the appropriate data sources and have knowledge of the expected data structure. Our package supports two input data formats:
Two example datasets are included in daqapo
. These are hospital
and hospital_events
. Below, you can find their respective structures.
str(hospital) #> Classes 'tbl_df', 'tbl' and 'data.frame': 53 obs. of 7 variables: #> $ patient_visit_nr: num 510 512 510 512 512 510 517 518 518 518 ... #> $ activity : chr "registration" "Registration" "Triage" "Triage" ... #> $ originator : chr "Clerk 9" "Clerk 12" "Nurse 27" "Nurse 27" ... #> $ start_ts : chr "20/11/2017 10:18:17" "20/11/2017 10:33:14" "20/11/2017 10:34:08" "20/11/2017 10:44:12" ... #> $ complete_ts : chr "20/11/2017 10:20:06" "20/11/2017 10:37:00" "20/11/2017 10:41:48" "20/11/2017 10:50:17" ... #> $ triagecode : num 3 3 3 3 3 NA 3 4 4 4 ... #> $ specialization : chr "TRAU" "URG" "TRAU" "URG" ...
str(hospital_events) #> Classes 'tbl_df', 'tbl' and 'data.frame': 106 obs. of 8 variables: #> $ patient_visit_nr : num 510 510 510 510 510 510 512 512 512 512 ... #> $ activity : chr "registration" "registration" "Triage" "Triage" ... #> $ originator : chr "Clerk 9" "Clerk 9" "Nurse 27" "Nurse 27" ... #> $ event_lifecycle_state: chr "start" "complete" "start" "complete" ... #> $ timestamp : chr "20/11/2017 10:18:17" "20/11/2017 10:20:06" "20/11/2017 10:34:08" "20/11/2017 10:41:48" ... #> $ triagecode : num 3 3 3 3 NA NA 3 3 3 3 ... #> $ specialization : chr "TRAU" "TRAU" "TRAU" "TRAU" ... #> $ event_matching : num 1 1 1 1 1 1 1 1 1 1 ...
Both datasets were artificially created merely to illustrate the package’s functionalities.
First of all, data must be read and prepared such that the quality assessment tests can be executed. Data preparation requires transforming the dataset to a standardised activity log format. However, earlier we mentioned two input data formats: an activity log and an event log. When an event log is available, it needs to be converted to an activity log. daqapo
provides a set of functions, with the aid of bupaR
, to assist the user in this process.
As mentioned earlier, the goal of reading and preparing data is to obtain a standardised activity log format. When your source data is already in this format, preparations come down to the following elements:
POSIXct
timestamp formatFor this section, the dataset hospital
will be used to illustrate data preparations. Three main functions help the user to prepare his/her own dataset:
rename
convert_timestamp
activitylog
The activity log object adds a mapping to the data frame to link each column with its specific meaning. In this regard, the timestamp columns each represent a different lifecycle state. daqapo
must know which column is which, requiring standardised timestamp names. The accepted timestamp values are:
The two timestamps required by daqapo
are start and complete.
hospital %>% rename(start = start_ts, complete = complete_ts) -> hospital
Each timestamp must also be in the POSIXct
format.
hospital %>% convert_timestamps(c("start","complete"), format = dmy_hms) -> hospital
When the timestamps are edited to the desired format, the activity log object can be created along with the required mapping.
hospital %>% activitylog(case_id = "patient_visit_nr", activity_id = "activity", resource_id = "originator", lifecycle_ids = c("start", "complete")) -> hospital
With event logs, things are a bit more complex. In an event log, each row represents only a part of an activity instance. Therefore, more complex data transformations must be executed and several problems could arise. In this section, we will use an event log variant of the activity log used earlier, named hospital_events
.
hospital_events #> # A tibble: 106 x 8 #> patient_visit_nr activity originator event_lifecycle~ timestamp #>#> 1 510 registr~ Clerk 9 start 20/11/20~ #> 2 510 registr~ Clerk 9 complete 20/11/20~ #> 3 510 Triage Nurse 27 start 20/11/20~ #> 4 510 Triage Nurse 27 complete 20/11/20~ #> 5 510 Clinica~ Doctor 7 start 20/11/20~ #> 6 510 Clinica~ Doctor 4 complete 20/11/20~ #> 7 512 Registr~ Clerk 12 start 20/11/20~ #> 8 512 Registr~ Clerk 12 complete 20/11/20~ #> 9 512 Triage Nurse 27 start 20/11/20~ #> 10 512 Triage Nurse 27 complete 20/11/20~ #> # ... with 96 more rows, and 3 more variables: triagecode , #> # specialization , event_matching
The same principle regarding the timestamps apply. Therefore, the POSIXct
format must be applied in advance. Additionally, the event log object also requires an activity instance id. If needed, one can be created manually as illustrated below.
The following functions form the building blocks of the required data preparation, but not all must be called to obtain a fully prepared event log at all times:
convert_timestamps
assign_instance_id
check/fix_resource_inconsistencies
standardize_lifecycle
events_to_activitylog
hospital_events %>% convert_timestamps(c("timestamp"), format = dmy_hms) %>% mutate(event_matching = paste(patient_visit_nr, activity, event_matching)) %>% events_to_activitylog(case_id = "patient_visit_nr", activity_id = "activity", activity_instance_id = "event_matching", timestamp = "timestamp", resource_id = "originator", lifecycle_id = "event_lifecycle_state") -> hospital_events
The table below summarizes the different data quality assessment tests available in daqapo
, after which each test will be briefly demonstrated.
Function name | Description | Output |
---|---|---|
detect_activity_frequency_violations | Function that detects activity frequency anomalies per case | Summary in console + Returns activities in cases which are executed too many times |
detect_activity_order_violations | Function detecting violations in activity order | Summary in console + Returns detected orders which violate the specified order |
detect_attribute_dependencies | Function detecting violations of dependencies between attributes (i.e. condition(s) that should hold when (an)other condition(s) hold(s)) | Summary in console + Returns rows with dependency violations |
detect_case_id_sequence_gaps | Function detecting gaps in the sequence of case identifiers | Summary in console + Returns case IDs which should be expected to be present |
detect_conditional_activity_presence | Function detection violations of conditional activity presence (i.e. activity/activities that should be present when (a) particular condition(s) hold(s)) | Summary in console + Returns cases violating conditional activity presence |
detect_duration_outliers | Function detecting duration outliers for a particular activity | Summary in console + Returns rows with outliers |
detect_inactive_periods | Function detecting inactive periods, i.e. periods of time in which no activity executions/arrivals are recorded | Summary in console + Returns periods of inactivity |
detect_incomplete_cases | Function detecting incomplete cases in terms of the activities that need to be recorded for a case | Summary in console + Returns traces in which the mentioned activities are not present |
detect_incorrect_activity_names | Function returning the incorrect activity labels in the log | Summary in console + Returns rows with incorrect activities |
detect_missing_values | Function detecting missing values at different levels of aggregation | Summary in console + Returns rows with NAs |
detect_multiregistration | Function detecting the registration of a series of events in a short time period for the same case or by the same resource | Summary in console + Returns rows with multiregistration on resource or case level |
detect_overlaps | Checks if a resource has performed two activities in parallel | Data frame containing the activities, the number of overlaps and average overlap in minutes |
detect_related_activities | Function detecting missing related activities, i.e. activities that should be registered because another activity is registered for a case | Summary in console + Returns cases violating related activities |
detect_similar_labels | Function detecting potential spelling mistakes | Table showing similarities for each label |
detect_time_anomalies | Funtion detecting activity executions with negative or zero duration | Summary in console + Returns rows with negative or zero durations |
detect_unique_values | Function listing all distinct combinations of the given log attributes | Summary in console + Returns all unique combinations of values in given columns |
detect_value_range_violations | Function detecting violations of the range of acceptable values | Summary in console + Returns rows with value range infringements |
hospital %>% detect_activity_frequency_violations("Registration" = 1, "Clinical exam" = 1) #> *** OUTPUT *** #> For 3 cases in the activity log (13.6363636363636%) an anomaly is detected. #> The anomalies are spread over the following cases: #> # A tibble: 3 x 3 #> patient_visit_nr activity n #>#> 1 518 Registration 3 #> 2 512 Clinical exam 2 #> 3 535 Registration 2
hospital %>% detect_activity_order_violations(activity_order = c("Registration", "Triage", "Clinical exam", "Treatment", "Treatment evaluation")) #> Warning in detect_activity_order_violations.activitylog(., activity_order #> = c("Registration", : Some activity instances within the same case overlap. #> Use detect_overlaps to investigate further. #> Warning in detect_activity_order_violations.activitylog(., activity_order #> = c("Registration", : Not all specified activities occur in each case. Use #> detect_incomplete_cases to investigate further. #> Selected timestamp parameter value: both #> *** OUTPUT *** #> It was checked whether the activity order Registration - Triage - Clinical exam - Treatment - Treatment evaluation is respected. #> This activity order is respected for 18 (81.82%) of the cases and not for4 (18.18%) of the cases. #> For cases for which the aformentioned activity order is not respected, the following order is detected (ordered by decreasing frequeny of occurrence): #> # A tibble: 4 x 3 #> activity_list n case_ids #>#> 1 Registration - Registration - Registration 1 518 #> 2 Registration - Registration - Triage - Clinical exam - Tr~ 1 535 #> 3 Registration - Triage - Clinical exam - Clinical exam 1 512 #> 4 Triage - Registration 1 521
hospital %>% detect_attribute_dependencies(antecedent = activity == "Registration", consequent = startsWith(originator,"Clerk")) #> *** OUTPUT *** #> The following statement was checked: if condition(s) ~activity == "Registration" hold(s), then ~startsWith(originator, "Clerk") should also hold. #> This statement holds for 12 (85.71%) of the rows in the activity log for which the first condition(s) hold and does not hold for 2 (14.29%) of these rows. #> For the following rows, the first condition(s) hold(s), but the second condition does not: #> # A tibble: 2 x 7 #> patient_visit_nr activity originator start #>#> 1 528 Registr~ Nurse 6 2017-11-21 18:10:17 #> 2 534 Registr~ 2017-11-22 18:35:00 #> # ... with 3 more variables: complete , triagecode , #> # specialization
hospital %>% detect_case_id_sequence_gaps() #> *** OUTPUT *** #> It was checked whether there are gaps in the sequence of case IDs #> From the 27 expected cases in the activity log, ranging from 510 to 536, 5 (18.52%) are missing. #> These case numbers are: #> case present #> 1 511 FALSE #> 2 513 FALSE #> 3 514 FALSE #> 4 515 FALSE #> 5 516 FALSE
hospital %>% detect_conditional_activity_presence(condition = specialization == "TRAU", activities = "Clinical exam") #> *** OUTPUT *** #> The following statement was checked: if condition(s) ~specialization == "TRAU" hold(s), then activity/activities Clinical exam should be recorded #> The condition(s) hold(s) for 2 cases. From these cases: #> - the specified activity/activities is/are recorded for 2 case(s) (100%) #> - the specified activity/activities is/are not recorded for 0 case(s) (0%)
hospital %>% detect_duration_outliers(Treatment = duration_within(bound_sd = 1)) #> *** OUTPUT *** #> Outliers are detected for following activities #> Treatment Lower bound: 5.06 Upper bound: 22.2 #> A total of 1 is detected (1.89% of the activity executions) #> For the following activity instances, outliers are detected: #> # A tibble: 1 x 13 #> patient_visit_nr activity originator start #>#> 1 523 Treatme~ Nurse 17 2017-11-21 18:26:04 #> # ... with 9 more variables: complete , triagecode , #> # specialization , duration , mean , sd , #> # bound_sd , lower_bound , upper_bound
hospital %>% detect_duration_outliers(Treatment = duration_within(lower_bound = 0, upper_bound = 15)) #> *** OUTPUT *** #> Outliers are detected for following activities #> Treatment Lower bound: 0 Upper bound: 15 #> A total of 1 is detected (1.89% of the activity executions) #> For the following activity instances, outliers are detected: #> # A tibble: 1 x 13 #> patient_visit_nr activity originator start #>#> 1 523 Treatme~ Nurse 17 2017-11-21 18:26:04 #> # ... with 9 more variables: complete , triagecode , #> # specialization , duration , mean , sd , #> # bound_sd , lower_bound , upper_bound
hospital %>% detect_inactive_periods(threshold = 30) #> Selected timestamp parameter value: both #> Selected inactivity type:arrivals #> *** OUTPUT *** #> Specified threshold of 30 minutes is violated 9 times. #> Threshold is violated in the following periods: #> # A tibble: 9 x 3 #> period_start period_end time_gap #>#> 1 2017-11-20 10:20:06 2017-11-21 11:35:16 1515. #> 2 2017-11-21 11:22:16 2017-11-21 11:59:41 37.4 #> 3 2017-11-21 12:05:52 2017-11-21 13:43:16 97.4 #> 4 2017-11-21 14:06:09 2017-11-21 15:12:17 66.1 #> 5 2017-11-21 15:18:19 2017-11-21 16:42:08 83.8 #> 6 2017-11-21 17:06:10 2017-11-21 18:02:10 56 #> 7 2017-11-21 18:15:04 2017-11-22 10:04:57 950. #> 8 2017-11-22 10:32:56 2017-11-22 16:30:00 357. #> 9 2017-11-22 17:00:00 2017-11-22 18:00:00 60
hospital %>% detect_incomplete_cases(activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation")) #> *** OUTPUT *** #> It was checked whether the activities Clinical exam, Registration, Treatment, Treatment evaluation, Triage are present for cases. #> These activities are present for 4 (39.62%) of the cases and are not present for 18 (60.38%) of the cases. #> Note: this function only checks the presence of activities for a particular case, not the completeness of these entries in the activity log or the order of activities. #> For cases for which the aforementioned activities are not all present, the following activities are recorded (ordered by decreasing frequeny of occurrence): #> # A tibble: 9 x 3 #> activity n case_ids #>#> 1 Triage 11 510 - 512 - 517 - 521 - 524 - 525 - 526 - 527 - ~ #> 2 Registration 9 512 - 518 - 518 - 518 - 521 - 522 - 527 - 528 - ~ #> 3 Clinical exam 5 512 - 510 - 527 - 528 - 512 #> 4 Treatment evalua~ 2 529 - 532 #> 5 0 1 533 #> 6 registration 1 510 #> 7 Trage 1 520 #> 8 Treatment 1 532 #> 9 Triaga 1 522
hospital %>% detect_incorrect_activity_names(allowed_activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation")) #> *** OUTPUT *** #> 4 out of 9 (44.44% ) activity labels are identified to be incorrect. #> These activity labels are: #> registration - Trage - Triaga - 0 #> Given this information, 4 of 53 (7.55%) rows in the activity log are incorrect. These are the following: #> # A tibble: 4 x 7 #> patient_visit_nr activity originator start #>#> 1 510 registr~ Clerk 9 2017-11-20 10:18:17 #> 2 520 Trage Nurse 17 2017-11-21 13:43:16 #> 3 522 Triaga Nurse 5 2017-11-21 15:15:25 #> 4 533 0 2017-11-22 18:35:00 #> # ... with 3 more variables: complete , triagecode , #> # specialization
hospital %>% detect_missing_values(column = "activity") #> Selected level of aggregation:overview #> Warning in detect_missing_values.activitylog(., column = "activity"): #> Ignoring provided column argument at overview level. #> *** OUTPUT *** #> Absolute number of missing values per column: #> #> patient_visit_nr 0 #> activity 0 #> originator 2 #> start 1 #> complete 0 #> triagecode 1 #> specialization 0 #> Relative number of missing values per column (expressed as percentage): #> #> patient_visit_nr 0.000000 #> activity 0.000000 #> originator 3.773585 #> start 1.886792 #> complete 0.000000 #> triagecode 1.886792 #> specialization 0.000000 #> Overview of activity log rows which are incomplete: #> # A tibble: 4 x 7 #> patient_visit_nr activity originator start #>#> 1 510 Clinica~ Doctor 7 2017-11-20 11:35:01 #> 2 533 0 2017-11-22 18:35:00 #> 3 534 Registr~ 2017-11-22 18:35:00 #> 4 512 Clinica~ Doctor 7 NA #> # ... with 3 more variables: complete , triagecode , #> # specialization ## column heeft hier geen zin?!
hospital %>% detect_missing_values(level_of_aggregation = "activity") #> Selected level of aggregation:activity #> *** OUTPUT *** #> Absolute number of missing values per column (per activity): #> # A tibble: 9 x 7 #> activity patient_visit_nr originator start complete triagecode #>#> 1 0 0 1 0 0 0 #> 2 Clinica~ 0 0 1 0 1 #> 3 registr~ 0 0 0 0 0 #> 4 Registr~ 0 1 0 0 0 #> 5 Trage 0 0 0 0 0 #> 6 Treatme~ 0 0 0 0 0 #> 7 Treatme~ 0 0 0 0 0 #> 8 Triaga 0 0 0 0 0 #> 9 Triage 0 0 0 0 0 #> # ... with 1 more variable: specialization #> Relative number of missing values per column (per activity, expressed as percentage): #> # A tibble: 9 x 7 #> activity patient_visit_nr originator start complete triagecode #> #> 1 0 0 1 0 0 0 #> 2 Clinica~ 0 0 0.111 0 0.111 #> 3 registr~ 0 0 0 0 0 #> 4 Registr~ 0 0.0714 0 0 0 #> 5 Trage 0 0 0 0 0 #> 6 Treatme~ 0 0 0 0 0 #> 7 Treatme~ 0 0 0 0 0 #> 8 Triaga 0 0 0 0 0 #> 9 Triage 0 0 0 0 0 #> # ... with 1 more variable: specialization #> Overview of activity log rows which are incomplete: #> # A tibble: 4 x 7 #> patient_visit_nr activity originator start #> #> 1 510 Clinica~ Doctor 7 2017-11-20 11:35:01 #> 2 533 0 2017-11-22 18:35:00 #> 3 534 Registr~ 2017-11-22 18:35:00 #> 4 512 Clinica~ Doctor 7 NA #> # ... with 3 more variables: complete , triagecode , #> # specialization
hospital %>% detect_missing_values( level_of_aggregation = "column", column = "triagecode") #> Selected level of aggregation:column #> *** OUTPUT *** #> Absolute number of missing values in columntriagecode:1 #> Relative number of missing values in columntriagecode(expressed as percentage):1.88679245283019 #> #> Overview of activity log rows in whichtriagecodeis missing: #> # A tibble: 1 x 7 #> patient_visit_nr activity originator start #>#> 1 510 Clinica~ Doctor 7 2017-11-20 11:35:01 #> # ... with 3 more variables: complete , triagecode , #> # specialization
hospital %>% detect_multiregistration(threshold_in_seconds = 10) #> Selected level of aggregation: resource #> Selected timestamp parameter value: complete #> *** OUTPUT *** #> Multi-registration is detected for 4 of the 12 resources (33.33%). These resources are: #> Doctor 7 - Nurse 5 - Nurse 27 - NA #> For the following rows in the activity log, multi-registration is detected: #> # A tibble: 9 x 7 #> patient_visit_nr activity originator start #>#> 1 512 Clinica~ Doctor 7 2017-11-20 11:27:12 #> 2 512 Clinica~ Doctor 7 NA #> 3 524 Triage Nurse 5 2017-11-21 17:04:03 #> 4 525 Triage Nurse 5 2017-11-21 17:04:13 #> 5 526 Triage Nurse 5 2017-11-21 17:04:15 #> 6 536 Triage Nurse 27 2017-11-22 15:15:39 #> 7 536 Treatme~ Nurse 27 2017-11-22 15:15:41 #> 8 533 0 2017-11-22 18:35:00 #> 9 534 Registr~ 2017-11-22 18:35:00 #> # ... with 3 more variables: complete , triagecode , #> # specialization
hospital %>% detect_overlaps() #> # A tibble: 7 x 4 #> activity_a activity_b n avg_overlap_mins #>#> 1 Clinical exam Treatment 2 8.17 #> 2 Registration Clinical exam 1 1.9 #> 3 Registration Triaga 1 2.65 #> 4 Registration Triage 1 1.93 #> 5 Triage Clinical exam 2 5.63 #> 6 Triage Registration 1 0.817 #> 7 Triage Treatment 1 9.33
hospital %>% detect_similar_labels(column_labels = "activity", max_edit_distance = 3) #> # A tibble: 5 x 3 #> column_labels labels similar_to #>#> 1 activity registration Registration #> 2 activity Registration registration #> 3 activity Triage Trage - Triaga #> 4 activity Trage Triage - Triaga #> 5 activity Triaga Triage - Trage
hospital %>% detect_time_anomalies() #> Selected anomaly type: both #> *** OUTPUT *** #> For 5 rows in the activity log (9.43%), an anomaly is detected. #> The anomalies are spread over the activities as follows: #> # A tibble: 3 x 3 #> # Groups: activity [3] #> activity type n #>#> 1 Registration negative duration 3 #> 2 Clinical exam zero duration 1 #> 3 Trage negative duration 1 #> Anomalies are found in the following rows: #> # A tibble: 5 x 9 #> patient_visit_nr activity originator start #> #> 1 518 Registr~ Clerk 12 2017-11-21 11:45:16 #> 2 518 Registr~ Clerk 6 2017-11-21 11:45:16 #> 3 518 Registr~ Clerk 9 2017-11-21 11:45:16 #> 4 520 Trage Nurse 17 2017-11-21 13:43:16 #> 5 528 Clinica~ Doctor 1 2017-11-21 19:00:00 #> # ... with 5 more variables: complete , triagecode , #> # specialization , duration , type
hospital %>% detect_unique_values(column_labels = "activity") #> *** OUTPUT *** #> Distinct entries are computed for the following columns: #> activity #> # A tibble: 9 x 1 #> activity #>#> 1 registration #> 2 Registration #> 3 Triage #> 4 Clinical exam #> 5 Trage #> 6 Treatment #> 7 Triaga #> 8 Treatment evaluation #> 9 0
hospital %>% detect_unique_values(column_labels = c("activity", "originator")) #> *** OUTPUT *** #> Distinct entries are computed for the following columns: #> activity - originator #> # A tibble: 22 x 2 #> activity originator #>#> 1 registration Clerk 9 #> 2 Registration Clerk 12 #> 3 Triage Nurse 27 #> 4 Clinical exam Doctor 7 #> 5 Triage Nurse 17 #> 6 Registration Clerk 6 #> 7 Registration Clerk 9 #> 8 Trage Nurse 17 #> 9 Clinical exam Doctor 4 #> 10 Registration Clerk 3 #> # ... with 12 more rows
hospital %>% detect_value_range_violations(triagecode = domain_numeric(from = 0, to = 5)) #> *** OUTPUT *** #> The domain range for column triagecode is checked. #> Values allowed between 0 and 5 #> The values fall within the specified domain range for 46 (86.79%) of the rows in the activity log and outside the domain range for 7 (13.21%) of these rows. #> #> The following rows fall outside the specified domain range for indicated column: #> # A tibble: 7 x 8 #> column_checked patient_visit_nr activity originator start #>#> 1 triagecode 510 Clinica~ Doctor 7 2017-11-20 11:35:01 #> 2 triagecode 529 Treatme~ Doctor 1 2017-11-22 16:30:00 #> 3 triagecode 530 Triage Nurse 17 2017-11-22 18:00:00 #> 4 triagecode 531 Triage Nurse 17 2017-11-22 18:05:00 #> 5 triagecode 532 Treatme~ Nurse 17 2017-11-22 18:15:00 #> 6 triagecode 532 Treatme~ Doctor 7 2017-11-22 18:27:00 #> 7 triagecode 533 0 2017-11-22 18:35:00 #> # ... with 3 more variables: complete , triagecode , #> # specialization
Hasselt University, Research group Business Informatics | Research Foundation Flanders (FWO). niels.martin@uhasselt.be↩
Hasselt University, Research group Business Informatics. greg.vanhoudt@uhasselt.be↩
Hasselt University, Research group Business Informatics. gert.janssenswillen@uhasselt.be↩