Data Quality

Despite the extensive opportunities that process mining techniques provide, the garbage in - garbage out principle still applies. Data quality issues are widespread in real-life data and can generate misleading results when used for analysis purposes. daqapo - Data Quality Assessment for Process-Oriented data - provides a set of assessment functions to identify a wide array of quality issues.

Getting started

In the examples below, we use the dataset hospital_actlog, which is an artificial event log with data quality issues provided by daqapo.

library(daqapo)
library(dplyr)
data("hospital_actlog")
data("hospital_events")
hospital_actlog <- activitylog(hospital_actlog)

Attribute Dependencies

Detect violations of dependencies between attributes (i.e. condition(s) that should hold when (an)other condition(s) hold(s)).

Example: when the activity is “Registration”, the originator should start with “Clerk”.

hospital_actlog %>% 
  detect_attribute_dependencies(antecedent = activity == "Registration",
                                consequent = startsWith(originator,"Clerk"))
## *** OUTPUT ***
## The following statement was checked: if condition(s) ~activity == "Registration" hold(s), then ~startsWith(originator, "Clerk") should also hold.
## This statement holds for 12 (85.71%) of the rows in the activity log for which the first condition(s) hold and does not hold for 2 (14.29%) of these rows.
## For the following rows, the first condition(s) hold(s), but the second condition does not:
## # Log of 10 events consisting of:
## 2 traces 
## 4 cases 
## 5 instances of 1 activity 
## 5 resources 
## Events occurred from 2017-11-21 18:10:17 until 2017-11-22 18:37:00 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 5 × 8
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            528 Regist… Nurse 6 2017-11-21 18:10:17 2017-11-21 18:15:04       3
## 2            535 Regist… Clerk 3 2017-11-22 10:04:57 2017-11-22 10:06:46       2
## 3            536 Regist… Clerk 9 2017-11-22 10:26:41 2017-11-22 10:32:56       5
## 4            535 Regist… Clerk 6 2017-11-22 11:05:42 2017-11-22 11:11:11       2
## 5            534 Regist… <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00       0
## # … with 2 more variables: specialization <chr>, .order <int>, and abbreviated
## #   variable names ¹​patient_visit_nr, ²​activity, ³​originator, ⁴​triagecode

Case ID Sequence Gaps

Detect gaps in the sequence of case identifiers.

hospital_actlog %>%
  detect_case_id_sequence_gaps()
## *** OUTPUT ***
## It was checked whether there are gaps in the sequence of case IDs
## From the 27 expected cases in the activity log, ranging from 510 to 536, 5 (18.52%) are missing.
## These missing case numbers are:
## # A tibble: 2 × 3
##    from    to n_missing
##   <dbl> <dbl>     <dbl>
## 1   511   511         1
## 2   513   516         4

Conditional Activity Presence

Check whether certain activities are present when a specific condition is satisfied.

For example, if specialization is “TRAU”, then the activity “Clinical exam” must take place.

hospital_actlog %>%
  detect_conditional_activity_presence(condition = specialization == "TRAU",
                                       activities = "Clinical exam")
## *** OUTPUT ***
## The following statement was checked: if condition(s) ~specialization == "TRAU" hold(s), then activity/activities Clinical exam should be recorded
## The condition(s) hold(s) for 2 cases. From these cases:
## - the specified activity/activities is/are recorded for 2 case(s) (100%)
## - the specified activity/activities is/are not recorded for 0 case(s) (0%)

Duration Outliers

Detect duration outliers for particular activities.

For example, the duration of “Treatment” should be within 1 standard deviation of its mean duration.

hospital_actlog %>%
  detect_duration_outliers(Treatment = duration_within(bound_sd = 1))
## *** OUTPUT ***
## Outliers are detected for following activities
## Treatment     Lower bound: 5.06   Upper bound: 22.2
## A total of 1 is detected (1.89% of the activity executions)
## For the following activity instances, outliers are detected:
## # Log of 2 events consisting of:
## 1 trace 
## 1 case 
## 1 instance of 1 activity 
## 1 resource 
## Events occurred from 2017-11-21 18:26:04 until 2017-11-21 18:55:00 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 1 × 14
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            523 Treatm… Nurse … 2017-11-21 18:26:04 2017-11-21 18:55:00       3
## # … with 8 more variables: specialization <chr>, .order <int>, duration <dbl>,
## #   mean <dbl>, sd <dbl>, bound_sd <dbl>, lower_bound <dbl>, upper_bound <dbl>,
## #   and abbreviated variable names ¹​patient_visit_nr, ²​activity, ³​originator,
## #   ⁴​triagecode

Or, the duration of “Treatment” should be within 0 to 15 minutes.

hospital_actlog %>%
  detect_duration_outliers(Treatment = duration_within(lower_bound = 0, upper_bound = 15))
## *** OUTPUT ***
## Outliers are detected for following activities
## Treatment     Lower bound: 0      Upper bound: 15
## A total of 1 is detected (1.89% of the activity executions)
## For the following activity instances, outliers are detected:
## # Log of 2 events consisting of:
## 1 trace 
## 1 case 
## 1 instance of 1 activity 
## 1 resource 
## Events occurred from 2017-11-21 18:26:04 until 2017-11-21 18:55:00 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 1 × 14
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            523 Treatm… Nurse … 2017-11-21 18:26:04 2017-11-21 18:55:00       3
## # … with 8 more variables: specialization <chr>, .order <int>, duration <dbl>,
## #   mean <dbl>, sd <dbl>, bound_sd <dbl>, lower_bound <dbl>, upper_bound <dbl>,
## #   and abbreviated variable names ¹​patient_visit_nr, ²​activity, ³​originator,
## #   ⁴​triagecode

Inactive Periods

Detect periods of time in which no activity executions are recorded, using a threshold specified in minutes.

For example, detect whether there are periods of more than 30 minutes without any activity executions.

hospital_actlog %>%
  detect_inactive_periods(threshold = 30)
## Selected timestamp parameter value: both
## Selected inactivity type:arrivals
## *** OUTPUT ***
## Specified threshold of 30 minutes is violated 9 times.
## Threshold is violated in the following periods:
##          period_start          period_end   time_gap
## 1 2017-11-20 10:20:06 2017-11-21 11:35:16 1515.16667
## 2 2017-11-21 11:22:16 2017-11-21 11:59:41   37.41667
## 3 2017-11-21 12:05:52 2017-11-21 13:43:16   97.40000
## 4 2017-11-21 14:06:09 2017-11-21 15:12:17   66.13333
## 5 2017-11-21 15:18:19 2017-11-21 16:42:08   83.81667
## 6 2017-11-21 17:06:10 2017-11-21 18:02:10   56.00000
## 7 2017-11-21 18:15:04 2017-11-22 10:04:57  949.88333
## 8 2017-11-22 10:32:56 2017-11-22 16:30:00  357.06667
## 9 2017-11-22 17:00:00 2017-11-22 18:00:00   60.00000

Incomplete Cases

Check whether there are cases that miss a specific activity.

For example, does any of the cases miss the 5 listed activities?

hospital_actlog %>%
  detect_incomplete_cases(activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
## *** OUTPUT ***
## It was checked whether the activities Clinical exam, Registration, Treatment, Treatment evaluation, Triage are present for cases.
## These activities are present for 4 (39.62%) of the cases and are not present for 18 (60.38%) of the cases.
## Note: this function only checks the presence of activities for a particular case, not the completeness of these entries in the activity log or the order of activities.
## For cases for which the aforementioned activities are not all present, the following activities are recorded (ordered by decreasing frequeny of occurrence):
## # A tibble: 9 × 3
##   activity                 n case_ids                                           
##   <chr>                <int> <chr>                                              
## 1 Triage                  11 510 - 512 - 517 - 521 - 524 - 525 - 526 - 527 - 52…
## 2 Registration             9 512 - 518 - 518 - 518 - 521 - 522 - 527 - 528 - 534
## 3 Clinical exam            5 512 - 510 - 527 - 528 - 512                        
## 4 Treatment evaluation     2 529 - 532                                          
## 5 0                        1 533                                                
## 6 registration             1 510                                                
## 7 Trage                    1 520                                                
## 8 Treatment                1 532                                                
## 9 Triaga                   1 522

Incorrect Activity Names

Given a set of allowed activities, are there any activities that are incorrect?

hospital_actlog %>%
  detect_incorrect_activity_names(allowed_activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
## *** OUTPUT ***
## 4 out of 9 (44.44% ) activity labels are identified to be incorrect.
## These activity labels are:
## registration - Trage - Triaga - 0
## Given this information, 4 of 53 (7.55%) rows in the activity log are incorrect. These are the following:
## # Log of 8 events consisting of:
## 4 traces 
## 4 cases 
## 4 instances of 4 activities 
## 4 resources 
## Events occurred from 2017-11-20 10:18:17 until 2017-11-22 18:37:00 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 4 × 8
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            510 regist… Clerk 9 2017-11-20 10:18:17 2017-11-20 10:20:06       3
## 2            520 Trage   Nurse … 2017-11-21 13:43:16 2017-11-21 13:39:00       5
## 3            522 Triaga  Nurse 5 2017-11-21 15:15:25 2017-11-21 15:18:04       2
## 4            533 0       <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00       7
## # … with 2 more variables: specialization <chr>, .order <int>, and abbreviated
## #   variable names ¹​patient_visit_nr, ²​activity, ³​originator, ⁴​triagecode

Missing Values

Analyse the missing values of the log. This can be done in general, or at the level of activities or specific columns.

hospital_actlog %>%
  detect_missing_values()
## Selected level of aggregation:overview
## *** OUTPUT ***
## Absolute number of missing values per column:
##                   
## patient_visit_nr 0
## activity         0
## originator       2
## start            1
## complete         0
## triagecode       1
## specialization   0
## .order           0
## Relative number of missing values per column (expressed as percentage):
##                          
## patient_visit_nr 0.000000
## activity         0.000000
## originator       3.773585
## start            1.886792
## complete         0.000000
## triagecode       1.886792
## specialization   0.000000
## .order           0.000000
## Overview of activity log rows which are incomplete:
## # Log of 7 events consisting of:
## 3 traces 
## 4 cases 
## 4 instances of 3 activities 
## 2 resources 
## Events occurred from NA until NA 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 4 × 8
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            510 Clinic… Doctor… 2017-11-20 11:35:01 2017-11-20 11:36:09      NA
## 2            533 0       <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00       7
## 3            534 Regist… <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00       0
## 4            512 Clinic… Doctor… NA                  2017-11-20 11:33:57       3
## # … with 2 more variables: specialization <chr>, .order <int>, and abbreviated
## #   variable names ¹​patient_visit_nr, ²​activity, ³​originator, ⁴​triagecode
hospital_actlog %>% 
  detect_missing_values(level_of_aggregation = "activity")
## Selected level of aggregation:activity
## *** OUTPUT ***
## Absolute number of missing values per column (per activity):
## # A tibble: 9 × 8
##   activity             patient_vi…¹ origi…² start compl…³ triag…⁴ speci…⁵ .order
##   <chr>                       <int>   <int> <int>   <int>   <int>   <int>  <int>
## 1 0                               0       1     0       0       0       0      0
## 2 Clinical exam                   0       0     1       0       1       0      0
## 3 registration                    0       0     0       0       0       0      0
## 4 Registration                    0       1     0       0       0       0      0
## 5 Trage                           0       0     0       0       0       0      0
## 6 Treatment                       0       0     0       0       0       0      0
## 7 Treatment evaluation            0       0     0       0       0       0      0
## 8 Triaga                          0       0     0       0       0       0      0
## 9 Triage                          0       0     0       0       0       0      0
## # … with abbreviated variable names ¹​patient_visit_nr, ²​originator, ³​complete,
## #   ⁴​triagecode, ⁵​specialization
## Relative number of missing values per column (per activity, expressed as percentage):
## # A tibble: 9 × 8
##   activity             patient_vi…¹ origi…² start compl…³ triag…⁴ speci…⁵ .order
##   <chr>                       <dbl>   <dbl> <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
## 1 0                               0  1      0           0   0           0      0
## 2 Clinical exam                   0  0      0.111       0   0.111       0      0
## 3 registration                    0  0      0           0   0           0      0
## 4 Registration                    0  0.0714 0           0   0           0      0
## 5 Trage                           0  0      0           0   0           0      0
## 6 Treatment                       0  0      0           0   0           0      0
## 7 Treatment evaluation            0  0      0           0   0           0      0
## 8 Triaga                          0  0      0           0   0           0      0
## 9 Triage                          0  0      0           0   0           0      0
## # … with abbreviated variable names ¹​patient_visit_nr, ²​originator, ³​complete,
## #   ⁴​triagecode, ⁵​specialization
## Overview of activity log rows which are incomplete:
## # Log of 7 events consisting of:
## 3 traces 
## 4 cases 
## 4 instances of 3 activities 
## 2 resources 
## Events occurred from NA until NA 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 4 × 8
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            510 Clinic… Doctor… 2017-11-20 11:35:01 2017-11-20 11:36:09      NA
## 2            533 0       <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00       7
## 3            534 Regist… <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00       0
## 4            512 Clinic… Doctor… NA                  2017-11-20 11:33:57       3
## # … with 2 more variables: specialization <chr>, .order <int>, and abbreviated
## #   variable names ¹​patient_visit_nr, ²​activity, ³​originator, ⁴​triagecode
hospital_actlog %>% 
  detect_missing_values(
  level_of_aggregation = "column",
  column = "triagecode")
## Selected level of aggregation:column
## *** OUTPUT ***
## Absolute number of missing values in columntriagecode:1
## Relative number of missing values in columntriagecode(expressed as percentage):1.88679245283019
## 
## Overview of activity log rows in whichtriagecodeis missing:
## # Log of 2 events consisting of:
## 1 trace 
## 1 case 
## 1 instance of 1 activity 
## 1 resource 
## Events occurred from 2017-11-20 11:35:01 until 2017-11-20 11:36:09 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 1 × 8
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            510 Clinic… Doctor… 2017-11-20 11:35:01 2017-11-20 11:36:09      NA
## # … with 2 more variables: specialization <chr>, .order <int>, and abbreviated
## #   variable names ¹​patient_visit_nr, ²​activity, ³​originator, ⁴​triagecode

Multiregistration

Detect whether there are multiple activity executions registered by the same resource (or for the same case), in a short period of time. This period of time can be specified with a threshold in seconds.

hospital_actlog %>%
  detect_multiregistration(threshold_in_seconds = 10)
## Selected level of aggregation: resource
## Selected timestamp parameter value: complete
## *** OUTPUT ***
## Multi-registration is detected for 4 of the 12 resources (33.33%). These resources are:
## Doctor 7 - Nurse 27 - Nurse 5 - NA
## For the following rows in the activity log, multi-registration is detected:
## # Log of 17 events consisting of:
## 5 traces 
## 7 cases 
## 9 instances of 5 activities 
## 4 resources 
## Events occurred from NA until NA 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 9 × 8
##   originator patient_v…¹ activ…² start               complete            triag…³
##   <chr>            <dbl> <chr>   <dttm>              <dttm>                <dbl>
## 1 Doctor 7           512 Clinic… 2017-11-20 11:27:12 2017-11-20 11:33:57       3
## 2 Doctor 7           512 Clinic… NA                  2017-11-20 11:33:57       3
## 3 Nurse 27           536 Triage  2017-11-22 15:15:39 2017-11-22 15:25:01       5
## 4 Nurse 27           536 Treatm… 2017-11-22 15:15:41 2017-11-22 15:25:03       5
## 5 Nurse 5            524 Triage  2017-11-21 17:04:03 2017-11-21 17:06:05       3
## 6 Nurse 5            525 Triage  2017-11-21 17:04:13 2017-11-21 17:06:08       3
## 7 Nurse 5            526 Triage  2017-11-21 17:04:15 2017-11-21 17:06:10       4
## 8 <NA>               533 0       2017-11-22 18:35:00 2017-11-22 18:37:00       7
## 9 <NA>               534 Regist… 2017-11-22 18:35:00 2017-11-22 18:37:00       0
## # … with 2 more variables: specialization <chr>, .order <int>, and abbreviated
## #   variable names ¹​patient_visit_nr, ²​activity, ³​triagecode

Overlaps

Check if a resource has performed two or more activities in parallel.

hospital_actlog %>%
  detect_overlaps()
## # A tibble: 7 × 4
##   activity_a    activity_b        n avg_overlap_mins
##   <chr>         <chr>         <int>            <dbl>
## 1 Clinical exam Treatment         2            8.17 
## 2 Registration  Clinical exam     1            1.9  
## 3 Registration  Triaga            1            2.65 
## 4 Registration  Triage            1            1.93 
## 5 Triage        Clinical exam     2            5.63 
## 6 Triage        Registration      1            0.817
## 7 Triage        Treatment         1            9.33

Similar Labels

Check for similar labels in a specific column. Both the column and the maximum allowed edit distance for two labels to consider similar can be configured.

hospital_actlog %>%
  detect_similar_labels(column_labels = "activity", max_edit_distance = 3)
## Warning in detect_similar_labels.activitylog(., column_labels = "activity", :
## Not all provided columns are of type character or factor and will be ignored:
## patient_visit_nr,start,complete,.order
## # A tibble: 16 × 3
##    column_labels labels       similar_to                   
##    <chr>         <chr>        <chr>                        
##  1 activity      registration Registration                 
##  2 activity      Registration registration                 
##  3 activity      Triage       Trage - Triaga               
##  4 activity      Trage        Triage - Triaga              
##  5 activity      Triaga       Triage - Trage               
##  6 originator    Clerk 9      Clerk 12 - Clerk 6 - Clerk 3 
##  7 originator    Clerk 12     Clerk 9 - Clerk 6 - Clerk 3  
##  8 originator    Nurse 27     Nurse 17 - Nurse 5 - Nurse 6 
##  9 originator    Doctor 7     Doctor 4 - Doctor 1          
## 10 originator    Nurse 17     Nurse 27 - Nurse 5 - Nurse 6 
## 11 originator    Clerk 6      Clerk 9 - Clerk 12 - Clerk 3 
## 12 originator    Doctor 4     Doctor 7 - Doctor 1          
## 13 originator    Clerk 3      Clerk 9 - Clerk 12 - Clerk 6 
## 14 originator    Nurse 5      Nurse 27 - Nurse 17 - Nurse 6
## 15 originator    Nurse 6      Nurse 27 - Nurse 17 - Nurse 5
## 16 originator    Doctor 1     Doctor 7 - Doctor 4

Time Anomalies

Detect activity executions with negative or zero duration.

hospital_actlog %>%
  detect_time_anomalies()
## Selected anomaly type: both
## *** OUTPUT ***
## For 5 rows in the activity log (9.43%), an anomaly is detected.
## The anomalies are spread over the activities as follows:
## # A tibble: 3 × 3
##   activity      type                  n
##   <chr>         <chr>             <int>
## 1 Registration  negative duration     3
## 2 Clinical exam zero duration         1
## 3 Trage         negative duration     1
## Anomalies are found in the following rows:
## # Log of 10 events consisting of:
## 3 traces 
## 3 cases 
## 5 instances of 3 activities 
## 5 resources 
## Events occurred from 2017-11-21 11:22:16 until 2017-11-21 19:00:00 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 5 × 10
##   patient_visi…¹ activ…² origi…³ start               complete            triag…⁴
##            <dbl> <chr>   <chr>   <dttm>              <dttm>                <dbl>
## 1            518 Regist… Clerk … 2017-11-21 11:45:16 2017-11-21 11:22:16       4
## 2            518 Regist… Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16       4
## 3            518 Regist… Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16       4
## 4            520 Trage   Nurse … 2017-11-21 13:43:16 2017-11-21 13:39:00       5
## 5            528 Clinic… Doctor… 2017-11-21 19:00:00 2017-11-21 19:00:00       3
## # … with 4 more variables: specialization <chr>, .order <int>, duration <dbl>,
## #   type <chr>, and abbreviated variable names ¹​patient_visit_nr, ²​activity,
## #   ³​originator, ⁴​triagecode

Unique Values

List all unique combinations of the specified columns.

hospital_actlog %>%
  detect_unique_values(column_labels = "activity")
## *** OUTPUT ***
## Distinct entries are computed for the following columns: 
## activity
## # Log of 105 events consisting of:
## 14 traces 
## 22 cases 
## 53 instances of 9 activities 
## 12 resources 
## Events occurred from NA until NA 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 53 × 6
##    activity      patien…¹ origi…² start               complete            .order
##    <chr>            <dbl> <chr>   <dttm>              <dttm>               <int>
##  1 registration       510 Clerk 9 2017-11-20 10:18:17 2017-11-20 10:20:06      1
##  2 Registration       512 Clerk … 2017-11-20 10:33:14 2017-11-20 10:37:00      2
##  3 Triage             510 Nurse … 2017-11-20 10:34:08 2017-11-20 10:41:48      3
##  4 Triage             512 Nurse … 2017-11-20 10:44:12 2017-11-20 10:50:17      4
##  5 Clinical exam      512 Doctor… 2017-11-20 11:27:12 2017-11-20 11:33:57      5
##  6 Clinical exam      510 Doctor… 2017-11-20 11:35:01 2017-11-20 11:36:09      6
##  7 Triage             517 Nurse … 2017-11-21 11:35:16 2017-11-21 11:39:00      7
##  8 Registration       518 Clerk … 2017-11-21 11:45:16 2017-11-21 11:22:16      8
##  9 Registration       518 Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16      9
## 10 Registration       518 Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16     10
## # … with 43 more rows, and abbreviated variable names ¹​patient_visit_nr,
## #   ²​originator
hospital_actlog %>%
  detect_unique_values(column_labels = c("activity", "originator"))
## *** OUTPUT ***
## Distinct entries are computed for the following columns: 
## activity - originator
## # Log of 105 events consisting of:
## 14 traces 
## 22 cases 
## 53 instances of 9 activities 
## 12 resources 
## Events occurred from NA until NA 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 53 × 6
##    activity      origin…¹ patie…² start               complete            .order
##    <chr>         <chr>      <dbl> <dttm>              <dttm>               <int>
##  1 registration  Clerk 9      510 2017-11-20 10:18:17 2017-11-20 10:20:06      1
##  2 Registration  Clerk 12     512 2017-11-20 10:33:14 2017-11-20 10:37:00      2
##  3 Triage        Nurse 27     510 2017-11-20 10:34:08 2017-11-20 10:41:48      3
##  4 Triage        Nurse 27     512 2017-11-20 10:44:12 2017-11-20 10:50:17      4
##  5 Clinical exam Doctor 7     512 2017-11-20 11:27:12 2017-11-20 11:33:57      5
##  6 Clinical exam Doctor 7     510 2017-11-20 11:35:01 2017-11-20 11:36:09      6
##  7 Triage        Nurse 17     517 2017-11-21 11:35:16 2017-11-21 11:39:00      7
##  8 Registration  Clerk 12     518 2017-11-21 11:45:16 2017-11-21 11:22:16      8
##  9 Registration  Clerk 6      518 2017-11-21 11:45:16 2017-11-21 11:22:16      9
## 10 Registration  Clerk 9      518 2017-11-21 11:45:16 2017-11-21 11:22:16     10
## # … with 43 more rows, and abbreviated variable names ¹​originator,
## #   ²​patient_visit_nr

Value Range Violations

Detect value range violation.

hospital_actlog %>%
  detect_value_range_violations(triagecode = domain_numeric(from = 0, to = 5))
## $triagecode
## $type
## [1] "numeric"
## 
## $from
## [1] 0
## 
## $to
## [1] 5
## 
## attr(,"class")
## [1] "value_range" "list"
## *** OUTPUT ***
## The domain range for column triagecode is checked.
## Values allowed between 0 and 5
## The values fall within the specified domain range for 46 (86.79%) of the rows in the activity log and outside the domain range for 7 (13.21%) of these rows.
## 
## The following rows fall outside the specified domain range for indicated column:
## # Log of 14 events consisting of:
## 5 traces 
## 6 cases 
## 7 instances of 5 activities 
## 4 resources 
## Events occurred from 2017-11-20 11:35:01 until 2017-11-23 18:33:00 
##  
## # Variables were mapped as follows:
## Case identifier:     patient_visit_nr 
## Activity identifier:     activity 
## Resource identifier:     originator 
## Timestamps:      start, complete 
## 
## # A tibble: 7 × 9
##   column_checked patie…¹ activ…² origi…³ start               complete           
##   <chr>            <dbl> <chr>   <chr>   <dttm>              <dttm>             
## 1 triagecode         510 Clinic… Doctor… 2017-11-20 11:35:01 2017-11-20 11:36:09
## 2 triagecode         529 Treatm… Doctor… 2017-11-22 16:30:00 2017-11-22 17:00:00
## 3 triagecode         530 Triage  Nurse … 2017-11-22 18:00:00 2017-11-22 18:05:00
## 4 triagecode         531 Triage  Nurse … 2017-11-22 18:05:00 2017-11-22 18:10:00
## 5 triagecode         532 Treatm… Nurse … 2017-11-22 18:15:00 2017-11-22 18:25:00
## 6 triagecode         532 Treatm… Doctor… 2017-11-22 18:27:00 2017-11-23 18:33:00
## 7 triagecode         533 0       <NA>    2017-11-22 18:35:00 2017-11-22 18:37:00
## # … with 3 more variables: triagecode <dbl>, specialization <chr>,
## #   .order <int>, and abbreviated variable names ¹​patient_visit_nr, ²​activity,
## #   ³​originator

Read more:


Copyright © 2023 bupaR - Hasselt University