Augment logs

library(bupaverse)
library(dplyr)

Enriching an event log with calculated metrics can be done using augment(). For example, consider trace_length().

traffic_fines %>%
    trace_length(level = "case") 
## # A tibble: 10,000 × 2
##    case_id absolute
##    <chr>      <int>
##  1 A10249         9
##  2 A10338         9
##  3 A10619         9
##  4 A10858         9
##  5 A12027         9
##  6 A12414         9
##  7 A13217         9
##  8 A1327          9
##  9 A13617         9
## 10 A13984         9
## # ℹ 9,990 more rows

Feeding the resulting table back to traffic_fines with augment() makes the trace length metric available as a case attribute for further analysis.

traffic_fines %>%
    trace_length(level = "case") %>%
    augment(traffic_fines) %>%
    glimpse()
## Rows: 34,724
## Columns: 19
## $ case_id              <chr> "A1", "A1", "A100", "A100", "A100", "A100", "A100…
## $ activity             <fct> Create Fine, Send Fine, Create Fine, Send Fine, I…
## $ lifecycle            <fct> complete, complete, complete, complete, complete,…
## $ resource             <fct> 561, NA, 561, NA, NA, NA, NA, 561, NA, NA, NA, NA…
## $ timestamp            <dttm> 2006-07-24, 2006-12-05, 2006-08-02, 2006-12-12, …
## $ amount               <chr> "35.0", NA, "35.0", NA, NA, "71.5", NA, "36.0", N…
## $ article              <dbl> 157, NA, 157, NA, NA, NA, NA, 157, NA, NA, NA, NA…
## $ dismissal            <chr> "NIL", NA, "NIL", NA, NA, NA, NA, "NIL", NA, NA, …
## $ expense              <chr> NA, "11.0", NA, "11.0", NA, NA, NA, NA, "13.0", N…
## $ lastsent             <chr> NA, NA, NA, NA, "P", NA, NA, NA, NA, "P", NA, NA,…
## $ matricola            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ notificationtype     <chr> NA, NA, NA, NA, "P", NA, NA, NA, NA, "P", NA, NA,…
## $ paymentamount        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 870, …
## $ points               <dbl> 0, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, N…
## $ totalpaymentamount   <chr> "0.0", NA, "0.0", NA, NA, NA, NA, "0.0", NA, NA, …
## $ vehicleclass         <chr> "A", NA, "A", NA, NA, NA, NA, "A", NA, NA, NA, NA…
## $ activity_instance_id <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10"…
## $ .order               <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15…
## $ absolute             <int> 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6…

Adjust names

Using the prefix argument, you can add a descriptive prefix to the name of the new variable. In the current example, where the variable is called absolute, it might be useful to add the prefix trace_length.

traffic_fines %>%
    trace_length(level = "case") %>%
    augment(traffic_fines, prefix = "trace_length") %>%
    glimpse()
## Rows: 34,724
## Columns: 19
## $ case_id               <chr> "A1", "A1", "A100", "A100", "A100", "A100", "A10…
## $ activity              <fct> Create Fine, Send Fine, Create Fine, Send Fine, …
## $ lifecycle             <fct> complete, complete, complete, complete, complete…
## $ resource              <fct> 561, NA, 561, NA, NA, NA, NA, 561, NA, NA, NA, N…
## $ timestamp             <dttm> 2006-07-24, 2006-12-05, 2006-08-02, 2006-12-12,…
## $ amount                <chr> "35.0", NA, "35.0", NA, NA, "71.5", NA, "36.0", …
## $ article               <dbl> 157, NA, 157, NA, NA, NA, NA, 157, NA, NA, NA, N…
## $ dismissal             <chr> "NIL", NA, "NIL", NA, NA, NA, NA, "NIL", NA, NA,…
## $ expense               <chr> NA, "11.0", NA, "11.0", NA, NA, NA, NA, "13.0", …
## $ lastsent              <chr> NA, NA, NA, NA, "P", NA, NA, NA, NA, "P", NA, NA…
## $ matricola             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ notificationtype      <chr> NA, NA, NA, NA, "P", NA, NA, NA, NA, "P", NA, NA…
## $ paymentamount         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 870,…
## $ points                <dbl> 0, NA, 0, NA, NA, NA, NA, 0, NA, NA, NA, NA, 0, …
## $ totalpaymentamount    <chr> "0.0", NA, "0.0", NA, NA, NA, NA, "0.0", NA, NA,…
## $ vehicleclass          <chr> "A", NA, "A", NA, NA, NA, NA, "A", NA, NA, NA, N…
## $ activity_instance_id  <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10…
## $ .order                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ trace_length_absolute <int> 2, 2, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, …

Select variables

Some metrics return several variables with information. Say you want to add information on the processing time of each activity to the data.

patients %>% 
    processing_time(level = "activity", units = "hours")
## # A tibble: 7 × 11
##   handling              min    q1    mean  median q3    max   st_dev   iqr total
##   <fct>                 <drtn> <drt> <drt> <drtn> <drt> <drt>  <dbl> <dbl> <drt>
## 1 Registration          0.828…  2.0…  2.7…  2.71…  3.4…  5.6…  0.954 1.33  1376…
## 2 Triage and Assessment 5.868… 11.3… 13.1… 13.34… 15.0… 18.8…  2.76  3.68  6552…
## 3 Discuss Results       1.333…  2.3…  2.7…  2.77…  3.2…  4.5…  0.628 0.906 1374…
## 4 Check-out             0.667…  1.6…  2.0…  2.07…  2.4…  3.8…  0.620 0.860 1014…
## 5 X-Ray                 2.294…  3.8…  4.8…  4.79…  5.6…  8.1…  1.28  1.76  1264…
## 6 Blood test            3.089…  4.7…  5.5…  5.46…  6.2…  8.1…  1.06  1.51  1311…
## 7 MRI SCAN              2.489…  3.6…  4.1…  4.09…  4.6…  5.9…  0.735 1.09   979…
## # ℹ 1 more variable: relative_frequency <dbl>

Calling augment without any further arguments will add all columns, from min until relative_frequency to the data.

patients %>% 
    processing_time(level = "activity", units = "hours") %>%
    augment(patients) %>%
    glimpse()
## Rows: 5,442
## Columns: 17
## $ handling           <fct> Registration, Registration, Registration, Registrat…
## $ patient            <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", …
## $ employee           <fct> r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1,…
## $ handling_id        <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", …
## $ registration_type  <fct> start, start, start, start, start, start, start, st…
## $ time               <dttm> 2017-01-02 11:41:53, 2017-01-02 11:41:53, 2017-01-…
## $ .order             <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, …
## $ min                <drtn> 0.8288889 hours, 0.8288889 hours, 0.8288889 hours,…
## $ q1                 <drtn> 2.070417 hours, 2.070417 hours, 2.070417 hours, 2.…
## $ mean               <drtn> 2.7538 hours, 2.7538 hours, 2.7538 hours, 2.7538 h…
## $ median             <drtn> 2.713611 hours, 2.713611 hours, 2.713611 hours, 2.…
## $ q3                 <drtn> 3.402014 hours, 3.402014 hours, 3.402014 hours, 3.…
## $ max                <drtn> 5.634722 hours, 5.634722 hours, 5.634722 hours, 5.…
## $ st_dev             <dbl> 0.9539039, 0.9539039, 0.9539039, 0.9539039, 0.95390…
## $ iqr                <dbl> 1.331597, 1.331597, 1.331597, 1.331597, 1.331597, 1…
## $ total              <drtn> 1376.9 hours, 1376.9 hours, 1376.9 hours, 1376.9 h…
## $ relative_frequency <dbl> 0.183756, 0.183756, 0.183756, 0.183756, 0.183756, 0…

Using the columns argument we can specify a selection of columns that we want to use for augmenting the log. For example, say we are only interested in the mean and median processing time. Let’s also add a descriptive prefix to these columns.

patients %>% 
    processing_time(level = "activity", units = "hours") %>%
    augment(patients, columns = c("mean","median"), prefix = "processing_time") %>%
    glimpse()
## Rows: 5,442
## Columns: 9
## $ handling               <fct> Registration, Registration, Registration, Regis…
## $ patient                <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "1…
## $ employee               <fct> r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1,…
## $ handling_id            <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "1…
## $ registration_type      <fct> start, start, start, start, start, start, start…
## $ time                   <dttm> 2017-01-02 11:41:53, 2017-01-02 11:41:53, 2017…
## $ .order                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
## $ processing_time_mean   <drtn> 2.7538 hours, 2.7538 hours, 2.7538 hours, 2.75…
## $ processing_time_median <drtn> 2.713611 hours, 2.713611 hours, 2.713611 hours…

Adding multiple metrics

When you want to add multiple metrics, it is imperative to save intermediate updates of the data. Consider the example below.

patients %>%
    trace_length(level = "case") %>%
    augment(patients, prefix = "trace_length") %>%
    trace_coverage(level = "case") %>%
    augment(patients, prefix = "trace_frequency") %>%
    glimpse()
## Rows: 5,442
## Columns: 10
## $ handling                 <fct> Registration, Registration, Registration, Reg…
## $ patient                  <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ employee                 <fct> r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r…
## $ handling_id              <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ registration_type        <fct> start, start, start, start, start, start, sta…
## $ time                     <dttm> 2017-01-02 11:41:53, 2017-01-02 11:41:53, 20…
## $ .order                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ trace_frequency_trace    <chr> "Registration,Triage and Assessment,Blood tes…
## $ trace_frequency_absolute <int> 234, 258, 234, 234, 258, 234, 234, 258, 258, …
## $ trace_frequency_relative <dbl> 0.468, 0.516, 0.468, 0.468, 0.516, 0.468, 0.4…

As you can see only the trace_coverage() values of the second augment are added, while the first augment is lost. This is because the patients data set did not get updated after the first augment() call. The proper way would be as follows.

patients %>%
    trace_length(level = "case") %>%
    augment(patients, prefix = "trace_length") -> patients

patients %>%
    trace_coverage(level = "case") %>%
    augment(patients, prefix = "trace_frequency") %>%
    glimpse()
## Rows: 5,442
## Columns: 11
## $ handling                 <fct> Registration, Registration, Registration, Reg…
## $ patient                  <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ employee                 <fct> r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r1, r…
## $ handling_id              <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", …
## $ registration_type        <fct> start, start, start, start, start, start, sta…
## $ time                     <dttm> 2017-01-02 11:41:53, 2017-01-02 11:41:53, 20…
## $ .order                   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14…
## $ trace_length_absolute    <int> 6, 5, 6, 6, 5, 6, 6, 5, 5, 5, 5, 6, 6, 5, 6, …
## $ trace_frequency_trace    <chr> "Registration,Triage and Assessment,Blood tes…
## $ trace_frequency_absolute <int> 234, 258, 234, 234, 258, 234, 234, 258, 258, …
## $ trace_frequency_relative <dbl> 0.468, 0.516, 0.468, 0.468, 0.516, 0.468, 0.4…

Read more:


Copyright © 2023 bupaR - Hasselt University