Wrangling

library(bupaverse)
library(dplyr)

In order to easily manipulate logs, well-known dplyr-verbs have been adapted. This page serves as a general introduction of the wrangling verbs. Their usage is illustrated throughout the documentation in Manipulation, Analysis and Visualization.

group_by

Using the group_by() function, event logs can be grouped according to (a set of) variables, such that all further computations happen for each of these different groups.

In the next example, the number of cases are computed for each value of vehicleclass.

traffic_fines %>%
    group_by(vehicleclass) %>%
    n_cases()
## # A tibble: 4 × 2
##   vehicleclass n_cases
##   <chr>          <int>
## 1 A               9973
## 2 C                 21
## 3 M                  6
## 4 <NA>           10000

Predefined groupings

For specific groupings, some auxiliary functions are available.

  • group_by_case - group by cases
  • group_by_activity - group by activity types
  • group_by_resource - group by resources
  • group_by_activity_resource - group by activity resource pair
  • group_by_activity_instance - group by activity instances.

For example, the number of cases in which a specific resource occurs, can be computed as follows:

sepsis %>%
    group_by_resource %>%
    n_cases()
## # A tibble: 26 × 2
##    resource n_cases
##    <fct>      <int>
##  1 ?            294
##  2 A            985
##  3 B           1013
##  4 C           1050
##  5 D             46
##  6 E            782
##  7 F            200
##  8 G            147
##  9 H             50
## 10 I            118
## # ℹ 16 more rows

Note that each of the descriptive metrics discussed here can be rewritten using these lower-level functions. The example above is equal to the resource_involvement metric at case-level.

Grouping on id

When you want to group on a combination of mapping variables, for example, for each combination of case and activity, you can use group_by_ids(). The following examples counts the number of events per case and per activity:

patients %>%
    group_by_ids(case_id, activity_id) %>%
    n_events()  
## # A tibble: 2,721 × 3
##    patient handling              n_events
##    <chr>   <fct>                    <int>
##  1 1       Blood test                   2
##  2 1       Check-out                    2
##  3 1       Discuss Results              2
##  4 1       MRI SCAN                     2
##  5 1       Registration                 2
##  6 1       Triage and Assessment        2
##  7 10      Check-out                    2
##  8 10      Discuss Results              2
##  9 10      Registration                 2
## 10 10      Triage and Assessment        2
## # ℹ 2,711 more rows

Note that the arguments of group_by_ids() are not the variable names of case (patient) and activity (handling) columns, but unquoted mapping id-functions. You can thus use this function while being agnostic of the precise variable names.

Remove grouping

When a grouping is no longer needed, it can be removed using ungroup_eventlog().

mutate

You can use mutate() to add new variables to an event log, possibly by using existing variables. In the next example, the total amount of lacticacid is computed for each case. Read more.

sepsis %>%
    group_by_case() %>%
    mutate(total_lacticacid = sum(lacticacid, na.rm = T))
## # Groups: [case_id]
## Grouped # Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 × 35
##    case_id activity  lifecycle resource timestamp             age   crp diagnose
##    <chr>   <fct>     <fct>     <fct>    <dttm>              <dbl> <dbl> <chr>   
##  1 A       ER Regis… complete  A        2014-10-22 11:15:41    85    NA A       
##  2 A       Leucocyt… complete  B        2014-10-22 11:27:00    NA    NA <NA>    
##  3 A       CRP       complete  B        2014-10-22 11:27:00    NA   210 <NA>    
##  4 A       LacticAc… complete  B        2014-10-22 11:27:00    NA    NA <NA>    
##  5 A       ER Triage complete  C        2014-10-22 11:33:37    NA    NA <NA>    
##  6 A       ER Sepsi… complete  A        2014-10-22 11:34:00    NA    NA <NA>    
##  7 A       IV Liquid complete  A        2014-10-22 14:03:47    NA    NA <NA>    
##  8 A       IV Antib… complete  A        2014-10-22 14:03:47    NA    NA <NA>    
##  9 A       Admissio… complete  D        2014-10-22 14:13:19    NA    NA <NA>    
## 10 A       CRP       complete  B        2014-10-24 09:00:00    NA  1090 <NA>    
## # ℹ 15,204 more rows
## # ℹ 27 more variables: diagnosticartastrup <lgl>, diagnosticblood <lgl>,
## #   diagnosticecg <lgl>, diagnosticic <lgl>, diagnosticlacticacid <lgl>,
## #   diagnosticliquor <lgl>, diagnosticother <lgl>, diagnosticsputum <lgl>,
## #   diagnosticurinaryculture <lgl>, diagnosticurinarysediment <lgl>,
## #   diagnosticxthorax <lgl>, disfuncorg <lgl>, hypotensie <lgl>, hypoxie <lgl>,
## #   infectionsuspected <lgl>, infusion <lgl>, lacticacid <dbl>, …

filter

Generic filtering of events can be done using filter(), which takes an event log and any number of logical conditions. The example below filters events where “C” is the vehicle class and an amount greater than 300. Read more..

traffic_fines %>%
    filter(vehicleclass == "C", amount > 300)
## # Log of 20 events consisting of:
## 1 trace 
## 20 cases 
## 20 instances of 1 activity 
## 10 resources 
## Events occurred from 2006-08-10 until 2008-02-09 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 20 × 18
##    case_id activity    lifecycle resource timestamp           amount article
##    <chr>   <fct>       <fct>     <fct>    <dttm>              <chr>    <dbl>
##  1 A10060  Create Fine complete  541      2007-03-08 00:00:00 36.0       157
##  2 A10497  Create Fine complete  558      2007-03-30 00:00:00 36.0       157
##  3 A10818  Create Fine complete  561      2007-04-08 00:00:00 36.0       157
##  4 A11707  Create Fine complete  550      2007-04-24 00:00:00 36.0       157
##  5 A11936  Create Fine complete  557      2007-04-29 00:00:00 36.0       157
##  6 A12073  Create Fine complete  557      2007-05-03 00:00:00 36.0       157
##  7 A1408   Create Fine complete  559      2006-08-20 00:00:00 35.0       157
##  8 A14883  Create Fine complete  561      2007-06-29 00:00:00 36.0       157
##  9 A17130  Create Fine complete  541      2007-07-15 00:00:00 36.0       157
## 10 A1815   Create Fine complete  563      2006-08-10 00:00:00 35.0       157
## 11 A19109  Create Fine complete  556      2007-07-17 00:00:00 36.0       157
## 12 A23000  Create Fine complete  550      2007-12-29 00:00:00 36.0       157
## 13 A24247  Create Fine complete  561      2007-12-03 00:00:00 36.0       157
## 14 A24366  Create Fine complete  541      2008-02-09 00:00:00 36.0       157
## 15 A24634  Create Fine complete  537      2007-11-21 00:00:00 36.0       157
## 16 A24942  Create Fine complete  561      2007-12-30 00:00:00 36.0       157
## 17 A25581  Create Fine complete  559      2007-11-23 00:00:00 36.0       157
## 18 A25599  Create Fine complete  559      2007-11-24 00:00:00 36.0       157
## 19 A26099  Create Fine complete  559      2007-12-09 00:00:00 36.0       157
## 20 A26277  Create Fine complete  538      2008-01-07 00:00:00 36.0       157
## # ℹ 11 more variables: dismissal <chr>, expense <chr>, lastsent <chr>,
## #   matricola <dbl>, notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## #   totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## #   .order <int>

select

Variables on a event log can be selected using select(). By default, select() will always make sure that the mapping-variables are retained. Otherwise, it would no longer function as an eventlog object.

traffic_fines %>%
    select(vehicleclass)
## # Log of 34724 events consisting of:
## 44 traces 
## 10000 cases 
## 34724 instances of 11 activities 
## 16 resources 
## Events occurred from 2006-06-17 until 2012-03-26 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 34,724 × 8
##    vehicleclass case_id activity        activity_instance_id timestamp          
##    <chr>        <chr>   <fct>           <chr>                <dttm>             
##  1 A            A1      Create Fine     1                    2006-07-24 00:00:00
##  2 <NA>         A1      Send Fine       2                    2006-12-05 00:00:00
##  3 A            A100    Create Fine     3                    2006-08-02 00:00:00
##  4 <NA>         A100    Send Fine       4                    2006-12-12 00:00:00
##  5 <NA>         A100    Insert Fine No… 5                    2007-01-15 00:00:00
##  6 <NA>         A100    Add penalty     6                    2007-03-16 00:00:00
##  7 <NA>         A100    Send for Credi… 7                    2009-03-30 00:00:00
##  8 A            A10000  Create Fine     8                    2007-03-09 00:00:00
##  9 <NA>         A10000  Send Fine       9                    2007-07-17 00:00:00
## 10 <NA>         A10000  Insert Fine No… 10                   2007-08-02 00:00:00
## # ℹ 34,714 more rows
## # ℹ 3 more variables: resource <fct>, lifecycle <fct>, .order <int>

By setting the argument force_df = TRUE, the mapping-variables will not be retained, and the output will be a data.frame, and not an eventlog object. Note that doing so will hold even in the case that all mapping variables are selected.

traffic_fines %>%
    select(case_id, vehicleclass, amount, force_df = TRUE)
## # A tibble: 34,724 × 3
##    case_id vehicleclass amount
##    <chr>   <chr>        <chr> 
##  1 A1      A            35.0  
##  2 A1      <NA>         <NA>  
##  3 A100    A            35.0  
##  4 A100    <NA>         <NA>  
##  5 A100    <NA>         <NA>  
##  6 A100    <NA>         71.5  
##  7 A100    <NA>         <NA>  
##  8 A10000  A            36.0  
##  9 A10000  <NA>         <NA>  
## 10 A10000  <NA>         <NA>  
## # ℹ 34,714 more rows

Selecting id

Similar to group_by_ids(), select_ids() can be used to select the mapping variables.

patients %>%
    select_ids(case_id, activity_id)
## # A tibble: 5,442 × 2
##    patient handling    
##    <chr>   <fct>       
##  1 1       Registration
##  2 2       Registration
##  3 3       Registration
##  4 4       Registration
##  5 5       Registration
##  6 6       Registration
##  7 7       Registration
##  8 8       Registration
##  9 9       Registration
## 10 10      Registration
## # ℹ 5,432 more rows

Note again how the arguments are unquoted id-functions instead of raw variable names. The result of select_ids() will always result in a data.frame object, as typically not all id’s in the mapping will be selected.

arrange

Event data can be sorted using the arrange(). desc() argument can be used to sort descending on an attribute.

#sort descending on time
patients %>%
    arrange(desc(time))
## # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Triage an… 500     r2       1000        complete          2018-05-05 07:16:02
##  2 Discuss R… 495     r6       2229        complete          2018-05-05 02:49:57
##  3 X-Ray      498     r5       1734        complete          2018-05-05 01:34:30
##  4 Triage an… 500     r2       1000        start             2018-05-04 23:53:27
##  5 Triage an… 499     r2       999         complete          2018-05-04 23:53:27
##  6 Discuss R… 495     r6       2229        start             2018-05-04 23:50:05
##  7 Discuss R… 489     r6       2223        complete          2018-05-04 23:50:05
##  8 X-Ray      498     r5       1734        start             2018-05-04 21:50:07
##  9 X-Ray      497     r5       1733        complete          2018-05-04 21:50:07
## 10 Discuss R… 489     r6       2223        start             2018-05-04 20:24:44
## # ℹ 5,432 more rows
## # ℹ 1 more variable: .order <int>

Read more:


Copyright © 2023 bupaR - Hasselt University