Wrangling

library(bupaverse)
library(dplyr)

In order to easily manipulate logs, well-known dplyr-verbs have been adapted. This page serves as a general introduction of the wrangling verbs. Their usage is illustrated throughout the documentation in Manipulation, Analysis and Visualization.

group_by

Using the group_by() function, event logs can be grouped according to (a set of) variables, such that all further computations happen for each of these different groups.

In the next example, the number of cases are computed for each value of vehicleclass.

traffic_fines %>%
    group_by(vehicleclass) %>%
    n_cases()
## # A tibble: 4 × 2
##   vehicleclass n_cases
##   <chr>          <int>
## 1 A               9973
## 2 C                 21
## 3 M                  6
## 4 <NA>           10000

Predefined groupings

For specific groupings, some auxiliary functions are available.

  • group_by_case - group by cases
  • group_by_activity - group by activity types
  • group_by_resource - group by resources
  • group_by_activity_resource - group by activity resource pair
  • group_by_activity_instance - group by activity instances.

For example, the number of cases in which a specific resource occurs, can be computed as follows:

sepsis %>%
    group_by_resource %>%
    n_cases()
## # A tibble: 26 × 2
##    resource n_cases
##    <fct>      <int>
##  1 ?            294
##  2 A            985
##  3 B           1013
##  4 C           1050
##  5 D             46
##  6 E            782
##  7 F            200
##  8 G            147
##  9 H             50
## 10 I            118
## # … with 16 more rows

Note that each of the descriptive metrics discussed here can be rewritten using these lower-level functions. The example above is equal to the resource_involvement metric at case-level.

Grouping on id

When you want to group on a combination of mapping variables, for example, for each combination of case and activity, you can use group_by_ids(). The following examples counts the number of events per case and per activity:

patients %>%
    group_by_ids(case_id, activity_id) %>%
    n_events()  
## # A tibble: 2,721 × 3
##    patient handling              n_events
##    <chr>   <fct>                    <int>
##  1 1       Blood test                   2
##  2 1       Check-out                    2
##  3 1       Discuss Results              2
##  4 1       MRI SCAN                     2
##  5 1       Registration                 2
##  6 1       Triage and Assessment        2
##  7 10      Check-out                    2
##  8 10      Discuss Results              2
##  9 10      Registration                 2
## 10 10      Triage and Assessment        2
## # … with 2,711 more rows

Note that the arguments of group_by_ids() are not the variable names of case (patient) and activity (handling) columns, but unquoted mapping id-functions. You can thus use this function while being agnostic of the precise variable names.

Remove grouping

When a grouping is no longer needed, it can be removed using ungroup_eventlog().

mutate

You can use mutate() to add new variables to an event log, possibly by using existing variables. In the next example, the total amount of lacticacid is computed for each case. Read more.

sepsis %>%
    group_by_case() %>%
    mutate(total_lacticacid = sum(lacticacid, na.rm = T))
## # Groups: [case_id]
## Grouped # Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 × 35
##    case_id activity      lifec…¹ resou…² timestamp             age   crp diagn…³
##    <chr>   <fct>         <fct>   <fct>   <dttm>              <dbl> <dbl> <chr>  
##  1 A       ER Registrat… comple… A       2014-10-22 11:15:41    85    NA A      
##  2 A       Leucocytes    comple… B       2014-10-22 11:27:00    NA    NA <NA>   
##  3 A       CRP           comple… B       2014-10-22 11:27:00    NA   210 <NA>   
##  4 A       LacticAcid    comple… B       2014-10-22 11:27:00    NA    NA <NA>   
##  5 A       ER Triage     comple… C       2014-10-22 11:33:37    NA    NA <NA>   
##  6 A       ER Sepsis Tr… comple… A       2014-10-22 11:34:00    NA    NA <NA>   
##  7 A       IV Liquid     comple… A       2014-10-22 14:03:47    NA    NA <NA>   
##  8 A       IV Antibioti… comple… A       2014-10-22 14:03:47    NA    NA <NA>   
##  9 A       Admission NC  comple… D       2014-10-22 14:13:19    NA    NA <NA>   
## 10 A       CRP           comple… B       2014-10-24 09:00:00    NA  1090 <NA>   
## # … with 15,204 more rows, 27 more variables: diagnosticartastrup <lgl>,
## #   diagnosticblood <lgl>, diagnosticecg <lgl>, diagnosticic <lgl>,
## #   diagnosticlacticacid <lgl>, diagnosticliquor <lgl>, diagnosticother <lgl>,
## #   diagnosticsputum <lgl>, diagnosticurinaryculture <lgl>,
## #   diagnosticurinarysediment <lgl>, diagnosticxthorax <lgl>, disfuncorg <lgl>,
## #   hypotensie <lgl>, hypoxie <lgl>, infectionsuspected <lgl>, infusion <lgl>,
## #   lacticacid <dbl>, leucocytes <chr>, oligurie <lgl>, …

filter

Generic filtering of events can be done using filter(), which takes an event log and any number of logical conditions. The example below filters events where “C” is the vehicle class and an amount greater than 300. Read more..

traffic_fines %>%
    filter(vehicleclass == "C", amount > 300)
## # Log of 20 events consisting of:
## 1 trace 
## 20 cases 
## 20 instances of 1 activity 
## 10 resources 
## Events occurred from 2006-08-10 until 2008-02-09 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 20 × 18
##    case_id activity   lifec…¹ resou…² timestamp           amount article dismi…³
##    <chr>   <fct>      <fct>   <fct>   <dttm>              <chr>    <dbl> <chr>  
##  1 A10060  Create Fi… comple… 541     2007-03-08 00:00:00 36.0       157 NIL    
##  2 A10497  Create Fi… comple… 558     2007-03-30 00:00:00 36.0       157 NIL    
##  3 A10818  Create Fi… comple… 561     2007-04-08 00:00:00 36.0       157 NIL    
##  4 A11707  Create Fi… comple… 550     2007-04-24 00:00:00 36.0       157 NIL    
##  5 A11936  Create Fi… comple… 557     2007-04-29 00:00:00 36.0       157 NIL    
##  6 A12073  Create Fi… comple… 557     2007-05-03 00:00:00 36.0       157 NIL    
##  7 A1408   Create Fi… comple… 559     2006-08-20 00:00:00 35.0       157 NIL    
##  8 A14883  Create Fi… comple… 561     2007-06-29 00:00:00 36.0       157 NIL    
##  9 A17130  Create Fi… comple… 541     2007-07-15 00:00:00 36.0       157 NIL    
## 10 A1815   Create Fi… comple… 563     2006-08-10 00:00:00 35.0       157 NIL    
## 11 A19109  Create Fi… comple… 556     2007-07-17 00:00:00 36.0       157 NIL    
## 12 A23000  Create Fi… comple… 550     2007-12-29 00:00:00 36.0       157 NIL    
## 13 A24247  Create Fi… comple… 561     2007-12-03 00:00:00 36.0       157 NIL    
## 14 A24366  Create Fi… comple… 541     2008-02-09 00:00:00 36.0       157 NIL    
## 15 A24634  Create Fi… comple… 537     2007-11-21 00:00:00 36.0       157 NIL    
## 16 A24942  Create Fi… comple… 561     2007-12-30 00:00:00 36.0       157 NIL    
## 17 A25581  Create Fi… comple… 559     2007-11-23 00:00:00 36.0       157 NIL    
## 18 A25599  Create Fi… comple… 559     2007-11-24 00:00:00 36.0       157 NIL    
## 19 A26099  Create Fi… comple… 559     2007-12-09 00:00:00 36.0       157 NIL    
## 20 A26277  Create Fi… comple… 538     2008-01-07 00:00:00 36.0       157 NIL    
## # … with 10 more variables: expense <chr>, lastsent <chr>, matricola <dbl>,
## #   notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## #   totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## #   .order <int>, and abbreviated variable names ¹​lifecycle, ²​resource,
## #   ³​dismissal

select

Variables on a event log can be selected using select(). By default, select() will always make sure that the mapping-variables are retained. Otherwise, it would no longer function as an eventlog object.

traffic_fines %>%
    select(vehicleclass)
## # Log of 34724 events consisting of:
## 44 traces 
## 10000 cases 
## 34724 instances of 11 activities 
## 16 resources 
## Events occurred from 2006-06-17 until 2012-03-26 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 34,724 × 8
##    vehiclec…¹ case_id activ…² activ…³ timestamp           resou…⁴ lifec…⁵ .order
##    <chr>      <chr>   <fct>   <chr>   <dttm>              <fct>   <fct>    <int>
##  1 A          A1      Create… 1       2006-07-24 00:00:00 561     comple…      1
##  2 <NA>       A1      Send F… 2       2006-12-05 00:00:00 <NA>    comple…      2
##  3 A          A100    Create… 3       2006-08-02 00:00:00 561     comple…      3
##  4 <NA>       A100    Send F… 4       2006-12-12 00:00:00 <NA>    comple…      4
##  5 <NA>       A100    Insert… 5       2007-01-15 00:00:00 <NA>    comple…      5
##  6 <NA>       A100    Add pe… 6       2007-03-16 00:00:00 <NA>    comple…      6
##  7 <NA>       A100    Send f… 7       2009-03-30 00:00:00 <NA>    comple…      7
##  8 A          A10000  Create… 8       2007-03-09 00:00:00 561     comple…      8
##  9 <NA>       A10000  Send F… 9       2007-07-17 00:00:00 <NA>    comple…      9
## 10 <NA>       A10000  Insert… 10      2007-08-02 00:00:00 <NA>    comple…     10
## # … with 34,714 more rows, and abbreviated variable names ¹​vehicleclass,
## #   ²​activity, ³​activity_instance_id, ⁴​resource, ⁵​lifecycle

By setting the argument force_df = TRUE, the mapping-variables will not be retained, and the output will be a data.frame, and not an eventlog object. Note that doing so will hold even in the case that all mapping variables are selected.

traffic_fines %>%
    select(case_id, vehicleclass, amount, force_df = TRUE)
## # A tibble: 34,724 × 3
##    case_id vehicleclass amount
##    <chr>   <chr>        <chr> 
##  1 A1      A            35.0  
##  2 A1      <NA>         <NA>  
##  3 A100    A            35.0  
##  4 A100    <NA>         <NA>  
##  5 A100    <NA>         <NA>  
##  6 A100    <NA>         71.5  
##  7 A100    <NA>         <NA>  
##  8 A10000  A            36.0  
##  9 A10000  <NA>         <NA>  
## 10 A10000  <NA>         <NA>  
## # … with 34,714 more rows

Selecting id

Similar to group_by_ids(), select_ids() can be used to select the mapping variables.

patients %>%
    select_ids(case_id, activity_id)
## # A tibble: 5,442 × 2
##    patient handling    
##    <chr>   <fct>       
##  1 1       Registration
##  2 2       Registration
##  3 3       Registration
##  4 4       Registration
##  5 5       Registration
##  6 6       Registration
##  7 7       Registration
##  8 8       Registration
##  9 9       Registration
## 10 10      Registration
## # … with 5,432 more rows

Note again how the arguments are unquoted id-functions instead of raw variable names. The result of select_ids() will always result in a data.frame object, as typically not all id’s in the mapping will be selected.

arrange

Event data can be sorted using the arrange(). desc() argument can be used to sort descending on an attribute.

#sort descending on time
patients %>%
    arrange(desc(time))
## # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 × 7
##    handling           patient emplo…¹ handl…² regis…³ time                .order
##    <fct>              <chr>   <fct>   <chr>   <fct>   <dttm>               <int>
##  1 Triage and Assess… 500     r2      1000    comple… 2018-05-05 07:16:02   3721
##  2 Discuss Results    495     r6      2229    comple… 2018-05-05 02:49:57   4950
##  3 X-Ray              498     r5      1734    comple… 2018-05-05 01:34:30   4455
##  4 Triage and Assess… 500     r2      1000    start   2018-05-04 23:53:27   1000
##  5 Triage and Assess… 499     r2      999     comple… 2018-05-04 23:53:27   3720
##  6 Discuss Results    495     r6      2229    start   2018-05-04 23:50:05   2229
##  7 Discuss Results    489     r6      2223    comple… 2018-05-04 23:50:05   4944
##  8 X-Ray              498     r5      1734    start   2018-05-04 21:50:07   1734
##  9 X-Ray              497     r5      1733    comple… 2018-05-04 21:50:07   4454
## 10 Discuss Results    489     r6      2223    start   2018-05-04 20:24:44   2223
## # … with 5,432 more rows, and abbreviated variable names ¹​employee,
## #   ²​handling_id, ³​registration_type

Read more:


Copyright © 2023 bupaR - Hasselt University