Developer Guide

Developer guide

Object classes

bupaR knows 2 main object classes: eventlog and activitylog. Both are special types of a data.frame object. Furthermore, there is the overarching object class log. The object class log is used by functions where a distinction between the two classes is not relevant. It is only used as a higher-level classification of the eventlog and activitylog objects - it cannot stand on its own. That is, objects which have just the class log cannot exist, they must have one of the subclasses as well.

The defining characteristics of a log are stored in regular variables, of which the names can be obtained with the mapping() function.

mapping(patients)

## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type

mapping(patients_act)

## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Timestamps:      start, complete

Note that both eventlog and activitylog have some mapping-elements in common:

case identifier
activity identifier
resource identifier

While other mapping-elements are slightly different:

the activity instance identifier only exist for eventlog. For activitylog, each row is an activity instance by definition
the lifecycle identifier for eventlog consist of a single column. For activitylog, it consist of multiple columns. (At least start- and complete status are required, although they can contain NA’s). These are stored under the timestamps mapping element.

Note that there are 2 classes for the mapping, one for eventlog and one for activitylog. (Note also that the eventlog_mapping() has a dedicated print(), while activitylog has not (yet), and prints just a regular list.)

Individual mapping-variables can be obtained with the dedicated id functions. They work on both the logs itself, and on the mappings.

activity_id(patients)

## [1] "handling"

activity_id(patients_act)

## [1] "handling"

mapping_event <- mapping(patients)
mapping_act <- mapping(patients_act)
activity_id(mapping_event)

## [1] "handling"

activity_id(mapping_act)

## [1] "handling"

During data manipulation, it can sometimes happen (or sometimes is necessary) that the log is at some point transferred to a regular data.frame for some operations. If the ultimate output of the function should be once again a log object (and not a visual or summary table), the mapping can be used to recuperate the original mapping. This can be done using re_map().

patients_df <- as.data.frame(patients)
class(patients_df)

## [1] "data.frame"

patients_log <- re_map(patients_df, mapping_event)
class(patients_log)

## [1] "eventlog"   "log"        "tbl_df"     "tbl"        "data.frame"

re_map() recognizes the class of the mapping, and thus works for both activitylog and eventlog mappings. It will always return to the original type. (I.e. if the mapping originates from an activitylog object, it will result once again in an activitylog object.) It can never be used to convert activitylog to eventlog, or vice versa.

While re_map() is exported by bupaR, it is primarily for internal use. Only for more advanced use of bupaR, it can be useful for the end-user.

Note that functions that are not exported can always be used using the ::: instead of the :: operator. For instance, we can use the non-exported activity_id_() function outside of bupaR as follows:

bupaR:::activity_id_(patients)

## handling

While you should typically not need these function outside of bupaR, except for perhaps developing or testing some code interactively, we will use the ::: notation in this manual whenever we refer to internal functions.

activity_id_() is a variant of activity_id(). Only instead of returning a chr object, it returns a symbol. This symbol is useful when you want to use the mapping variable while programming.

For example, suppose you want to filter the patients log, only for patient == 1. But you don’t know that the case_id is “patient”, so you use the function to get the case_id.

The following will not work.

patients %>%
    filter(case_id(patients) == 1)

## EMPTY EVENT LOG
## # A tibble: 0 × 7
## # ℹ 7 variables: handling <fct>, patient <chr>, employee <fct>,
## #   handling_id <chr>, registration_type <fct>, time <dttm>, .order <int>

just as the following will not work.

patients %>%
    filter("patient" == 1)

## EMPTY EVENT LOG
## # A tibble: 0 × 7
## # ℹ 7 variables: handling <fct>, patient <chr>, employee <fct>,
## #   handling_id <chr>, registration_type <fct>, time <dttm>, .order <int>

In order to successfully do this, we could use the symbol:

patients %>%
    filter(!!bupaR:::case_id_(patients) == 1)

## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-09 19:45:45 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Registrat… 1       r1       1           start             2017-01-02 11:41:53
##  2 Triage an… 1       r2       501         start             2017-01-02 12:40:20
##  3 Blood test 1       r3       1001        start             2017-01-05 08:59:04
##  4 MRI SCAN   1       r4       1238        start             2017-01-05 21:37:12
##  5 Discuss R… 1       r6       1735        start             2017-01-07 07:57:49
##  6 Check-out  1       r7       2230        start             2017-01-09 17:09:43
##  7 Registrat… 1       r1       1           complete          2017-01-02 12:40:20
##  8 Triage an… 1       r2       501         complete          2017-01-02 22:32:25
##  9 Blood test 1       r3       1001        complete          2017-01-05 14:34:27
## 10 MRI SCAN   1       r4       1238        complete          2017-01-06 01:54:23
## 11 Discuss R… 1       r6       1735        complete          2017-01-07 10:18:08
## 12 Check-out  1       r7       2230        complete          2017-01-09 19:45:45
## # ℹ 1 more variable: .order <int>

More on symbols and !!: https://adv-r.hadley.nz/quasiquotation.html

Alternatively, the following notation works as well.

patients %>%
    filter(.data[[case_id(patients)]] == 1)

## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-01-02 11:41:53 until 2017-01-09 19:45:45 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Registrat… 1       r1       1           start             2017-01-02 11:41:53
##  2 Triage an… 1       r2       501         start             2017-01-02 12:40:20
##  3 Blood test 1       r3       1001        start             2017-01-05 08:59:04
##  4 MRI SCAN   1       r4       1238        start             2017-01-05 21:37:12
##  5 Discuss R… 1       r6       1735        start             2017-01-07 07:57:49
##  6 Check-out  1       r7       2230        start             2017-01-09 17:09:43
##  7 Registrat… 1       r1       1           complete          2017-01-02 12:40:20
##  8 Triage an… 1       r2       501         complete          2017-01-02 22:32:25
##  9 Blood test 1       r3       1001        complete          2017-01-05 14:34:27
## 10 MRI SCAN   1       r4       1238        complete          2017-01-06 01:54:23
## 11 Discuss R… 1       r6       1735        complete          2017-01-07 10:18:08
## 12 Check-out  1       r7       2230        complete          2017-01-09 19:45:45
## # ℹ 1 more variable: .order <int>

The .data here is a special command, a pronoun, that can be used in dplyr functions. More information here: https://adv-r.hadley.nz/quasiquotation.html

In bupaR, the preference goes to the latter notation. It has the advantage to be used in scripts both inside bupaR as well as outside (whereas the !! notation only works with the bupaR::: prefix). It is also slightly easier to understand than the workings of !!.

That said, the use of case_id_() and symbol(case_id()) is still widespread in bupaR, but the goal is to phase out this usage.

dplyr verbs

The following dplyr verbs have received methods for activity logs and event logs.

filter()
group_by()
arrange()
mutate()
select()

They will all return a proper log, i.e. there is no risk of losing the defined mapping.

Special attention has to be given to the following:

Select

Conventionally, select() will not ensure that the log maintains the variables it needs to be considered a log. The select methods for logs therefore will keep the listed variables and the variables that define the event log.

The following code returns an eventlog object with the attribute oligurie, as well as the 6 variables needed to define the event log (plus the .order variable, see further).

sepsis %>%
    select(oligurie)

## # Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 × 8
##    oligurie case_id activity   activity_instance_id timestamp           resource
##    <lgl>    <chr>   <fct>      <chr>                <dttm>              <fct>   
##  1 FALSE    A       ER Regist… 1                    2014-10-22 11:15:41 A       
##  2 NA       A       Leucocytes 2                    2014-10-22 11:27:00 B       
##  3 NA       A       CRP        3                    2014-10-22 11:27:00 B       
##  4 NA       A       LacticAcid 4                    2014-10-22 11:27:00 B       
##  5 NA       A       ER Triage  5                    2014-10-22 11:33:37 C       
##  6 NA       A       ER Sepsis… 6                    2014-10-22 11:34:00 A       
##  7 NA       A       IV Liquid  7                    2014-10-22 14:03:47 A       
##  8 NA       A       IV Antibi… 8                    2014-10-22 14:03:47 A       
##  9 NA       A       Admission… 9                    2014-10-22 14:13:19 D       
## 10 NA       A       CRP        10                   2014-10-24 09:00:00 B       
## # ℹ 15,204 more rows
## # ℹ 2 more variables: lifecycle <fct>, .order <int>

This behavior can be turned by setting force_df = TRUE. In that case, the select will work just like a traditional select(), and the result will be a data.frame, no longer eventlog.

sepsis %>%
    select(oligurie, force_df = TRUE)

## # A tibble: 15,214 × 1
##    oligurie
##    <lgl>   
##  1 FALSE   
##  2 NA      
##  3 NA      
##  4 NA      
##  5 NA      
##  6 NA      
##  7 NA      
##  8 NA      
##  9 NA      
## 10 NA      
## # ℹ 15,204 more rows

Because of this, you can select just the event log mapping using select().

sepsis %>%
    select()

## # Log of 15214 events consisting of:
## 846 traces 
## 1050 cases 
## 15214 instances of 16 activities 
## 26 resources 
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11 
##  
## # Variables were mapped as follows:
## Case identifier:     case_id 
## Activity identifier:     activity 
## Resource identifier:     resource 
## Activity instance identifier:    activity_instance_id 
## Timestamp:           timestamp 
## Lifecycle transition:        lifecycle 
## 
## # A tibble: 15,214 × 7
##    case_id activity  activity_instance_id timestamp           resource lifecycle
##    <chr>   <fct>     <chr>                <dttm>              <fct>    <fct>    
##  1 A       ER Regis… 1                    2014-10-22 11:15:41 A        complete 
##  2 A       Leucocyt… 2                    2014-10-22 11:27:00 B        complete 
##  3 A       CRP       3                    2014-10-22 11:27:00 B        complete 
##  4 A       LacticAc… 4                    2014-10-22 11:27:00 B        complete 
##  5 A       ER Triage 5                    2014-10-22 11:33:37 C        complete 
##  6 A       ER Sepsi… 6                    2014-10-22 11:34:00 A        complete 
##  7 A       IV Liquid 7                    2014-10-22 14:03:47 A        complete 
##  8 A       IV Antib… 8                    2014-10-22 14:03:47 A        complete 
##  9 A       Admissio… 9                    2014-10-22 14:13:19 D        complete 
## 10 A       CRP       10                   2014-10-24 09:00:00 B        complete 
## # ℹ 15,204 more rows
## # ℹ 1 more variable: .order <int>

If you want to select only specific eventlog classifiers, you can use selects_ids(). Because you would typically not select all id’s (otherwise you can use select()), this will by default turn your object to a data.frame object.

sepsis %>%
    bupaR::select_ids(activity_id, case_id)

## # A tibble: 15,214 × 2
##    activity         case_id
##    <fct>            <chr>  
##  1 ER Registration  A      
##  2 Leucocytes       A      
##  3 CRP              A      
##  4 LacticAcid       A      
##  5 ER Triage        A      
##  6 ER Sepsis Triage A      
##  7 IV Liquid        A      
##  8 IV Antibiotics   A      
##  9 Admission NC     A      
## 10 CRP              A      
## # ℹ 15,204 more rows

Note how the different classifiers are defined: using the _id() functions, but without the brackets. And not using characters.

Group by

While group_by() is defined for logs, it should be noted that it requires special methods for each function before that function is “compatible” with grouped logs. Some utility functions for this do however exist (see further).

There are some short cuts for typical groupings when programming in bupaR:

group_by_case()
group_by_activity()
group_by_activity_instance()
group_by_resource()
group_by_resource_activity()

patients %>%
    group_by_case()

## # Groups: [patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Registrat… 1       r1       1           start             2017-01-02 11:41:53
##  2 Registrat… 2       r1       2           start             2017-01-02 11:41:53
##  3 Registrat… 3       r1       3           start             2017-01-04 01:34:05
##  4 Registrat… 4       r1       4           start             2017-01-04 01:34:04
##  5 Registrat… 5       r1       5           start             2017-01-04 16:07:47
##  6 Registrat… 6       r1       6           start             2017-01-04 16:07:47
##  7 Registrat… 7       r1       7           start             2017-01-05 04:56:11
##  8 Registrat… 8       r1       8           start             2017-01-05 04:56:11
##  9 Registrat… 9       r1       9           start             2017-01-06 05:58:54
## 10 Registrat… 10      r1       10          start             2017-01-06 05:58:54
## # ℹ 5,432 more rows
## # ℹ 1 more variable: .order <int>

is equivalent to

patients %>%
    group_by(.data[[case_id(patients)]])

## # Groups: [patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Registrat… 1       r1       1           start             2017-01-02 11:41:53
##  2 Registrat… 2       r1       2           start             2017-01-02 11:41:53
##  3 Registrat… 3       r1       3           start             2017-01-04 01:34:05
##  4 Registrat… 4       r1       4           start             2017-01-04 01:34:04
##  5 Registrat… 5       r1       5           start             2017-01-04 16:07:47
##  6 Registrat… 6       r1       6           start             2017-01-04 16:07:47
##  7 Registrat… 7       r1       7           start             2017-01-05 04:56:11
##  8 Registrat… 8       r1       8           start             2017-01-05 04:56:11
##  9 Registrat… 9       r1       9           start             2017-01-06 05:58:54
## 10 Registrat… 10      r1       10          start             2017-01-06 05:58:54
## # ℹ 5,432 more rows
## # ℹ 1 more variable: .order <int>

While, except for the more common resource-activity, not all relevant combinations of groupings are provided as a shortcut, the internal group_by_ids() allows the use of any combination of _id() functions. For example:

patients %>%
    bupaR:::group_by_ids(activity_id, case_id)

## # Groups: [handling, patient]
## Grouped # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Registrat… 1       r1       1           start             2017-01-02 11:41:53
##  2 Registrat… 2       r1       2           start             2017-01-02 11:41:53
##  3 Registrat… 3       r1       3           start             2017-01-04 01:34:05
##  4 Registrat… 4       r1       4           start             2017-01-04 01:34:04
##  5 Registrat… 5       r1       5           start             2017-01-04 16:07:47
##  6 Registrat… 6       r1       6           start             2017-01-04 16:07:47
##  7 Registrat… 7       r1       7           start             2017-01-05 04:56:11
##  8 Registrat… 8       r1       8           start             2017-01-05 04:56:11
##  9 Registrat… 9       r1       9           start             2017-01-06 05:58:54
## 10 Registrat… 10      r1       10          start             2017-01-06 05:58:54
## # ℹ 5,432 more rows
## # ℹ 1 more variable: .order <int>

Note that the notation is analogous to select_ids(): specify the id functions, without quotation marks or brackets.