Creating logs

Transforming your raw data into an event log object is one of the most challenging tasks in process analysis. On this page, we cover all the possible situations and challenges that you can encounter.

We start with some important terminology:

  • Case: The subject of your process, e.g. a customer, an order, a patient.
  • Activity: A step in your process, e.g. receive order, sent payment, perform MRI SCAN, etc.
  • Activity instance: The execution of a specific step for a specific case.
  • Event: A registration connected to an activity instance, characterized by a single timestamp. E.g. the start of Perform MRI SCAN for Patient X.
  • Resource: A person or machine that is related to the execution of (part of) an activity instance. E.g. the radiologist in charge of our MRI SCAN.
  • Lifecycle status: An indication of the status of an activity instance connect to an event. Typical values are start, complete. Other possible values are schedule, suspend, resume, etc.
  • Trace: A sequence of activities. The activity instances that belong to a case will result to a specific trace when ordered by the time each instance occurred.

Logs: eventlog vs activitylog

bupaR supports two different kinds of log formats, both of which are an extension on R data.frame:

  • eventlog: Event logs are created from data.frame in which each row represents a single event. This means that it has a single timestamp.
  • activitylog: Activity logs are created from data.frame in which each row represents a single activity instances. This means it can has multiple timestamps, stored in different columns.

The data model below shows the difference between these two levels of observations, i.e. activity instances vs events.

The example below shows an excerpt of an event log containing 6 events. It can be seen that each event is linked to a single timestamp. As there can be more events within a single activity instance, each event also needs to be linked to a lifecycle status (here the registration_type). Furthermore, an activity instance identifier (handling_id) is needed to indicated which events belong to the same activity instances.

handling patient employee handling_id registration_type time
Registration 333 r1 333 start 2017-11-15 16:50:59
Registration 333 r1 333 complete 2017-11-15 18:45:18
Triage and Assessment 333 r2 833 start 2017-11-16 20:37:26
Triage and Assessment 333 r2 833 complete 2017-11-17 08:21:08
Blood test 333 r3 1152 start 2017-11-17 22:27:09
Blood test 333 r3 1152 complete 2017-11-18 03:16:03
Transactional lifecycle?
An event is an atomic registration related to an activity instance. It thus contains one (and only one) timestamp. Additionally, the event should include a reference to a lifecycle transition. More specifically, multiple events can describe different lifecycle transitions of a single activity instance. For example, one event might record when a surgery is scheduled, another when it is started, yet another when it is completed, etc.
The standard transactional lifecycle.

The table below show the same data as above, but now using the activitylog format. It can be seen that there are now just 3 rows instead of 6, but each row as 2 timestamps, representing 2 events. The lifecycle status represented by those timestamps is now the column names of those variables.

handling patient employee handling_id complete start
Registration 333 r1 333 2017-11-15 18:45:18 2017-11-15 16:50:59
Triage and Assessment 333 r2 833 2017-11-17 08:21:08 2017-11-16 20:37:26
Blood test 333 r3 1152 2017-11-18 03:16:03 2017-11-17 22:27:09

As these examples show, both formats can often be used for representing the same process data. However, there are some important differences between them:

  • the eventlog format has much more flexibility in terms of lifecycle. There is no limit to the number of events that can occur in a single activity instance. If your data contains lifecycle statuses such as suspend, resume or reassign, they can be recorded multiple times within a single activity instance. In the activitylog format, as each lifecycle gets is own column, it isn’t possible to have two events of the same lifecycle status in a single activity instance.
  • the level of observation in an eventlog is an event. As a result, attribute values can be stored at the event level. In an activitylog, the level of observation is an activity instance. This means that all additional attributes that you have about your process should be at this higher level. For example, an activity instance can only be connected to a single resource in the activitylog format, whereas in an eventlog different events within the same activity instance can have different resources, of different values for any other attribute.
  • because of the limited flexibility, an activitylog is easier to make, and typically closer to the format that your data is already in (see further below on how to construct log objects). As a result of this, there are many situations in which the analysis of an activitylog will be much faster compared to eventlog, where a lot of additional complexity needs to be taken into account.

The right log for the job

Functionalities in bupaR core packages support both formats. 1 As such, the goal of your analysis does not impact the decision. Only the complexity of your data is important to make this decision. The precise format your raw data is in will further define the preparatory steps that are needed. We can distinguish between 3 typical scenarios. The flowchart below helps you on your way.

An activitylog is the best option when each row in your data is an activity instance, or when events belonging to the same activity instance have equal attribute values (e.g. all events are executed by the same resource). When these two criteria do not hold, you can create an eventlog object.

Scenario 1

If each row in your data.frame is already an activity instance, the activitylog format is the best way to go. Consider the data sample below.

patient handling activity_started activity_ended
464 Blood test 2018-04-06 20:04:09 2018-04-07 01:18:17
464 Check-out 2018-04-12 19:02:11 2018-04-12 21:41:01
464 Discuss Results 2018-04-12 11:00:16 2018-04-12 13:59:44
464 MRI SCAN 2018-04-07 06:30:56 2018-04-07 09:37:26
464 Registration 2018-03-20 19:07:17 2018-03-20 21:15:41
464 Triage and Assessment 2018-03-21 15:58:55 2018-03-22 05:21:56

As each row contains multiple timestamps, i.e. activity_started and activity_ended, it is clear that each row represents an activity instance. Turning this dataset in an activitylog requires the following steps:

  1. Timestamp variables should be named in correspondence with the standard Transactional lifecycle.
  2. Timestamp variables should be of type Date or POSIXct.
  3. Use the activitylog constructor function.
data %>%
    # rename timestamp variables appropriately
    dplyr::rename(start = activity_started, 
           complete = activity_ended) %>%
    # convert timestamps to 
    convert_timestamps(columns = c("start", "complete"), format = ymd_hms) %>%
    activitylog(case_id = "patient",
                activity_id = "handling",
                timestamps = c("start", "complete"))
## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 0 resources 
## Events occurred from 2018-03-20 19:07:17 until 2018-04-12 21:41:01 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Timestamps:      start, complete 
## 
## # A tibble: 6 × 5
##   patient handling              start               complete            .order
##   <chr>   <fct>                 <dttm>              <dttm>               <int>
## 1 464     Blood test            2018-04-06 20:04:09 2018-04-07 01:18:17      1
## 2 464     Check-out             2018-04-12 19:02:11 2018-04-12 21:41:01      2
## 3 464     Discuss Results       2018-04-12 11:00:16 2018-04-12 13:59:44      3
## 4 464     MRI SCAN              2018-04-07 06:30:56 2018-04-07 09:37:26      4
## 5 464     Registration          2018-03-20 19:07:17 2018-03-20 21:15:41      5
## 6 464     Triage and Assessment 2018-03-21 15:58:55 2018-03-22 05:21:56      6

Note that in case a resource identifier is available, this information can be added in the activitylog call.

Scenario 2

If each row in your data.frame is an event, but all events that belong to the same activity instance share the same attribute values, the activitylog format is again the best way to go. Consider the data sample below.

patient handling employee handling_id registration_type time
227 Registration r1 227 started 2017-08-09 19:55:30
227 Triage and Assessment r2 727 started 2017-08-09 22:17:43
227 Registration r1 227 completed 2017-08-09 22:17:43
227 Triage and Assessment r2 727 completed 2017-08-10 15:21:30
227 Blood test r3 1109 started 2017-08-17 03:01:24
227 Blood test r3 1109 completed 2017-08-17 09:17:20
227 MRI SCAN r4 1346 started 2017-08-17 13:15:04
227 MRI SCAN r4 1346 completed 2017-08-17 18:47:44
227 Discuss Results r6 1961 started 2017-08-22 13:33:38
227 Check-out r7 2456 started 2017-08-22 15:38:38
227 Discuss Results r6 1961 completed 2017-08-22 15:38:38
227 Check-out r7 2456 completed 2017-08-22 17:12:46

The resource identifier (employee) has been added as an additional attribute. Note that though each row is an event, they can be grouped into activity instances using the handling_id column, which we will call the activity instance id. Using the latter, we can see that the resource attribute is the same within each activity instance, which allows us to create an activitylog. The steps to do so are the following.

  1. Lifecycle variable should be named in correspondence with the standard Transactional lifecycle.
  2. Timestamp variable should be of type Date or POSIXct.
  3. Use the eventlog constructor function.
  4. Convert to activitylog using to_activitylog for reduced memory usage and improved performance.
data %>%
    # recode lifecycle variable appropriately
    dplyr::mutate(registration_type = forcats::fct_recode(registration_type, 
                                                          "start" = "started",
                                                          "complete" = "completed")) %>%
    convert_timestamps(columns = "time", format = ymd_hms) %>%
    eventlog(case_id = "patient",
                activity_id = "handling",
                activity_instance_id = "handling_id",
                lifecycle_id = "registration_type",
                timestamp = "time",
                resource_id = "employee") %>%
    to_activitylog() -> tmp_act

Note that the resource identifier is optional, and can be left out of the eventlog call if such an attribute does not exist in your data. If the activity instance id does not exist, some heuristics are available to generate it: [Missing activity instance identifier].

Scenario 3

If each row is an event, and events of the same activity instance have differing attribute values, the flexibility of eventlog objects is required. Consider the data sample below.

patient handling employee handling_id registration_type time
116 Registration r2 116 started 2017-04-29 03:24:59
116 Registration r6 116 completed 2017-04-29 06:23:09
116 Triage and Assessment r1 616 started 2017-04-29 15:41:27
116 Triage and Assessment r7 616 completed 2017-04-30 03:04:21
116 Blood test r4 1054 started 2017-04-30 15:13:28
116 Blood test r6 1054 completed 2017-04-30 21:24:18
116 MRI SCAN r1 1291 started 2017-05-01 01:12:51
116 MRI SCAN r4 1291 completed 2017-05-01 05:32:37
116 Discuss Results r3 1850 started 2017-05-01 09:44:20
116 Discuss Results r7 1850 completed 2017-05-01 14:00:48
116 Check-out r3 2345 started 2017-05-03 04:02:35
116 Check-out r2 2345 completed 2017-05-03 06:16:03

In this example, different resources (employees) sometimes perform the start and complete event of the same activity instance. Therefore, we resort to the eventlog format which has no problems storing this. The steps to take are the following:

  1. Lifecycle variable should be named in correspondence with the standard Transactional lifecycle.
  2. Timestamp variable should be of type Date or POSIXct.
  3. Use the eventlog constructor function.
data %>%
    # recode lifecycle variable appropriately
    dplyr::mutate(registration_type = forcats::fct_recode(registration_type, 
                                                          "start" = "started",
                                                          "complete" = "completed")) %>%
    convert_timestamps(columns = "time", format = ymd_hms) %>%
    eventlog(case_id = "patient",
                activity_id = "handling",
                activity_instance_id = "handling_id",
                lifecycle_id = "registration_type",
                timestamp = "time",
                resource_id = "employee") 
## Warning in validate_eventlog(eventlog): The following activity instances are
## connected to more than one resource: 1054,116,1291,1850,2345,616
## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-04-29 03:24:59 until 2017-05-03 06:16:03 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 × 7
##    patient handling   employee handling_id registration_type time               
##    <chr>   <fct>      <fct>    <chr>       <fct>             <dttm>             
##  1 116     Registrat… r2       116         start             2017-04-29 03:24:59
##  2 116     Registrat… r6       116         complete          2017-04-29 06:23:09
##  3 116     Triage an… r1       616         start             2017-04-29 15:41:27
##  4 116     Triage an… r7       616         complete          2017-04-30 03:04:21
##  5 116     Blood test r4       1054        start             2017-04-30 15:13:28
##  6 116     Blood test r6       1054        complete          2017-04-30 21:24:18
##  7 116     MRI SCAN   r1       1291        start             2017-05-01 01:12:51
##  8 116     MRI SCAN   r4       1291        complete          2017-05-01 05:32:37
##  9 116     Discuss R… r3       1850        start             2017-05-01 09:44:20
## 10 116     Discuss R… r7       1850        complete          2017-05-01 14:00:48
## 11 116     Check-out  r3       2345        start             2017-05-03 04:02:35
## 12 116     Check-out  r2       2345        complete          2017-05-03 06:16:03
## # ℹ 1 more variable: .order <int>

Note that we need an eventlog irrespective of which attribute values are differing, i.e. it can be resources, but also any additional variables you have in your data set. For the special case of resource values, it might be that a different resource executing events in the same activity instance is a data quality issue. If so, some functions can help you to identify this issue: Inconsistent Resources.

Again, if the activity instance id does not exist, some heuristics are available to generate it: [Missing activity instance identifier].

Typical problems

Missing activity instance id

In order to be able to correlate events which belong to the same activity instance, an activity instance identifier is required. For example, in the data shown below, it is possible that a patient has gone through different surgeries, each with their own start- and complete event. The activity instance identifier will then allow to distinguish which events belong together and which do not. It is important to note that this instance identifier should be unique, also among different cases and activities.

patient activity timestamp status activity_instance
John Doe check-in 2017-05-10 08:33:26 complete 1
John Doe surgery 2017-05-10 08:53:16 start 2
John Doe surgery 2017-05-10 09:25:19 complete 2
John Doe treatment 2017-05-10 10:01:25 start 3
John Doe treatment 2017-05-10 10:35:18 complete 3
John Doe surgery 2017-05-10 10:41:35 start 4
John Doe surgery 2017-05-10 11:05:56 complete 4
John Doe check-out 2017-05-11 14:52:36 complete 5

If the activity instance identifier is not available you can use the assign_instance_id() function, which uses an heuristic to create the missing identifier. Alternatively, you can try to create the identifier on your own using dplyr::mutate() and other manipulation functions.

Large Datasets and Validation

By default, bupaR validates certain properties of the activity instances that is supplied when creating an event log:

  • a single activity instance identifier must not be connected to multiple cases,
  • a single activity instance identifier must not be connected to multiple activity labels,

However, these checks are not efficient and may lead to considerable performance issues for large data frames. It is possible to deactivate the validation in case you already know that your data fulfills all the requirements, using the argument validate = FALSE when creating the eventlog. Note that when the activity instance id was created with the assign_instance_id() function, you can assume the above properties hold.

Inconsistent Resources

Each event can contain the notion of a resource. It can be so that different events belonging to the same activity instance are executed by different resources, as in the eventlog below.

patient handling employee handling_id registration_type time .order
206 Registration r4 206 start 2017-07-19 15:48:14 1
206 Triage and Assessment r6 706 start 2017-07-19 17:03:44 2
206 Registration r3 206 complete 2017-07-19 17:03:44 3
206 Triage and Assessment r7 706 complete 2017-07-20 07:28:53 4
206 Blood test r1 1100 start 2017-07-25 03:02:14 5
206 Blood test r3 1100 complete 2017-07-25 08:14:46 6
206 MRI SCAN r6 1337 start 2017-07-25 12:37:36 7
206 MRI SCAN r2 1337 complete 2017-07-25 16:52:16 8
206 Discuss Results r2 1940 start 2017-07-26 07:36:36 9
206 Discuss Results r4 1940 complete 2017-07-26 11:08:03 10
206 Check-out r1 2435 start 2017-07-28 02:54:17 11
206 Check-out r7 2435 complete 2017-07-28 03:55:13 12

If you have a large dataset, and want to have an overview of the activity instances that have more than one resource connected to them, you can use the detect_resource_inconsistences() function.

log %>%
    detect_resource_inconsistencies()
## # A tibble: 6 × 5
##   patient handling              handling_id complete start
##   <chr>   <fct>                 <chr>       <chr>    <chr>
## 1 206     Blood test            1100        r3       r1   
## 2 206     Check-out             2435        r7       r1   
## 3 206     Discuss Results       1940        r4       r2   
## 4 206     MRI SCAN              1337        r2       r6   
## 5 206     Registration          206         r3       r4   
## 6 206     Triage and Assessment 706         r7       r6

If you want to remove these inconsistencies, a quick fix is to merge the resource labels together with fix_resource_inconsistencies(). Note that this is not needed for eventlog, but it is for activitylog. While the creation of the eventlog will emit a warning when resource inconsistencies exist, this should mostly be seen as a data quality warning. That said, there might be analysis related to the counting of resources where such inconsistencies might lead to odd results.

log %>%
    fix_resource_inconsistencies()
## *** OUTPUT ***
## A total of 6 activity executions in the event log are classified as inconsistencies.
## They are spread over the following cases and activities:
## # A tibble: 6 × 5
##   patient handling              handling_id complete start
##   <chr>   <fct>                 <chr>       <chr>    <chr>
## 1 206     Blood test            1100        r3       r1   
## 2 206     Check-out             2435        r7       r1   
## 3 206     Discuss Results       1940        r4       r2   
## 4 206     MRI SCAN              1337        r2       r6   
## 5 206     Registration          206         r3       r4   
## 6 206     Triage and Assessment 706         r7       r6
## Inconsistencies solved succesfully.
## # Log of 12 events consisting of:
## 1 trace 
## 1 case 
## 6 instances of 6 activities 
## 6 resources 
## Events occurred from 2017-07-19 15:48:14 until 2017-07-28 03:55:13 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 12 × 7
##    patient handling   employee handling_id registration_type time               
##    <chr>   <fct>      <chr>    <chr>       <fct>             <dttm>             
##  1 206     Registrat… r3 - r4  206         start             2017-07-19 15:48:14
##  2 206     Triage an… r7 - r6  706         start             2017-07-19 17:03:44
##  3 206     Registrat… r3 - r4  206         complete          2017-07-19 17:03:44
##  4 206     Triage an… r7 - r6  706         complete          2017-07-20 07:28:53
##  5 206     Blood test r3 - r1  1100        start             2017-07-25 03:02:14
##  6 206     Blood test r3 - r1  1100        complete          2017-07-25 08:14:46
##  7 206     MRI SCAN   r2 - r6  1337        start             2017-07-25 12:37:36
##  8 206     MRI SCAN   r2 - r6  1337        complete          2017-07-25 16:52:16
##  9 206     Discuss R… r4 - r2  1940        start             2017-07-26 07:36:36
## 10 206     Discuss R… r4 - r2  1940        complete          2017-07-26 11:08:03
## 11 206     Check-out  r7 - r1  2435        start             2017-07-28 02:54:17
## 12 206     Check-out  r7 - r1  2435        complete          2017-07-28 03:55:13
## # ℹ 1 more variable: .order <int>

Read more:


  1. Currently both eventlog and activitylog are supported by the packages bupaR, edeaR and processmapR. The daqapo package only supports activitylog, while all other packages only support eventlog. While the goal is to extend support for both to all packages, you can in the meanwhile always convert the format of your log using the functions to_eventlog() and to_activitylog().↩︎


Copyright © 2023 bupaR - Hasselt University