Transforming your raw data into an event log object is one of the most challenging tasks in process analysis. On this page, we cover all the possible situations and challenges that you can encounter.
We start with some important terminology:
eventlog
vs activitylog
bupaR
supports two different kinds of log formats, both
of which are an extension on R data.frame
:
eventlog
: Event logs are created from
data.frame
in which each row represents a single event.
This means that it has a single timestamp.activitylog
: Activity logs are created
from data.frame
in which each row represents a single
activity instances. This means it can has multiple timestamps, stored in
different columns.The data model below shows the difference between these two levels of observations, i.e. activity instances vs events.
The example below shows an excerpt of an event log containing 6 events. It can be seen that each event is linked to a single timestamp. As there can be more events within a single activity instance, each event also needs to be linked to a lifecycle status (here the registration_type). Furthermore, an activity instance identifier (handling_id) is needed to indicated which events belong to the same activity instances.
handling | patient | employee | handling_id | registration_type | time |
---|---|---|---|---|---|
Registration | 333 | r1 | 333 | start | 2017-11-15 16:50:59 |
Registration | 333 | r1 | 333 | complete | 2017-11-15 18:45:18 |
Triage and Assessment | 333 | r2 | 833 | start | 2017-11-16 20:37:26 |
Triage and Assessment | 333 | r2 | 833 | complete | 2017-11-17 08:21:08 |
Blood test | 333 | r3 | 1152 | start | 2017-11-17 22:27:09 |
Blood test | 333 | r3 | 1152 | complete | 2017-11-18 03:16:03 |
The table below show the same data as above, but now using the
activitylog
format. It can be seen that there are now just
3 rows instead of 6, but each row as 2 timestamps, representing 2
events. The lifecycle status represented by those timestamps is now the
column names of those variables.
handling | patient | employee | handling_id | complete | start |
---|---|---|---|---|---|
Registration | 333 | r1 | 333 | 2017-11-15 18:45:18 | 2017-11-15 16:50:59 |
Triage and Assessment | 333 | r2 | 833 | 2017-11-17 08:21:08 | 2017-11-16 20:37:26 |
Blood test | 333 | r3 | 1152 | 2017-11-18 03:16:03 | 2017-11-17 22:27:09 |
As these examples show, both formats can often be used for representing the same process data. However, there are some important differences between them:
eventlog
format has much more
flexibility in terms of lifecycle. There is no limit to
the number of events that can occur in a single activity instance. If
your data contains lifecycle statuses such as suspend,
resume or reassign, they can be recorded multiple
times within a single activity instance. In the activitylog
format, as each lifecycle gets is own column, it isn’t possible to have
two events of the same lifecycle status in a single activity
instance.eventlog
is an event. As
a result, attribute values can be stored at the event
level. In an activitylog
, the level of observation
is an activity instance. This means that all additional attributes that
you have about your process should be at this higher level. For example,
an activity instance can only be connected to a single resource in the
activitylog
format, whereas in an eventlog
different events within the same activity instance can have different
resources, of different values for any other attribute.activitylog
is
easier to make, and typically closer to the format that
your data is already in (see further below on how to construct
log
objects). As a result of this, there are many
situations in which the analysis of an activitylog
will be
much faster compared to eventlog
, where a lot of additional
complexity needs to be taken into account.Functionalities in bupaR core packages support both formats. 1 As such, the goal of your analysis does not impact the decision. Only the complexity of your data is important to make this decision. The precise format your raw data is in will further define the preparatory steps that are needed. We can distinguish between 3 typical scenarios. The flowchart below helps you on your way.
An activitylog
is the best option when each row in your
data is an activity instance, or when events belonging to the same
activity instance have equal attribute values (e.g. all events are
executed by the same resource). When these two criteria do not hold, you
can create an eventlog
object.
If each row in your data.frame
is already an activity
instance, the activitylog
format is the best way to go.
Consider the data sample below.
patient | handling | activity_started | activity_ended |
---|---|---|---|
464 | Blood test | 2018-04-06 20:04:09 | 2018-04-07 01:18:17 |
464 | Check-out | 2018-04-12 19:02:11 | 2018-04-12 21:41:01 |
464 | Discuss Results | 2018-04-12 11:00:16 | 2018-04-12 13:59:44 |
464 | MRI SCAN | 2018-04-07 06:30:56 | 2018-04-07 09:37:26 |
464 | Registration | 2018-03-20 19:07:17 | 2018-03-20 21:15:41 |
464 | Triage and Assessment | 2018-03-21 15:58:55 | 2018-03-22 05:21:56 |
As each row contains multiple timestamps, i.e. activity_started and
activity_ended, it is clear that each row represents an activity
instance. Turning this dataset in an activitylog
requires
the following steps:
Date
or
POSIXct
.activitylog
constructor function.%>%
data # rename timestamp variables appropriately
::rename(start = activity_started,
dplyrcomplete = activity_ended) %>%
# convert timestamps to
convert_timestamps(columns = c("start", "complete"), format = ymd_hms) %>%
activitylog(case_id = "patient",
activity_id = "handling",
timestamps = c("start", "complete"))
## # Log of 12 events consisting of:
## 1 trace
## 1 case
## 6 instances of 6 activities
## 0 resources
## Events occurred from 2018-03-20 19:07:17 until 2018-04-12 21:41:01
##
## # Variables were mapped as follows:
## Case identifier: patient
## Activity identifier: handling
## Resource identifier: employee
## Timestamps: start, complete
##
## # A tibble: 6 × 5
## patient handling start complete .order
## <chr> <fct> <dttm> <dttm> <int>
## 1 464 Blood test 2018-04-06 20:04:09 2018-04-07 01:18:17 1
## 2 464 Check-out 2018-04-12 19:02:11 2018-04-12 21:41:01 2
## 3 464 Discuss Results 2018-04-12 11:00:16 2018-04-12 13:59:44 3
## 4 464 MRI SCAN 2018-04-07 06:30:56 2018-04-07 09:37:26 4
## 5 464 Registration 2018-03-20 19:07:17 2018-03-20 21:15:41 5
## 6 464 Triage and Assessment 2018-03-21 15:58:55 2018-03-22 05:21:56 6
Note that in case a resource identifier is available, this
information can be added in the activitylog
call.
If each row in your data.frame
is an event, but all
events that belong to the same activity instance share the same
attribute values, the activitylog
format is again the best
way to go. Consider the data sample below.
patient | handling | employee | handling_id | registration_type | time |
---|---|---|---|---|---|
227 | Registration | r1 | 227 | started | 2017-08-09 19:55:30 |
227 | Triage and Assessment | r2 | 727 | started | 2017-08-09 22:17:43 |
227 | Registration | r1 | 227 | completed | 2017-08-09 22:17:43 |
227 | Triage and Assessment | r2 | 727 | completed | 2017-08-10 15:21:30 |
227 | Blood test | r3 | 1109 | started | 2017-08-17 03:01:24 |
227 | Blood test | r3 | 1109 | completed | 2017-08-17 09:17:20 |
227 | MRI SCAN | r4 | 1346 | started | 2017-08-17 13:15:04 |
227 | MRI SCAN | r4 | 1346 | completed | 2017-08-17 18:47:44 |
227 | Discuss Results | r6 | 1961 | started | 2017-08-22 13:33:38 |
227 | Check-out | r7 | 2456 | started | 2017-08-22 15:38:38 |
227 | Discuss Results | r6 | 1961 | completed | 2017-08-22 15:38:38 |
227 | Check-out | r7 | 2456 | completed | 2017-08-22 17:12:46 |
The resource identifier (employee) has been added as an additional
attribute. Note that though each row is an event, they can be grouped
into activity instances using the handling_id column, which we will call
the activity instance id. Using the latter, we can see that the resource
attribute is the same within each activity instance, which allows us to
create an activitylog
. The steps to do so are the
following.
Date
or
POSIXct
.eventlog
constructor function.activitylog
using
to_activitylog
for reduced memory usage and improved
performance.%>%
data # recode lifecycle variable appropriately
::mutate(registration_type = forcats::fct_recode(registration_type,
dplyr"start" = "started",
"complete" = "completed")) %>%
convert_timestamps(columns = "time", format = ymd_hms) %>%
eventlog(case_id = "patient",
activity_id = "handling",
activity_instance_id = "handling_id",
lifecycle_id = "registration_type",
timestamp = "time",
resource_id = "employee") %>%
to_activitylog() -> tmp_act
Note that the resource identifier is optional, and can be left out of
the eventlog
call if such an attribute does not exist in
your data. If the activity instance id does not exist, some heuristics
are available to generate it: [Missing activity instance
identifier].
If each row is an event, and events of the same activity instance
have differing attribute values, the flexibility of
eventlog
objects is required. Consider the data sample
below.
patient | handling | employee | handling_id | registration_type | time |
---|---|---|---|---|---|
116 | Registration | r2 | 116 | started | 2017-04-29 03:24:59 |
116 | Registration | r6 | 116 | completed | 2017-04-29 06:23:09 |
116 | Triage and Assessment | r1 | 616 | started | 2017-04-29 15:41:27 |
116 | Triage and Assessment | r7 | 616 | completed | 2017-04-30 03:04:21 |
116 | Blood test | r4 | 1054 | started | 2017-04-30 15:13:28 |
116 | Blood test | r6 | 1054 | completed | 2017-04-30 21:24:18 |
116 | MRI SCAN | r1 | 1291 | started | 2017-05-01 01:12:51 |
116 | MRI SCAN | r4 | 1291 | completed | 2017-05-01 05:32:37 |
116 | Discuss Results | r3 | 1850 | started | 2017-05-01 09:44:20 |
116 | Discuss Results | r7 | 1850 | completed | 2017-05-01 14:00:48 |
116 | Check-out | r3 | 2345 | started | 2017-05-03 04:02:35 |
116 | Check-out | r2 | 2345 | completed | 2017-05-03 06:16:03 |
In this example, different resources (employees) sometimes perform
the start and complete event of the same activity instance. Therefore,
we resort to the eventlog
format which has no problems
storing this. The steps to take are the following:
Date
or
POSIXct
.eventlog
constructor function.%>%
data # recode lifecycle variable appropriately
::mutate(registration_type = forcats::fct_recode(registration_type,
dplyr"start" = "started",
"complete" = "completed")) %>%
convert_timestamps(columns = "time", format = ymd_hms) %>%
eventlog(case_id = "patient",
activity_id = "handling",
activity_instance_id = "handling_id",
lifecycle_id = "registration_type",
timestamp = "time",
resource_id = "employee")
## Warning in validate_eventlog(eventlog): The following activity instances are
## connected to more than one resource: 1054,116,1291,1850,2345,616
## # Log of 12 events consisting of:
## 1 trace
## 1 case
## 6 instances of 6 activities
## 6 resources
## Events occurred from 2017-04-29 03:24:59 until 2017-05-03 06:16:03
##
## # Variables were mapped as follows:
## Case identifier: patient
## Activity identifier: handling
## Resource identifier: employee
## Activity instance identifier: handling_id
## Timestamp: time
## Lifecycle transition: registration_type
##
## # A tibble: 12 × 7
## patient handling employee handling_id registration_type time
## <chr> <fct> <fct> <chr> <fct> <dttm>
## 1 116 Registrat… r2 116 start 2017-04-29 03:24:59
## 2 116 Registrat… r6 116 complete 2017-04-29 06:23:09
## 3 116 Triage an… r1 616 start 2017-04-29 15:41:27
## 4 116 Triage an… r7 616 complete 2017-04-30 03:04:21
## 5 116 Blood test r4 1054 start 2017-04-30 15:13:28
## 6 116 Blood test r6 1054 complete 2017-04-30 21:24:18
## 7 116 MRI SCAN r1 1291 start 2017-05-01 01:12:51
## 8 116 MRI SCAN r4 1291 complete 2017-05-01 05:32:37
## 9 116 Discuss R… r3 1850 start 2017-05-01 09:44:20
## 10 116 Discuss R… r7 1850 complete 2017-05-01 14:00:48
## 11 116 Check-out r3 2345 start 2017-05-03 04:02:35
## 12 116 Check-out r2 2345 complete 2017-05-03 06:16:03
## # ℹ 1 more variable: .order <int>
Note that we need an eventlog
irrespective of which
attribute values are differing, i.e. it can be resources, but also any
additional variables you have in your data set. For the special case of
resource values, it might be that a different resource executing events
in the same activity instance is a data quality issue. If so, some
functions can help you to identify this issue: Inconsistent Resources.
Again, if the activity instance id does not exist, some heuristics are available to generate it: [Missing activity instance identifier].
In order to be able to correlate events which belong to the same activity instance, an activity instance identifier is required. For example, in the data shown below, it is possible that a patient has gone through different surgeries, each with their own start- and complete event. The activity instance identifier will then allow to distinguish which events belong together and which do not. It is important to note that this instance identifier should be unique, also among different cases and activities.
patient | activity | timestamp | status | activity_instance |
---|---|---|---|---|
John Doe | check-in | 2017-05-10 08:33:26 | complete | 1 |
John Doe | surgery | 2017-05-10 08:53:16 | start | 2 |
John Doe | surgery | 2017-05-10 09:25:19 | complete | 2 |
John Doe | treatment | 2017-05-10 10:01:25 | start | 3 |
John Doe | treatment | 2017-05-10 10:35:18 | complete | 3 |
John Doe | surgery | 2017-05-10 10:41:35 | start | 4 |
John Doe | surgery | 2017-05-10 11:05:56 | complete | 4 |
John Doe | check-out | 2017-05-11 14:52:36 | complete | 5 |
If the activity instance identifier is not available you can use the
assign_instance_id()
function, which uses an heuristic to
create the missing identifier. Alternatively, you can try to create the
identifier on your own using dplyr::mutate()
and other
manipulation functions.
By default, bupaR
validates certain properties of the
activity instances that is supplied when creating an event log:
However, these checks are not efficient and may lead to considerable
performance issues for large data frames. It is possible to deactivate
the validation in case you already know that your data fulfills all the
requirements, using the argument validate = FALSE
when
creating the eventlog
. Note that when the activity instance
id was created with the assign_instance_id()
function, you
can assume the above properties hold.
Each event can contain the notion of a resource. It can be so that
different events belonging to the same activity instance are executed by
different resources, as in the eventlog
below.
patient | handling | employee | handling_id | registration_type | time | .order |
---|---|---|---|---|---|---|
206 | Registration | r4 | 206 | start | 2017-07-19 15:48:14 | 1 |
206 | Triage and Assessment | r6 | 706 | start | 2017-07-19 17:03:44 | 2 |
206 | Registration | r3 | 206 | complete | 2017-07-19 17:03:44 | 3 |
206 | Triage and Assessment | r7 | 706 | complete | 2017-07-20 07:28:53 | 4 |
206 | Blood test | r1 | 1100 | start | 2017-07-25 03:02:14 | 5 |
206 | Blood test | r3 | 1100 | complete | 2017-07-25 08:14:46 | 6 |
206 | MRI SCAN | r6 | 1337 | start | 2017-07-25 12:37:36 | 7 |
206 | MRI SCAN | r2 | 1337 | complete | 2017-07-25 16:52:16 | 8 |
206 | Discuss Results | r2 | 1940 | start | 2017-07-26 07:36:36 | 9 |
206 | Discuss Results | r4 | 1940 | complete | 2017-07-26 11:08:03 | 10 |
206 | Check-out | r1 | 2435 | start | 2017-07-28 02:54:17 | 11 |
206 | Check-out | r7 | 2435 | complete | 2017-07-28 03:55:13 | 12 |
If you have a large dataset, and want to have an overview of the
activity instances that have more than one resource connected to them,
you can use the detect_resource_inconsistences()
function.
%>%
log detect_resource_inconsistencies()
## # A tibble: 6 × 5
## patient handling handling_id complete start
## <chr> <fct> <chr> <chr> <chr>
## 1 206 Blood test 1100 r3 r1
## 2 206 Check-out 2435 r7 r1
## 3 206 Discuss Results 1940 r4 r2
## 4 206 MRI SCAN 1337 r2 r6
## 5 206 Registration 206 r3 r4
## 6 206 Triage and Assessment 706 r7 r6
If you want to remove these inconsistencies, a quick fix is to merge
the resource labels together with
fix_resource_inconsistencies()
. Note that this is not
needed for eventlog
, but it is for
activitylog
. While the creation of the
eventlog
will emit a warning when resource inconsistencies
exist, this should mostly be seen as a data quality warning. That said,
there might be analysis related to the counting of resources where such
inconsistencies might lead to odd results.
%>%
log fix_resource_inconsistencies()
## *** OUTPUT ***
## A total of 6 activity executions in the event log are classified as inconsistencies.
## They are spread over the following cases and activities:
## # A tibble: 6 × 5
## patient handling handling_id complete start
## <chr> <fct> <chr> <chr> <chr>
## 1 206 Blood test 1100 r3 r1
## 2 206 Check-out 2435 r7 r1
## 3 206 Discuss Results 1940 r4 r2
## 4 206 MRI SCAN 1337 r2 r6
## 5 206 Registration 206 r3 r4
## 6 206 Triage and Assessment 706 r7 r6
## Inconsistencies solved succesfully.
## # Log of 12 events consisting of:
## 1 trace
## 1 case
## 6 instances of 6 activities
## 6 resources
## Events occurred from 2017-07-19 15:48:14 until 2017-07-28 03:55:13
##
## # Variables were mapped as follows:
## Case identifier: patient
## Activity identifier: handling
## Resource identifier: employee
## Activity instance identifier: handling_id
## Timestamp: time
## Lifecycle transition: registration_type
##
## # A tibble: 12 × 7
## patient handling employee handling_id registration_type time
## <chr> <fct> <chr> <chr> <fct> <dttm>
## 1 206 Registrat… r3 - r4 206 start 2017-07-19 15:48:14
## 2 206 Triage an… r7 - r6 706 start 2017-07-19 17:03:44
## 3 206 Registrat… r3 - r4 206 complete 2017-07-19 17:03:44
## 4 206 Triage an… r7 - r6 706 complete 2017-07-20 07:28:53
## 5 206 Blood test r3 - r1 1100 start 2017-07-25 03:02:14
## 6 206 Blood test r3 - r1 1100 complete 2017-07-25 08:14:46
## 7 206 MRI SCAN r2 - r6 1337 start 2017-07-25 12:37:36
## 8 206 MRI SCAN r2 - r6 1337 complete 2017-07-25 16:52:16
## 9 206 Discuss R… r4 - r2 1940 start 2017-07-26 07:36:36
## 10 206 Discuss R… r4 - r2 1940 complete 2017-07-26 11:08:03
## 11 206 Check-out r7 - r1 2435 start 2017-07-28 02:54:17
## 12 206 Check-out r7 - r1 2435 complete 2017-07-28 03:55:13
## # ℹ 1 more variable: .order <int>
Read more:
Currently both eventlog
and
activitylog
are supported by the packages
bupaR
, edeaR
and processmapR
. The
daqapo
package only supports activitylog
,
while all other packages only support eventlog
. While the
goal is to extend support for both to all packages, you can in the
meanwhile always convert the format of your log using the functions
to_eventlog()
and to_activitylog()
.↩︎
Copyright © 2023 bupaR - Hasselt University