Event filters

library(bupaR)

Activity

The filter activity function can be used to filter activities by name. It has three arguments

the event log
a vector of activities
the reverse argument (FALSE or TRUE)

patients %>%
    filter_activity(c("X-Ray", "Blood test")) %>%
    activities

## # A tibble: 2 × 3
##   handling   absolute_frequency relative_frequency
##   <fct>                   <int>              <dbl>
## 1 X-Ray                     261              0.524
## 2 Blood test                237              0.476

As one can see, there are only 2 distinct activities left in the event log.

Activity Frequency

Relative filtering - using percentage

It is also possible to filter on activity frequency. This filter uses a percentile cut off, and will look at those activities which are most frequent until the required percentage of events has been reached. Thus, a percentile cut off of 80% will look at the activities needed to represent 80% of the events. In the example below, the least frequent activities covering 50% of the event log are selected, since the reverse argument is true.

patients %>%
    filter_activity_frequency(percentage = 0.5, reverse = T) %>%
    activities

## # A tibble: 4 × 3
##   handling   absolute_frequency relative_frequency
##   <fct>                   <int>              <dbl>
## 1 Check-out                 492              0.401
## 2 X-Ray                     261              0.213
## 3 Blood test                237              0.193
## 4 MRI SCAN                  236              0.192

Absolute filtering - using interval

Instead of providing a target percentage, we can provide a target frequency interval. For example, only retain the activities which occur more than 300 times.

patients %>%
    filter_activity_frequency(interval = c(300,500)) %>%
    activities

## # A tibble: 4 × 3
##   handling              absolute_frequency relative_frequency
##   <fct>                              <int>              <dbl>
## 1 Registration                         500              0.252
## 2 Triage and Assessment                500              0.252
## 3 Discuss Results                      495              0.249
## 4 Check-out                            492              0.248

When we don’t now the maximal frequency - 500 in this case, we can use an open interval by using NA.

patients %>%
    filter_activity_frequency(interval = c(300, NA)) %>%
    activities

## # A tibble: 4 × 3
##   handling              absolute_frequency relative_frequency
##   <fct>                              <int>              <dbl>
## 1 Registration                         500              0.252
## 2 Triage and Assessment                500              0.252
## 3 Discuss Results                      495              0.249
## 4 Check-out                            492              0.248

Activity Instance

Specific activity instances can be selected using the filter_activity_instance.

patients %>%
    filter_activity_instance(activity_instances = 10)

## # Log of 2 events consisting of:
## 1 trace 
## 1 case 
## 1 instance of 1 activity 
## 1 resource 
## Events occurred from 2017-01-06 05:58:54 until 2017-01-06 09:13:28 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 2 × 7
##   handling    patient employee handling_id registration_type time               
##   <fct>       <chr>   <fct>    <chr>       <fct>             <dttm>             
## 1 Registrati… 10      r1       10          start             2017-01-06 05:58:54
## 2 Registrati… 10      r1       10          complete          2017-01-06 09:13:28
## # ℹ 1 more variable: .order <int>

Lifecycle

filter_lifecycle can be used to select events with a specific lifecycle

patients %>%
    filter_lifecycle("complete")

## # Log of 2721 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 12:40:20 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 2,721 × 7
##    handling   patient employee handling_id registration_type time               
##    <fct>      <chr>   <fct>    <chr>       <fct>             <dttm>             
##  1 Registrat… 1       r1       1           complete          2017-01-02 12:40:20
##  2 Registrat… 2       r1       2           complete          2017-01-02 15:16:38
##  3 Registrat… 3       r1       3           complete          2017-01-04 06:36:54
##  4 Registrat… 4       r1       4           complete          2017-01-04 04:25:06
##  5 Registrat… 5       r1       5           complete          2017-01-04 20:07:50
##  6 Registrat… 6       r1       6           complete          2017-01-04 18:12:46
##  7 Registrat… 7       r1       7           complete          2017-01-05 06:27:49
##  8 Registrat… 8       r1       8           complete          2017-01-05 07:58:17
##  9 Registrat… 9       r1       9           complete          2017-01-06 07:18:32
## 10 Registrat… 10      r1       10          complete          2017-01-06 09:13:28
## # ℹ 2,711 more rows
## # ℹ 1 more variable: .order <int>

Lifecycle Presence

We can select activity instances that contain a specific status with filter_lifecycle_presence. Its workings are comparable to filter_activity_presence.

Resource Labels

Similar to the activity filter, the resource filter can be used to filter events by listing on or more resources.

patients %>%
    filter_resource(c("r1","r4")) %>%
    resource_frequency("resource")

## # A tibble: 2 × 3
##   employee absolute relative
##   <fct>       <int>    <dbl>
## 1 r1            500    0.679
## 2 r4            236    0.321

Resource Frequency

Instead of filtering events by the resource that performed the activity, we can also filter event by the frequency of the resource. This happens in the same way as for the activity frequency filter. The filter below gives us the 80% activity instances performed by the most common resources.

patients %>%
    filter_resource_frequency(perc = 0.80) %>%
    resources()

## # A tibble: 5 × 3
##   employee absolute_frequency relative_frequency
##   <fct>                 <int>              <dbl>
## 1 r1                      500              0.222
## 2 r2                      500              0.222
## 3 r6                      495              0.220
## 4 r7                      492              0.219
## 5 r5                      261              0.116

Alternatively, using the interval argument, we can select resources who perform between 200 and 300 activity instances.

patients %>%
    filter_resource_frequency(interval = c(200,300)) %>%
    resources()

## # A tibble: 3 × 3
##   employee absolute_frequency relative_frequency
##   <fct>                 <int>              <dbl>
## 1 r5                      261              0.356
## 2 r3                      237              0.323
## 3 r4                      236              0.322

Trim to Endpoints

The trim filter is a special event filter, as it also take into account the notion of cases. In fact, it trim cases such that they start with a certain activities until they end with a certain activity. It requires two list: one for possible start activities and one for end activities. The cases will be trimmed from the first appearance of a start activity till the last appearance of an end activity. When reversed, these slices of the event log will be removed instead of preserved.

patients %>%
    filter_trim(start_activities = "Registration", end_activities =  c("MRI SCAN","X-Ray")) %>%
    process_map(type = performance())

Trim to Time Window

Instead of triming cases to a particular start and/or end activity, we can also trim cases to a particular time window. For this we use the function filter_time_period with filter_method trim. This filter needs a time interval, which is a vector of length 2 containing data/datetime values. These can be created easily using lubridate function, e.g. ymd for year-month-day formats.

This example takes only activity instances which happened (at least partly, i.e. some events) in December of 2017.

library(lubridate)
patients %>%
    filter_time_period(interval = ymd(c(20171201, 20171231)), filter_method = "trim") %>%
    summary()

## Number of events:  290
## Number of cases:  36
## Number of traces:  13
## Number of distinct activities:  7
## Average trace length:  8.055556
## 
## Start eventlog:  2017-11-30 20:29:12
## End eventlog:  2017-12-31 08:00:08

##                   handling    patient          employee handling_id       
##  Blood test           :30   Length:290         r1:52    Length:290        
##  Check-out            :48   Class :character   r2:52    Class :character  
##  Discuss Results      :54   Mode  :character   r3:30    Mode  :character  
##  MRI SCAN             :30                      r4:30                      
##  Registration         :52                      r5:24                      
##  Triage and Assessment:52                      r6:54                      
##  X-Ray                :24                      r7:48                      
##  registration_type      time                         .order      
##  complete:145      Min.   :2017-11-30 20:29:12   Min.   :  1.00  
##  start   :145      1st Qu.:2017-12-06 01:04:43   1st Qu.: 73.25  
##                    Median :2017-12-13 13:12:47   Median :145.50  
##                    Mean   :2017-12-13 20:14:51   Mean   :145.50  
##                    3rd Qu.:2017-12-19 18:09:13   3rd Qu.:217.75  
##                    Max.   :2017-12-31 08:00:08   Max.   :290.00  
##

Using a different filter method (start, complete, contained or intersecting), this filter can also act as a case filter (see below).

bupaR Docs | Filter events