library(bupaverse)
library(dplyr)
In order to easily manipulate logs, well-known dplyr-verbs have been adapted. This page serves as a general introduction of the wrangling verbs. Their usage is illustrated throughout the documentation in Manipulation, Analysis and Visualization.
Using the group_by()
function, event logs can be grouped
according to (a set of) variables, such that all further computations
happen for each of these different groups.
In the next example, the number of cases are computed for each value of vehicleclass.
%>%
traffic_fines group_by(vehicleclass) %>%
n_cases()
## # A tibble: 4 × 2
## vehicleclass n_cases
## <chr> <int>
## 1 A 9973
## 2 C 21
## 3 M 6
## 4 <NA> 10000
For specific groupings, some auxiliary functions are available.
group_by_case
- group by casesgroup_by_activity
- group by activity typesgroup_by_resource
- group by resourcesgroup_by_activity_resource
- group by activity resource
pairgroup_by_activity_instance
- group by activity
instances.For example, the number of cases in which a specific resource occurs, can be computed as follows:
%>%
sepsis %>%
group_by_resource n_cases()
## # A tibble: 26 × 2
## resource n_cases
## <fct> <int>
## 1 ? 294
## 2 A 985
## 3 B 1013
## 4 C 1050
## 5 D 46
## 6 E 782
## 7 F 200
## 8 G 147
## 9 H 50
## 10 I 118
## # … with 16 more rows
Note that each of the descriptive metrics discussed here can be rewritten
using these lower-level functions. The example above is equal to the
resource_involvement
metric at case-level.
When you want to group on a combination of mapping variables, for
example, for each combination of case and activity,
you can use group_by_ids()
. The following examples counts
the number of events per case and per activity:
%>%
patients group_by_ids(case_id, activity_id) %>%
n_events()
## # A tibble: 2,721 × 3
## patient handling n_events
## <chr> <fct> <int>
## 1 1 Blood test 2
## 2 1 Check-out 2
## 3 1 Discuss Results 2
## 4 1 MRI SCAN 2
## 5 1 Registration 2
## 6 1 Triage and Assessment 2
## 7 10 Check-out 2
## 8 10 Discuss Results 2
## 9 10 Registration 2
## 10 10 Triage and Assessment 2
## # … with 2,711 more rows
Note that the arguments of group_by_ids()
are not the
variable names of case (patient) and activity
(handling) columns, but unquoted mapping id-functions. You can
thus use this function while being agnostic of the precise variable
names.
When a grouping is no longer needed, it can be removed using
ungroup_eventlog()
.
You can use mutate()
to add new variables to an event
log, possibly by using existing variables. In the next example, the
total amount of lacticacid is computed for each case. Read more.
%>%
sepsis group_by_case() %>%
mutate(total_lacticacid = sum(lacticacid, na.rm = T))
## # Groups: [case_id]
## Grouped # Log of 15214 events consisting of:
## 846 traces
## 1050 cases
## 15214 instances of 16 activities
## 26 resources
## Events occurred from 2013-11-07 08:18:29 until 2015-06-05 12:25:11
##
## # Variables were mapped as follows:
## Case identifier: case_id
## Activity identifier: activity
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle
##
## # A tibble: 15,214 × 35
## case_id activity lifec…¹ resou…² timestamp age crp diagn…³
## <chr> <fct> <fct> <fct> <dttm> <dbl> <dbl> <chr>
## 1 A ER Registrat… comple… A 2014-10-22 11:15:41 85 NA A
## 2 A Leucocytes comple… B 2014-10-22 11:27:00 NA NA <NA>
## 3 A CRP comple… B 2014-10-22 11:27:00 NA 210 <NA>
## 4 A LacticAcid comple… B 2014-10-22 11:27:00 NA NA <NA>
## 5 A ER Triage comple… C 2014-10-22 11:33:37 NA NA <NA>
## 6 A ER Sepsis Tr… comple… A 2014-10-22 11:34:00 NA NA <NA>
## 7 A IV Liquid comple… A 2014-10-22 14:03:47 NA NA <NA>
## 8 A IV Antibioti… comple… A 2014-10-22 14:03:47 NA NA <NA>
## 9 A Admission NC comple… D 2014-10-22 14:13:19 NA NA <NA>
## 10 A CRP comple… B 2014-10-24 09:00:00 NA 1090 <NA>
## # … with 15,204 more rows, 27 more variables: diagnosticartastrup <lgl>,
## # diagnosticblood <lgl>, diagnosticecg <lgl>, diagnosticic <lgl>,
## # diagnosticlacticacid <lgl>, diagnosticliquor <lgl>, diagnosticother <lgl>,
## # diagnosticsputum <lgl>, diagnosticurinaryculture <lgl>,
## # diagnosticurinarysediment <lgl>, diagnosticxthorax <lgl>, disfuncorg <lgl>,
## # hypotensie <lgl>, hypoxie <lgl>, infectionsuspected <lgl>, infusion <lgl>,
## # lacticacid <dbl>, leucocytes <chr>, oligurie <lgl>, …
Generic filtering of events can be done using filter()
,
which takes an event log and any number of logical conditions. The
example below filters events where “C” is the vehicle class and an
amount greater than 300. Read
more..
%>%
traffic_fines filter(vehicleclass == "C", amount > 300)
## # Log of 20 events consisting of:
## 1 trace
## 20 cases
## 20 instances of 1 activity
## 10 resources
## Events occurred from 2006-08-10 until 2008-02-09
##
## # Variables were mapped as follows:
## Case identifier: case_id
## Activity identifier: activity
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle
##
## # A tibble: 20 × 18
## case_id activity lifec…¹ resou…² timestamp amount article dismi…³
## <chr> <fct> <fct> <fct> <dttm> <chr> <dbl> <chr>
## 1 A10060 Create Fi… comple… 541 2007-03-08 00:00:00 36.0 157 NIL
## 2 A10497 Create Fi… comple… 558 2007-03-30 00:00:00 36.0 157 NIL
## 3 A10818 Create Fi… comple… 561 2007-04-08 00:00:00 36.0 157 NIL
## 4 A11707 Create Fi… comple… 550 2007-04-24 00:00:00 36.0 157 NIL
## 5 A11936 Create Fi… comple… 557 2007-04-29 00:00:00 36.0 157 NIL
## 6 A12073 Create Fi… comple… 557 2007-05-03 00:00:00 36.0 157 NIL
## 7 A1408 Create Fi… comple… 559 2006-08-20 00:00:00 35.0 157 NIL
## 8 A14883 Create Fi… comple… 561 2007-06-29 00:00:00 36.0 157 NIL
## 9 A17130 Create Fi… comple… 541 2007-07-15 00:00:00 36.0 157 NIL
## 10 A1815 Create Fi… comple… 563 2006-08-10 00:00:00 35.0 157 NIL
## 11 A19109 Create Fi… comple… 556 2007-07-17 00:00:00 36.0 157 NIL
## 12 A23000 Create Fi… comple… 550 2007-12-29 00:00:00 36.0 157 NIL
## 13 A24247 Create Fi… comple… 561 2007-12-03 00:00:00 36.0 157 NIL
## 14 A24366 Create Fi… comple… 541 2008-02-09 00:00:00 36.0 157 NIL
## 15 A24634 Create Fi… comple… 537 2007-11-21 00:00:00 36.0 157 NIL
## 16 A24942 Create Fi… comple… 561 2007-12-30 00:00:00 36.0 157 NIL
## 17 A25581 Create Fi… comple… 559 2007-11-23 00:00:00 36.0 157 NIL
## 18 A25599 Create Fi… comple… 559 2007-11-24 00:00:00 36.0 157 NIL
## 19 A26099 Create Fi… comple… 559 2007-12-09 00:00:00 36.0 157 NIL
## 20 A26277 Create Fi… comple… 538 2008-01-07 00:00:00 36.0 157 NIL
## # … with 10 more variables: expense <chr>, lastsent <chr>, matricola <dbl>,
## # notificationtype <chr>, paymentamount <dbl>, points <dbl>,
## # totalpaymentamount <chr>, vehicleclass <chr>, activity_instance_id <chr>,
## # .order <int>, and abbreviated variable names ¹lifecycle, ²resource,
## # ³dismissal
Variables on a event log can be selected using
select()
. By default, select()
will always
make sure that the mapping-variables are retained. Otherwise, it would
no longer function as an eventlog
object.
%>%
traffic_fines select(vehicleclass)
## # Log of 34724 events consisting of:
## 44 traces
## 10000 cases
## 34724 instances of 11 activities
## 16 resources
## Events occurred from 2006-06-17 until 2012-03-26
##
## # Variables were mapped as follows:
## Case identifier: case_id
## Activity identifier: activity
## Resource identifier: resource
## Activity instance identifier: activity_instance_id
## Timestamp: timestamp
## Lifecycle transition: lifecycle
##
## # A tibble: 34,724 × 8
## vehiclec…¹ case_id activ…² activ…³ timestamp resou…⁴ lifec…⁵ .order
## <chr> <chr> <fct> <chr> <dttm> <fct> <fct> <int>
## 1 A A1 Create… 1 2006-07-24 00:00:00 561 comple… 1
## 2 <NA> A1 Send F… 2 2006-12-05 00:00:00 <NA> comple… 2
## 3 A A100 Create… 3 2006-08-02 00:00:00 561 comple… 3
## 4 <NA> A100 Send F… 4 2006-12-12 00:00:00 <NA> comple… 4
## 5 <NA> A100 Insert… 5 2007-01-15 00:00:00 <NA> comple… 5
## 6 <NA> A100 Add pe… 6 2007-03-16 00:00:00 <NA> comple… 6
## 7 <NA> A100 Send f… 7 2009-03-30 00:00:00 <NA> comple… 7
## 8 A A10000 Create… 8 2007-03-09 00:00:00 561 comple… 8
## 9 <NA> A10000 Send F… 9 2007-07-17 00:00:00 <NA> comple… 9
## 10 <NA> A10000 Insert… 10 2007-08-02 00:00:00 <NA> comple… 10
## # … with 34,714 more rows, and abbreviated variable names ¹vehicleclass,
## # ²activity, ³activity_instance_id, ⁴resource, ⁵lifecycle
By setting the argument force_df = TRUE
, the
mapping-variables will not be retained, and the output will be a
data.frame, and not an eventlog
object. Note that doing so
will hold even in the case that all mapping variables are selected.
%>%
traffic_fines select(case_id, vehicleclass, amount, force_df = TRUE)
## # A tibble: 34,724 × 3
## case_id vehicleclass amount
## <chr> <chr> <chr>
## 1 A1 A 35.0
## 2 A1 <NA> <NA>
## 3 A100 A 35.0
## 4 A100 <NA> <NA>
## 5 A100 <NA> <NA>
## 6 A100 <NA> 71.5
## 7 A100 <NA> <NA>
## 8 A10000 A 36.0
## 9 A10000 <NA> <NA>
## 10 A10000 <NA> <NA>
## # … with 34,714 more rows
Similar to group_by_ids()
, select_ids()
can
be used to select the mapping variables.
%>%
patients select_ids(case_id, activity_id)
## # A tibble: 5,442 × 2
## patient handling
## <chr> <fct>
## 1 1 Registration
## 2 2 Registration
## 3 3 Registration
## 4 4 Registration
## 5 5 Registration
## 6 6 Registration
## 7 7 Registration
## 8 8 Registration
## 9 9 Registration
## 10 10 Registration
## # … with 5,432 more rows
Note again how the arguments are unquoted id-functions instead of raw
variable names. The result of select_ids()
will
always result in a data.frame
object, as
typically not all id’s in the mapping will be selected.
Event data can be sorted using the arrange()
.
desc()
argument can be used to sort descending on an
attribute.
#sort descending on time
%>%
patients arrange(desc(time))
## # Log of 5442 events consisting of:
## 7 traces
## 500 cases
## 2721 instances of 7 activities
## 7 resources
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02
##
## # Variables were mapped as follows:
## Case identifier: patient
## Activity identifier: handling
## Resource identifier: employee
## Activity instance identifier: handling_id
## Timestamp: time
## Lifecycle transition: registration_type
##
## # A tibble: 5,442 × 7
## handling patient emplo…¹ handl…² regis…³ time .order
## <fct> <chr> <fct> <chr> <fct> <dttm> <int>
## 1 Triage and Assess… 500 r2 1000 comple… 2018-05-05 07:16:02 3721
## 2 Discuss Results 495 r6 2229 comple… 2018-05-05 02:49:57 4950
## 3 X-Ray 498 r5 1734 comple… 2018-05-05 01:34:30 4455
## 4 Triage and Assess… 500 r2 1000 start 2018-05-04 23:53:27 1000
## 5 Triage and Assess… 499 r2 999 comple… 2018-05-04 23:53:27 3720
## 6 Discuss Results 495 r6 2229 start 2018-05-04 23:50:05 2229
## 7 Discuss Results 489 r6 2223 comple… 2018-05-04 23:50:05 4944
## 8 X-Ray 498 r5 1734 start 2018-05-04 21:50:07 1734
## 9 X-Ray 497 r5 1733 comple… 2018-05-04 21:50:07 4454
## 10 Discuss Results 489 r6 2223 start 2018-05-04 20:24:44 2223
## # … with 5,432 more rows, and abbreviated variable names ¹employee,
## # ²handling_id, ³registration_type
Read more:
Copyright © 2023 bupaR - Hasselt University