1 Getting Started
This tutorial assumes that you are familiar with the most known R packages of the tidyverse
, in particular tibble
operations provided by dplyr
and the forward-pipe operator %>%
provided by magrittr
.
1.1 Downloading the dataset
This tutorial presents a use case of the data.cube
library on a quite simple dataset (downloadable here) referencing the citations of country names in press articles. This dataset has been extracted from the corpus of articles gathered by the ANR GEOMEDIA Project.
First, import the main file of this dataset as a data.frame
(actually, as a tibble
):
library (readr)
df <- read_csv ('data/articles.csv')
## Parsed with column specification:
## cols(
## id_media = col_character(),
## week = col_date(format = ""),
## id_country = col_character(),
## article_nb = col_double()
## )
head (df)
## # A tibble: 6 x 4
## id_media week id_country article_nb
## <chr> <date> <chr> <dbl>
## 1 en_NZL_nzhera_int 2015-06-01 SVK 0.333
## 2 en_NZL_nzhera_int 2015-03-09 VEN 10.5
## 3 es_VEN_univer_int 2014-05-19 LBY 9
## 4 en_MYS_starmy_int 2014-10-13 PSE 9
## 5 en_NZL_nzhera_int 2014-06-09 GTM 1.5
## 6 en_MYS_starmy_int 2015-06-22 ARG 3
- First column
id_media
contains standardised identifiers of selected newspapers. - Second column
week
contains publication dates at the week level (date of the first day of the week at theYYYY-MM-DD
format). - Third column
id_country
contains standardised identifiers of cited countries (ISO 3166-1 alpha-3). - Last column
article_nb
gives the corresponding number of articles, that is the number of articles published byid_media
duringweek
and citingid_country
.
For example, the third line of the dataset above indicates that the Venezuelan newspaper El Universal (es_VEN_univer_int
) as published 9
articles talking about Libya (LBY
) during the week starting on the 19th of May, 2014 (2014-05-19
). Note that the indicated number of articles is not necessarily an integer value (see for example the first line) as an article simultaneously citing n
countries is weighted by 1/n
and then distributed among n
lines.
1.2 Loading the library
First, load the library:
library (data.cube)
1.3 Building the cube with as.data.cube
Function as.data.cube
transforms a classical data.frame
(or tibble
) object into a data.cube
, that is the data structure that will then be used by the library. One should specify which columns correspond to the cube’s dimensions (in our case, the first three) and which columns correspond to the observed variables (in our case, the last one). Note that one might also rename these dimensions and variables when transforming the data.frame
into a data.cube
.
geomedia <-
df %>%
as.data.cube (
dim.names = list (media = id_media, week, country = id_country),
var.names = list (articles = article_nb)
)
## Warning in as.data.cube_.data.frame(., str.dim.names, str.var.names):
## Observations are assumed to be unique. Check for potential duplicates in
## the input data.frame if unsure.
1.4 What’s in the cube with summary
Function summary
then prints a short summary of the data contained in the resulting structure.
geomedia %>% summary ()
## data.cube of 3 dimensions and 1 variable
##
## -> Dimension media
## - Element number: 36
## - Class (type): character (character)
## - Element names: en_NZL_nzhera_int, es_VEN_univer_int, en_MYS_starmy_int, ...
##
## -> Dimension week
## - Element number: 79
## - Class (type): Date (double)
## - Element names: 2015-06-01, 2015-03-09, 2014-05-19, 2014-10-13, 2014-06-09, ...
##
## -> Dimension country
## - Element number: 205
## - Class (type): character (character)
## - Element names: SVK, VEN, LBY, PSE, GTM, ARG, ARM, AFG, NRU, MEX, IRN, ...
##
## -> Variable articles
## - Dimensions: media x week x country
## - Class (type): numeric (double)
## - NA value: num 0
1.5 Back to frame with as.data.frame
Function as.data.frame
transforms a data.cube
object back into a data.frame
object (actually, a tibble
).
geomedia %>% as.data.frame ()
## # A tibble: 95,674 x 4
## media week country articles
## <chr> <date> <chr> <dbl>
## 1 en_NZL_nzhera_int 2015-06-01 SVK 0.333
## 2 en_NZL_nzhera_int 2015-03-09 VEN 10.5
## 3 es_VEN_univer_int 2014-05-19 LBY 9
## 4 en_MYS_starmy_int 2014-10-13 PSE 9
## 5 en_NZL_nzhera_int 2014-06-09 GTM 1.5
## 6 en_MYS_starmy_int 2015-06-22 ARG 3
## 7 en_USA_wapost_int 2015-01-05 ARM 0.5
## 8 es_VEN_univer_int 2014-05-26 AFG 4.5
## 9 en_GBR_dailyt_int 2015-05-04 NRU 1
## 10 fr_BEL_derheu_int 2014-06-16 MEX 0.333
## # … with 95,664 more rows