1 Getting Started

This tutorial assumes that you are familiar with the most known R packages of the tidyverse, in particular tibble operations provided by dplyr and the forward-pipe operator %>% provided by magrittr.

1.1 Downloading the dataset

This tutorial presents a use case of the data.cube library on a quite simple dataset (downloadable here) referencing the citations of country names in press articles. This dataset has been extracted from the corpus of articles gathered by the ANR GEOMEDIA Project.

First, import the main file of this dataset as a data.frame (actually, as a tibble):

library (readr)
df <- read_csv ('data/articles.csv')
## Parsed with column specification:
## cols(
##   id_media = col_character(),
##   week = col_date(format = ""),
##   id_country = col_character(),
##   article_nb = col_double()
## )
head (df)
## # A tibble: 6 x 4
##   id_media          week       id_country article_nb
##   <chr>             <date>     <chr>           <dbl>
## 1 en_NZL_nzhera_int 2015-06-01 SVK             0.333
## 2 en_NZL_nzhera_int 2015-03-09 VEN            10.5  
## 3 es_VEN_univer_int 2014-05-19 LBY             9    
## 4 en_MYS_starmy_int 2014-10-13 PSE             9    
## 5 en_NZL_nzhera_int 2014-06-09 GTM             1.5  
## 6 en_MYS_starmy_int 2015-06-22 ARG             3
  • First column id_media contains standardised identifiers of selected newspapers.
  • Second column week contains publication dates at the week level (date of the first day of the week at the YYYY-MM-DD format).
  • Third column id_country contains standardised identifiers of cited countries (ISO 3166-1 alpha-3).
  • Last column article_nb gives the corresponding number of articles, that is the number of articles published by id_media during week and citing id_country.

For example, the third line of the dataset above indicates that the Venezuelan newspaper El Universal (es_VEN_univer_int) as published 9 articles talking about Libya (LBY) during the week starting on the 19th of May, 2014 (2014-05-19). Note that the indicated number of articles is not necessarily an integer value (see for example the first line) as an article simultaneously citing n countries is weighted by 1/n and then distributed among n lines.

1.2 Loading the library

First, load the library:

library (data.cube)

1.3 Building the cube with as.data.cube

Function as.data.cube transforms a classical data.frame (or tibble) object into a data.cube, that is the data structure that will then be used by the library. One should specify which columns correspond to the cube’s dimensions (in our case, the first three) and which columns correspond to the observed variables (in our case, the last one). Note that one might also rename these dimensions and variables when transforming the data.frame into a data.cube.

geomedia <-
    df %>%
    as.data.cube (
        dim.names = list (media = id_media, week, country = id_country),
        var.names = list (articles = article_nb)
    )
## Warning in as.data.cube_.data.frame(., str.dim.names, str.var.names):
## Observations are assumed to be unique. Check for potential duplicates in
## the input data.frame if unsure.

1.4 What’s in the cube with summary

Function summary then prints a short summary of the data contained in the resulting structure.

geomedia %>% summary ()
## data.cube of 3 dimensions and 1 variable
## 
## -> Dimension media
##  - Element number: 36
##  - Class (type):   character (character)
##  - Element names:  en_NZL_nzhera_int, es_VEN_univer_int, en_MYS_starmy_int, ...
## 
## -> Dimension week
##  - Element number: 79
##  - Class (type):   Date (double)
##  - Element names:  2015-06-01, 2015-03-09, 2014-05-19, 2014-10-13, 2014-06-09, ...
## 
## -> Dimension country
##  - Element number: 205
##  - Class (type):   character (character)
##  - Element names:  SVK, VEN, LBY, PSE, GTM, ARG, ARM, AFG, NRU, MEX, IRN, ...
## 
## -> Variable articles
##  - Dimensions:   media x week x country
##  - Class (type): numeric (double)
##  - NA value:     num 0

1.5 Back to frame with as.data.frame

Function as.data.frame transforms a data.cube object back into a data.frame object (actually, a tibble).

geomedia %>% as.data.frame ()
## # A tibble: 95,674 x 4
##    media             week       country articles
##    <chr>             <date>     <chr>      <dbl>
##  1 en_NZL_nzhera_int 2015-06-01 SVK        0.333
##  2 en_NZL_nzhera_int 2015-03-09 VEN       10.5  
##  3 es_VEN_univer_int 2014-05-19 LBY        9    
##  4 en_MYS_starmy_int 2014-10-13 PSE        9    
##  5 en_NZL_nzhera_int 2014-06-09 GTM        1.5  
##  6 en_MYS_starmy_int 2015-06-22 ARG        3    
##  7 en_USA_wapost_int 2015-01-05 ARM        0.5  
##  8 es_VEN_univer_int 2014-05-26 AFG        4.5  
##  9 en_GBR_dailyt_int 2015-05-04 NRU        1    
## 10 fr_BEL_derheu_int 2014-06-16 MEX        0.333
## # … with 95,664 more rows