Additional Features


Dark Mode

Fork it

LINEA offers a few useful features to make modelling quicker, simpler and more accurate. This page covers a basic implementation of the features below:

  • Categories: Aggregate and visualize your variables
  • Seasonality: Automatically generate seasonality variables
  • Testing: Running multiple models to quickly test different variables
  • GTrends: Import Google trends data
  • Pools: Build models with panel data

We will run simple models on some fictitious data sourced from Google trends. The aim of this exercise will be to demonstrate the use of the features above.

We start by importing linea and some other useful libraries.

library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization

Categories

The output of the linea::decomp_chart() function can be grouped based on a data.frame mapping variables to categories and specific operations (i.e. max and min). This helps simplify the visualization and provide focus on specific groups of variables. Lets start by looking at a non-aggregated, variable decomposition.

First, we import some data…

data_path = 'https://raw.githubusercontent.com/paladinic/data/main/ecomm_data.csv'

data = read_xcsv(file = data_path)

data %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

…and run a model.

dv = 'ecommerce'
ivs = c('christmas','covid','black.friday','offline_media')

model = data %>% 
  run_model(dv = dv,
            ivs = ivs)

summary(model)
## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -22738.0  -4713.4     -4.6   4550.7  21995.4 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   5.642e+04  5.486e+02 102.849  < 2e-16 ***
## christmas     2.913e+02  2.523e+01  11.546  < 2e-16 ***
## covid         3.014e+02  1.606e+01  18.775  < 2e-16 ***
## black.friday  2.796e+02  3.791e+01   7.374 2.29e-12 ***
## offline_media 5.538e+00  6.509e-01   8.507 1.51e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7038 on 256 degrees of freedom
## Multiple R-squared:  0.7752, Adjusted R-squared:  0.7717 
## F-statistic: 220.8 on 4 and 256 DF,  p-value: < 2.2e-16

Now we can plot our variable decomposition.

model %>% 
  decomp_chart(variable_decomp = T)

Now lets create a categories data.frame to group the ‘christmas’ and ‘black.friday’ variables together.

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','seasonality','media')
)

model = run_model(
  data = data,
  dv = dv,
  ivs = ivs,
  categories = categories,
  id_var = 'date' # specify horizontal axis
) 

model %>% 
  decomp_chart(variable_decomp = F)

The ‘christmas’ and ‘black.friday’ variables are derived from Google trends, which captures the impact of these events over time. As there is always a level of search for these keywords throughout the year, the series never reaches zero. Using the calc column of the categories data.frame we can tell linea to add this minimum value of search to the intercept, isolating the impact of the variation of the variable.

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','seasonality','media'),
  calc = c('min','none','min','none')
)

model = run_model(
  data = data,
  dv = dv,
  ivs = ivs,
  categories = categories,
  id_var = 'date' # specify horizontal axis
) 

model %>% 
  decomp_chart(variable_decomp = F)


Seasonality

While the model above captures some of the variation from our ecommerce variable, there is still a lot left unexplained. Using a date column, of data-type date, we can generate seasonality variables with linea::get_seasonality(). Several columns will be added to the original data.frame. These are mainly dummy variables that capture some basic holidays as well as year, month, and week number. Also a trend variable is added which is a column that goes form 1 to n, where n is the number of rows.

data = data %>%
  get_seasonality(
    date_col_name = 'date',
    date_type = 'weekly ending')

data %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))
plot_ly(data) %>%
  add_bars(y = ~ week_26,
           x = ~ date,
           name = 'week_26',
           color = color_palette()[1]) %>%
  add_bars(y = ~ new_years_eve,
           x = ~ date,
           name = 'new_years_eve',
           color = color_palette()[2]) %>%
  add_bars(y = ~ year_2019,
           x = ~ date,
           name = 'year_2019',
           color = color_palette()[3]) %>%
  layout(yaxis = list(title = 'value'),
         title = 'Seasonality Variables',         
         plot_bgcolor  = "rgba(0, 0, 0, 0)",
         paper_bgcolor = "rgba(0, 0, 0, 0)")

These variables can be used in the model to capture the seasonal component of the dependent variable, among other things (e.g. trend).

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec')

model = run_model(data = data,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

summary(model)
## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20899.1  -3149.9   -871.3   2667.1  20500.2 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.822e+04  7.763e+02  62.115  < 2e-16 ***
## christmas     2.546e+02  3.201e+01   7.955 5.93e-14 ***
## covid         1.482e+02  1.738e+01   8.525 1.38e-15 ***
## black.friday  2.713e+02  3.215e+01   8.438 2.47e-15 ***
## offline_media 5.609e+00  5.098e-01  11.003  < 2e-16 ***
## trend         8.142e+01  6.384e+00  12.753  < 2e-16 ***
## month_Dec     1.573e+03  2.083e+03   0.755    0.451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5510 on 254 degrees of freedom
## Multiple R-squared:  0.8633, Adjusted R-squared:  0.8601 
## F-statistic: 267.3 on 6 and 254 DF,  p-value: < 2.2e-16

Thanks to the new variables this model has a better r-squared (~86%) compared to the previous. The impact of these variables can be seen clearly using the linea::decomp_chart() function.

model %>%
  decomp_chart()

To simplify this visualization it is worth using categories, as demonstrated previously.

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','seasonality','media','Base','seasonality'),
  calc = c('min','none','min','none','none','none')
)

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

model %>% decomp_chart()


Testing

While the model is improving, thanks to the seasonal variables introduced, selecting which variable could be a good fit for the model can be tricky and tedious.

df = model %>% what_next()

df %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

As shown above, the linea::what_next() function generates a data.frame where each row represents a variable in our data, and the impact it would have on our model in terms of:

  • adjusted R squared
  • coefficient
  • t statistic

We can now quickly see which variables are more likely to benefit the model.

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51')

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','seasonality','media','Base','seasonality','covid','seasonality'),
  calc = c('min','none','min','none','none','none','none','none')
)

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

summary(model)
## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14353.5  -2856.5   -891.7   2910.0  20611.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.765e+04  6.658e+02  71.572  < 2e-16 ***
## christmas      3.112e+02  2.947e+01  10.560  < 2e-16 ***
## covid          1.910e+02  1.575e+01  12.132  < 2e-16 ***
## black.friday   2.483e+02  2.777e+01   8.940  < 2e-16 ***
## offline_media  4.756e+00  4.441e-01  10.709  < 2e-16 ***
## trend          8.429e+01  5.472e+00  15.404  < 2e-16 ***
## month_Dec      2.243e+03  1.782e+03   1.259    0.209    
## year_2021     -1.210e+04  1.599e+03  -7.567 7.19e-13 ***
## week_51       -1.625e+04  2.612e+03  -6.219 2.07e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4707 on 252 degrees of freedom
## Multiple R-squared:  0.901,  Adjusted R-squared:  0.8979 
## F-statistic: 286.7 on 8 and 252 DF,  p-value: < 2.2e-16
model %>% decomp_chart()

GTrends

The model is getting better and better, with an adjusted R squared almost reaching 90%. This doesn’t mean it can’t be improved further! Google Trends can be a very useful source of data as Google search volumes are often correlated with events and can be used as a proxy for a missing variable. The function linea::gt_f() will return the original data.frame with the added Google trends variable.

data = data %>%
  gt_f(kw = 'ramadan',append = T) %>%
  gt_f(kw = 'trump',append = T) %>%
  gt_f(kw = 'prime day',append = T) %>%
  gt_f(kw = 'amazon workers',append = T)

data %>%
  datatable(options = list(scrollX = T),rownames = NULL)
plot_ly(data) %>%
  add_lines(y = ~ gtrends_ramadan,
           x = ~ date,
           name = 'gtrends_ramadan',
           color = color_palette()[1]) %>%
  add_lines(y = ~ gtrends_trump,
           x = ~ date,
           name = 'gtrends_trump',
           color = color_palette()[2]) %>%
  add_lines(y = ~ `gtrends_prime day`,
           x = ~ date,
           name = 'gtrends_prime day',
           color = color_palette()[3]) %>%
  layout(yaxis = list(title = 'value'),
         title = 'Google Trend Variables',         
         plot_bgcolor  = "rgba(0, 0, 0, 0)",
         paper_bgcolor = "rgba(0, 0, 0, 0)")

Now that these variables are part of our data, we can use the linea::what_next() function to see if they can be added to the model.

df = model %>% what_next(data = data)

df %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

As shown from the table above, the new variable, gtrends_prime day, seems like a sensible addition to the model.

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

summary(model)
## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14685.4  -2728.5   -665.1   2782.3  14956.6 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.731e+04  6.102e+02  77.528  < 2e-16 ***
## christmas          3.171e+02  2.694e+01  11.771  < 2e-16 ***
## covid              1.930e+02  1.439e+01  13.409  < 2e-16 ***
## black.friday       2.537e+02  2.539e+01   9.994  < 2e-16 ***
## offline_media      4.688e+00  4.059e-01  11.549  < 2e-16 ***
## trend              8.201e+01  5.010e+00  16.368  < 2e-16 ***
## month_Dec          2.501e+03  1.628e+03   1.536    0.126    
## year_2021         -1.149e+04  1.463e+03  -7.854 1.18e-13 ***
## week_51           -1.651e+04  2.387e+03  -6.917 3.82e-11 ***
## gtrends_prime day  1.760e+02  2.468e+01   7.131 1.06e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4301 on 251 degrees of freedom
## Multiple R-squared:  0.9177, Adjusted R-squared:  0.9147 
## F-statistic: 310.9 on 9 and 251 DF,  p-value: < 2.2e-16

Using the variable decomposition we can see the new variable is nicely fitting that July peak.

model %>% decomp_chart(variable_decomp = T)

The model has an R squared greater than 90% and be presented in a more polished way using categories and other charting functions.

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','retail events','media','Base','seasonality','covid','seasonality','retail events'),
  calc = c('min','none','min','none','none','none','none','none','none')
)

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

model %>% 
  decomp_chart()
model %>% fit_chart()

Pools

Another feature available relates to Panel Data and Pooled Models. linea’s pooling functionality will divide the dependent variable by the mean of each group (pool, panel, region, etc…). When the coefficients are then multiplied by that same mean, we get a scaled coefficient for each group.

Lets start by looking at some pooled data. As we can see, the data below, generated again through Google trends, has a non-numeric variable, country.

data_path = 'https://raw.githubusercontent.com/paladinic/data/main/pooled%20data.csv'

data = read_xcsv(file = data_path)

data %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

To run a pooled model we must pass a pool_var, a character string of the pool variable name (i.e. country), to linea::run_model(). To enforce the normalization the normalise_by_pool parameter of the linea::run_model() function must be set to TRUE.

dv = 'amazon'
ivs = c('christmas','rakhi','diwali')
id_var = 'Week'

pool_var = 'country'

model = run_model(data = data,
                  dv = dv,
                  ivs = ivs,
                  id_var = id_var,
                  pool_var = pool_var,
                  normalise_by_pool = TRUE)

model %>% 
  decomp_chart()

In the decomposition above, the model’s decomposition is simply aggregated, while still considering the re-scaled coefficients. The visualization functions, such as the linea::decomp_chart() function, allow to filter the visualization based on the pool, as shown below.

model %>% 
  decomp_chart(pool = 'UK')
model %>% 
  decomp_chart(pool = 'India') 

Next Steps

  1. The Getting Started page is a good place to start learning how to build basic linear models with linea.

  2. The Advanced Features page shows how to implement the features of linea that allow users to capture non-linear relationships.