LINEA

Home Getting Started Additional Features Advanced Features

Additional Features

Dark Mode

LINEA

LINEA offers a few useful features to make modelling quicker, simpler and more accurate. This page covers a basic implementation of the features below:

Categories: Aggregate and visualize your variables
Seasonality: Automatically generate seasonality variables
Testing: Running multiple models to quickly test different variables
GTrends: Import Google trends data
Pools: Build models with panel data

We will run simple models on some fictitious data sourced from Google trends. The aim of this exercise will be to demonstrate the use of the features above.

We start by importing linea and some other useful libraries.

library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization

Seasonality

While the model above captures some of the variation from our ecommerce variable, there is still a lot left unexplained. Using a date column, of data-type date, we can generate seasonality variables with linea::get_seasonality(). Several columns will be added to the original data.frame. These are mainly dummy variables that capture some basic holidays as well as year, month, and week number. Also a trend variable is added which is a column that goes form 1 to n, where n is the number of rows.

data = data %>%
  get_seasonality(
    date_col_name = 'date',
    date_type = 'weekly ending')

data %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

plot_ly(data) %>%
  add_bars(y = ~ week_26,
           x = ~ date,
           name = 'week_26',
           color = color_palette()[1]) %>%
  add_bars(y = ~ new_years_eve,
           x = ~ date,
           name = 'new_years_eve',
           color = color_palette()[2]) %>%
  add_bars(y = ~ year_2019,
           x = ~ date,
           name = 'year_2019',
           color = color_palette()[3]) %>%
  layout(yaxis = list(title = 'value'),
         title = 'Seasonality Variables',         
         plot_bgcolor  = "rgba(0, 0, 0, 0)",
         paper_bgcolor = "rgba(0, 0, 0, 0)")

These variables can be used in the model to capture the seasonal component of the dependent variable, among other things (e.g. trend).

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec')

model = run_model(data = data,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

summary(model)

## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20899.1  -3149.9   -871.3   2667.1  20500.2 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.822e+04  7.763e+02  62.115  < 2e-16 ***
## christmas     2.546e+02  3.201e+01   7.955 5.93e-14 ***
## covid         1.482e+02  1.738e+01   8.525 1.38e-15 ***
## black.friday  2.713e+02  3.215e+01   8.438 2.47e-15 ***
## offline_media 5.609e+00  5.098e-01  11.003  < 2e-16 ***
## trend         8.142e+01  6.384e+00  12.753  < 2e-16 ***
## month_Dec     1.573e+03  2.083e+03   0.755    0.451    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5510 on 254 degrees of freedom
## Multiple R-squared:  0.8633, Adjusted R-squared:  0.8601 
## F-statistic: 267.3 on 6 and 254 DF,  p-value: < 2.2e-16

Thanks to the new variables this model has a better r-squared (~86%) compared to the previous. The impact of these variables can be seen clearly using the linea::decomp_chart() function.

model %>%
  decomp_chart()

To simplify this visualization it is worth using categories, as demonstrated previously.

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','seasonality','media','Base','seasonality'),
  calc = c('min','none','min','none','none','none')
)

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

model %>% decomp_chart()

Testing

While the model is improving, thanks to the seasonal variables introduced, selecting which variable could be a good fit for the model can be tricky and tedious.

df = model %>% what_next()

df %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

As shown above, the linea::what_next() function generates a data.frame where each row represents a variable in our data, and the impact it would have on our model in terms of:

adjusted R squared
coefficient
t statistic

We can now quickly see which variables are more likely to benefit the model.

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51')

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','seasonality','media','Base','seasonality','covid','seasonality'),
  calc = c('min','none','min','none','none','none','none','none')
)

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

summary(model)

## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14353.5  -2856.5   -891.7   2910.0  20611.0 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    4.765e+04  6.658e+02  71.572  < 2e-16 ***
## christmas      3.112e+02  2.947e+01  10.560  < 2e-16 ***
## covid          1.910e+02  1.575e+01  12.132  < 2e-16 ***
## black.friday   2.483e+02  2.777e+01   8.940  < 2e-16 ***
## offline_media  4.756e+00  4.441e-01  10.709  < 2e-16 ***
## trend          8.429e+01  5.472e+00  15.404  < 2e-16 ***
## month_Dec      2.243e+03  1.782e+03   1.259    0.209    
## year_2021     -1.210e+04  1.599e+03  -7.567 7.19e-13 ***
## week_51       -1.625e+04  2.612e+03  -6.219 2.07e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4707 on 252 degrees of freedom
## Multiple R-squared:  0.901,  Adjusted R-squared:  0.8979 
## F-statistic: 286.7 on 8 and 252 DF,  p-value: < 2.2e-16

model %>% decomp_chart()

GTrends

The model is getting better and better, with an adjusted R squared almost reaching 90%. This doesn’t mean it can’t be improved further! Google Trends can be a very useful source of data as Google search volumes are often correlated with events and can be used as a proxy for a missing variable. The function linea::gt_f() will return the original data.frame with the added Google trends variable.

data = data %>%
  gt_f(kw = 'ramadan',append = T) %>%
  gt_f(kw = 'trump',append = T) %>%
  gt_f(kw = 'prime day',append = T) %>%
  gt_f(kw = 'amazon workers',append = T)

data %>%
  datatable(options = list(scrollX = T),rownames = NULL)

plot_ly(data) %>%
  add_lines(y = ~ gtrends_ramadan,
           x = ~ date,
           name = 'gtrends_ramadan',
           color = color_palette()[1]) %>%
  add_lines(y = ~ gtrends_trump,
           x = ~ date,
           name = 'gtrends_trump',
           color = color_palette()[2]) %>%
  add_lines(y = ~ `gtrends_prime day`,
           x = ~ date,
           name = 'gtrends_prime day',
           color = color_palette()[3]) %>%
  layout(yaxis = list(title = 'value'),
         title = 'Google Trend Variables',         
         plot_bgcolor  = "rgba(0, 0, 0, 0)",
         paper_bgcolor = "rgba(0, 0, 0, 0)")

Now that these variables are part of our data, we can use the linea::what_next() function to see if they can be added to the model.

df = model %>% what_next(data = data)

df %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

As shown from the table above, the new variable, gtrends_prime day, seems like a sensible addition to the model.

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

summary(model)

## 
## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14685.4  -2728.5   -665.1   2782.3  14956.6 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.731e+04  6.102e+02  77.528  < 2e-16 ***
## christmas          3.171e+02  2.694e+01  11.771  < 2e-16 ***
## covid              1.930e+02  1.439e+01  13.409  < 2e-16 ***
## black.friday       2.537e+02  2.539e+01   9.994  < 2e-16 ***
## offline_media      4.688e+00  4.059e-01  11.549  < 2e-16 ***
## trend              8.201e+01  5.010e+00  16.368  < 2e-16 ***
## month_Dec          2.501e+03  1.628e+03   1.536    0.126    
## year_2021         -1.149e+04  1.463e+03  -7.854 1.18e-13 ***
## week_51           -1.651e+04  2.387e+03  -6.917 3.82e-11 ***
## gtrends_prime day  1.760e+02  2.468e+01   7.131 1.06e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4301 on 251 degrees of freedom
## Multiple R-squared:  0.9177, Adjusted R-squared:  0.9147 
## F-statistic: 310.9 on 9 and 251 DF,  p-value: < 2.2e-16

Using the variable decomposition we can see the new variable is nicely fitting that July peak.

model %>% decomp_chart(variable_decomp = T)

The model has an R squared greater than 90% and be presented in a more polished way using categories and other charting functions.

ivs = c('christmas','covid','black.friday','offline_media','trend','month_Dec','year_2021','week_51','gtrends_prime day')

categories = data.frame(
  variable = ivs, # variables from the model
  category = c('seasonality','covid','retail events','media','Base','seasonality','covid','seasonality','retail events'),
  calc = c('min','none','min','none','none','none','none','none','none')
)

model = run_model(data = data,
                  categories = categories,
                  dv = dv,
                  ivs = ivs,
                  id_var = 'date')

model %>% 
  decomp_chart()

model %>% fit_chart()

Pools

Another feature available relates to Panel Data and Pooled Models. linea’s pooling functionality will divide the dependent variable by the mean of each group (pool, panel, region, etc…). When the coefficients are then multiplied by that same mean, we get a scaled coefficient for each group.

Lets start by looking at some pooled data. As we can see, the data below, generated again through Google trends, has a non-numeric variable, country.

data_path = 'https://raw.githubusercontent.com/paladinic/data/main/pooled%20data.csv'

data = read_xcsv(file = data_path)

data %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

To run a pooled model we must pass a pool_var, a character string of the pool variable name (i.e. country), to linea::run_model(). To enforce the normalization the normalise_by_pool parameter of the linea::run_model() function must be set to TRUE.

dv = 'amazon'
ivs = c('christmas','rakhi','diwali')
id_var = 'Week'

pool_var = 'country'

model = run_model(data = data,
                  dv = dv,
                  ivs = ivs,
                  id_var = id_var,
                  pool_var = pool_var,
                  normalise_by_pool = TRUE)

model %>% 
  decomp_chart()

In the decomposition above, the model’s decomposition is simply aggregated, while still considering the re-scaled coefficients. The visualization functions, such as the linea::decomp_chart() function, allow to filter the visualization based on the pool, as shown below.

model %>% 
  decomp_chart(pool = 'UK')

model %>% 
  decomp_chart(pool = 'India')

Next Steps

The Getting Started page is a good place to start learning how to build basic linear models with linea.
The Advanced Features page shows how to implement the features of linea that allow users to capture non-linear relationships.

LINEA

Additional Features

Categories

Seasonality

Testing

GTrends

Pools

Next Steps