An Interface for Linear Regression

LINEA is an open-source R library aimed at simplifying and accelerating the development of linear models to understand the relationship between two or more variables.

Linear models are commonly used in a variety of contexts including natural and social sciences, and various business applications (e.g. marketing, finance).

This page covers a basic how to setup the linea library to analyse a time-series. We’ll cover:


To use this library an understanding of the following is assumed:


The library can be installed from CRAN using install.packages('linea') or from GitHub using devtools::install_github('paladinic/linea'). Once installed you can check the installation.

## [1] '0.1.1'

Quick Start

The linea library works well with pipes. Used with dplyr and plotly, it can perform data analysis and visualization with elegant code. Let’s build a quick model to illustrate what linea can do.

Import Data

We start by importing linea, some other useful libraries, and some data.

# librarise
library(linea) # modelling
library(tidyverse) # data manipulation
library(plotly) # visualization
library(DT) # visualization

# fictitious ecommerce data
data_path = ''

# importing flat file
data = read_xcsv(file = data_path)

# adding seasonality and Google trends variables
data = data %>%
  get_seasonality(date_col_name = 'date',date_type = 'weekly starting') %>%
  gt_f(kw = 'prime day',append = T)

# visualize data
data %>%
  datatable(rownames = NULL,
            options = list(scrollX = TRUE))

Run Models

Now lets build a model to understand what drives changes in the ecommerce variable. We can start by selecting a few initial independent variables (i.e. christmas,black.friday,trend,gtrends_prime day)

model = run_model(data = data,
                  dv = 'ecommerce',
                  ivs = c('christmas','black.friday','trend','gtrends_prime day'),
                  id_var = 'date')

## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## Residuals:
##    Min     1Q Median     3Q    Max 
## -20604  -4502   -405   2982  54637 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       43679.24     948.66  46.043  < 2e-16 ***
## christmas           300.86      26.38  11.405  < 2e-16 ***
## black.friday        320.44      39.03   8.209 1.10e-14 ***
## trend               129.16       6.11  21.139  < 2e-16 ***
## gtrends_prime day   182.86      42.42   4.311 2.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7417 on 256 degrees of freedom
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.7465 
## F-statistic: 192.4 on 4 and 256 DF,  p-value: < 2.2e-16

Our next steps can be guided by functions like what_next(), which will test all other variables in our data. From the output below, it seems like the variables covid and offline_media would improve the model most.

model %>%
## # A tibble: 81 × 5
##    variable      adj_R2 t_stat       coef adj_R2_diff
##    <chr>          <dbl>  <dbl>      <dbl>       <dbl>
##  1 offline_media  0.837  12.0        6.44     0.121  
##  2 covid          0.815   9.78     191.       0.0917 
##  3 year_2020      0.814   9.69   12076.       0.0904 
##  4 year_2019      0.781  -6.40   -7115.       0.0458 
##  5 christmas_eve  0.777  -6.05 -170926.       0.0414 
##  6 week_48        0.771   5.30   21444.       0.0325 
##  7 christmas_day  0.768  -5.02 -137025.       0.0293 
##  8 week_52        0.766  -4.69  -21223.       0.0258 
##  9 promo          0.758   3.66       5.59     0.0157 
## 10 year_2017      0.754   2.93    3683.       0.00974
## # … with 71 more rows

Adding these variables to model brings the adjusted R squared to ~88%.

model = run_model(data = data,
                  dv = 'ecommerce',
                  ivs = c('christmas','black.friday','trend','gtrends_prime day','covid','offline_media'),
                  id_var = 'date')

## Call:
## lm(formula = formula, data = trans_data[, c(dv, ivs_t, fixed_ivs_t)])
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21541.6  -2909.5   -718.2   2661.9  16287.3 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.781e+04  7.247e+02  65.977  < 2e-16 ***
## christmas         2.812e+02  1.849e+01  15.208  < 2e-16 ***
## black.friday      2.668e+02  2.770e+01   9.629  < 2e-16 ***
## trend             7.930e+01  5.959e+00  13.309  < 2e-16 ***
## gtrends_prime day 1.840e+02  2.940e+01   6.257 1.66e-09 ***
## covid             1.522e+02  1.621e+01   9.392  < 2e-16 ***
## offline_media     5.507e+00  4.752e-01  11.588  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 5135 on 254 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8785 
## F-statistic: 314.2 on 6 and 254 DF,  p-value: < 2.2e-16

Generate Insights

Now that we have a decent model we can start extracting insights from it. We can start by looking at the contribution of each independent variable over time.

model %>%

We can also visualize the relationships between our independent and dependent variables using response curves. From this we can see that, for example, when offline_media is 10, ecommerce increases by ~55. To capture non-linear relationships (i.e. response curves that aren’t straight lines) see the Advanced Features page.

model %>%
  response_curves(x_min = 0)

Next Steps

  1. The Getting Started page is a good place to start learning how to build linear models with linea.

  2. The Advanced Features page shows how to implement the features of linea that allow users to capture non-linear relationships.

  3. The Additional Features illustrates page all other functions of the library.

