forked from shokru/mlfactor.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
search_index.json
1 lines (1 loc) · 689 KB
/
search_index.json
1
[["preface.html", "Machine Learning for Factor Investing Preface What this book is not about The targeted audience How this book is structured Companion website Why R? Coding instructions Acknowledgments Future developments", " Machine Learning for Factor Investing Guillaume Coqueret and Tony Guida 2021-01-08 Preface This book is intended to cover some advanced modelling techniques applied to equity investment strategies that are built on firm characteristics. The content is threefold. First, we try to simply explain the ideas behind most mainstream machine learning algorithms that are used in equity asset allocation. Second, we mention a wide range of academic references for the readers who wish to push a little further. Finally, we provide hands-on R code samples that show how to apply the concepts and tools on a realistic dataset which we share to encourage reproducibility. What this book is not about This book deals with machine learning (ML) tools and their applications in factor investing. Factor investing is a subfield of a large discipline that encompasses asset allocation, quantitative trading and wealth management. Its premise is that differences in the returns of firms can be explained by the characteristics of these firms. Thus, it departs from traditional analyses which rely on price and volume data only, like classical portfolio theory à la Markowitz (1952), or high frequency trading. For a general and broad treatment of Machine Learning in Finance, we refer to Matthew F. Dixon, Halperin, and Bilokon (2020). The topics we discuss are related to other themes that will not be covered in the monograph. These themes include: Applications of ML in other financial fields, such as fraud detection or credit scoring. We refer to Ngai et al. (2011) and Baesens, Van Vlasselaer, and Verbeke (2015) for general purpose fraud detection, to Bhattacharyya et al. (2011) for a focus on credit cards and to Ravisankar et al. (2011) and Abbasi et al. (2012) for studies on fraudulent financial reporting. On the topic of credit scoring, Wang et al. (2011) and Brown and Mues (2012) provide overviews of methods and some empirical results. Also, we do not cover ML algorithms for data sampled at higher (daily or intraday) frequencies (microstructure models, limit order book). The chapter from Kearns and Nevmyvaka (2013) and the recent paper by Sirignano and Cont (2019) are good introductions on this topic. Use cases of alternative datasets that show how to leverage textual data from social media, satellite imagery, or credit card logs to predict sales, earning reports, and, ultimately, future returns. The literature on this topic is still emerging (see, e.g., Blank, Davis, and Greene (2019), Jha (2019) and Ke, Kelly, and Xiu (2019)) but will likely blossom in the near future. Technical details of machine learning tools. While we do provide some insights on specificities of some approaches (those we believe are important), the purpose of the book is not to serve as reference manual on statistical learning. We refer to Hastie, Tibshirani, and Friedman (2009), Cornuejols, Miclet, and Barra (2018) (written in French), James et al. (2013) (coded in R!) and Mohri, Rostamizadeh, and Talwalkar (2018) for a general treatment on the subject.1 Moreover, Du and Swamy (2013) and Goodfellow et al. (2016) are solid monographs on neural networks particularly and Sutton and Barto (2018) provide a self-contained and comprehensive tour in reinforcement learning. Finally, the book does not cover methods of natural language processing (NLP) that can be used to evaluate sentiment which can in turn be translated into investment decisions. This topic has nonetheless been trending lately and we refer to Loughran and McDonald (2016), Cong, Liang, and Zhang (2019a), Cong, Liang, and Zhang (2019b) and Gentzkow, Kelly, and Taddy (2019) for recent advances on the matter. The targeted audience Who should read this book? This book is intended for two types of audiences. First, postgraduate students who wish to pursue their studies in quantitative finance with a view towards investment and asset management. The second target groups are professionals from the money management industry who either seek to pivot towards allocation methods that are based on machine learning or are simply interested in these new tools and want to upgrade their set of competences. To a lesser extent, the book can serve scholars or researchers who need a manual with a broad spectrum of references both on recent asset pricing issues and on machine learning algorithms applied to money management. While the book covers mostly common methods, it also shows how to implement more exotic models, like causal graphs (Chapter 14), Bayesian additive trees (Chapter 9), and hybrid autoencoders (Chapter 7). The book assumes basic knowledge in algebra (matrix manipulation), analysis (function differentiation, gradients), optimization (first and second order conditions, dual forms), and statistics (distributions, moments, tests, simple estimation method like maximum likelihood). A minimal financial culture is also required: simple notions like stocks, accounting quantities (e.g., book value) will not be defined in this book. Lastly, all examples and illustrations are coded in R. A minimal culture of the language is sufficient to understand the code snippets which rely heavily on the most common functions of the tidyverse (Wickham et al. (2019), www.tidyverse.org), and piping (Bache and Wickham (2014), Mailund (2019)). How this book is structured The book is divided into four parts. Part I gathers preparatory material and starts with notations and data presentation (Chapter 1), followed by introductory remarks (Chapter 2). Chapter 3 outlines the economic foundations (theoretical and empirical) of factor investing and briefly sums up the dedicated recent literature. Chapter 4 deals with data preparation. It rapidly recalls the basic tips and warns about some major issues. Part II of the book is dedicated to predictive algorithms in supervised learning. Those are the most common tools that are used to forecast financial quantities (returns, volatilities, Sharpe ratios, etc.). They range from penalized regressions (Chapter 5), to tree methods (Chapter 6), encompassing neural networks (Chapter 7), support vector machines (Chapter 8) and Bayesian approaches (Chapter 9). The next portion of the book bridges the gap between these tools and their applications in finance. Chapter 10 details how to assess and improve the ML engines defined beforehand. Chapter 11 explains how models can be combined and often why that may not be a good idea. Finally, one of the most important chapters (Chapter 12) reviews the critical steps of portfolio backtesting and mentions the frequent mistakes that are often encountered at this stage. The end of the book covers a range of advanced topics connected to machine learning more specifically. The first one is interpretability. ML models are often considered to be black boxes and this raises trust issues: how and why should one trust ML-based predictions? Chapter 13 is intended to present methods that help understand what is happening under the hood. Chapter 14 is focused on causality, which is both a much more powerful concept than correlation and also at the heart of many recent discussions in Artificial Intelligence (AI). Most ML tools rely on correlation-like patterns and it is important to underline the benefits of techniques related to causality. Finally, Chapters 15 and 16 are dedicated to non-supervised methods. The latter can be useful, but their financial applications should be wisely and cautiously motivated. Companion website This book is entirely available at http://www.mlfactor.com. It is important that not only the content of the book be accessible, but also the data and code that are used throughout the chapters. They can be found at https://github.com/shokru/mlfactor.github.io/tree/master/material. The online version of the book will be updated beyond the publication of the printed version. Why R? The supremacy of Python as the dominant ML programming language is a widespread belief. This is because almost all applications of deep learning (which is as of 2020 one of the most fashionable branches of ML) are coded in Python via Tensorflow or Pytorch. The fact is that R has a lot to offer as well. First of all, let us not forget that one of the most influencial textbooks in ML (Hastie, Tibshirani, and Friedman (2009)) is written by statisticians who code in R. Moreover, many statistics-orientated algorithms (e.g., BARTs in Section 9.5) are primarily coded in R and not always in Python. The R offering in Bayesian packages in general (https://cran.r-project.org/web/views/Bayesian.html) and in Bayesian learning in particular is probably unmatched. There are currently several ML frameworks available in R. caret: https://topepo.github.io/caret/index.html, a compilation of more than 200 ML models; tidymodels: https://github.com/tidymodels, a recent collection of packages for ML workflow (developed by Max Kuhn at RStudio, which is a token of high quality material!); rtemis: https://rtemis.netlify.com, a general purpose package for ML and visualization; mlr3: https://mlr3.mlr-org.com/index.html, also a simple framework for ML models; h2o: https://github.com/h2oai/h2o-3/tree/master/h2o-r, a large set of tools provided by h2o (coded in Java); Open ML: https://github.com/openml/openml-r, the R version of the OpenML (www.openml.org) community. Moreover, via the reticulate package, it is possible (but not always easy) to benefit from Python tools as well. The most prominent example is the adaptation of the tensorflow and keras libraries to R. Thus, some very advanced Python material is readily available to R users. This is also true for other resources, like Stanford’s CoreNLP library (in Java) which was adapted to R in the package coreNLP (which we will not use in this book). Coding instructions One of the purposes of the book is to propose a large-scale tutorial of ML applications in financial predictions and portfolio selection. Thus, one keyword is REPRODUCIBILITY! In order to duplicate our results (up to possible randomness in some learning algorithms), you will need running versions of R and RStudio on your computer. The best books to learn R are also often freely available online. A short list can be found here https://rstudio.com/resources/books/. The monograph R for Data Science is probably the most crucial. In terms of coding requirements, we rely heavily on the tidyverse, which is a collection of packages (or libraries). The three packages we use most are dplyr which implements simple data manipulations (filter, select, arrange), tidyr which formats data in a tidy fashion, and ggplot, for graphical outputs. A list of the packages we use can be found in Table 0.1 below. Packages with a star \\(*\\) need to be installed via bioconductor.2 Packages with a plus \\(^+\\) need to be installed manually.3 TABLE 0.1: List of all packages used in the book. Package Purpose Chapter(s) BART Bayesian additive trees 10 broom Tidy regression output 5 CAM\\(^+\\) Causal Additive Models 15 caTools AUC curves 11 CausalImpact Causal inference with structural time series 15 cowplot Stacking plots 4 & 13 breakDown Breakdown interpretability 14 dummies One-hot encoding 8 e1071 Support Vector Machines 9 factoextra PCA visualization 16 fastAdaboost Boosted trees 7 forecast Autocorrelation function 4 FNN Nearest Neighbors detection 16 ggpubr Combining plots 11 glmnet Penalized regressions 6 iml Interpretability tools 14 keras Neural networks 8 lime Interpretability 14 lmtest Granger causality 15 lubridate Handling dates All (or many) naivebayes Naive Bayes classifier 10 pcalg Causal graphs 15 quadprog Quadratic programming 12 quantmod Data extraction 4, 12 randomForest Random forests 7 rBayesianOptimization Bayesian hyperparameter tuning 11 ReinforcementLearning Reinforcement Learning 17 Rgraphviz\\(^*\\) Causal graphs 15 rpart and rpart.plot Simple decision trees 7 spBayes Bayesian linear regression 10 tidyverse Environment for data science, data wrangling All xgboost Boosted trees 7 xtable Table formatting 4 Of all of these packages (or collections thereof), the tidyverse and lubridate are compulsory in almost all sections of the book. To install a new package in R, just type install.packages(“name_of_the_package”) in the console. Sometimes, because of function name conflicts (especially with the select() function), we use the syntax package::function() to make sure the function call is from the right source. The exact version of the packages used to compile the book is listed in the “renv.lock” file available on the book’s GitHub web page https://github.com/shokru/mlfactor.github.io. One minor comment is the following: while the functions gather() and spread() from the dplyr package have been superseded by pivot_longer() and pivot_wider(), we still use them because of their much more compact syntax. As much as we could, we created short code chunks and commented each line whenever we felt it was useful. Comments are displayed at the end of a row and preceded with a single hastag #. The book is constructed as a very big notebook, thus results are often presented below code chunks. They can be graphs or tables. Sometimes, they are simple numbers and are preceded with two hashtags ##. The example below illustrates this formatting. 1+2 # Example ## [1] 3 The book can be viewed as a very big tutorial. Therefore, most of the chunks depend on previously defined variables. When replicating parts of the code (via online code), please make sure that the environment includes all relevant variables. One best practice is to always start by running all code chunks from Chapter 1. For the exercises, we often resort to variables created in the corresponding chapters. Acknowledgments The core of the book was prepared for a series of lectures given by one of the authors to students of master’s degrees in finance at EMLYON Business School and at the Imperial College Business School in the Spring of 2019. We are grateful to those students who asked fruitful questions and thereby contributed to improve the content of the book. We are grateful to Bertrand Tavin and Gautier Marti for their thorough screening of the book. We also thank Eric André, Aurélie Brossard, Alban Cousin, Frédérique Girod, Philippe Huber, Jean-Michel Maeso, Javier Nogales and for friendly reviews; Christophe Dervieux for his help with bookdown; Mislav Sagovac and Vu Tran for their early feedback; John Kimmel for making this happen and Jonathan Regenstein for his availability, no matter the topic. Lastly, we are grateful for the anonymous reviews collected by John. Future developments Machine learning and factor investing are two immense research domains and the overlap between the two is also quite substantial and developing at a fast pace. The content of this book will always constitute a solid background, but it is naturally destined to obsolescence. Moreover, by construction, some subtopics and many references will have escaped our scrutiny. Our intent is to progressively improve the content of the book and update it with the latest ongoing research. We will be grateful to any comment that helps correct or update the monograph. Thank you for sending your feedback directly (via pull requests) on the book’s website which is hosted at https://github.com/shokru/mlfactor.github.io. References "],["notdata.html", "Chapter 1 Notations and data 1.1 Notations 1.2 Dataset", " Chapter 1 Notations and data 1.1 Notations This section aims at providing the formal mathematical conventions that will be used throughout the book. Bold notations indicate vectors and matrices. We use capital letters for matrices and lower case letters for vectors. \\(\\mathbf{v}'\\) and \\(\\mathbf{M}'\\) denote the transposes of \\(\\mathbf{v}\\) and \\(\\mathbf{M}\\). \\(\\mathbf{M}=[m]_{i,j}\\), where \\(i\\) is the row index and \\(j\\) the column index. We will work with two notations in parallel. The first one is the pure machine learning notation in which the labels (also called output, dependent variables or predicted variables) \\(\\mathbf{y}=y_i\\) are approximated by functions of features \\(\\mathbf{X}_i=(x_{i,1},\\dots,x_{i,K})\\). The dimension of the feature matrix \\(\\mathbf{X}\\) is \\(I\\times K\\): there are \\(I\\) instances, records, or observations and each one of them has \\(K\\) attributes, features, inputs, or predictors which will serve as independent and explanatory variables (all these terms will be used interchangeably). Sometimes, to ease notations, we will write \\(\\textbf{x}_i\\) for one instance (one row) of \\(\\textbf{X}\\) or \\(\\textbf{x}_k\\) for one (feature) column vector of \\(\\textbf{X}\\). The second notation type pertains to finance and will directly relate to the first. We will often work with discrete returns \\(r_{t,n}=p_{t,n}/p_{t-1,n}-1\\) computed from price data. Here \\(t\\) is the time index and \\(n\\) the asset index. Unless specified otherwise, the return is always computed over one period, though this period can sometimes be one month or one year. Whenever confusion might occur, we will specify other notations for returns. In line with our previous conventions, the number of return dates will be \\(T\\) and the number of assets, \\(N\\). The features or characteristics of assets will be denoted with \\(x_{t,n}^{(k)}\\): it is the time-\\(t\\) value of the \\(k^{th}\\) attribute of firm or asset \\(n\\). In stacked notation, \\(\\mathbf{x}_{t,n}\\) will stand for the vector of characteristics of asset \\(n\\) at time \\(t\\). Moreover, \\(\\mathbf{r}_t\\) stands for all returns at time \\(t\\) while \\(\\mathbf{r}_n\\) stands for all returns of asset \\(n\\). Often, returns will play the role of the dependent variable, or label (in ML terms). For the riskless asset, we will use the notation \\(r_{t,f}\\). The link between the two notations will most of the time be the following. One instance (or observation) \\(i\\) will consist of one couple (\\(t,n\\)) of one particular date and one particular firm (if the data is perfectly rectangular with no missing field, \\(I=T\\times N\\)). The label will usually be some performance measure of the firm computed over some future period, while the features will consist of the firm attributes at time \\(t\\). Hence, the purpose of the machine learning engine in factor investing will be to determine the model that maps the time-\\(t\\) characteristics of firms to their future performance. In terms of canonical matrices: \\(\\mathbf{I}_N\\) will denote the \\((N\\times N)\\) identity matrix. From the probabilistic literature, we employ the expectation operator \\(\\mathbb{E}[\\cdot]\\) and the conditional expectation \\(\\mathbb{E}_t[\\cdot]\\), where the corresponding filtration \\(\\mathcal{F}_t\\) corresponds to all information available at time \\(t\\). More precisely, \\(\\mathbb{E}_t[\\cdot]=\\mathbb{E}[\\cdot | \\mathcal{F}_t]\\). \\(\\mathbb{V}[\\cdot]\\) will denote the variance operator. Depending on the context, probabilities will be written simply \\(P\\), but sometimes we will use the heavier notation \\(\\mathbb{P}\\). Probability density functions (pdfs) will be denoted with lowercase letters (\\(f\\)) and cumulative distribution functions (cdfs) with uppercase letters (\\(F\\)). We will write equality in distribution as \\(X \\overset{d}{=}Y\\), which is equivalent to \\(F_X(z)=F_Y(z)\\) for all \\(z\\) on the support of the variables. For a random process \\(X_t\\), we say that it is stationary if the law of \\(X_t\\) is constant through time, i.e., \\(X_t\\overset{d}{=}X_s\\), where \\(\\overset{d}{=}\\) means equality in distribution. Sometimes, asymptotic behaviors will be characterized with the usual Landau notation \\(o(\\cdot)\\) and \\(O(\\cdot)\\). The symbol \\(\\propto\\) refers to proportionality: \\(x\\propto y\\) means that \\(x\\) is proportional to \\(y\\). With respect to derivatives, we use the standard notation \\(\\frac{\\partial}{\\partial x}\\) when differentiating with respect to \\(x\\). We resort to the compact symbol \\(\\nabla\\) when all derivatives are computed (gradient vector). In equations, the left-hand side and right-hand side can be written more compactly: l.h.s. and r.h.s., respectively. Finally, we turn to functions. We list a few below: - \\(1_{\\{x \\}}\\): the indicator function of the condition \\(x\\), which is equal to one if \\(x\\) is true and to zero otherwise. - \\(\\phi(\\cdot)\\) and \\(\\Phi(\\cdot)\\) are the standard Gaussian pdf and cdf. - card\\((\\cdot)=\\#(\\cdot)\\) are two notations for the cardinal function which evaluates the number of elements in a given set (provided as argument of the function). - \\(\\lfloor \\cdot \\rfloor\\) is the integer part function. - for a real number \\(x\\), \\([x]^+\\) is the positive part of \\(x\\), that is \\(\\max(0,x)\\). - tanh\\((\\cdot)\\) is the hyperbolic tangent: tanh\\((x)=\\frac{e^x-e^{-x}}{e^x+e^{-x}}\\). - ReLu\\((\\cdot)\\) is the rectified linear unit: ReLu\\((x)=\\max(0,x)\\). - s\\((\\cdot)\\) will be the softmax function: \\(s(\\textbf{x})_i=\\frac{e^{x_i}}{\\sum_{j=1}^Je^{x_j}}\\), where the subscript \\(i\\) refers to the \\(i^{th}\\) element of the vector. 1.2 Dataset Throughout the book, and for the sake of reproducibility, we will illustrate the concepts we present with examples of implementation based on a single financial dataset available at https://github.com/shokru/mlfactor.github.io/tree/master/material. This dataset comprises information on 1,207 stocks listed in the US (possibly originating from Canada or Mexico). The time range starts in November 1998 and ends in March 2019. For each point in time, 93 characteristics describe the firms in the sample. These attributes cover a wide range of topics: valuation (earning yields, accounting ratios); profitability and quality (return on equity); momentum and technical analysis (past returns, relative strength index); risk (volatilities); estimates (earnings-per-share); volume and liquidity (share turnover). The sample is not perfectly rectangular: there are no missing points, but the number of firms and their attributes is not constant through time. This makes the computations in the backtest more tricky, but also more realistic. library(tidyverse) # Activate the data science package library(lubridate) # Activate the date management package load("data_ml.RData") # Load the data data_ml <- data_ml %>% filter(date > "1999-12-31", # Keep the date with sufficient data points date < "2019-01-01") %>% arrange(stock_id, date) # Order the data data_ml[1:6, 1:6] # Sample values ## # A tibble: 6 x 6 ## stock_id date Advt_12M_Usd Advt_3M_Usd Advt_6M_Usd Asset_Turnover ## <int> <date> <dbl> <dbl> <dbl> <dbl> ## 1 1 2000-01-31 0.41 0.39 0.42 0.19 ## 2 1 2000-02-29 0.41 0.39 0.4 0.19 ## 3 1 2000-03-31 0.4 0.37 0.37 0.2 ## 4 1 2000-04-30 0.39 0.36 0.37 0.2 ## 5 1 2000-05-31 0.4 0.42 0.4 0.2 ## 6 1 2000-06-30 0.41 0.47 0.42 0.21 The data has 99 columns and 268336 rows. The first two columns indicate the stock identifier and the date. The next 93 columns are the features (see Table 17.1 in the Appendix for details). The last four columns are the labels. The points are sampled at the monthly frequency. As is always the case in practice, the number of assets changes with time, as is shown in Figure 1.1. data_ml %>% group_by(date) %>% # Group by date summarize(nb_assets = stock_id %>% # Count nb assets as.factor() %>% nlevels()) %>% ggplot(aes(x = date, y = nb_assets)) + geom_col() + # Plot coord_fixed(3) FIGURE 1.1: Number of assets through time. There are four immediate labels in the dataset: R1M_Usd, R3M_Usd, R6M_Usd and R12M_Usd, which correspond to the 1-month, 3-month, 6-month and 12-month future/forward returns of the stocks. The returns are total returns, that is, they incorporate potential dividend payments over the considered periods. This is a better proxy of financial gain compared to price returns only. We refer to the analysis of Hartzmark and Solomon (2019) for a study on the impact of decoupling price returns and dividends. These labels are located in the last 4 columns of the dataset. We provide their descriptive statistics below. ## # A tibble: 4 x 5 ## Label mean sd min max ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 R12M_Usd 0.137 0.738 -0.991 96.0 ## 2 R1M_Usd 0.0127 0.176 -0.922 30.2 ## 3 R3M_Usd 0.0369 0.328 -0.929 39.4 ## 4 R6M_Usd 0.0723 0.527 -0.98 107. In anticipation for future models, we keep the name of the predictors in memory. In addition, we also keep a much shorter list of predictors. features <- colnames(data_ml[3:95]) # Keep the feature's column names (hard-coded, beware!) features_short <- c("Div_Yld", "Eps", "Mkt_Cap_12M_Usd", "Mom_11M_Usd", "Ocf", "Pb", "Vol1Y_Usd") The predictors have been uniformized, that is, for any given feature and time point, the distribution is uniform. Given 1,207 stocks, the graph below cannot display a perfect rectangle. data_ml %>% filter(date == "2000-02-29") %>% ggplot(aes(x = Div_Yld)) + geom_histogram(bins = 100) + coord_fixed(0.03) FIGURE 1.2: Distribution of the dividend yield feature on date 2000-02-29. The original labels (future returns) are numerical and will be used for regression exercises, that is, when the objective is to predict a scalar real number. Sometimes, the exercises can be different and the purpose may be to forecast categories (also called classes), like “buy”, “hold” or “sell”. In order to be able to perform this type of classification analysis, we create additional labels that are categorical. data_ml <- data_ml %>% group_by(date) %>% # Group by date mutate(R1M_Usd_C = R1M_Usd > median(R1M_Usd), # Create the categorical labels R12M_Usd_C = R12M_Usd > median(R12M_Usd)) %>% ungroup() %>% mutate_if(is.logical, as.factor) The new labels are binary: they are equal to 1 (true) if the original return is above that of the median return over the considered period and to 0 (false) if not. Hence, at each point in time, half of the sample has a label equal to zero and the other half to one: some stocks overperform and others underperform. In machine learning, models are estimated on one portion of data (training set) and then tested on another portion of the data (testing set) to assess their quality. We split our sample accordingly. separation_date <- as.Date("2014-01-15") training_sample <- filter(data_ml, date < separation_date) testing_sample <- filter(data_ml, date >= separation_date) We also keep in memory a few key variables, like the list of asset identifiers and a rectangular version of returns. For simplicity, in the computation of the latter, we shrink the investment universe to keep only the stocks for which we have the maximum number of points. stock_ids <- levels(as.factor(data_ml$stock_id)) # A list of all stock_ids stock_days <- data_ml %>% # Compute the number of data points per stock group_by(stock_id) %>% summarize(nb = n()) stock_ids_short <- stock_ids[which(stock_days$nb == max(stock_days$nb))] # Stocks with full data returns <- data_ml %>% # Compute returns, in matrix format, in 3 steps: filter(stock_id %in% stock_ids_short) %>% # 1. Filtering the data dplyr::select(date, stock_id, R1M_Usd) %>% # 2. Keep returns along with dates & firm names spread(key = stock_id, value = R1M_Usd) # 3. Put in matrix shape References "],["intro.html", "Chapter 2 Introduction 2.1 Context 2.2 Portfolio construction: the workflow 2.3 Machine learning is no magic wand", " Chapter 2 Introduction Conclusions often echo introductions. This chapter was completed at the very end of the writing of the book. It outlines principles and ideas that are probably more relevant than the sum of technical details covered subsequently. When stuck with disappointing results, we advise the reader to take a step away from the algorithm and come back to this section to get a broader perspective of some of the issues in predictive modelling. 2.1 Context The blossoming of machine learning in factor investing has it source at the confluence of three favorable developments: data availability, computational capacity, and economic groundings. First, the data. Nowadays, classical providers, such as Bloomberg and Reuters have seen their playing field invaded by niche players and aggregation platforms.4 In addition, high-frequency data and derivative quotes have become mainstream. Hence, firm-specific attributes are easy and often cheap to compile. This means that the size of \\(\\mathbf{X}\\) in (2.1) is now sufficiently large to be plugged into ML algorithms. The order of magnitude (in 2019) that can be reached is the following: a few hundred monthly observations over several thousand stocks (US listed at least) covering a few hundred attributes. This makes a dataset of dozens of millions of points. While it is a reasonably high figure, we highlight that the chronological depth is probably the weak point and will remain so for decades to come because accounting figures are only released on a quarterly basis. Needless to say that this drawback does not hold for high-frequency strategies. Second, computational power, both through hardware and software. Storage and processing speed are not technical hurdles anymore and models can even be run on the cloud thanks to services hosted by major actors (Amazon, Microsoft, IBM and Google) and by smaller players (Rackspace, Techila). On the software side, open source has become the norm, funded by corporations (TensorFlow & Keras by Google, Pytorch by Facebook, h2o, etc.), universities (Scikit-Learn by INRIA, NLPCore by Stanford, NLTK by UPenn) and small groups of researchers (caret, xgboost, tidymodels to list but a pair of frameworks). Consequently, ML is no longer the private turf of a handful of expert computer scientists, but is on the contrary accessible to anyone willing to learn and code. Finally, economic framing. Machine learning applications in finance were initially introduced by computer scientists and information system experts (e.g., Braun and Chandler (1987), White (1988)) and exploited shortly after by academics in financial economics (Bansal and Viswanathan (1993)), and hedge funds (see, e.g., Zuckerman (2019)). Nonlinear relationships then became more mainstream in asset pricing (Freeman and Tse (1992), Bansal, Hsieh, and Viswanathan (1993)). These contributions started to pave the way for the more brute-force approaches that have blossomed since the 2010 decade and which are mentioned throughout the book. In the synthetic proposal of R. Arnott, Harvey, and Markowitz (2019), the first piece of advice is to rely on a model that makes sense economically. We agree with this stance, and the only assumption that we make in this book is that future returns depend on firm characteristics. The relationship between these features and performance is largely unknown and probably time-varying. This is why ML can be useful: to detect some hidden patterns beyond the documented asset pricing anomalies. Moreover, dynamic training allows to adapt to changing market conditions. 2.2 Portfolio construction: the workflow Building successful portfolio strategies requires many steps. This book covers many of them but focuses predominantly on the prediction part. Indeed, allocating to assets most of the time requires to make bets and thus to presage and foresee which ones will do well and which ones will not. In this book, we mostly resort to supervised learning to forecast returns in the cross-section. The baseline equation in supervised learning, \\[\\begin{equation} \\mathbf{y}=f(\\mathbf{X})+\\mathbf{\\epsilon}, \\tag{2.1} \\end{equation}\\] is translated in financial terms as \\[\\begin{equation} \\mathbf{r}_{t+1,n}=f(\\mathbf{x}_{t,n})+\\mathbf{\\epsilon}_{t+1,n}, \\tag{2.2} \\end{equation}\\] where \\(f(\\mathbf{x}_{t,n})\\) can be viewed as the expected return for time \\(t+1\\) computed at time \\(t\\), that is, \\(\\mathbb{E}_t[r_{t+1,n}]\\). Note that the model is common to all assets (\\(f\\) is not indexed by \\(n\\)), thus it shares similarity with panel approaches. Building accurate predictions requires to pay attention to all terms in the above equation. Chronologically, the first step is to gather data and to process it (see Chapter 4). To the best of our knowledge, the only consensus is that, on the \\(\\textbf{x}\\) side, the features should include classical predictors reported in the literature: market capitalization, accounting ratios, risk measures, momentum proxies (see Chapter 3). For the dependent variable, many researchers and practitioners work with monthly returns, but other maturities may perform better out-of-sample. While it is tempting to believe that the most crucial part is the choice of \\(f\\) (it is the most sophisticated, mathematically), we believe that the choice and engineering of inputs, that is, the variables, are at least as important. The usual modelling families for \\(f\\) are covered in Chapters 5 to 9. Finally, the errors \\(\\mathbf{\\epsilon}_{t+1,n}\\) are often overlooked. People consider that vanilla quadratic programming is the best way to go (the most common for sure!), thus the mainstream objective is to minimize squared errors. In fact, other options may be wiser choices (see for instance Section 7.4.3). Even if the overall process, depicted in Figure 2.1, seems very sequential, it is more judicious to conceive it as integrated. All steps are intertwined and each part should not be dealt with independently from the others.5 The global framing of the problem is essential, from the choice of predictors, to the family of algorithms, not to mention the portfolio weighting schemes (see Chapter 12 for the latter). FIGURE 2.1: Simplified workflow in ML-based portfolio construction. 2.3 Machine learning is no magic wand By definition, the curse of predictions is that they rely on past data to infer patterns about subsequent fluctuations. The more or less explicit hope of any forecaster is that the past will turn out to be a good approximation of the future. Needless to say, this is a pious wish; in general, predictions fare badly. Surprisingly, this does not depend much on the sophistication of the econometric tool. In fact, heuristic guesses are often hard to beat. To illustrate this sad truth, the baseline algorithms that we detail in Chapters 5 to 7 yield at best mediocre results. This is done on purpose. This forces the reader to understand that blindly feeding data and parameters to a coded function will seldom suffice to reach satisfactory out-of-sample accuracy. Below, we sum up some key points that we have learned through our exploratory journey in financial ML. The first point is that causality is key. If one is able to identify \\(X \\rightarrow y\\), where \\(y\\) are expected returns, then the problem is solved. Unfortunately, causality is incredibly hard to uncover. Thus, researchers have most of the time to make do with simple correlation patterns, which are far less informative and robust. Relatedly, financial datasets are extremely noisy. It is a daunting task to extract signals out of them. No-arbitrage reasonings imply that if a simple pattern yielded durable profits, it would mechanically and rapidly vanish. The no-free lunch theorem of Wolpert (1992a) imposes that the analyst formulates views on the model. This is why economic or econometric framing is key. The assumptions and choices that are made regarding both the dependent variables and the explanatory features are decisive. As a corollary, data is key. The inputs given to the models are probably much more important than the choice of the model itself. To maximize out-of-sample efficiency, the right question is probably to paraphrase Jeff Bezos: what’s not going to change? Persistent series are more likely to unveil enduring patterns. Everybody makes mistakes. Errors in loops or variable indexing are part of the journey. What matters is to learn from those lapses. To conclude, we remind the reader of this obvious truth: nothing will ever replace practice. Gathering and cleaning data, coding backtests, tuning ML models, testing weighting schemes, debugging, starting all over again: these are all absolutely indispensable steps and tasks that must be repeated indefinitely. There is no sustitute to experience. References "],["factor.html", "Chapter 3 Factor investing and asset pricing anomalies 3.1 Introduction 3.2 Detecting anomalies 3.3 Factors or characteristics? 3.4 Hot topics: momentum, timing and ESG 3.5 The links with machine learning 3.6 Coding exercises", " Chapter 3 Factor investing and asset pricing anomalies Asset pricing anomalies are the foundations of factor investing. In this chapter our aim is twofold: present simple ideas and concepts: basic factor models and common empirical facts (time-varying nature of returns and risk premia); provide the reader with lists of articles that go much deeper to stimulate and satisfy curiosity. The purpose of this chapter is not to provide a full treatment of the many topics related to factor investing. Rather, it is intended to give a broad overview and cover the essential themes so that the reader is guided towards the relevant references. As such, it can serve as a short, non-exhaustive, review of the literature. The subject of factor modelling in finance is incredibly vast and the number of papers dedicated to it is substantial and still rapidly increasing. The universe of peer-reviewed financial journals can be split in two. The first kind is the academic journals. Their articles are mostly written by professors, and the audience consists mostly of scholars. The articles are long and often technical. Prominent examples are the Journal of Finance, the Review of Financial Studies and the Journal of Financial Economics. The second type is more practitioner-orientated. The papers are shorter, easier to read, and target finance professionals predominantly. Two emblematic examples are the Journal of Portfolio Management and the Financial Analysts Journal. This chapter reviews and mentions articles published essentially in the first family of journals. Beyond academic articles, several monographs are already dedicated to the topic of style allocation (a synonym of factor investing used for instance in theoretical articles (Barberis and Shleifer (2003)) or practitioner papers (Asness et al. (2015))). To cite but a few, we mention: Ilmanen (2011): an exhaustive excursion into risk premia, across many asset classes, with a large spectrum of descriptive statistics (across factors and periods), Ang (2014): covers factor investing with a strong focus on the money management industry, Bali, Engle, and Murray (2016): very complete book on the cross-section of signals with statistical analyses (univariate metrics, correlations, persistence, etc.), Jurczenko (2017): a tour on various topics given by field experts (factor purity, predictability, selection versus weighting, factor timing, etc.). Finally, we mention a few wide-scope papers on this topic: Goyal (2012), Cazalet and Roncalli (2014) and Baz et al. (2015). 3.1 Introduction The topic of factor investing, though a decades-old academic theme, has gained traction concurrently with the rise of equity traded funds (ETFs) as vectors of investment. Both have gathered momentum in the 2010 decade. Not so surprisingly, the feedback loop between practical financial engineering and academic research has stimulated both sides in a mutually beneficial manner. Practitioners rely on key scholarly findings (e.g., asset pricing anomalies) while researchers dig deeper into pragmatic topics (e.g., factor exposure or transaction costs). Recently, researchers have also tried to quantify and qualify the impact of factor indices on financial markets. For instance, Krkoska and Schenk-Hoppé (2019) analyze herding behaviors while Cong and Xu (2019) show that the introduction of composite securities increases volatility and cross-asset correlations. The core aim of factor models is to understand the drivers of asset prices. Broadly speaking, the rationale behind factor investing is that the financial performance of firms depends on factors, whether they be latent and unobservable, or related to intrinsic characteristics (like accounting ratios for instance). Indeed, as Cochrane (2011) frames it, the first essential question is which characteristics really provide independent information about average returns? Answering this question helps understand the cross-section of returns and may open the door to their prediction. Theoretically, linear factor models can be viewed as special cases of the arbitrage pricing theory (APT) of Ross (1976), which assumes that the return of an asset \\(n\\) can be modelled as a linear combination of underlying factors \\(f_k\\): \\[\\begin{equation} \\tag{3.1} r_{t,n}= \\alpha_n+\\sum_{k=1}^K\\beta_{n,k}f_{t,k}+\\epsilon_{t,n}, \\end{equation}\\] where the usual econometric constraints on linear models hold: \\(\\mathbb{E}[\\epsilon_{t,n}]=0\\), \\(\\text{cov}(\\epsilon_{t,n},\\epsilon_{t,m})=0\\) for \\(n\\neq m\\) and \\(\\text{cov}(\\textbf{f}_n,\\boldsymbol{\\epsilon}_n)=0\\). If such factors do exist, then they are in contradiction with the cornerstone model in asset pricing: the capital asset pricing model (CAPM) of Sharpe (1964), Lintner (1965) and Mossin (1966). Indeed, according to the CAPM, the only driver of returns is the market portfolio. This explains why factors are also called ‘anomalies’. Empirical evidence of asset pricing anomalies has accumulated since the dual publication of Fama and French (1992) and Fama and French (1993). This seminal work has paved the way for a blossoming stream of literature that has its meta-studies (e.g., Green, Hand, and Zhang (2013), Harvey, Liu, and Zhu (2016) and McLean and Pontiff (2016)). The regression (3.1) can be evaluated once (unconditionally) or sequentially over different time frames. In the latter case, the parameters (coefficient estimates) change and the models are thus called conditional (we refer to Ang and Kristensen (2012) and to Cooper and Maio (2019) for recent results on this topic as well as for a detailed review on the related research). Conditional models are more flexible because they acknowledge that the drivers of asset prices may not be constant, which seems like a reasonable postulate. 3.2 Detecting anomalies 3.2.1 Challenges Obviously, a crucial step is to be able to identify an anomaly and the complexity of this task should not be underestimated. Given the publication bias towards positive results (see, e.g., Harvey (2017) in financial economics), researchers are often tempted to report partial results that are sometimes invalidated by further studies. The need for replication is therefore high and many findings have no tomorrow (Linnainmaa and Roberts (2018), Johannesson, Ohlson, and Zhai (2020)), especially if transaction costs are taken into account (Patton and Weller (2020), A. Y. Chen and Velikov (2020)). Nevertheless, as is demonstrate by Chen (2019), \\(p\\)-hacking alone cannot account for all the anomalies documented in the literature. One way to reduce the risk of spurious detection is to increase the hurdles (often, the \\(t\\)-statistics) but the debate is still ongoing (Harvey, Liu, and Zhu (2016), A. Y. Chen (2020a)), or to resort to multiple testing (Harvey, Liu, and Saretto (2020), Vincent, Hsu, and Lin (2020)). Nevertheless, the large sample sizes used in finance may mechanically lead to very low \\(p\\)-values and we refer to Michaelides (2020) for a discussion on this topic. Some researchers document fading anomalies because of publication: once the anomaly becomes public, agents invest in it, which pushes prices up and the anomaly disappears. McLean and Pontiff (2016) and Shanaev and Ghimire (2020) document this effect in the US but Jacobs and Müller (2020) find that all other countries experience sustained post-publication factor returns (see also Zaremba, Umutlu, and Maydubura (2020)). With a different methodology, A. Y. Chen and Zimmermann (2020) introduce a publication bias adjustment for returns and the authors note that this (negative) adjustment is in fact rather small. Likewise, A. Y. Chen (2020b) finds that \\(p\\)-hacking cannot be responsible for all the anomalies reported in the literature. Penasse (2019) recommends the notion of alpha decay to study the persistence or attenuation of anomalies. Horenstein (2020) even builds a model in which agents invest according to anomalies reporting in academic research. The destruction of factor premia may be due to herding (Krkoska and Schenk-Hoppé (2019), Volpati et al. (2020)) and could be accelerated by the democratization of so-called smart-beta products (ETFs notably) that allow investors to directly invest in particular styles (value, low volatility, etc.) - see S. Huang, Song, and Xiang (2020). For a theoretical perspective on the attractivity of factor investing, we refer to Jin (2019). On the other hand, DeMiguel, Martin Utrera, and Uppal (2019) argue that the price impact of crowding in the smart-beta universe is mitigated by trading diversification stemming from external institutions that trade according to strategies outside this space (e.g., high frequency traders betting via order-book algorithms). The remainder of this subsection was inspired from Baker, Luo, and Taliaferro (2017) and C. Harvey and Liu (2019). 3.2.2 Simple portfolio sorts This is the most common procedure and the one used in Fama and French (1992). The idea is simple. On one date, rank firms according to a particular criterion (e.g., size, book-to-market ratio); form \\(J\\ge 2\\) portfolios (i.e., homogeneous groups) consisting of the same number of stocks according to the ranking (usually, \\(J=2\\), \\(J=3\\), \\(J=5\\) or \\(J=10\\) portfolios are built, based on the median, terciles, quintiles or deciles of the criterion); the weight of stocks inside the portfolio is either uniform (equal weights), or proportional to market capitalization; at a future date (usually one month), report the returns of the portfolios. Then, iterate the procedure until the chronological end of the sample is reached. The outcome is a time series of portfolio returns \\(r_t^j\\) for each grouping \\(j\\). An anomaly is identified if the \\(t\\)-test between the first (\\(j=1\\)) and the last group (\\(j=J\\)) unveils a significant difference in average returns. More robust tests are described in Cattaneo et al. (2020). A strong limitation of this approach is that the sorting criterion could have a non-monotonic impact on returns and a test based on the two extreme portfolios would not detect it. Several articles address this concern: Patton and Timmermann (2010) and Romano and Wolf (2013) for instance. Another concern is that these sorted portfolios may capture not only the priced risk associated to the characteristic, but also some unpriced risk. K. Daniel et al. (2020) show that it is possible to disentangle the two and make the most of altered sorted portfolios. Instead of focusing on only one criterion, it is possible to group asset according to more characteristics. The original paper Fama and French (1992) also combines market capitalization with book-to-market ratios. Each characteristic is divided into 10 buckets, which makes 100 portfolios in total. Beyond data availability, there is no upper bound on the number of features that can be included in the sorting process. In fact, some authors investigate more complex sorting algorithms that can manage a potentially large number of characteristics (see e.g., Feng, Polson, and Xu (2019) and Bryzgalova, Pelger, and Zhu (2019)). Finally, we refer to Ledoit, Wolf, and Zhao (2020) for refinements that take into account the covariance structure of asset returns and to Cattaneo et al. (2020) for a theoretical study on the statistical properties of the sorting procedure (including theoretical links with regression-based approaches). Notably, the latter paper discusses the optimal number of portfolios and suggests that it is probably larger than the usual 10 often used in the literature. In the code and Figure 3.1 below, we compute size portfolios (equally weighted: above versus below the median capitalization). According to the size anomaly, the firms with below median market cap should earn higher returns on average. This is verified whenever the orange bar in the plot is above the blue one (it happens most of the time). data_ml %>% group_by(date) %>% mutate(large = Mkt_Cap_12M_Usd > median(Mkt_Cap_12M_Usd)) %>% # Creates the cap sort ungroup() %>% # Ungroup mutate(year = lubridate::year(date)) %>% # Creates a year variable group_by(year, large) %>% # Analyze by year & cap summarize(avg_return = mean(R1M_Usd)) %>% # Compute average return ggplot(aes(x = year, y = avg_return, fill = large)) + # Plot! geom_col(position = "dodge") + # Bars side-to-side theme(legend.position = c(0.8, 0.2)) + # Legend location coord_fixed(124) + theme(legend.title=element_blank()) + # x/y aspect ratio scale_fill_manual(values=c("#F87E1F", "#0570EA"), name = "", # Colors labels=c("Small", "Large")) + ylab("Average returns") + theme(legend.text=element_text(size=9)) FIGURE 3.1: The size factor: average returns of small versus large firms. 3.2.3 Factors The construction of so-called factors follows the same lines as above. Portfolios are based on one characteristic and the factor is a long-short ensemble of one extreme portfolio minus the opposite extreme (small minus large for the size factor or high book-to-market ratio minus low book-to-market ratio for the value factor). Sometimes, subtleties include forming bivariate sorts and aggregating several portfolios together, as in the original contribution of Fama and French (1993). The most common factors are listed below, along with a few references. We refer to the books listed at the beginning of the chapter for a more exhaustive treatment of factor idiosyncrasies. For most anomalies, theoretical justifications have been brought forward, whether risk-based or behavioral. We list the most frequently cited factors below: Size (SMB = small firms minus large firms): Banz (1981), Fama and French (1992), Fama and French (1993), Van Dijk (2011), Asness et al. (2018) and Astakhov, Havranek, and Novak (2019). Value (HML = high minus low: undervalued minus `growth’ firms): Fama and French (1992), Fama and French (1993), C. S. Asness, Moskowitz, and Pedersen (2013). Momentum (WML = winners minus losers): Jegadeesh and Titman (1993), Carhart (1997) and C. S. Asness, Moskowitz, and Pedersen (2013). The winners are the assets that have experienced the highest returns over the last year (sometimes the computation of the return is truncated to omit the last month). Cross-sectional momentum is linked, but not equivalent, to time series momentum (trend following), see e.g., Moskowitz, Ooi, and Pedersen (2012) and Lempérière et al. (2014). Momentum is also related to contrarian movements that occur both at higher and lower frequencies (short-term and long-term reversals), see Luo, Subrahmanyam, and Titman (2020). Profitability (RMW = robust minus weak profits): Fama and French (2015), Bouchaud et al. (2019). In the former reference, profitability is measured as (revenues - (cost and expenses))/equity. Investment (CMA = conservative minus aggressive): Fama and French (2015), Hou, Xue, and Zhang (2015). Investment is measured via the growth of total assets (divided by total assets). Aggressive firms are those that experience the largest growth in assets. Low `risk’ (sometimes, BAB = betting against beta): Ang et al. (2006), Baker, Bradley, and Wurgler (2011), Frazzini and Pedersen (2014), Boloorforoosh et al. (2020), Baker, Hoeyer, and Wurgler (2020) and Asness et al. (2020). In this case, the computation of risk changes from one article to the other (simple volatility, market beta, idiosyncratic volatility, etc.). With the notable exception of the low risk premium, the most mainstream anomalies are kept and updated in the data library of Kenneth French (https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html). Of course, the computation of the factors follows a particular set of rules, but they are generally accepted in the academic sphere. Another source of data is the AQR repository: https://www.aqr.com/Insights/Datasets. In the dataset we use for the book, we proxy the value anomaly not with the book-to-market ratio but with the price-to-book ratio (the book value is located in the denominator). As is shown in Clifford Asness and Frazzini (2013), the choice of the variable for value can have sizable effects. Below, we import data from Ken French’s data library. We will use it later on in the chapter. library(quantmod) # Package for data extraction library(xtable) # Package for LaTeX exports min_date <- "1963-07-31" # Start date max_date <- "2020-03-28" # Stop date temp <- tempfile() KF_website <- "http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/" KF_file <- "ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip" link <- paste0(KF_website,KF_file) # Link of the file download.file(link, temp, quiet = TRUE) # Download! FF_factors <- read_csv(unz(temp, "F-F_Research_Data_5_Factors_2x3.CSV"), skip = 3) %>% # Check the number of lines to skip! rename(date = X1, MKT_RF = `Mkt-RF`) %>% # Change the name of first columns mutate_at(vars(-date), as.numeric) %>% # Convert values to number mutate(date = ymd(parse_date_time(date, "%Y%m"))) %>% # Date in right format mutate(date = rollback(date + months(1))) # End of month date FF_factors <- FF_factors %>% mutate(MKT_RF = MKT_RF / 100, # Scale returns SMB = SMB / 100, HML = HML / 100, RMW = RMW / 100, CMA = CMA / 100, RF = RF/100) %>% filter(date >= min_date, date <= max_date) # Finally, keep only recent points knitr::kable(head(FF_factors), booktabs = TRUE, caption = "Sample of monthly factor returns.") # A look at the data (see table) TABLE 3.1: Sample of monthly factor returns. date MKT_RF SMB HML RMW CMA RF 1963-07-31 -0.0039 -0.0047 -0.0083 0.0066 -0.0115 0.0027 1963-08-31 0.0507 -0.0079 0.0167 0.0040 -0.0040 0.0025 1963-09-30 -0.0157 -0.0048 0.0018 -0.0076 0.0024 0.0027 1963-10-31 0.0253 -0.0129 -0.0010 0.0275 -0.0224 0.0029 1963-11-30 -0.0085 -0.0084 0.0171 -0.0045 0.0222 0.0027 1963-12-31 0.0183 -0.0189 -0.0012 0.0007 -0.0030 0.0029 Posterior to the discovery of these stylized facts, some contributions have aimed at building theoretical models that capture these properties. We cite a handful below: size and value: Berk, Green, and Naik (1999), K. D. Daniel, Hirshleifer, and Subrahmanyam (2001), Barberis and Shleifer (2003), Gomes, Kogan, and Zhang (2003), Carlson, Fisher, and Giammarino (2004), Arnott et al. (2014); momentum: Johnson (2002), Grinblatt and Han (2005), Vayanos and Woolley (2013), Choi and Kim (2014). In addition, recent bridges have been built between risk-based factor representations and behavioural theories. We refer essentially to Barberis, Mukherjee, and Wang (2016) and K. Daniel, Hirshleifer, and Sun (2020) and the references therein. While these factors (i.e., long-short portfolios) exhibit time-varying risk premia and are magnified by corporate news and announcements (Engelberg, McLean, and Pontiff (2018)), it is well-documented (and accepted) that they deliver positive returns over long horizons. We refer to Gagliardini, Ossola, and Scaillet (2016) and to the survey Gagliardini, Ossola, and Scaillet (2019), as well as to the related bibliography for technical details on estimation procedures of risk premia and the corresponding empirical results. Large sample studies that documents regime changes in factor premia were also carried out by Ilmanen et al. (2019) and Smith and Timmermann (2020). Moreover, the predictability of returns is also time-varying (as documented in Farmer, Schmidt, and Timmermann (2019), Tsiakas, Li, and Zhang (2020) and Liu, Pan, and Wang (2020)), and estimation methods can be improved (Johnson (2019)). In Figure 3.2, we plot the average monthly return aggregated over each calendar year for five common factors. The risk free rate (which is not a factor per se) is the most stable, while the market factor (aggregate market returns minus the risk-free rate) is the most volatile. This makes sense because it is the only long equity factor among the five series. FF_factors %>% mutate(date = year(date)) %>% # Turn date into year gather(key = factor, value = value, - date) %>% # Put in tidy shape group_by(date, factor) %>% # Group by year and factor summarise(value = mean(value)) %>% # Compute average return ggplot(aes(x = date, y = value, color = factor)) + # Plot geom_line() + coord_fixed(500) # Fix x/y ratio FIGURE 3.2: Average returns of common anomalies (1963-2020). Source: Ken French library. The individual attributes of investors who allocate towards particular factors is a blossoming topic. We list a few references below, even though they somewhat lie out of the scope of this book. Betermier, Calvet, and Sodini (2017) show that value investors are older, wealthier and face lower income risk compared to growth investors who are those in the best position to take financial risks. The study Cronqvist, Siegel, and Yu (2015) leads to different conclusions: it finds that the propensity to invest in value versus growth assets has roots in genetics and in life events (the latter effect being confirmed in Cocco, Gomes, and Lopes (2020), and the former being further detailed in a more general context in Cronqvist et al. (2015)). Psychological traits can also explain some factors: when agents extrapolate, they are likely to fuel momentum (this topic is thoroughly reviewed in Barberis (2018)). Micro- and macro-economic consequences of these preferences are detailed in Bhamra and Uppal (2019). To conclude this paragraph, we mention that theoretical models have also been proposed that link agents’ preferences and beliefs (via prospect theory) to market anomalies (see for instance Barberis, Jin, and Wang (2020)). Finally, we highlight the need of replicability of factor premia and echo the recent editorial by Harvey (2020). As is shown by Linnainmaa and Roberts (2018) and Hou, Xue, and Zhang (2020), many proclaimed factors are in fact very much data-dependent and often fail to deliver sustained profitability when the investment universe is altered or when the definition of variable changes (Clifford Asness and Frazzini (2013)). Campbell Harvey and his co-authors, in a series of papers, tried to synthesize the research on factors in Harvey, Liu, and Zhu (2016), C. Harvey and Liu (2019) and Harvey and Liu (2019). His work underlines the need to set high bars for an anomaly to be called a ‘true’ factor. Increasing thresholds for \\(p\\)-values is only a partial answer, as it is always possible to resort to data snooping in order to find an optimized strategy that will fail out-of-sample but that will deliver a \\(t\\)-statistic larger than three (or even four). Harvey (2017) recommends to resort to a Bayesian approach which blends data-based significance with a prior into a so-called Bayesianized p-value (see subsection below). Following this work, researchers have continued to explore the richness of this zoo. Bryzgalova, Huang, and Julliard (2019) propose a tractable Bayesian estimation of large-dimensional factor models and evaluate all possible combinations of more than 50 factors, yielding an incredibly large number of coefficients. This combined with a Bayesianized Fama and MacBeth (1973) procedure allows to distinguish between pervasive and superfluous factors. Chordia, Goyal, and Saretto (2020) use simulations of 2 million trading strategies to estimate the rate of false discoveries, that is, when a spurious factor is detected (type I error). They also advise to use thresholds for t-statistics that are well above three. In a similar vein, Harvey and Liu (2020) also underline that sometimes true anomalies may be missed because of a one time \\(t\\)-statistic that is too low (type II error). The propensity of journals to publish positive results has led researchers to estimate the difference between reported returns and true returns. A. Y. Chen and Zimmermann (2020) call this difference the publication bias and estimate it as roughly 12%. That is, if a published average return is 8%, the actual value may in fact be closer to (1-12%)*8%=7%. Qualitatively, this estimation of 12% is smaller than the out-of-sample reduction in returns found in McLean and Pontiff (2016). 3.2.4 Predictive regressions, sorts, and p-value issues For simplicity, we assume a simple form: \\[\\begin{equation} \\tag{3.2} \\textbf{r} = a+b\\textbf{x}+\\textbf{e}, \\end{equation}\\] where the vector \\(\\textbf{r}\\) stacks all returns of all stocks and \\(\\textbf{x}\\) is a lagged variable so that the regression is indeed predictive. If the estimate \\(\\hat{b}\\) is significant given a specified threshold, then it can be tempting to conclude that \\(\\textbf{x}\\) does a good job at predicting returns. Hence, long-short portfolios related to extreme values of \\(\\textbf{x}\\) (mind the sign of \\(\\hat{b}\\)) are expected to generate profits. This is unfortunately often false because \\(\\hat{b}\\) gives information on the past ability of \\(\\textbf{x}\\) to forecast returns. What happens in the future may be another story. Statistical tests are also used for portfolio sorts. Assume two extreme portfolios are expected to yield very different average returns (like very small cap versus very large cap, or strong winners versus bad losers). The portfolio returns are written \\(r_t^+\\) and \\(r_t^-\\). The simplest test for the mean is \\(t=\\sqrt{T}\\frac{m_{r_+}-m_{r_-}}{\\sigma_{r_+-r_-}}\\), where \\(T\\) is the number of points and \\(m_{r_\\pm}\\) denotes the means of returns and \\(\\sigma_{r_+-r_-}\\) is the standard deviation of the difference between the two series, i.e., the volatility of the long-short portfolio. In short, the statistic can be viewed as a scaled Sharpe ratio (though usually these ratios are computed for long-only portfolios) and can in turn be used to compute \\(p\\)-values to assess the robustness of an anomaly. As is shown in Linnainmaa and Roberts (2018) and Hou, Xue, and Zhang (2020), many factors discovered by reasearchers fail to survive in out-of-sample tests. One reason why people are overly optimistic about anomalies they detect is the widespread reverse interpretation of the p-value. Often, it is thought of as the probability of one hypothesis (e.g., my anomaly exists) given the data. In fact, it’s the opposite; it’s the likelihood of your data sample, knowing that the anomaly holds. \\[\\begin{align*} p-\\text{value} &= P[D|H] \\\\ \\text{target prob.}& = P[H|D]=\\frac{P[D|H]}{P[D]}\\times P[H], \\end{align*}\\] where \\(H\\) stands for hypothesis and \\(D\\) for data. The equality in the second row is a plain application of Bayes’ identity: the interesting probability is in fact a transform of the \\(p\\)-value. Two articles (at least) discuss this idea. Harvey (2017) introduces Bayesianized \\(p\\)-values: \\[\\begin{equation} \\tag{3.3} \\text{Bayesianized } p-\\text{value}=\\text{Bpv}= e^{-t^2/2}\\times\\frac{\\text{prior}}{1+e^{-t^2/2}\\times \\text{prior}} , \\end{equation}\\] where \\(t\\) is the \\(t\\)-statistic obtained from the regression (i.e., the one that defines the p-value) and prior is the analyst’s estimation of the odds that the hypothesis (anomaly) is true. The prior is coded as follows. Suppose there is a p% chance that the null holds (i.e., (1-p)% for the anomaly). The odds are coded as \\(p/(1-p)\\). Thus, if the t-statistic is equal to 2 (corresponding to a p-value of 5% roughly) and the prior odds are equal to 6, then the Bpv is equal to \\(e^{-2}\\times 6 \\times(1+e^{-2}\\times 6)^{-1}\\approx 0.448\\) and there is a 44.8% chance that the null is true. This interpretation stands in sharp contrast with the original \\(p\\)-value which cannot be viewed as a probability that the null holds. Of course, one drawback is that the level of the prior is crucial and solely user-specified. The work of Chinco, Neuhierl, and Weber (2020) is very different but shares some key concepts, like the introduction of Bayesian priors in regression outputs. They show that coercing the predictive regression with an \\(L^2\\) constraint (see the ridge regression in Chapter 5) amounts to introducing views on what the true distribution of \\(b\\) is. The stronger the constraint, the more the estimate \\(\\hat{b}\\) will be shrunk towards zero. One key idea in their work is the assumption of a distribution for the true \\(b\\) across many anomalies. It is assumed to be Gaussian and centered. The interesting parameter is the standard deviation: the larger it is, the more frequently significant anomalies are discovered. Notably, the authors show that this parameter changes through time and we refer to the original paper for more details on this subject. 3.2.5 Fama-Macbeth regressions Another detection method was proposed by Fama and MacBeth (1973) through a two-stage regression analysis of risk premia. The first stage is a simple estimation of the relationship (3.1): the regressions are run on a stock-by-stock basis over the corresponding time series. The resulting estimates \\(\\hat{\\beta}_{i,k}\\) are then plugged into a second series of regressions: \\[\\begin{equation} r_{t,n}= \\gamma_{t,0} + \\sum_{k=1}^K\\gamma_{t,k}\\hat{\\beta}_{n,k} + \\varepsilon_{t,n}, \\end{equation}\\] which are run date-by-date on the cross-section of assets.6 Theoretically, the betas would be known and the regression would be run on the \\(\\beta_{n,k}\\) instead of their estimated values. The \\(\\hat{\\gamma}_{t,k}\\) estimate the premia of factor \\(k\\) at time \\(t\\). Under suitable distributional assumptions on the \\(\\varepsilon_{t,n}\\), statistical tests can be performed to determine whether these premia are significant or not. Typically, the statistic on the time-aggregated (average) premia \\(\\hat{\\gamma}_k=\\frac{1}{T}\\sum_{t=1}^T\\hat{\\gamma}_{t,k}\\): \\[t_k=\\frac{\\hat{\\gamma}_k}{\\hat{\\sigma_k}/\\sqrt{T}}\\] is often used in pure Gaussian contexts to assess whether or not the factor is significant (\\(\\hat{\\sigma}_k\\) is the standard deviation of the \\(\\hat{\\gamma}_{t,k}\\)). We refer to Jagannathan and Wang (1998) and Petersen (2009) for technical discussions on the biases and losses in accuracy that can be induced by standard ordinary least squares (OLS) estimations. Moreover, as the \\(\\hat{\\beta}_{i,k}\\) in the second-pass regression are estimates, a second level of errors can arise (the so-called errors in variables). The interested reader will find some extensions and solutions in Shanken (1992), Ang, Liu, and Schwarz (2018) and Jegadeesh et al. (2019). Below, we perform Fama and MacBeth (1973) regressions on our sample. We start by the first pass: individual estimation of betas. We build a dedicated function below and use some functional programming to automate the process. We stick to the original implementation of the estimation and perform synchronous regressions. nb_factors <- 5 # Number of factors data_FM <- left_join(data_ml %>% # Join the 2 datasets dplyr::select(date, stock_id, R1M_Usd) %>% # (with returns... filter(stock_id %in% stock_ids_short), # ... over some stocks) FF_factors, by = "date") %>% group_by(stock_id) %>% # Grouping mutate(R1M_Usd = lag(R1M_Usd)) %>% # Lag returns ungroup() %>% na.omit() %>% # Remove missing points spread(key = stock_id, value = R1M_Usd) models <- lapply(paste0("`", stock_ids_short, '` ~ MKT_RF + SMB + HML + RMW + CMA'), # Model spec function(f){ lm(as.formula(f), data = data_FM, # Call lm(.) na.action="na.exclude") %>% summary() %>% # Gather the output "$"(coef) %>% # Keep only coefs data.frame() %>% # Convert to dataframe dplyr::select(Estimate)} # Keep the estimates ) betas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) %>% # Extract the betas data.frame(row.names = stock_ids_short) # Format: row names colnames(betas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Format: col names TABLE 3.2: Sample of beta values (row numbers are stock IDs). Constant MKT_RF SMB HML RMW CMA 1 0.008 1.431 0.524 0.635 0.998 -0.397 3 -0.002 0.829 1.101 0.889 0.310 -0.541 4 0.005 0.362 0.298 -0.049 0.588 0.201 7 0.006 0.424 0.681 0.255 0.309 0.116 9 0.004 0.843 0.662 1.076 0.041 0.052 11 -0.001 0.993 0.142 0.483 -0.103 -0.002 In the table, MKT_RF is the market return minus the risk free rate. The corresponding coefficient is often referred to as the beta, especially in univariate regressions. We then reformat these betas from Table 3.2 to prepare the second pass. Each line corresponds to one asset: the first 5 columns are the estimated factor loadings and the remaining ones are the asset returns (date by date). loadings <- betas %>% # Start from loadings (betas) dplyr::select(-Constant) %>% # Remove constant data.frame() # Convert to dataframe ret <- returns %>% # Start from returns dplyr::select(-date) %>% # Keep the returns only data.frame(row.names = returns$date) %>% # Set row names t() # Transpose FM_data <- cbind(loadings, ret) # Aggregate both TABLE 3.3: Sample of reformatted beta values (ready for regression). MKT_RF SMB HML RMW CMA 2000-01-31 2000-02-29 2000-03-31 1 1.4308383 0.5237022 0.6347215 0.9976083 -0.3974365 -0.036 0.263 0.031 3 0.8285128 1.1007130 0.8893534 0.3104149 -0.5410804 0.077 -0.024 0.018 4 0.3624986 0.2983925 -0.0487881 0.5875738 0.2014851 -0.016 0.000 0.153 7 0.4243414 0.6810480 0.2554230 0.3094124 0.1159429 -0.009 0.027 0.000 9 0.8426102 0.6624523 1.0758153 0.0412878 0.0515565 0.032 0.076 -0.025 11 0.9929190 0.1423543 0.4831593 -0.1034593 -0.0019200 0.144 0.258 0.049 We observe that the values of the first column (market betas) revolve around one, which is what we would expect. Finally, we are ready for the second round of regressions. models <- lapply(paste("`", returns$date, "`", ' ~ MKT_RF + SMB + HML + RMW + CMA', sep = ""), function(f){ lm(as.formula(f), data = FM_data) %>% # Call lm(.) summary() %>% # Gather the output "$"(coef) %>% # Keep only the coefs data.frame() %>% # Convert to dataframe dplyr::select(Estimate)} # Keep only estimates ) gammas <- matrix(unlist(models), ncol = nb_factors + 1, byrow = T) %>% # Switch to dataframe data.frame(row.names = returns$date) # & set row names colnames(gammas) <- c("Constant", "MKT_RF", "SMB", "HML", "RMW", "CMA") # Set col names TABLE 3.4: Sample of gamma (premia) values. Constant MKT_RF SMB HML RMW CMA 2000-01-31 -0.013 0.043 0.218 -0.137 -0.272 0.035 2000-02-29 0.012 0.077 -0.130 0.044 0.086 -0.027 2000-03-31 0.007 -0.011 -0.014 0.052 0.039 0.043 2000-04-30 0.136 -0.154 -0.104 0.157 0.078 -0.058 2000-05-31 0.050 -0.009 0.072 -0.096 -0.093 -0.054 2000-06-30 0.026 -0.029 -0.018 0.053 0.045 0.017 Visually, the estimated premia are also very volatile. We plot their estimated values for the market, SMB and HML factors. gammas[2:nrow(gammas),] %>% # Take gammas: # The first row is omitted because the first row of returns is undefined dplyr::select(MKT_RF, SMB, HML) %>% # Select 3 factors bind_cols(date = data_FM$date) %>% # Add date gather(key = factor, value = gamma, -date) %>% # Put in tidy shape ggplot(aes(x = date, y = gamma, color = factor)) + # Plot geom_line() + facet_grid( factor~. ) + # Lines & facets scale_color_manual(values=c("#F87E1F", "#0570EA", "#F81F40")) + # Colors coord_fixed(980) # Fix x/y ratio FIGURE 3.3: Time series plot of gammas (premia) in Fama-Macbeth regressions. The two spikes at the end of the sample signal potential colinearity issues; two factors seem to compensate in an unclear aggregate effect. This underlines the usefulness of penalized estimates (see Chapter 5). 3.2.6 Factor competition The core purpose of factors is to explain the cross-section of stock returns. For theoretical and practical reasons, it is preferable if redundancies within factors are avoided. Indeed, redundancies imply collinearity which is known to perturb estimates (Belsley, Kuh, and Welsch (2005)). In addition, when asset managers decompose the performance of their returns into factors, overlaps (high absolute correlations) between factors yield exposures that are less interpretable; positive and negative exposures compensate each other spuriously. A simple protocol to sort out redundant factors is to run regressions of each factor against all others: \\[\\begin{equation} \\tag{3.4} f_{t,k} = a_k +\\sum_{j\\neq k} \\delta_{k,j} f_{t,j} + \\epsilon_{t,k}. \\end{equation}\\] The interesting metric is then the test statistic associated to the estimation of \\(a_k\\). If \\(a_k\\) is significantly different from zero, then the cross-section of (other) factors fails to explain exhaustively the average return of factor \\(k\\). Otherwise, the return of the factor can be captured by exposures to the other factors and is thus redundant. One mainstream application of this technique was performed in Fama and French (2015), in which the authors show that the HML factor is redundant when taking into account four other factors (Market, SMB, RMW and CMA). Below, we reproduce their analysis on an updated sample. We start our analysis directly with the database maintained by Kenneth French. We can run the regressions that determine the redundancy of factors via the procedure defined in Equation (3.4). factors <- c("MKT_RF", "SMB", "HML", "RMW", "CMA") models <- lapply(paste(factors, ' ~ MKT_RF + SMB + HML + RMW + CMA-',factors), function(f){ lm(as.formula(f), data = FF_factors) %>% # Call lm(.) summary() %>% # Gather the output "$"(coef) %>% # Keep only the coefs data.frame() %>% # Convert to dataframe filter(rownames(.) == "(Intercept)") %>% # Keep only the Intercept dplyr::select(Estimate,`Pr...t..`)} # Keep the coef & p-value ) alphas <- matrix(unlist(models), ncol = 2, byrow = T) %>% # Switch from list to dataframe data.frame(row.names = factors) # alphas # To see the alphas (optional) We obtain the vector of \\(\\alpha\\) values from Equation ((3.4)). Below, we format these figures along with \\(p\\)-value thresholds and export them in a summary table. The significance levels of coefficients is coded as follows: \\(0<(***)<0.001<(**)<0.01<(*)<0.05\\). results <- matrix(NA, nrow = length(factors), ncol = length(factors) + 1) # Coefs signif <- matrix(NA, nrow = length(factors), ncol = length(factors) + 1) # p-values for(j in 1:length(factors)){ form <- paste(factors[j], ' ~ MKT_RF + SMB + HML + RMW + CMA-',factors[j]) # Build model fit <- lm(form, data = FF_factors) %>% summary() # Estimate model coef <- fit$coefficients[,1] # Keep coefficients p_val <- fit$coefficients[,4] # Keep p-values results[j,-(j+1)] <- coef # Fill matrix signif[j,-(j+1)] <- p_val } signif[is.na(signif)] <- 1 # Kick out NAs results <- results %>% round(3) %>% data.frame() # Basic formatting results[signif<0.001] <- paste(results[signif<0.001]," (***)") # 3 star signif results[signif>0.001&signif<0.01] <- # 2 star signif paste(results[signif>0.001&signif<0.01]," (**)") results[signif>0.01&signif<0.05] <- # 1 star signif paste(results[signif>0.01&signif<0.05]," (*)") results <- cbind(as.character(factors), results) # Add dep. variable colnames(results) <- c("Dep. Variable","Intercept", factors) # Add column names TABLE 3.5: Factor competition among the Fama and French (2015) five factors. Dep. Variable Intercept MKT_RF SMB HML RMW CMA MKT_RF 0.008 (***) NA 0.265 (***) 0.093 -0.344 (***) -0.895 (***) SMB 0.003 (*) 0.134 (***) NA 0.078 -0.428 (***) -0.127 HML 0 0.025 0.041 NA 0.157 (***) 1.02 (***) RMW 0.004 (***) -0.09 (***) -0.222 (***) 0.155 (***) NA -0.285 (***) CMA 0.002 (***) -0.107 (***) -0.03 0.46 (***) -0.13 (***) NA We confirm that the HML factor remains redundant when the four others are present in the asset pricing model. The figures we obtain are very close to the ones in the original paper (Fama and French (2015)), which makes sense, since we only add 5 years to their initial sample. At a more macro-level, researchers also try to figure out which models (i.e., combinations of factors) are the most likely, given the data empirically observed (and possibly given priors formulated by the econometrician). For instance, this stream of literature seeks to quantify to which extent the 3-factor model of Fama and French (1993) outperforms the 5 factors in Fama and French (2015). In this direction, De Moor, Dhaene, and Sercu (2015) introduce a novel computation for p-values that compare the relative likelihood that two models pass a zero-alpha test. More generally, the Bayesian method of Barillas and Shanken (2018) was subsequently improved by Chib, Zeng, and Zhao (2020) - see also Chib and Zeng (2020). For a discussion on model comparison from a transaction cost perspective, we refer to S. A. Li, DeMiguel, and Martin-Utrera (2020). Lastly, even the optimal number of factors is a subject of disagreement among conclusions of recent work. While the traditional literature focuses on a limited number (3-5) of factors, more recent research by DeMiguel et al. (2020), He, Huang, and Zhou (2020), Kozak, Nagel, and Santosh (2019) and Freyberger, Neuhierl, and Weber (2020) advocates the need to use at least 15 or more (in contrast, Kelly, Pruitt, and Su (2019) argue that a small number of latent factors may suffice). Green, Hand, and Zhang (2017) even find that the number of characteristics that help explain the cross-section of returns varies in time.7 3.2.7 Advanced techniques The ever increasing number of factors combined to their importance in asset management has led researchers to craft more subtle methods in order to ``organize’’ the so-called factor zoo and, more importantly, to detect spurious anomalies and compare different asset pricing model specifications. We list a few of them below. - Feng, Giglio, and Xiu (2020) combine LASSO selection with Fama-MacBeth regressions to test if new factor models are worth it. They quantify the gain of adding one new factor to a set of predefined factors and show that many factors reported in papers published in the 2010 decade do not add much incremental value; - C. Harvey and Liu (2019) (in a similar vein) use bootstrap on orthogonalized factors. They make the case that correlations among predictors is a major issue and their method aims at solving this problem. Their lengthy procedure seeks to test if maximal additional contribution of a candidate variable is significant; - Fama and French (2018) compare asset pricing models through squared maximum Sharpe ratios; - Giglio and Xiu (2019) estimate factor risk premia using a three-pass method based on principal component analysis; - Pukthuanthong, Roll, and Subrahmanyam (2018) disentangle priced and non-priced factors via a combination of principal component analysis and Fama and MacBeth (1973) regressions; - Gospodinov, Kan, and Robotti (2019) warn against factor misspecification (when spurious factors are included in the list of regressors). Traded factors (\\(resp.\\) macro-economic factors) seem more likely (\\(resp.\\) less likely) to yield robust identifications (see also Bryzgalova (2019)). There is obviously no infallible method, but the number of contributions in the field highlights the need for robustness. This is evidently a major concern when crafting investment decisions based on factor intuitions. One major hurdle for short-term strategies is the likely time-varying feature of factors. We refer for instance to Ang and Kristensen (2012) and Cooper and Maio (2019) for practical results and to Gagliardini, Ossola, and Scaillet (2016) and S. Ma et al. (2020) for more theoretical treatments (with additional empirical results). 3.3 Factors or characteristics? The decomposition of returns into linear factor models is convenient because of its simple interpretation. There is nonetheless a debate in the academic literature about whether firm returns are indeed explained by exposure to macro-economic factors or simply by the characteristics of firms. In their early study, Lakonishok, Shleifer, and Vishny (1994) argue that one explanation of the value premium comes from incorrect extrapolation of past earning growth rates. Investors are overly optimistic about firms subject to recent profitability. Consequently, future returns are (also) driven by the core (accounting) features of the firm. The question is then to disentangle which effect is the most pronounced when explaining returns: characteristics versus exposures to macro-economic factors. In their seminal contribution on this topic, Daniel and Titman (1997) provide evidence in favour of the former (two follow-up papers are K. Daniel, Titman, and Wei (2001) and Daniel and Titman (2012)). They show that firms with high book-to-market ratios or small capitalizations display higher average returns, even if they are negatively loaded on the HML or SMB factors. Therefore, it seems that it is indeed the intrinsic characteristics that matter, and not the factor exposure. For further material on characteristics’ role in return explanation or prediction, we refer to the following contributions: - Haugen and Baker (1996) estimate predictive regressions based on firms characteristics and show that it is possible to build profitable portfolios based on the resulting predictions. There method was subsequently enhanced with the adaptive LASSO by Guo (2020). - Section 2.5.2. in Goyal (2012) surveys pre-2010 results on this topic; - Chordia, Goyal, and Shanken (2019) find that characteristics explain a larger proportion of variation in estimated expected returns than factor loadings; - Kozak, Nagel, and Santosh (2018) reconcile factor-based explanations of premia to a theoretical model in which some agents’ demands are sentiment driven; - Han et al. (2019) show with penalized regressions that 20 to 30 characteristics (out of 94) are useful for the prediction of monthly returns of US stocks. Their methodology is interesting: they regress returns against characteristics to build forecasts and then regress the returns on the forecast to assess if they are reliable. The latter regression uses a LASSO-type penalization (see Chapter 5) so that useless characteristics are excluded from the model. The penalization is extended to elasticnet in Rapach and Zhou (2019). - Kelly, Pruitt, and Su (2019) and Kim, Korajczyk, and Neuhierl (2019) both estimate models in which factors are latent but loadings (betas) and possibly alphas depend on characteristics. Kirby (2020) generalizes the first approach by introducing regime-switching. In contrast, Lettau and Pelger (2020a) and Lettau and Pelger (2020b) estimate latent factors without any link to particular characteristics (and provide large sample asymptotic properties of their methods). - In the same vein as Hoechle, Schmid, and Zimmermann (2018), Gospodinov, Kan, and Robotti (2019) and Bryzgalova (2019) and discuss potential errors that arise when working with portfolio sorts that yield long-short returns. The authors show that in some cases, tests based on this procedure may be deceitful. This happens when the characteristic chosen to perform the sort is correlated with an external (unobservable) factor. They propose a novel regression-based approach aimed at bypassing this problem. More recently and in a separate stream of literature, R. S. J. Koijen and Yogo (2019) have introduced a demand model in which investors form their portfolios according to their preferences towards particular firm characteristics. They show that this allows them to mimic the portfolios of large institutional investors. In their model, aggregate demands (and hence, prices) are directly linked to characteristics, not to factors. In a follow-up paper, R. S. Koijen, Richmond, and Yogo (2019) show that a few sets of characteristics suffice to predict future returns. They also show that, based on institutional holdings from the UK and the US, the largest investors are those who are the most influencial in the formation of prices. In a similar vein, Betermier, Calvet, and Jo (2019) derive an elegant (theoretical) general equilibrium model that generates some well-documented anomalies (size, book-to-market). The models of Arnott et al. (2014) and Alti and Titman (2019) are also able to theoretically generate known anomalies. Finally, in I. Martin and Nagel (2019), characteristics influence returns via the role they play in the predictability of dividend growth. This paper discussed the asymptotic case when the number of assets and the number of characteristics are proportional and both increase to infinity. 3.4 Hot topics: momentum, timing and ESG 3.4.1 Factor momentum A recent body of literature unveils a time series momentum property of factor returns. For instance, Gupta and Kelly (2019) report that autocorrelation patterns within these returns is statistically significant.8 Similar results are obtained in Falck, Rej, and Thesmar (2020). In the same vein, Arnott et al. (2020) make the case that the industry momentum found in Moskowitz and Grinblatt (1999) can in fact be explained by this factor momentum. Going even further, Ehsani and Linnainmaa (2019) conclude that the original momentum factor is in fact the aggregation of the autocorrelation that can be found in all other factors. Acknowledging the profitability of factor momentum, H. Yang (2020b) seeks to understand its source and decomposes stock factor momentum portfolios into two components: factor timing portfolio and a static portfolio. The former seeks to profit from the serial correlations of factor returns while the latter tries to harness factor premia. The author shows that it is the static portfolio that explains the larger portion of factor momentum returns. In H. Yang (2020a), the same author presents a new estimator to gauge factor momentum predictability. Given the data obtained on Ken French’s website, we compute the autocorrelation function (ACF) of factors. We recall that \\[\\text{ACF}_k(\\textbf{x}_t)=\\mathbb{E}[(\\textbf{x}_t-\\bar{\\textbf{x}})(\\textbf{x}_{t+k}-\\bar{\\textbf{x}})].\\] library(cowplot) # For stacking plots library(forecast) # For autocorrelation function acf_SMB <- ggAcf(FF_factors$SMB, lag.max = 10) + labs(title = "") # ACF SMB acf_HML <- ggAcf(FF_factors$HML, lag.max = 10) + labs(title = "") # ACF HML acf_RMW <- ggAcf(FF_factors$RMW, lag.max = 10) + labs(title = "") # ACF RMW acf_CMA <- ggAcf(FF_factors$CMA, lag.max = 10) + labs(title = "") # ACF CMA plot_grid(acf_SMB, acf_HML, acf_RMW, acf_CMA, # Plot labels = c('SMB', 'HML', 'RMW', 'CMA')) FIGURE 3.4: Autocorrelograms of common factor portfolios. Of the four chosen series, only the size factor is not significantly autocorrelated at the first order. 3.4.2 Factor timing Given the abundance of evidence of the time-varying nature of factor premia, it is legitimate to wonder if it is possible to predict when factor will perform well or badly. The evidence on the effectiveness of timing is diverse: positive for Greenwood and Hanson (2012), Hodges et al. (2017), Hasler, Khapko, and Marfe (2019), Haddad, Kozak, and Santosh (2020) and Lioui and Tarelli (2020), negative for Asness et al. (2017) and mixed for Dichtl et al. (2019). There is no consensus on which predictors to use (general macroeconomic indicators in Hodges et al. (2017), stock issuances versus repurchases in Greenwood and Hanson (2012), and aggregate fundamental data in Dichtl et al. (2019)). A method for building reasonable timing strategies for long-only portfolios with sustainable transaction costs is laid out in Leippold and Rüegg (2020). In ML-based factor investing, it is possible to resort to more granularity by combining firm-specific attributes to large-scale economic data as we explain in Section 4.7.2. 3.4.3 The green factors The demand for ethical financial products has sharply risen during the 2010 decade, leading to the creation of funds dedicated to socially responsible investing (SRI - see Camilleri (2020)). Though this phenomenon is not really new (Schueth (2003), Hill et al. (2007)), its acceleration has prompted research about whether or not characteristics related to ESG criteria (environment, social, governance) are priced. Dozens and even possibly hundreds of papers have been devoted to this question, but no consensus has been reached. More and more, researchers study the financial impact of climate change (see Bernstein, Gustafson, and Lewis (2019), Hong, Li, and Xu (2019) and Hong, Karolyi, and Scheinkman (2020)) and the societal push for responsible corporate behavior (Fabozzi (2020), Kurtz (2020)). We gather below a very short list of papers that suggests conflicting results: favorable: ESG investing works (Kempf and Osthoff (2007), Cheema-Fox et al. (2020)), can work (Nagy, Kassam, and Lee (2016), Alessandrini and Jondeau (2020)), or can at least be rendered efficient (Branch and Cai (2012)). A large meta-study reports overwhelming favorable results (Friede, Busch, and Bassen (2015)), but of course, they could well stem from the publication bias towards positive results. unfavorable: Ethical investing is not profitable according to Adler and Kritzman (2008) and Blitz and Swinkels (2020). An ESG factor should be long unethical firms and short ethical ones (Lioui (2018)). mixed: ESG investing may be beneficial globally but not locally (Chakrabarti and Sen (2020)). Portfolios relying on ESG screening do not significantly outperform those with no screening but are subject to lower levels of volatility (Gibson et al. (2020), Gougler and Utz (2020)). As is often the case, the devil is in the details, and results depend on whether to use E, S or G (Bruder et al. (2019)). On top of these contradicting results, several articles point towards complexities in the measurement of ESG. Depending on the chosen criteria and on the data provider, results can change drastically (see Galema, Plantinga, and Scholtens (2008), Berg, Koelbel, and Rigobon (2020) and Atta-Darkua et al. (2020)). We end this short section by noting that of course ESG criteria can directly be integrated into ML model, as is for instance done in Franco et al. (2020). 3.5 The links with machine learning Given the exponential increase in data availability, the obvious temptation of any asset manager is to try to infer future returns from the abundance of attributes available at the firm level. We allude to classical data like accounting ratios and to alternative data, such as sentiment. This task is precisely the aim of Machine Learning. Given a large set of predictor variables (\\(\\mathbf{X}\\)), the goal is to predict a proxy for future performance \\(\\mathbf{y}\\) through a model of the form (2.1). If fundamental data (accounting ratios, earnings, relative valuations, etc.) help predict returns, then one refinement is to predict this fundamental data upfront. This may allow to anticipate changes or gain informational edges. Recent contributions in this directions include Cao and You (2020) and Huang et al. (2020). Some earlier attempts have already been made that aim to explain and predict returns with firm attributes (e.g., Brandt, Santa-Clara, and Valkanov (2009), Hjalmarsson and Manchev (2012), Ammann, Coqueret, and Schade (2016), DeMiguel et al. (2020) and McGee and Olmo (2020)), but not with any ML intent or focus originally. In retrospect, these approaches do share some links with ML tools. The general formulation is the following. At time \\(T\\), the agent or investor seeks to solve the following program: \\[\\begin{align*} \\underset{\\boldsymbol{\\theta}_T}{\\max} \\ \\mathbb{E}_T\\left[ u(r_{p,T+1})\\right] = \\underset{\\boldsymbol{\\theta}_T}{\\max} \\ \\mathbb{E}_T\\left[ u\\left(\\left(\\bar{\\textbf{w}}_T+\\textbf{x}_T\\boldsymbol{\\theta}_T\\right)'\\textbf{r}_{T+1}\\right)\\right] , \\end{align*}\\] where \\(u\\) is some utility function and \\(r_{p,T+1}=\\left(\\bar{\\textbf{w}}_T+\\textbf{x}_T\\boldsymbol{\\theta}_T\\right)'\\textbf{r}_{T+1}\\) is the return of the portfolio, which is defined as a benchmark \\(\\bar{\\textbf{w}}_T\\) plus some deviations from this benchmark that are a linear function of features \\(\\textbf{x}_T\\boldsymbol{\\theta}_T\\). The above program may be subject to some external constraints (e.g., to limit leverage). In practice, the vector \\(\\boldsymbol{\\theta}_T\\) must be estimated using past data (from \\(T-\\tau\\) to \\(T-1\\)): the agent seeks the solution of \\[\\begin{align} \\tag{3.5} \\underset{\\boldsymbol{\\theta}_T}{\\text{max}} \\ \\frac{1}{\\tau} \\sum_{t=T-\\tau}^{T-1} u \\left( \\sum_{i=1}^{N_T}\\left(\\bar{w}_{i,t}+ \\boldsymbol{\\theta}'_T \\textbf{x}_{i,t} \\right)r_{i,t+1} \\right) \\end{align}\\] on a sample of size \\(\\tau\\) where \\(N_T\\) is the number of asset in the universe. The above formulation can be viewed as a learning task in which the parameters are chosen such that the reward (average return) is maximized. 3.5.1 A short list of recent references Independent of a characteristics-based approach, ML applications in finance have blossomed, initially working with price data only and later on integrating firm characteristics as predictors. We cite a few references below, grouped by methodological approach: penalized quadratic programming: Goto and Xu (2015), Ban, El Karoui, and Lim (2016) and Perrin and Roncalli (2019), regularized predictive regressions: Rapach, Strauss, and Zhou (2013) and Alexander Chinco, Clark-Joseph, and Ye (2019), support vector machines: Cao and Tay (2003) (and the references therein), model comparison and/or aggregation: Kim (2003), Huang, Nakamori, and Wang (2005), Matı́as and Reboredo (2012), Reboredo, Matı́as, and Garcia-Rubio (2012), Dunis et al. (2013), Gu, Kelly, and Xiu (2020b) and Guida and Coqueret (2018b). The latter two more recent articles work with a large cross-section of characteristics. We provide more detailed lists for tree-based methods, neural networks and reinforcement learning techniques in Chapters 6, 7 and 16, respectively. Moreover, we refer to Ballings et al. (2015) for a comparison of classifiers and to Henrique, Sobreiro, and Kimura (2019) and Bustos and Pomares-Quimbaya (2020) for surveys on ML-based forecasting techniques. 3.5.2 Explicit connections with asset pricing models The first and obvious link between factor investing and asset pricing is (average) return prediction. The main canonical academic reference is Gu, Kelly, and Xiu (2020b). Let us first write the general equation and then comment on it: \\[\\begin{equation} \\tag{3.6} r_{t+1,n}=g(\\textbf{x}_{t,n}) + \\epsilon_{t+1}. \\end{equation}\\] The interesting discussion lies in the differences between the above model and that of Equation (3.1). The first obvious difference is the introduction of the nonlinear function \\(g\\): indeed, there is no reason (beyond simplicity and interpretability) why we should restrict the model to linear relationships. One early reference for nonlinearities in asset pricing kernels is Bansal and Viswanathan (1993). More importantly, the second difference between (3.6) and (3.1) is the shift in the time index. Indeed, from an investor’s perspective, the interest is to be able to predict some information about the structure of the cross-section of assets. Explaining asset returns with synchronous factors is not useful because the realization of factor values is not known in advance. Hence, if one seeks to extract value from the model, there needs to be a time interval between the observation of the state space (which we call \\(\\textbf{x}_{t,n}\\)) and the occurrence of the returns. Once the model \\(\\hat{g}\\) is estimated, the time-\\(t\\) (measurable) value \\(g(\\textbf{x}_{t,n})\\) will give a forecast for the (average) future returns. These predictions can then serve as signals in the crafting of portfolio weights (see Chapter 12 for more on that topic). While most studies do work with returns on the l.h.s. of (3.6), there is no reason why other indicators should not be used. Returns are straightforward and simple to compute, but they could very well be replaced by more sophisticated metrics, like the Sharpe ratio, for instance. The firms’ features would then be used to predict a risk-adjusted performance rather than simple returns. Beyond the explicit form of Equation (3.6), several other ML-related tools can also be used to estimate asset pricing models. This can be achieved in several ways, some of which we list below. First, one mainstream problem in asset pricing is to characterize the stochastic discount factor (SDF) \\(M_t\\), which satisfies \\(\\mathbb{E}_t[M_{t+1}(r_{t+1,n}-r_{t+1,f})]=0\\) for any asset \\(n\\) (see Cochrane (2009)). This equation is a natural playing field for the generalized method of moment (Hansen (1982)): \\(M_t\\) must be such that \\[\\begin{equation} \\tag{3.7} \\mathbb{E}[M_{t+1}R_{t+1,n}g(V_t)]=0, \\end{equation}\\] where the instrumental variables \\(V_t\\) are \\(\\mathcal{F}_t\\)-measurable (i.e., are known at time \\(t\\)) and the capital \\(R_{t+1,n}\\) denotes the excess return of asset \\(n\\). In order to reduce and simplify the estimation problem, it is customary to define the SDF as a portfolio of assets (see chapter 3 in Back (2010)). In Luyang Chen, Pelger, and Zhu (2020), the authors use a generative adversarial network (GAN, see Section 7.6.1) to estimate the weights of the portfolios that are the closest to satisfy (3.7) under a strongly penalizing form. A second approach is to try to model asset returns as linear combinations of factors, just as in (3.1). We write in compact notation \\[r_{t,n}=\\alpha_n+\\boldsymbol{\\beta}_{t,n}'\\textbf{f}_t+\\epsilon_{t,n},\\] and we allow the loadings \\(\\boldsymbol{\\beta}_{t,n}\\) to be time-dependent. The trick is then to introduce the firm characteristics in the above equation. Traditionally, the characteristics are present in the definition of factors (as in the seminal definition of Fama and French (1993)). The decomposition of the return is made according to the exposition of the firm’s return to these factors constructed according to market size, accounting ratios, past performance, etc. Given the exposures, the performance of the stock is attributed to particular style profiles (e.g., small stock, or value stock, etc.). Habitually, the factors are heuristic portfolios constructed from simple rules like thresholding. For instance, firms below the 1/3 quantile in book-to-market are growth firms and those above the 2/3 quantile are the value firms. A value factor can then be defined by the long-short portfolio of these two sets, with uniform weights. Note that Fama and French (1993) use a more complex approach which also takes market capitalization into account both in the weighting scheme and also in the composition of the portfolios. One of the advances enabled by machine learning is to automate the construction of the factors. It is for instance the approach of Feng, Polson, and Xu (2019). Instead of building the factors heuristically, the authors optimize the construction to maximize the fit in the cross-section of returns. The optimization is performed via a relatively deep feed-forward neural network and the feature space is lagged so that the relationship is indeed predictive, as in Equation (3.6). Theoretically, the resulting factors help explain a substantially larger proportion of the in-sample variance in the returns. The prediction ability of the model depends on how well it generalizes out-of-sample. A third approach is that of Kelly, Pruitt, and Su (2019) (though the statistical treatment is not machine learning per se).9 Their idea is the opposite: factors are latent (unobserved) and it is the betas (loadings) that depend on the characteristics. This allows many degrees of freedom because in \\(r_{t,n}=\\alpha_n+(\\boldsymbol{\\beta}_{t,n}(\\textbf{x}_{t-1,n}))'\\textbf{f}_t+\\epsilon_{t,n},\\) only the characteristics \\(\\textbf{x}_{t-1,n}\\) are known and both the factors \\(\\textbf{f}_t\\) and the functional forms \\(\\boldsymbol{\\beta}_{t,n}(\\cdot)\\) must be estimated. In their article, Kelly, Pruitt, and Su (2019) work with a linear form, which is naturally more tractable. Lastly, a fourth approach (introduced in Gu, Kelly, and Xiu (2020a)) goes even further and combines two neural network architectures. The first neural network takes characteristics \\(\\textbf{x}_{t-1}\\) as inputs and generates factor loadings \\(\\boldsymbol{\\beta}_{t-1}(\\textbf{x}_{t-1})\\). The second network transforms returns \\(\\textbf{r}_t\\) into factor values \\(\\textbf{f}_t(\\textbf{r}_t)\\) (in Feng, Polson, and Xu (2019)). The aggregate model can then be written: \\[\\begin{equation} \\tag{3.8} \\textbf{r}_t=\\boldsymbol{\\beta}_{t-1}(\\textbf{x}_{t-1})'\\textbf{f}_t(\\textbf{r}_t)+\\boldsymbol{\\epsilon}_t. \\end{equation}\\] The above specification is quite special because the output (on the l.h.s.) is also present as input (in the r.h.s.). In machine learning, autoencoders (see Section 7.6.2) share the same property. Their aim, just like in principal component analysis, is to find a parsimonious nonlinear representation form for a dataset (in this case, returns). In Equation (3.8), the input is \\(\\textbf{r}_t\\) and the output function is \\(\\boldsymbol{\\beta}_{t-1}(\\textbf{x}_{t-1})'\\textbf{f}_t(\\textbf{r}_t)\\). The aim is to minimize the difference between the two just as is any regression-like model. Autoencoders are neural networks which have outputs as close as possible to the inputs with an objective of dimensional reduction. The innovation in Gu, Kelly, and Xiu (2020a) is that the pure autoencoder part is merged with a vanilla perceptron used to model the loadings. The structure of the neural network is summarized below. \\[\\left. \\begin{array}{rl} \\text{returns } (\\textbf{r}_t) & \\overset{NN_1}{\\longrightarrow} \\quad \\text{ factors } (\\textbf{f}_t=NN_1(\\textbf{r}_t)) \\\\ \\text{characteristics } (\\textbf{x}_{t-1}) & \\overset{NN_2}{\\longrightarrow} \\quad \\text{ loadings } (\\boldsymbol{\\beta}_{t-1}=NN_2(\\textbf{x}_{t-1})) \\end{array} \\right\\} \\longrightarrow \\text{ returns } (r_t)\\] A simple autoencoder would consist of only the first line of the model. This specification is discussed in more details in Section 7.6.2. As a conclusion of this chapter, it appears undeniable that the intersection between the two fields of asset pricing and machine learning offers a rich variety of applications. The literature is already exhaustive and it is often hard to disentangle the noise from the great ideas in the continuous flow of publications on these topics. Practice and implementation is the only way forward to extricate value from hype. This is especially true because agents often tend to overestimate the role of factors in the allocation decision process of real-world investors (see Alex Chinco, Hartzmark, and Sussman (2019) and Castaneda and Sabat (2019)). 3.6 Coding exercises Compute annual returns of the growth versus value portfolios, that is, the average return of firms with above median price-to-book ratio (the variable is called `Pb’ in the dataset). Same exercise, but compute the monthly returns and plot the value (through time) of the corresponding portfolios. Instead of a unique threshold, compute simply sorted portfolios based on quartiles of market capitalization. Compute their annual returns and plot them. References "],["Data.html", "Chapter 4 Data preprocessing 4.1 Know your data 4.2 Missing data 4.3 Outlier detection 4.4 Feature engineering 4.5 Labelling 4.6 Handling persistence 4.7 Extensions 4.8 Additional code and results 4.9 Coding exercises", " Chapter 4 Data preprocessing The methods we describe in this chapter are driven by financial applications. For an introduction to non-financial data processing, we recommend two references: chapter 3 from the general purpose ML book by Boehmke and Greenwell (2019) and the monograph on this dedicated subject by Kuhn and Johnson (2019). 4.1 Know your data The first step, as in any quantitative study, is obviously to make sure the data is trustworthy, i.e., comes from a reliable provider (a minima). The landscape in financial data provision is vast to say the least: some providers are well established (e.g., Bloomberg, Thomson-Reuters, Datastream, CRSP, Morningstar), some are more recent (e.g., Capital IQ, Ravenpack) and some focus on alternative data niches (see https://alternativedata.org/data-providers/ for an exhaustive list). Unfortunately, and to the best of our knowledge, no study has been published that evaluates a large spectrum of these providers in terms of data reliability. The second step is to have a look at summary statistics: ranges (minimum and maximum values), and averages and medians. Histograms or plots of time series carry of course more information but cannot be analyzed properly in high dimensions. They are nonetheless sometimes useful to track local patterns or errors for a given stock and/or a particular feature. Beyond first order moments, second order quantities (variances and covariances/correlations) also matter because they help spot colinearities. When two features are highly correlated, problems may arise in some models (e.g., simple regressions, see Section 15.1). Often, the number of predictors is so large that it is unpractical to look at these simple metrics. A minimal verification is recommended. To further ease the analysis: focus on a subset of predictors, e.g., the ones linked to the most common factors (market-capitalization, price-to-book or book-to-market, momentum (past returns), profitability, asset growth, volatility); track outliers in the summary statistics (when the maximum/median or median/minimum ratios seem suspicious). Below, in Figure 4.1, we show a box plot that illustrates the distribution of correlations between features and the one month ahead return. The correlations are computed on a date-by-date basis, over the whole cross-section of stocks. They are mostly located close to zero, but some dates seem to experience extreme shifts (outliers are shown with black circles). The market capitalization has the median which is the most negative while volatility is the only predictor with positive median correlation (this particular example seems to refute the low risk anomaly). data_ml %>% dplyr::select(c(features_short, "R1M_Usd", "date")) %>% # Keep few features, label & date group_by(date) %>% # Group: dates! summarise_all(funs(cor(.,R1M_Usd))) %>% # Compute correlations dplyr::select(-R1M_Usd) %>% # Remove label gather(key = Predictor, value = value, -date) %>% # Put in tidy format ggplot(aes(x = Predictor, y = value, color = Predictor)) + # Plot geom_boxplot(outlier.colour = "black") + coord_flip() + theme(aspect.ratio = 0.6) + xlab(element_blank()) FIGURE 4.1: Boxplot of correlations with the 1M forward return (label). More importantly, when seeking to work with supervised learning (as we will do most of the time), the link of some features with the dependent variable can be further characterized by the smoothed conditional average because it shows how the features impact the label. The use of the conditional average has a deep theoretical grounding. Suppose there is only one feature \\(X\\) and that we seek a model \\(Y=f(X)+\\text{error}\\), where variables are real-valued. The function \\(f\\) that minimizes the average squared error \\(\\mathbb{E}[(Y-f(X))^2]\\) is the so-called regression function (see Section 2.4 in Hastie, Tibshirani, and Friedman (2009)): \\[\\begin{equation} \\tag{4.1} f(x)=\\mathbb{E}[Y|X=x]. \\end{equation}\\] In Figure 4.2, we plot two illustrations of this function when the dependent variable (\\(Y\\)) is the one month ahead return. The first one pertains to the average market capitalization over the past year and the second to the volatility over the past year as well. Both predictors have been uniformized (see Section 4.4.2 below) so that their values are uniformly distributed in the cross-section of assets for any given time period. Thus, the range of features is \\([0,1]\\) and is shown on the \\(x\\)-axis of the plot. The grey corridors around the lines show 95% level confidence interval for the computation of the mean. Essentially, it is narrow when both (i) many data points are available and (ii) these points are not too dispersed. data_ml %>% # From dataset: ggplot(aes(y = R1M_Usd)) + # Plot geom_smooth(aes(x = Mkt_Cap_12M_Usd, color = "Market Cap")) + # Cond. Exp. Mkt_cap geom_smooth(aes(x = Vol1Y_Usd, color = "Volatility")) + # Cond. Exp. Vol scale_color_manual(values=c("#F87E1F", "#0570EA")) + # Change color coord_fixed(10) + # Change x/y ratio labs(color = "Predictor") + xlab(element_blank()) FIGURE 4.2: Conditional expectations: average returns as smooth functions of features. The two variables have a close to monotonic impact on future returns. Returns, on average, decrease with market capitalization (thereby corroborating the so-called size effect). The reverse pattern is less pronounced for volatility: the curve is rather flat for the first half of volatility scores and progressively increases, especially over the last quintile of volatility values (thereby contradicting the low-volatility anomaly). One important empirical property of features is autocorrelation (or absence thereof). A high level of autocorrelation for one predictor makes it plausible to use simple imputation techniques when some data points are missing. But autocorrelation is also important when moving towards prediction tasks and we discuss this issue shortly below in Section 4.6. In Figure 4.3, we build the histogram of autocorrelations, computed stock-by-stock and feature-by-feature. autocorrs <- data_ml %>% # From dataset: dplyr::select(c("stock_id", features)) %>% # Keep ids & features gather(key = feature, value = value, -stock_id) %>% # Put in tidy format group_by(stock_id, feature) %>% # Group summarize(acf = acf(value, lag.max = 1, plot = FALSE)$acf[2]) # Compute ACF autocorrs %>% ggplot(aes(x = acf)) + xlim(-0.1,1) + # Plot geom_histogram(bins = 60) FIGURE 4.3: Histogram of sample feature autocorrelations. Given the large number of values to evaluate, the above chunk is quite time-consuming. The output shows that predictors are highly autocorrelated: most of them have a first order autocorrelation above 0.80. 4.2 Missing data Similarly to any empirical discipline, portfolio management is bound to face missing data issues. The topic is well known and several books detail solutions to this problem (e.g., Allison (2001), Enders (2010), Little and Rubin (2014) and Van Buuren (2018)). While researchers continuously propose new methods to cope with absent points (Honaker and King (2010) or Che et al. (2018) to cite but a few), we believe that a simple, heuristic treatment is usually sufficient as long as some basic cautious safeguards are enforced. First of all, there are mainly two ways to deal with missing data: removal and imputation. Removal is agnostic but costly, especially if one whole instance is eliminated because of only one missing feature value. Imputation is often preferred but relies on some underlying and potentially erroneous assumption. A simplified classification of imputation is the following: A basic imputation choice is the median (or mean) of the feature for the stock over the past available values. If there is a trend in the time series, this will nonetheless alter the trend. Relatedly, this method can be forward-looking, unless the training and testing sets are treated separately. In time series contexts with views towards backtesting, the most simple imputation comes from previous values: if \\(x_t\\) is missing, replace it with \\(x_{t-1}\\). This makes sense most of the time because past values are all that is available and are by definition backward-looking. However, in some particular cases, this may be a very bad choice (see words of caution below). Medians and means can also be computed over the cross-section of assets. This roughly implies that the missing feature value will be relocated in the bulk of observed values. When many values are missing, this creates an atom in the distribution of the feature and alters the original distribution. One advantage is that this imputation is not forward-looking. Many techniques rely on some modelling assumptions for the data generating process. We refer to nonparametric approaches (Stekhoven and Bühlmann (2011) and Shah et al. (2014), which rely on random forests, see Chapter 6), Bayesian imputation (Schafer (1999)), maximum likelihood approaches (Enders (2001), Enders (2010)), interpolation or extrapolation and nearest neighbor algorithms (Garcı́a-Laencina et al. (2009)). More generally, the four books cited at the begining of the subsection detail many such imputation processes. Advanced techniques are much more demanding computationally. A few words of caution: Interpolation should be avoided at all cost. Accounting values or ratios that are released every quarter must never be linearly interpolated for the simple reason that this is forward-looking. If numbers are disclosed in January and April, then interpolating February and March requires the knowledge of the April figure, which, in live trading will not be known. Resorting to past values is a better way to go. Nevertheless, there are some feature types for which imputation from past values should be avoided. First of all, returns should not be replicated. By default, a superior choice is to set missing return indicators to zero (which is often close to the average or the median). A good indicator that can help the decision is the persistence of the feature through time. If it is highly autocorrelated (and the time series plot create a smooth curve, like for market capitalization), then imputation from the past can make sense. If not, then it should be avoided. There are some cases that can require more attention. Let us consider the following fictitious sample of dividend yield: TABLE 4.1: Challenges with chronological imputation. Date Original yield Replacement value 2015-02 NA preceding (if it exists) 2015-03 0.02 untouched (none) 2015-04 NA 0.02 (previous) 2015-05 NA 0.02 (previous) 2015-06 NA <= Problem! In this case, the yield is released quarterly, in March, June, September, etc. But in June, the value is missing. The problem is that we cannot know if it is missing because of a genuine data glitch, or because the firm simply did not pay any dividends in June. Thus, imputation from past value may be erroneous here. There is no perfect solution but a decision must nevertheless be taken. For dividend data, three options are: Keep the previous value. In R, the function na.locf() from the zoo package is incredibly efficient for this task. Extrapolate from previous observations (this is very different from interpolation): for instance, evaluate a trend on past data and pursue that trend. Set the value to zero. This is tempting but may be sub-optimal due to dividend smoothing practices from executives (see for instance Leary and Michaely (2011) and Chen, Da, and Priestley (2012) for details on the subject). For persistent time series, the first two options are probably better. Tests can be performed to evaluate the relative performance of each option. It is also important to remember these design choices. There are so many of them that they are easy to forget. Keeping track of them is obviously compulsory. In the ML pipeline, the scripts pertaining to data preparation are often key because they do not serve only once! 4.3 Outlier detection The topic of outlier detection is also well documented and has its own surveys (Hodge and Austin (2004), Chandola, Banerjee, and Kumar (2009) and Gupta et al. (2014)) and a few dedicated books (Aggarwal (2013) and Rousseeuw and Leroy (2005), though the latter is very focused on regression analysis). Again, incredibly sophisticated methods may require a lot of efforts for possibly limited gain. Simple heuristic methods, as long as they are documented in the process, may suffice. They often rely on ‘hard’ thresholds: for one given feature (possibly filtered in time), any point outside the interval \\([\\mu-m\\sigma, \\mu+m\\sigma]\\) can be deemed an outlier. Here \\(\\mu\\) is the mean of the sample and \\(\\sigma\\) the standard deviation. The multiple value \\(m\\) usually belongs to the set \\(\\{3, 5, 10\\}\\), which is of course arbitrary. likewise, if the largest value is above \\(m\\) times the second-to-largest, then it can also be classified as an outlier (the same reasoning applied for the other side of the tail). finally, for a given small threshold \\(q\\), any value outside the \\([q,1-q]\\) quantile range can be considered outliers. This latter idea was popularized by winsorization. Winsorizing amounts to setting to \\(x^{(q)}\\) all values below \\(x^{(q)}\\) and to \\(x^{(1-q)}\\) all values above \\(x^{(1-q)}\\). The winsorized variable \\(\\tilde{x}\\) is: \\[\\tilde{x}_i=\\left\\{\\begin{array}{ll} x_i & \\text{ if } x_i \\in [x^{(q)},x^{(1-q)}] \\quad \\text{ (unchanged)}\\\\ x^{(q)} & \\text{ if } x_i < x^{(q)} \\\\ x^{(1-q)} & \\text{ if } x_i > x^{(1-q)} \\end{array} \\right. .\\] The range for \\(q\\) is usually \\((0.5\\%, 5\\%)\\) with 1% and 2% being the most often used. The winsorization stage must be performed on a feature-by-feature and a date-by-date basis. However, keeping a time series perspective is also useful. For instance, a $800B market capitalization may seems out of range, except when looking at the history of Apple’s capitalization. We conclude this subsection by recalling that true outliers (i.e., extreme points that are not due to data extraction errors) are valuable because they are likely to carry important information. 4.4 Feature engineering Feature engineering is a very important step of the portfolio construction process. Computer scientists often refer to the saying “garbage in, garbage out”. It is thus paramount to prevent the ML engine of the allocation to be trained on ill-designed variables. We invite the interested reader to have a look at the recent work of Kuhn and Johnson (2019) on this topic. The (shorter) academic reference is Guyon and Elisseeff (2003). 4.4.1 Feature selection The first step is selection. Given a large set of predictors, it seems a sound idea to filter out unwanted or redundant exogenous variables. Heuristically, simple methods include: computing the correlation matrix of all features and making sure that no (absolute) value is above a threshold (0.7 is a common value) so that redundant variables do not pollute the learning engine; carrying out a linear regression and removing the non significant variables (e.g., those with \\(p\\)-value above 0.05). perform a clustering analysis over the set of features and retain only one feature within each cluster (see Chapter 15). Both these methods are somewhat reductive and overlook nonlinear relationships. Another approach would be to fit a decision tree (or a random forest) and retain only the features that have a high variable importance. These methods will be developed in Chapter 6 for trees and Chapter 13 for variable importance. 4.4.2 Scaling the predictors The premise of the need to pre-process the data comes from the large variety of scales in financial data: returns are most of the time smaller than one in absolute value; stock volatility lies usually between 5% and 80%; market capitalization is expressed in million or billion units of a particular currency; accounting values as well; accounting ratios can have inhomogeneous units; synthetic attributes like sentiment also have their idiosyncrasies. While it is widely considered that monotonic transformations of the features have a marginal impact on prediction outcomes, Galili and Meilijson (2016) show that this is not always the case (see also Section 4.8.2). Hence, the choice of normalization may in fact very well matter. If we write \\(x_i\\) for the raw input and \\(\\tilde{x}_i\\) for the transformed data, common scaling practices include: standardization: \\(\\tilde{x}_i=(x_i-m_x)/\\sigma_x\\), where \\(m_x\\) and \\(\\sigma_x\\) are the mean and standard deviation of \\(x\\), respectively; min-max rescaling over [0,1]: \\(\\tilde{x}_i=(x_i-\\min(\\mathbf{x}))/(\\max(\\mathbf{x})-\\min(\\mathbf{x}))\\); min-max rescaling over [-1,1]: \\(\\tilde{x}_i=2\\frac{x_i-\\min(\\mathbf{x})}{\\max(\\mathbf{x})-\\min(\\mathbf{x})}-1\\); uniformization: \\(\\tilde{x}_i=F_\\mathbf{x}(x_i)\\), where \\(F_\\mathbf{x}\\) is the empirical c.d.f. of \\(\\mathbf{x}\\). In this case, the vector \\(\\tilde{\\mathbf{x}}\\) is defined to follow a uniform distribution over [0,1]. Sometimes, it is possible to apply a logarithmic transform of variables with both large values (market capitalization) and large outliers. The scaling can come after this transformation. Obviously, this technique is prohibited for features with negative values. It is often advised to scale inputs so that they range in [0,1] before sending them through the training of neural networks for instance. The dataset that we use in this book is based on variables that have been uniformized: for each point in time, the cross-sectional distribution of each feature is uniform over the unit interval. In factor investing, the scaling of features must be operated separately for each date and each feature. This point is critical. It makes sure that for every rebalancing date, the predictors will have a similar shape and do carry information on the cross-section of stocks. Uniformization is sometimes presented differently: for a given characteristic and time, characteristic values are ranked and the rank is then divided by the number of non-missing points. This is done in Freyberger, Neuhierl, and Weber (2020) for example. In Kelly, Pruitt, and Su (2019), the authors perform this operation but then subtract 0.5 to all features so that their values lie in [-0.5,0.5]. Scaling features across dates should be proscribed. Take for example the case of market capitalization. In the long run (market crashes notwithstanding), this feature increases through time. Thus, scaling across dates would lead to small values at the beginning of the sample and large values at the end of the sample. This would completely alter and dilute the cross-sectional content of the features. 4.5 Labelling 4.5.1 Simple labels There are several ways to define labels when constructing portfolio policies. Of course, the finality is the portfolio weight, but it is rarely considered as the best choice for the label.10 Usual labels in factor investing are the following: raw asset returns; future relative returns (versus some benchmark: market-wide index, or sector-based portfolio for instance). One simple choice is to take returns minus a cross-sectional mean or median; the probability of positive return (or of return above a specified threshold); the probability of outperforming a benchmark (computed over a given time frame); the binary version of the above: YES (outperforming) versus NO (underperforming); risk-adjusted versions of the above: Sharpe ratios, information ratios, MAR or CALMAR ratios (see Section 12.3). When creating binary variables, it is often tempting to create a test that compares returns to zero (profitable versus non profitable). This is not optimal because it is very much time-dependent. In good times, many assets will have positive returns, while in market crashes, few will experience positive returns, thereby creating very unbalanced classes. It is a better idea to split the returns in two by comparing them to their time-\\(t\\) median (or average). In this case, the indicator is relative and the two classes are much more balanced. As we will discuss later in this chapter, these choices still leave room for additional degrees of freedom. Should the labels be rescaled, just like features are processed? What is the best time horizon on which to compute performance metrics? 4.5.2 Categorical labels In a typical ML analysis, when \\(y\\) is a proxy for future performance, the ML engine will try to minimize some distance between the predicted value and the realized values. For mathematical convenience, the sum of squared error (\\(L^2\\) norm) is used because it has the simplest derivative and makes gradient descent accessible and easy to compute. Sometimes, it can be interesting not to focus on raw performance proxies, like returns or Sharpe ratios, but on discrete investment decisions, which can be derived from these proxies. A simple example (decision rule) is the following: \\[\\begin{equation} \\tag{4.2} y_{t,i}=\\left\\{ \\begin{array}{rll} -1 & \\text{ if } & \\hat{r}_{t,i} < r_- \\\\ 0 & \\text{ if } & \\hat{r}_{t,i} \\in [r_-,r_+] \\\\ +1 & \\text{ if } & \\hat{r}_{t,i} > r_+ \\\\ \\end{array} \\right., \\end{equation}\\] where \\(\\hat{r}_{t,i}\\) is the performance proxy (e.g., returns or Sharpe ratio) and \\(r_\\pm\\) are the decision thresholds. When the predicted performance is below \\(r_-\\), the decision is -1 (e.g., sell), when it is above \\(r_+\\), the decision is +1 (e.g., buy) and when it is in the middle (the model is neither very optimistic nor very pessimistic), then the decision is neutral (e.g., hold). The performance proxy can of course be relative to some benchmark so that the decision is directly related to this benchmark. It is often advised that the thresholds \\(r_\\pm\\) be chosen such that the three categories are relatively balanced, that is, so that they end up having a comparable number of instances. In this case, the final output can be considered as categorical or numerical because it belongs to an important subgroup of categorical variables: the ordered categorical (ordinal) variables. If \\(y\\) is taken as a number, the usual regression tools apply. When \\(y\\) is treated as a non-ordered (nominal) categorical variable, then a new layer of processing is required because ML tools only work with numbers. Hence, the categories must be recoded into digits. The mapping that is most often used is called ‘one-hot encoding’. The vector of classes is split in a sparse matrix in which each column is dedicated to one class. The matrix is filled with zeros and ones. A one is allocated to the column corresponding to the class of the instance. We provide a simple illustration in the table below. TABLE 4.2: Concise example of one-hot encoding. Initial data One-hot encoding Position Sell Hold Buy buy 0 0 1 buy 0 0 1 hold 0 1 0 sell 1 0 0 buy 0 0 1 In classification tasks, the output has a larger dimension. For each instance, it gives the probability of belonging to each class assigned by the model. As we will see in Chapters 6 and 7, this is easily handled via the softmax function. From the standpoint of allocation, handling categorical predictions is not necessarily easy. For long-short portfolios, plus or minus one signals can provide the sign of the position. For long-only portfolio, two possible solutions: (i) work with binary classes (in versus out of the portfolio) or (ii) adapt weights according to the prediction: zero weight for a -1 prediction, 0.5 weight for a 0 prediction and full weight for a +1 prediction. Weights are then of course normalized so as to comply with the budget constraint. 4.5.3 The triple barrier method We conclude this section with an advanced labelling technique mentioned in De Prado (2018). The idea is to consider the full dynamics of a trading strategy and not a simple performance proxy. The rationale for this extension is that often money managers implement P&L triggers that cash in when gains are sufficient or opt out to stop their losses. Upon inception of the strategy, three barriers are fixed (see Figure 4.4): one above the current level of the asset (magenta line), which measures a reasonable expected profit; one below the current level of the asset (cyan line), which acts as a stop-loss signal to prevent large negative returns; and finally, one that fixes the horizon of the strategy after which it will be terminated (black line). If the strategy hits the first (resp. second) barrier, the output is +1 (resp. -1), and if it hits the last barrier, the output is equal to zero or to some linear interpolation (between -1 and +1) that represents the position of the terminal value relative to the two horizontal barriers. Computationally, this method is much more demanding, as it evaluates a whole trajectory for each instance. Again, it is nonetheless considered as more realistic because trading strategies are often accompanied with automatic triggers such as stop-loss, etc. FIGURE 4.4: Illustration of the triple barrier method. 4.5.4 Filtering the sample One of the main challenges in Machine Learning is to extract as much signal as possible. By signal, we mean patterns that will hold out-of-sample. Intuitively, it may seem reasonable to think that the more data we gather, the more signal we can extract. This is in fact false in all generality because more data also means more noise. Surprisingly, filtering the training samples can improve performance. This idea was for example implemented successfully in Fu et al. (2018), Guida and Coqueret (2018a) and Guida and Coqueret (2018b). In Coqueret and Guida (2020), we investigate why smaller samples may lead to superior out-of-sample accuracy for a particular type of ML algorithm: decision trees (see Chapter 6). We focus on a particular kind of filter: we exclude the labels (e.g., returns) that are not extreme and retain the 20% values that are the smallest and the 20% that are the largest (the bulk of the distribution is removed). In doing so, we alter the structure of trees in two ways: - when the splitting points are altered, they are always closer to the center of the distribution of the splitting variable (i.e., the resulting clusters are more balanced and possibly more robust); - the choice of splitting variables is (sometimes) pushed towards the features that have a monotonic impact on the label. These two properties are desirable. The first reduces the risk of fitting to small groups of instances that may be spurious. The second gives more importance to features that appear globally more relevant in explaining the returns. However, the filtering must not be too intense. If, instead of retaining 20% of each tail of the predictor, we keep just 10%, then the loss in signal becomes too severe and the performance deteriorates. 4.5.5 Return horizons This subsection deals with one of the least debated issues in factor-based machine learning models: horizons. Several horizons come into play during the whole ML-driven allocation workflow: the horizon of the label, the estimation window (chronological depth of the training samples) and the holding periods. One early reference that looks at these aspects is the founding academic paper on momentum by Jegadeesh and Titman (1993). The authors compute the profitability of portfolios based on the returns over the past \\(J=3, 6, 9, 12\\) months. Four holding periods are tested: \\(K=3,6,9,12\\) months. They report: “The most successful zero-cost (long-short) strategy selects stocks based on their returns over the previous 12 months and then holds the portfolio for 3 months.” While there is no machine learning whatsoever in this contribution, it is possible that their conclusion that horizons matter may also hold for more sophisticated methods. This topic is in fact much discussed, as is shown by the continuing debate on the impact of horizons in momentum profitability (see, e.g., Novy-Marx (2012), Gong, Liu, and Liu (2015) and Goyal and Wahal (2015)). This debate should also be considered when working with ML algorithms. The issues of estimation windows and holding periods are mentioned later in the book, in Chapter 12. Naturally, in the present chapter, the horizon of the label is the important ingredient. Heuristically, there are four possible combinations if we consider only one feature for simplicity: oscillating label and feature; oscillating label, smooth feature (highly autocorrelated); smooth label, oscillating feature; smooth label and feature. Of all of these options, the last one is probably preferable because it is more robust, all things being equal.11 By all things being equal, we mean that in each case, a model is capable of extracting some relevant pattern. A pattern that holds between two slowly moving series is more likely to persist in time. Thus, since features are often highly autocorrelated (cf Figure 4.3), combining them with smooth labels is probably a good idea. To illustrate how critical this point is, we will purposefully use 1-month returns in most of the examples of the book and show that the corresponding results are often disappointing. These returns are very weakly autocorrelated while 6-month or 12-month returns are much more persistent and are better choices for labels. Theoretically, it is possible to understand why that may be the case. For simplicity, let us assume a single feature \\(x\\) that explains returns \\(r\\): \\(r_{t+1}=f(x_t)+e_{t+1}\\). If \\(x_t\\) is highly autocorrelated and the noise embeded in \\(e_{t+1}\\) is not too large, then the two-period ahead return \\((1+r_{t+1})(1+r_{t+2})-1\\) may carry more signal than \\(r_{t+1}\\) because the relationship with \\(x_t\\) has diffused and compounded through time. Consequently, it may also be beneficial to embed memory considerations directly into the modelling function, as is done for instance in Matthew F Dixon (2020). We discuss some practicalities related to autocorrelations in the next section. 4.6 Handling persistence While we have separated the steps of feature engineering and labelling in two different subsections, it is probably wiser to consider them jointly. One important property of the dataset processed by the ML algorithm should be the consistency of persistence between features and labels. Intuitively, the autocorrelation patterns between the label \\(y_{t,n}\\) (future performance) and the features \\(x_{t,n}^{(k)}\\) should not be too distant. One problematic example is when the dataset is sampled at the monthly frequency (not unusual in the money management industry) with the labels being monthly returns and the features being risk-based or fundamental attributes. In this case, the label is very weakly autocorrelated, while the features are often highly autocorrelated. In this situation, most sophisticated forecasting tools will arbitrage between features which will probably result in a lot of noise. In linear predictive models, this configuration is known to generate bias in estimates (see the study of Stambaugh (1999) and the review by Gonzalo and Pitarakis (2018)). Among other more technical options, there are two simple solutions when facing this issue: either introduce autocorrelation into the label, or remove it from the features. Again, the first option is not advised for statistical inference on linear models. Both are rather easy econometrically: to increase the autocorrelation of the label, compute performance over longer time ranges. For instance, when working with monthly data, considering annual or biennial returns will do the trick. to get rid of autocorrelation, the shortest route is to resort to differences/variations: \\(\\Delta x_{t,n}^{(k)}=x_{t,n}^{(k)}-x_{t-1,n}^{(k)}\\). One advantage of this procedure is that it makes sense, economically: variations in features may be better drivers of performance, compared to raw levels. A mix between persistent and oscillating variables in the feature space is of course possible, as long as it is driven by economic motivations. 4.7 Extensions 4.7.1 Transforming features The feature space can easily be augmented through simple operations. One of them is lagging, that is, considering older values of features and assuming some memory effect for their impact on the label. This is naturally useful mostly if the features are oscillating (adding a layer of memory on persistent features can be somewhat redundant). New variables are defined by \\(\\breve{x}_{t,n}^{(k)}=x_{t-1,n}^{(k)}\\). In some cases (e.g., insufficient number of features), it is possible to consider ratios or products between features. Accounting ratios like price-to-book, book-to-market, debt-to-equity are examples of functions of raw features that make sense. The gains brought by a larger spectrum of features are not obvious. The risk of overfitting increases, just like in a simple linear regression adding variables mechanically increases the \\(R^2\\). The choices must make sense, economically. Another way to increase the feature space (mentioned above) is to consider variations. Variations in sentiment, variations in book-to-market ratio, etc., can be relevant predictors because sometimes, the change is more important than the level. In this case, a new predictor is \\(\\breve{x}_{t,n}^{(k)}=x_{t,n}^{(k)}-x_{t-1,n}^{(k)}\\). 4.7.2 Macro-economic variables Finally, we discuss a very important topic. The data should never be separated from the context it comes from (its environment). In classical financial terms, this means that a particular model is likely to depend on the overarching situation which is often proxied by macro-economic indicators. One way to take this into account at the data level is simply to multiply the feature by an exogenous indicator \\(z_{t}\\) and in this case, the new predictor is \\[\\begin{equation} \\tag{4.3} \\breve{x}_{t,n}^{(k)}=z_t \\times x_{t,n}^{(k)} \\end{equation}\\] This technique is used by Gu, Kelly, and Xiu (2020b) who use 8 economic indicators (plus the original predictors (\\(z_t=1\\))). This increases the feature space ninefold. Another route that integrates shifting economic environments is conditional engineering. Suppose that labels are coded via formula (4.2). The thresholds can be made dependent on some exogenous variable. In times of turbulence, it might be a good idea to increase both \\(r_+\\) (buy threshold) and \\(r_-\\) (sell threshold) so that the labels become more conservative: it takes a higher return to make it to the buy category, while short positions are favored. One such example of dynamic thresholding could be \\[\\begin{equation} \\tag{4.4} r_{t,\\pm}=r_{\\pm} \\times e^{\\pm\\delta(\\text{VIX}_t-\\bar{\\text{VIX}})}, \\end{equation}\\] where \\(\\text{VIX}_t\\) is the time-\\(t\\) value of the VIX, while \\(\\bar{\\text{VIX}}\\) is some average or median value. When the VIX is above its average and risk seems to be increasing, the thresholds also increase. The parameter \\(\\delta\\) tunes the magnitude of the correction. In the above example, we assume \\(r_-<0<r_+\\). 4.7.3 Active learning We end this section with the notion of active learning. To the best of our knowledge, it is not widely used in quantitative investment, but the underlying concept is enlightening, hence we dedicate a few paragraphs to this notion for the sake of completeness. In general supervised learning, there is sometimes an asymmetry in the ability to gather features versus labels. For instance, it is free to have access to images, but the labelling of the content of the image (e.g., “a dog”, “a truck”, “a pizza”, etc.) is costly because it requires human annotation. In formal terms, \\(\\textbf{X}\\) is cheap but the corresponding \\(\\textbf{y}\\) is expensive. As is often the case when facing cost constraints, an evident solution is greed. Ahead of the usual learning process, a filter (often called query) is used to decide which data to label and train on (possibly in relationship with the ML algorithm). The labelling is performed by a so-called oracle (which/who knows the truth), usually human. This technique that focuses on the most informative instances is referred to as active learning. We refer to the surveys of Settles (2009) and Settles (2012) for a detailed account of this field (which we briefly summarize below). The term active comes from the fact that the learner does not passively accept data samples but actively participates in the choices of items it learns from. One major dichotomy in active learning pertains to the data source \\(\\textbf{X}\\) on which the query is based. One obvious case is when the original sample \\(\\textbf{X}\\) is very large and not labelled and the learner asks for particular instances within this sample to be labelled. The second case is when the learner has the ability to simulate/generate its own values \\(\\textbf{x}_i\\). This can sometimes be problematic if the oracle does not recognize the data that is generated by the machine. For instance, if the purpose is to label images of characters and numbers, the learner may generate shapes that do not correspond to any letter or digit: the oracle cannot label it. In active learning, one key question is, how does the learner choose the instances to be labelled? Heuristically, the answer is by picking those observations that maximize learning efficiency. In binary classification, a simple criterion is the probability of belonging to one particular class. If this probability is far from 0.5, then the algorithm will have no difficulty of picking one class (even though it can be wrong). The interesting case is when the probability is close to 0.5: the machine may hesitate for this particular instance. Thus, having the oracle label it is useful in this case because it helps the learner in a configuration in which it is undecided. Other methods seek to estimate the fit that can be obtained when including particular (new) instances in the training set, and then to optimize this fit. Recalling Section 3.1 in Geman, Bienenstock, and Doursat (1992) on the variance-bias tradeoff, we have, for a training dataset \\(D\\) and one instance \\(x\\) (we omit the bold font for simplicity), \\[\\mathbb{E}\\left[\\left.(y-\\hat{f}(x;D))^2\\right|\\{D,x\\}\\right]=\\mathbb{E}\\left[\\left.\\underbrace{(y-\\mathbb{E}[y|x])^2}_{\\text{indep. from }D\\text{ and }\\hat{f}} \\right|\\{D,x\\} \\right]+(\\hat{f}(x;D)-\\mathbb{E}[y|x])^2,\\] where the notation \\(f(x;D)\\) is used to highlight the dependence between the model \\(\\hat{f}\\) and the dataset \\(D\\): the model has been trained on \\(D\\). The first term is irreducible, as it does not depend on \\(\\hat{f}\\). Thus, only the second term is of interest. If we take the average of this quantity, taken over all possible values of \\(D\\): \\[\\mathbb{E}_D\\left[(\\hat{f}(x;D)-\\mathbb{E}[y|x])^2 \\right]=\\underbrace{\\left(\\mathbb{E}_D\\left[\\hat{f}(x;D)-\\mathbb{E}[y|x]\\right]\\right)^2}_{\\text{squared bias}} \\ + \\ \\underbrace{\\mathbb{E}_D\\left[(\\hat{f}(x,D)-\\mathbb{E}_D[\\hat{f}(x;D)])^2\\right]}_{\\text{variance}}\\] If this expression is not too complicated to compute, the learner can query the \\(x\\) that minimizes the tradeoff. Thus, on average, this new instance will be the one that yields the best learning angle (as measured by the \\(L^2\\) error). Beyond this approach (which is limited because it requires the oracle to label a possibly irrelevant instance), many other criteria exist for querying and we refer to section 3 from Settles (2009) for an exhaustive list. One final question: is active learning applicable to factor investing? One straightfoward answer is that data cannot be annotated by human intervention. Thus, the learners cannot simulate their own instances and ask for corresponding labels. One possible option is to provide the learner with \\(\\textbf{X}\\) but not \\(\\textbf{y}\\) and keep only a queried subset of observations with the corresponding labels. In spirit, this is close to what is done in Coqueret and Guida (2020) except that the query is not performed by a machine but by the human user. Indeed, it is shown in this paper that not all observations carry the same amount of signal. Instances with ‘average’ label values seem to be on average less informative compared to those with extreme label values. 4.8 Additional code and results 4.8.1 Impact of rescaling: graphical representation We start with a simple illustration of the different scaling methods. We generate an arbitrary series and then rescale it. The series is not random so that each time the code chunk is executed, the output remains the same. Length <- 100 # Length of the sequence x <- exp(sin(1:Length)) # Original data data <- data.frame(index = 1:Length, x = x) # Data framed into dataframe ggplot(data, aes(x = index, y = x)) + geom_bar(stat = "identity") # Plot We define and plot the scaled variables below. norm_unif <- function(v){ # This is a function that uniformalises a vector. v <- v %>% as.matrix() return(ecdf(v)(v)) } norm_0_1 <- function(v){ # This is a function that uniformalises a vector. return((v-min(v))/(max(v)-min(v))) } data_norm <- data.frame( # Formatting the data index = 1:Length, # Index of point/instance standard = (x - mean(x)) / sd(x), # Standardisation norm_0_1 = norm_0_1(x), # [0,1] reduction unif = norm_unif(x)) %>% # Uniformisation gather(key = Type, value = value, -index) # Putting in tidy format ggplot(data_norm, aes(x = index, y = value, fill = Type)) + # Plot! geom_bar(stat = "identity") + facet_grid(Type~.) # This option creates 3 concatenated graphs to ease comparison Finally, we look at the histogram of the newly created variables. ggplot(data_norm, aes(x = value, fill = Type)) + geom_histogram(position = "dodge") With respect to shape, the green and red distributions are close to the original one. It is only the support that changes: the min/max rescaling ensures all values lie in the \\([0,1]\\) interval. In both cases, the smallest values (on the left) display a spike in distribution. By construction, this spike disappears under the uniformization: the points are evenly distributed over the unit interval. 4.8.2 Impact of rescaling: toy example To illustrate the impact of choosing one particular rescaling method,12 we build a simple dataset, comprising 3 firms and 3 dates. firm <- c(rep(1,3), rep(2,3), rep(3,3)) # Firms (3 lines for each) date <- rep(c(1,2,3),3) # Dates cap <- c(10, 50, 100, # Market capitalization 15, 10, 15, 200, 120, 80) return <- c(0.06, 0.01, -0.06, # Return values -0.03, 0.00, 0.02, -0.04, -0.02,0.00) data_toy <- data.frame(firm, date, cap, return) # Aggregation of data data_toy <- data_toy %>% # Transformation of data group_by(date) %>% mutate(cap_0_1 = norm_0_1(cap), cap_u = norm_unif(cap)) TABLE 4.3: Sample data for a toy example. firm date cap return cap_0_1 cap_u 1 1 10 0.06 0.000 0.333 1 2 50 0.01 0.364 0.667 1 3 100 -0.06 1.000 1.000 2 1 15 -0.03 0.026 0.667 2 2 10 0.00 0.000 0.333 2 3 15 0.02 0.000 0.333 3 1 200 -0.04 1.000 1.000 3 2 120 -0.02 1.000 1.000 3 3 80 0.00 0.765 0.667 Let’s briefly comment on this synthetic data. We assume that dates are ordered chronologically and far away: each date stands for a year or the beginning of a decade, but the (forward) returns are computed on a monthly basis. The first firm is hugely successful and multiplies its cap ten times over the periods. The second firm remains stable cap-wise, while the third one plummets. If we look at ‘local’ future returns, they are strongly negatively related to size for the first and third firms. For the second one, there is no clear pattern. Date-by-date, the analysis is fairly similar, though slightly nuanced. On date 1, the smallest firm has the largest return and the two others have negative returns. On date 2, the biggest firm has a negative return while the two smaller firms do not. On date 3, returns are decreasing with size. While the relationship is not always perfectly monotonous, there seems to be a link between size and return and, typically, investing in the smallest firm would be a very good strategy with this sample. Now let us look at the output of simple regressions. Below, the package broom is part of the tidyverse. It is great to format regression outputs. lm(return ~ cap_0_1, data = data_toy) %>% # First regression (min-max rescaling) broom::tidy() %>% knitr::kable(caption = 'Regression output when the independent var. comes from min-max rescaling', booktabs = T) TABLE 4.4: Regression output when the independent var. comes from min-max rescaling term estimate std.error statistic p.value (Intercept) 0.0162778 0.0137351 1.185121 0.2746390 cap_0_1 -0.0497032 0.0213706 -2.325777 0.0529421 lm(return ~ cap_u, data = data_toy) %>% # Second regression (uniformised feature) broom::tidy() %>% knitr::kable(caption = 'Regression output when the indep. var. comes from uniformization', booktabs = T) TABLE 4.5: Regression output when the indep. var. comes from uniformization term estimate std.error statistic p.value (Intercept) 0.06 0.0198139 3.028170 0.0191640 cap_u -0.10 0.0275162 -3.634219 0.0083509 In terms of p-value (last column), the first estimation for the cap coefficient is above 5% (in Table 4.4) while the second is below 1% (in Table 4.5). One possible explanation for this discrepancy is the standard deviation of the variables. The deviations are equal to 0.47 and 0.29 for cap_0 and cap_u, respectively. Values like market capitalizations can have very large ranges and are thus subject to substantial deviations (even after scaling). Working with uniformized variables reduces dispersion and can help solve this problem. Note that this is a double-edged sword: while it can help avoid false negatives, it can also lead to false positives. 4.9 Coding exercises The Federal Reserve of Saint Louis (https://fred.stlouisfed.org) hosts thousands of time series of economic indicators that can serve as conditioning variables. Pick one and apply formula (4.3) to expand the number of predictors. If need be, use the function defined above. Create a new categorical label based on formulae (4.4) and (4.2). The time series of the VIX can also be retrieved from the Federal Reserve’s website: https://fred.stlouisfed.org/series/VIXCLS. Plot the histogram of the R12M_Usd variable. Clearly, some outliers are present. Identify the stock with highest value for this variable and determine if the value can be correct or not. References "],["lasso.html", "Chapter 5 Penalized regressions and sparse hedging for minimum variance portfolios 5.1 Penalized regressions 5.2 Sparse hedging for minimum variance portfolios 5.3 Predictive regressions 5.4 Coding exercise", " Chapter 5 Penalized regressions and sparse hedging for minimum variance portfolios In this chapter, we introduce the widespread concept of regularization for linear models. There are in fact several possible applications for these models. The first one is straightforward: resort to penalizations to improve the robustness of factor-based predictive regressions. The outcome can then be used to fuel an allocation scheme. For instance, Han et al. (2019) and Rapach and Zhou (2019) use penalized regressions to improve stock return prediction when combining forecasts that emanate from individual characteristics. Similar ideas can be developed for macroeconomic predictions for instance, as in Uematsu and Tanaka (2019). The second application stems from a less known result which originates from Stevens (1998). It links the weights of optimal mean-variance portfolios to particular cross-sectional regressions. The idea is then different and the purpose is to improve the quality of mean-variance driven portfolio weights. We present the two approaches below after an introduction on regularization techniques for linear models. Other examples of financial applications of penalization can be found in d’Aspremont (2011), Ban, El Karoui, and Lim (2016) and Kremer et al. (2019). In any case, the idea is the same as in the seminal paper Tibshirani (1996): standard (unconstrained) optimization programs may lead to noisy estimates, thus adding a structuring constraint helps remove some noise (at the cost of a possible bias). For instance, Kremer et al. (2019) use this concept to build more robust mean-variance (Markowitz (1952)) portfolios and Freyberger, Neuhierl, and Weber (2020) use it to single out the characteristics that really help explain the cross-section of equity returns. 5.1 Penalized regressions 5.1.1 Simple regressions The ideas behind linear models are at least two centuries old (Legendre (1805) is an early reference on least squares optimization). Given a matrix of predictors \\(\\textbf{X}\\), we seek to decompose the output vector \\(\\textbf{y}\\) as a linear function of the columns of \\(\\textbf{X}\\) (written \\(\\textbf{X}\\boldsymbol{\\beta}\\)) plus an error term \\(\\boldsymbol{\\epsilon}\\): \\(\\textbf{y}=\\textbf{X}\\boldsymbol{\\beta}+\\boldsymbol{\\epsilon}\\). The best choice of \\(\\boldsymbol{\\beta}\\) is naturally the one that minimizes the error. For analytical tractability, it is the sum of squared errors that is minimized: \\(L=\\boldsymbol{\\epsilon}'\\boldsymbol{\\epsilon}=\\sum_{i=1}^I\\epsilon_i^2\\). The loss \\(L\\) is called the sum of squared residuals (SSR). In order to find the optimal \\(\\boldsymbol{\\beta}\\), it is imperative to differentiate this loss \\(L\\) with respect to \\(\\boldsymbol{\\beta}\\) because the first order condition requires that the gradient be equal to zero: \\[\\begin{align*} \\nabla_{\\boldsymbol{\\beta}} L&=\\frac{\\partial}{\\partial \\boldsymbol{\\beta}}(\\textbf{y}-\\textbf{X}\\boldsymbol{\\beta})'(\\textbf{y}-\\textbf{X}\\boldsymbol{\\beta})=\\frac{\\partial}{\\partial \\boldsymbol{\\beta}}\\boldsymbol{\\beta}'\\textbf{X}'\\textbf{X}\\boldsymbol{\\beta}-2\\textbf{y}'\\textbf{X}\\boldsymbol{\\beta} \\\\ &=2\\textbf{X}'\\textbf{X}\\boldsymbol{\\beta} -2\\textbf{X}'\\textbf{y} \\end{align*}\\] so that the first order condition \\(\\nabla_{\\boldsymbol{\\beta}}=\\textbf{0}\\) is satisfied if \\[\\begin{equation} \\tag{5.1} \\boldsymbol{\\beta}^*=(\\textbf{X}'\\textbf{X})^{-1}\\textbf{X}'\\textbf{y}, \\end{equation}\\] which is known as the standard ordinary least squares (OLS) solution of the linear model. If the matrix \\(\\textbf{X}\\) has dimensions \\(I \\times K\\), then the \\(\\textbf{X}'\\textbf{X}\\) can only be inverted if the number of rows \\(I\\) is strictly superior to the number of columns \\(K\\). In some cases, that may not hold; there are more predictors than instances and there is no unique value of \\(\\boldsymbol{\\beta}\\) that minimizes the loss. If \\(\\textbf{X}'\\textbf{X}\\) is nonsingular (or positive definite), then the second order condition ensures that \\(\\boldsymbol{\\beta}^*\\) yields a global minimum for the loss \\(L\\) (the second order derivative of \\(L\\) with respect to \\(\\boldsymbol{\\beta}\\), the Hessian matrix, is exactly \\(\\textbf{X}'\\textbf{X}\\)). Up to now, we have made no distributional assumption on any of the above quantities. Standard assumptions are the following: - \\(\\mathbb{E}[\\textbf{y}|\\textbf{X}]=\\textbf{X}\\boldsymbol{\\beta}\\): linear shape for the regression function; - \\(\\mathbb{E}[\\boldsymbol{\\epsilon}|\\textbf{X}]=\\textbf{0}\\): errors are independent of predictors; - \\(\\mathbb{E}[\\boldsymbol{\\epsilon}\\boldsymbol{\\epsilon}'| \\textbf{X}]=\\sigma^2\\textbf{I}\\): homoscedasticity - errors are uncorrelated and have identical variance; - the \\(\\epsilon_i\\) are normally distributed. Under these hypotheses, it is possible to perform statistical tests related to the \\(\\hat{\\boldsymbol{\\beta}}\\) coefficients. We refer to chapters 2 to 4 in Greene (2018) for a thorough treatment on linear models as well as to chapter 5 of the same book for details on the corresponding tests. 5.1.2 Forms of penalizations Penalized regressions have been popularized since the seminal work of Tibshirani (1996). The idea is to impose a constraint on the coefficients of the regression, namely that their total magnitude be restrained. In his original paper, Tibshirani (1996) proposes to estimate the following model (LASSO): \\[\\begin{equation} \\tag{5.2} y_i = \\sum_{j=1}^J \\beta_jx_{i,j} + \\epsilon_i, \\quad i =1,\\dots,I, \\quad \\text{s.t.} \\quad \\sum_{j=1}^J |\\beta_j| < \\delta, \\end{equation}\\] for some strictly positive constant \\(\\delta\\). Under least square minimization, this amounts to solve the Lagrangian formulation: \\[\\begin{equation} \\tag{5.3} \\underset{\\mathbf{\\beta}}{\\min} \\, \\left\\{ \\sum_{i=1}^I\\left(y_i - \\sum_{j=1}^J \\beta_jx_{i,j} \\right)^2+\\lambda \\sum_{j=1}^J |\\beta_j| \\right\\}, \\end{equation}\\] for some value \\(\\lambda>0\\) which naturally depends on \\(\\delta\\) (the lower the \\(\\delta\\), the higher the \\(\\lambda\\): the constraint is more binding). This specification seems close to the ridge regression (\\(L^2\\) regularization), which is in fact anterior to the Lasso: \\[\\begin{equation} \\tag{5.4} \\underset{\\mathbf{\\beta}}{\\min} \\, \\left\\{ \\sum_{i=1}^I\\left(y_i - \\sum_{j=1}^J\\beta_jx_{i,j} \\right)^2+\\lambda \\sum_{j=1}^J \\beta_j^2 \\right\\}, \\end{equation}\\] and which is equivalent to estimating the following model \\[\\begin{equation} \\tag{5.5} y_i = \\sum_{j=1}^J \\beta_jx_{i,j} + \\epsilon_i, \\quad i =1,\\dots,I, \\quad \\text{s.t.} \\quad \\sum_{j=1}^J \\beta_j^2 < \\delta, \\end{equation}\\] but the outcome is in fact quite different, which justifies a separate treatment. Mechanically, as \\(\\lambda\\), the penalization intensity, increases (or as \\(\\delta\\) in (5.5) decreases), the coefficients of the ridge regression all slowly decrease in magnitude towards zero. In the case of the LASSO, the convergence is somewhat more brutal as some coefficients shrink to zero very quickly. For \\(\\lambda\\) sufficiently large, only one coefficient will remain nonzero, while in the ridge regression, the zero value is only reached asymptotically for all coefficients. We invite the interested read to have a look at the survey in Hastie (2020) about all applications of ridge regressions in data science with links to other topics like cross-validation and dropout regularization, among others. To depict the difference between the Lasso and the ridge regression, let us consider the case of \\(K=2\\) predictors which is shown in Figure 5.1. The optimal unconstrained solution \\(\\boldsymbol{\\beta}^*\\) is pictured in red in the middle of the space. The problem is naturally that it does not satisfy the imposed conditions. These constraints are shown in light grey: they take the shape of a square \\(|\\beta_1|+|\\beta_2| \\le \\delta\\) in the case of the Lasso and a circle \\(\\beta_1^2+\\beta_2^2 \\le \\delta\\) for the ridge regression. In order to satisfy these constraints, the optimization needs to look in the vicinity of \\(\\boldsymbol{\\beta}^*\\) by allowing for larger error levels. These error levels are shown as orange ellipsoids in the figure. When the requirement on the error is loose enough, one ellipsoid touches the acceptable boundary (in grey) and this is where the constrained solution is located. FIGURE 5.1: Schematic view of Lasso (left) versus ridge (right) regressions. Both methods work when the number of exogenous variables surpasses that of observations, i.e., in the case where classical regressions are ill-defined. This is easy to see in the case of the ridge regression for which the OLS solution is simply \\[\\hat{\\boldsymbol{\\beta}}=(\\mathbf{X}'\\mathbf{X}+\\lambda \\mathbf{I}_N)^{-1}\\mathbf{X}'\\mathbf{Y}.\\] The additional term \\(\\lambda \\mathbf{I}_N\\) compared to Equation (5.1) ensures that the inverse matrix is well-defined whenever \\(\\lambda>0\\). As \\(\\lambda\\) increases, the magnitudes of the \\(\\hat{\\beta}_i\\) decrease, which explains why penalizations are sometimes referred to as shrinkage methods (the estimated coefficients see their values shrink). Zou and Hastie (2005) propose to benefit from the best of both worlds when combining both penalizations in a convex manner (which they call the elasticnet): \\[\\begin{equation} \\tag{5.6} y_i = \\sum_{j=1}^J \\beta_jx_{i,j} + \\epsilon_i, \\quad \\text{s.t.} \\quad \\alpha \\sum_{j=1}^J |\\beta_j| +(1-\\alpha)\\sum_{j=1}^J \\beta_j^2< \\delta, \\quad i =1,\\dots,N, \\end{equation}\\] which is associated to the optimization program \\[\\begin{equation} \\tag{5.7} \\underset{\\mathbf{\\beta}}{\\min} \\, \\left\\{ \\sum_{i=1}^I\\left(y_i - \\sum_{j=1}^J\\beta_jx_{i,j} \\right)^2+\\lambda \\left(\\alpha\\sum_{j=1}^J |\\beta_j|+ (1-\\alpha)\\sum_{j=1}^J \\beta_j^2\\right) \\right\\}. \\end{equation}\\] The main advantage of the LASSO compared to the ridge regression is its selection capability. Indeed, given a very large number of variables (or predictors), the LASSO will progressively rule out those that are the least relevant. The elasticnet preserves this selection ability and Zou and Hastie (2005) argue that in some cases, it is even more effective than the LASSO. The parameter \\(\\alpha \\in [0,1]\\) tunes the smoothness of convergence (of the coefficients) towards zero. The closer \\(\\alpha\\) is to zero, the smoother the convergence. 5.1.3 Illustrations We begin with simple illustrations of penalized regressions. We start with the LASSO. The original implementation by the authors is in R, which is practical. The syntax is slightly different, compared to usual linear models. The illustrations are run on the whole dataset. First, we estimate the coefficients. By default, the function chooses a large array of penalization values so that the results for different penalization intensities (\\(\\lambda\\)) can be shown immediately. library(glmnet) y_penalized <- data_ml$R1M_Usd # Dependent variable x_penalized <- data_ml %>% # Predictors dplyr::select(all_of(features)) %>% as.matrix() fit_lasso <- glmnet(x_penalized, y_penalized, alpha = 1) # Model alpha = 1: LASSO Once the coefficients are computed, they require some wrangling before plotting. Also, there are too many of them, so we only plot a subset of them. lasso_res <- summary(fit_lasso$beta) # Extract LASSO coefs lambda <- fit_lasso$lambda # Values of the penalisation const lasso_res$Lambda <- lambda[lasso_res$j] # Put the labels where they belong lasso_res$Feature <- features[lasso_res$i] %>% as.factor() # Add names of variables to output lasso_res[1:120,] %>% # Take the first 120 estimates ggplot(aes(x = Lambda, y = x, color = Feature)) + # Plot! geom_line() + coord_fixed(0.25) + ylab("beta") + # Change aspect ratio of graph theme(legend.text = element_text(size = 7)) # Reduce legend font size FIGURE 5.2: LASSO model. The dependent variable is the 1 month ahead return. The graph plots the evolution of coefficients as the penalization intensity, \\(\\lambda\\), increases. For some characteristics, like Ebit_Ta (in orange), the convergence to zero is rapid. Other variables resist the penalization longer, like Mkt_Cap_3M_Usd, which is the last one to vanish. Essentially, this means that at the first order, this variable is an important driver of future 1-month returns in our sample. Moreover, the negative sign of its coefficient is a confirmation (again, in this sample) of the size anomaly, according to which small firms experience higher future returns compared to their larger counterparts. Next, we turn to ridge regressions. fit_ridge <- glmnet(x_penalized, y_penalized, alpha = 0) # alpha = 0: ridge ridge_res <- summary(fit_ridge$beta) # Extract ridge coefs lambda <- fit_ridge$lambda # Penalisation const ridge_res$Feature <- features[ridge_res$i] %>% as.factor() ridge_res$Lambda <- lambda[ridge_res$j] # Set labels right ridge_res %>% filter(Feature %in% levels(droplevels(lasso_res$Feature[1:120]))) %>% # Keep same features ggplot(aes(x = Lambda, y = x, color = Feature)) + ylab("beta") + # Plot! geom_line() + scale_x_log10() + coord_fixed(45) + # Aspect ratio theme(legend.text = element_text(size = 7)) FIGURE 5.3: Ridge regression. The dependent variable is the 1 month ahead return. In Figure 5.3, the convergence to zero is much smoother. We underline that the x-axis (penalization intensities) have a log-scale. This allows to see the early patterns (close to zero, to the left) more clearly. As in the previous figure, the Mkt_Cap_3M_Usd predictor clearly dominates, with again large negative coefficients. Nonetheless, as \\(\\lambda\\) increases, its domination over the other predictor fades. By definition, the elasticnet will produce curves that behave like a blend of the two above approaches. Nonetheless, as long as \\(\\alpha >0\\), the selective property of the LASSO will be preserved: some features will see their coefficients shrink rapidly to zero. In fact, the strength of the LASSO is such that a balanced mix of the two penalizations is not reached at \\(\\alpha = 1/2\\), but rather at a much smaller value (possibly below 0.1). 5.2 Sparse hedging for minimum variance portfolios 5.2.1 Presentation and derivations The idea of constructing sparse portfolios is not new per se (see, e.g., Brodie et al. (2009), Fastrich, Paterlini, and Winker (2015)) and the link with the selective property of the LASSO is rather straightforward in classical quadratic programs. Note that the choice of the \\(L^1\\) norm is imperative because when enforcing a simple \\(L^2\\) norm, the diversification of the portfolio increases (see Coqueret (2015)). The idea behind this section stems from Goto and Xu (2015) but the cornerstone result was first published by Stevens (1998) and we present it below. We provide details because the derivations are not commonplace in the literature. In usual mean-variance allocations, one core ingredient is the inverse covariance matrix of assets \\(\\mathbf{\\Sigma}^{-1}\\). For instance, the maximum Sharpe ratio (MSR) portfolio is given by \\[\\begin{equation} \\tag{5.8} \\mathbf{w}^{\\text{MSR}}=\\frac{\\mathbf{\\Sigma}^{-1}\\boldsymbol{\\mu}}{\\mathbf{1}'\\mathbf{\\Sigma}^{-1}\\boldsymbol{\\mu}}, \\end{equation}\\] where \\(\\mathbf{\\mu}\\) is the vector of expected (excess) returns. Taking \\(\\mathbf{\\mu}=\\mathbf{1}\\) yields the minimum variance portfolio, which is agnostic in terms of the first moment of expected returns (and, as such, usually more robust than most alternatives which try to estimate \\(\\boldsymbol{\\mu}\\) and often fail). Usually, the traditional way is to estimate \\(\\boldsymbol{\\Sigma}\\) and to invert it to get the MSR weights. However, several approaches aim at estimating \\(\\boldsymbol{\\Sigma}^{-1}\\) and we present one of them below. We proceed one asset at a time, that is, one line of \\(\\boldsymbol{\\Sigma}^{-1}\\) at a time. If we decompose the matrix \\(\\mathbf{\\Sigma}\\) into: \\[\\mathbf{\\Sigma}= \\left[\\begin{array}{cc} \\sigma^2 & \\mathbf{c}' \\\\ \\mathbf{c}& \\mathbf{C}\\end{array} \\right],\\] classical partitioning results (e.g., Schur complements) imply \\[\\small \\mathbf{\\Sigma}^{-1}= \\left[\\begin{array}{cc} (\\sigma^2 -\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c})^{-1} & - (\\sigma^2 -\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c})^{-1}\\mathbf{c}'\\mathbf{C}^{-1} \\\\ - (\\sigma^2 -\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c})^{-1}\\mathbf{C}^{-1}\\mathbf{c}& \\mathbf{C}^{-1}+ (\\sigma^2 -\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c})^{-1}\\mathbf{C}^{-1}\\mathbf{cc}'\\mathbf{C}^{-1}\\end{array} \\right].\\] We are interested in the first line, which has 2 components: the factor \\((\\sigma^2 -\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c})^{-1}\\) and the line vector \\(\\mathbf{c}'\\mathbf{C}^{-1}\\). \\(\\mathbf{C}\\) is the covariance matrix of assets \\(2\\) to \\(N\\) and \\(\\mathbf{c}\\) is the covariance between the first asset and all other assets. The first line of \\(\\mathbf{\\Sigma}^{-1}\\) is \\[\\begin{equation} \\tag{5.9} (\\sigma^2 -\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c})^{-1} \\left[1 \\quad \\underbrace{-\\mathbf{c}'\\mathbf{C}^{-1}}_{N-1 \\text{ terms}} \\right]. \\end{equation}\\] We now consider an alternative setting. We regress the returns of the first asset on those of all other assets: \\[\\begin{equation} \\tag{5.10} r_{1,t}=a_1+\\sum_{n=2}^N\\beta_{1|n}r_{n,t}+\\epsilon_t, \\quad \\text{ i.e., } \\quad \\mathbf{r}_1=a_1\\mathbf{1}_T+\\mathbf{R}_{-1}\\mathbf{\\beta}_1+\\epsilon_1, \\end{equation}\\] where \\(\\mathbf{R}_{-1}\\) gathers the returns of all assets except the first one. The OLS estimator for \\(\\mathbf{\\beta}_1\\) is \\[\\begin{equation} \\tag{5.11} \\hat{\\mathbf{\\beta}}_{1}=\\mathbf{C}^{-1}\\mathbf{c}, \\end{equation}\\] and this is the partitioned form (when a constant is included to the regression) stemming from the Frisch-Waugh-Lovell theorem (see chapter 3 in Greene (2018)). In addition, \\[\\begin{equation} \\tag{5.12} (1-R^2)\\sigma_{\\mathbf{r}_1}^2=\\sigma_{\\mathbf{r}_1}^2- \\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c} =\\sigma^2_{\\epsilon_1}. \\end{equation}\\] The proof of this last fact is given below. With \\(\\mathbf{X}\\) being the concatenation of \\(\\mathbf{1}_T\\) with returns \\(\\mathbf{R}_{-1}\\) and with \\(\\mathbf{y}=\\mathbf{r}_1\\), the classical expression of the \\(R^2\\) is \\[R^2=1-\\frac{\\mathbf{\\epsilon}'\\mathbf{\\epsilon}}{T\\sigma_Y^2}=1-\\frac{\\mathbf{y}'\\mathbf{y}-\\hat{\\mathbf{\\beta}'}\\mathbf{X}'\\mathbf{X}\\hat{\\mathbf{\\beta}}}{T\\sigma_Y^2}=1-\\frac{\\mathbf{y}'\\mathbf{y}-\\mathbf{y}'\\mathbf{X}\\hat{\\mathbf{\\beta}}}{T\\sigma_Y^2},\\] with fitted values \\(\\mathbf{X}\\hat{\\mathbf{\\beta}}= \\hat{a_1}\\mathbf{1}_T+\\mathbf{R}_{-1}\\mathbf{C}^{-1}\\mathbf{c}\\). Hence, \\[\\begin{align*} T\\sigma_{\\mathbf{r}_1}^2R^2&=T\\sigma_{\\mathbf{r}_1}^2-\\mathbf{r}'_1\\mathbf{r}_1+\\hat{a_1}\\mathbf{1}'_T\\mathbf{r}_1+\\mathbf{r}'_1\\mathbf{R}_{-1}\\mathbf{C}^{-1}\\mathbf{c} \\\\ T(1-R^2)\\sigma_{\\mathbf{r}_1}^2&=\\mathbf{r}'_1\\mathbf{r}_1-\\hat{a_1}\\mathbf{1}'_T\\mathbf{r}_1-\\left(\\mathbf{\\tilde{r}}_1+\\frac{\\mathbf{1}_T\\mathbf{1}'_T}{T}\\mathbf{r}_1\\right)'\\left(\\tilde{\\mathbf{R}}_{-1}+\\frac{\\mathbf{1}_T\\mathbf{1}'_T}{T}\\mathbf{R}_{-1}\\right)\\mathbf{C}^{-1}\\mathbf{c} \\\\ T(1-R^2)\\sigma_{\\mathbf{r}_1}^2&=\\mathbf{r}'_1\\mathbf{r}_1-\\hat{a_1}\\mathbf{1}'_T\\mathbf{r}_1-T\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c} -\\mathbf{r}'_1\\frac{\\mathbf{1}_T\\mathbf{1}'_T}{T}\\mathbf{R}_{-1} \\mathbf{C}^{-1}\\mathbf{c} \\\\ T(1-R^2)\\sigma_{\\mathbf{r}_1}^2&=\\mathbf{r}'_1\\mathbf{r}_1-\\frac{(\\mathbf{1}'_T\\mathbf{r}_1)^2}{T}- T\\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c} \\\\ (1-R^2)\\sigma_{\\mathbf{r}_1}^2&=\\sigma_{\\mathbf{r}_1}^2- \\mathbf{c}'\\mathbf{C}^{-1}\\mathbf{c} \\end{align*}\\] where in the fourth equality we have plugged \\(\\hat{a}_1=\\frac{\\mathbf{1'}_T}{T}(\\mathbf{r}_1-\\mathbf{R}_{-1}\\mathbf{C}^{-1}\\mathbf{c})\\). Note that there is probably a simpler proof, see, e.g., section 3.5 in Greene (2018). Combining ((5.9), ((5.11)) and ((5.12)), we get that the first line of \\(\\mathbf{\\Sigma}^{-1}\\) is equal to \\[\\begin{equation} \\tag{5.13} \\frac{1}{\\sigma^2_{\\epsilon_1}}\\times \\left[ 1 \\quad -\\hat{\\boldsymbol{\\beta}}_1'\\right]. \\end{equation}\\] Given the first line of \\(\\mathbf{\\Sigma}^{-1}\\), it suffices to multiply by \\(\\boldsymbol{\\mu}\\) to get the portfolio weight in the first asset (up to a scaling constant). There is a nice economic intuition behind the above results which justifies the term “sparse hedging”. We take the case of the minimum variance portfolio, for which \\(\\boldsymbol{\\mu}=\\boldsymbol{1}\\). In Equation (5.10), we try to explain the return of asset 1 with that of all other assets. In the above equation, up to a scaling constant, the portfolio has a unit position in the first asset and \\(-\\hat{\\boldsymbol{\\beta}}_1\\) positions in all other assets. Hence, the purpose of all other assets is clearly to hedge the return of the first one. In fact, these positions are aimed at minimizing the squared errors of the aggregate portfolio for the first asset (these errors are exactly \\(\\mathbf{\\epsilon}_1\\)). Moreover, the scaling factor \\(\\sigma^{-2}_{\\epsilon_1}\\) is also simple to interpret: the more we trust the regression output (because of a small \\(\\sigma^{2}_{\\epsilon_1}\\)), the more we invest in the hedging portfolio of the asset. This reasoning is easily generalized for any line of \\(\\mathbf{\\Sigma}^{-1}\\), which can be obtained by regressing the returns of asset \\(i\\) on the returns of all other assets. If the allocation scheme has the form ((5.8)) for given values of \\(\\boldsymbol{\\mu}\\), then the pseudo-code for the sparse portfolio strategy is the following. At each date (which we omit for notational convenience), For all stocks \\(i\\), estimate the elasticnet regression over the \\(t=1,\\dots,T\\) samples to get the \\(i^{th}\\) line of \\(\\hat{\\mathbf{\\Sigma}}^{-1}\\): \\[ \\small \\left[\\hat{\\mathbf{\\Sigma}}^{-1}\\right]_{i,\\cdot}= \\underset{\\mathbf{\\beta}_{i|}}{\\text{argmin}}\\, \\left\\{\\sum_{t=1}^T\\left( r_{i,t}-a_i+\\sum_{n\\neq i}^N\\beta_{i|n}r_{n,t}\\right)^2+\\lambda \\alpha || \\mathbf{\\beta}_{i|}||_1+\\lambda (1-\\alpha)||\\mathbf{\\beta}_{i|}||_2^2\\right\\} \\] to get the weights of asset \\(i\\), we compute the \\(\\mathbf{\\mu}\\)-weighted sum: \\(w_i= \\sigma_{\\epsilon_i}^{-2}\\left(\\mu_i- \\sum_{j\\neq i}\\mathbf{\\beta}_{i|j}\\mu_j\\right)\\), where we recall that the vectors \\(\\mathbf{\\beta}_{i|}=[\\mathbf{\\beta}_{i|1},\\dots,\\mathbf{\\beta}_{i|i-1},\\mathbf{\\beta}_{i|i+1},\\dots,\\mathbf{\\beta}_{i|N}]\\) are the coefficients from regressing the returns of asset \\(i\\) against the returns of all other assets. The introduction of the penalization norms is the new ingredient, compared to the original approach of Stevens (1998). The benefits are twofold: first, introducing constraints yields weights that are more robust and less subject to errors in the estimates of \\(\\mathbf{\\mu}\\); second, because of sparsity, weights are more stable, less leveraged and thus the strategy is less impacted by transaction costs. Before we turn to numerical applications, we mention a more direct route to the estimation of a robust inverse covariance matrix: the Graphical LASSO. The GLASSO estimates the precision matrix (inverse covariance matrix) via maximum likelihood while imposing constraints/penalizations on the weights of the matrix. When the penalization is strong enough, this yields a sparse matrix, i.e., a matrix in which some and possibly many coefficients are zero. We refer to the original article Friedman, Hastie, and Tibshirani (2008) for more details on this subject. 5.2.2 Example The interest of sparse hedging portfolios is to propose a robust approach to the estimation of minimum variance policies. Indeed, since the vector of expected returns \\(\\boldsymbol{\\mu}\\) is usually very noisy, a simple solution is to adopt an agnostic view by setting \\(\\boldsymbol{\\mu}=\\boldsymbol{1}\\). In order to test the added value of the sparsity constraint, we must resort to a full backtest. In doing so, we anticipate the content of Chapter 12. We first prepare the variables. Sparse portfolios are based on returns only; we thus base our analysis on the dedicated variable in matrix/rectangular format (returns) which were created at the end of Chapter 1. Then, we initialize the output variables: portfolio weights and portfolio returns. We want to compare three strategies: an equally weighted (EW) benchmark of all stocks, the classical global minimum variance portfolio (GMV) and the sparse-hedging approach to minimum variance. t_oos <- returns$date[returns$date > separation_date] %>% # Out-of-sample dates unique() %>% # Remove duplicates as.Date(origin = "1970-01-01") # Transform in date format Tt <- length(t_oos) # Nb of dates, avoid T nb_port <- 3 # Nb of portfolios/strats. portf_weights <- array(0, dim = c(Tt, nb_port, ncol(returns) - 1)) # Initial portf. weights portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Initial portf. returns Next, because it is the purpose of this section, we isolate the computation of the weights of sparse-hedging portfolios. In the case of minimum variance portfolios, when \\(\\boldsymbol{\\mu}=\\boldsymbol{1}\\), the weight in asset 1 will simply be the sum of all terms in Equation (5.13) and the other weights have similar forms. weights_sparsehedge <- function(returns, alpha, lambda){ # The parameters are defined here w <- 0 # Initiate weights for(i in 1:ncol(returns)){ # Loop on the assets y <- returns[,i] # Dependent variable x <- returns[,-i] # Independent variable fit <- glmnet(x,y, family = "gaussian", alpha = alpha, lambda = lambda) err <- y-predict(fit, x) # Prediction errors w[i] <- (1-sum(fit$beta))/var(err) # Output: weight of asset i } return(w / sum(w)) # Normalisation of weights } In order to benchmark our strategy, we define a meta-weighting function that embeds three strategies: (1) the EW benchmark, (2) the classical GMV and (3) the sparse-hedging minimum variance. For the GMV, since there are much more assets than dates, the covariance matrix is singular. Thus, we have a small heuristic shrinkage term. For a more rigorous treatment of this technique, we refer to the original article Ledoit and Wolf (2004) and to the recent improvements mentioned in Ledoit and Wolf (2017). In short, we use \\(\\hat{\\boldsymbol{\\Sigma}}=\\boldsymbol{\\Sigma}_S+\\delta \\boldsymbol{I}\\) for some small constant \\(\\delta\\) (equal to 0.01 in the code below). weights_multi <- function(returns,j, alpha, lambda){ N <- ncol(returns) if(j == 1){ # j = 1 => EW return(rep(1/N,N)) } if(j == 2){ # j = 2 => Minimum Variance sigma <- cov(returns) + 0.01 * diag(N) # Covariance matrix + regularizing term w <- solve(sigma) %*% rep(1,N) # Inverse & multiply return(w / sum(w)) # Normalize } if(j == 3){ # j = 3 => Penalised / elasticnet w <- weights_sparsehedge(returns, alpha, lambda) } } Finally, we proceed to the backtesting loop. Given the number of assets, the execution of the loop takes a few minutes. At the end of the loop, we compute the standard deviation of portfolio returns (monthly volatility). This is the key indicator as minimum variance seeks to minimize this particular metric. for(t in 1:length(t_oos)){ # Loop = rebal. dates temp_data <- returns %>% # Data for weights filter(date < t_oos[t]) %>% # Expand. window dplyr::select(-date) %>% as.matrix() realised_returns <- returns %>% # OOS returns filter(date == t_oos[t]) %>% dplyr::select(-date) for(j in 1:nb_port){ # Loop over strats portf_weights[t,j,] <- weights_multi(temp_data, j, 0.1, 0.1) # Hard-coded params! portf_returns[t,j] <- sum(portf_weights[t,j,] * realised_returns) # Portf. returns } } colnames(portf_returns) <- c("EW", "MV", "Sparse") # Colnames apply(portf_returns, 2, sd) # Portfolio volatilities (monthly scale) ## EW MV Sparse ## 0.04180422 0.03350424 0.02672169 The aim of the sparse hedging restrictions is to provide a better estimate of the covariance structure of assets so that the estimation of minimum variance portfolio weights is more accurate. From the above exercise, we see that the monthly volatility is indeed reduced when building covariance matrices based on sparse hedging relationships. This is not the case if we use the shrunk sample covariance matrix because there is probably too much noise in the estimates of correlations between assets. Working with daily returns would likely improve the quality of the estimates. But the above backtest shows that the penalized methodology performs well even when the number of observations (dates) is small compared to the number of assets. 5.3 Predictive regressions 5.3.1 Literature review and principle The topic of predictive regressions sits on a collection of very interesting articles. One influential contribution is Stambaugh (1999), where the author shows the perils of regressions in which the independent variables are autocorrelated. In this case, the usual OLS estimate is biased and must therefore be corrected. The results have since then been extended in numerous directions (see Campbell and Yogo (2006) and Hjalmarsson (2011), the survey in Gonzalo and Pitarakis (2018) and, more recently, the study of Xu (2020) on predictability over multiple horizons). A second important topic pertains to the time-dependence of the coefficients in predictive regressions. One contribution in this direction is Dangl and Halling (2012), where coefficients are estimated via a Bayesian procedure. More recently Kelly, Pruitt, and Su (2019) use time-dependent factor loadings to model the cross-section of stock returns. The time-varying nature of coefficients of predictive regressions is further documented by Henkel, Martin, and Nardari (2011) for short term returns. Lastly, Farmer, Schmidt, and Timmermann (2019) introduce the concept of pockets of predictability: assets or markets experience different phases; in some stages, they are predictable and in some others, they aren’t. Pockets are measured both by the number of days that a t-statistic is above a particular threshold and by the magnitude of the \\(R^2\\) over the considered period. Formal statistical tests are developed by Demetrescu et al. (2020). The introduction of penalization within predictive regressions goes back at least to Rapach, Strauss, and Zhou (2013), where they are used to assess lead-lag relationships between US markets and other international stock exchanges. More recently, Alexander Chinco, Clark-Joseph, and Ye (2019) use LASSO regressions to forecast high frequency returns based on past returns (in the cross-section) at various horizons. They report statistically significant gains. Han et al. (2019) and Rapach and Zhou (2019) use LASSO and elasticnet regressions (respectively) to improve forecast combinations and single out the characteristics that matter when explaining stock returns. These contributions underline the relevance of the overlap between predictive regressions and penalized regressions. In simple machine-learning based asset pricing, we often seek to build models such as that of Equation (3.6). If we stick to a linear relationship and add penalization terms, then the model becomes: \\[r_{t+1,n} = \\alpha_n + \\sum_{k=1}^K\\beta_n^kf^k_{t,n}+\\epsilon_{t+1,n}, \\quad \\text{s.t.} \\quad (1-\\alpha)\\sum_{j=1}^J |\\beta_j| +\\alpha\\sum_{j=1}^J \\beta_j^2< \\theta\\] where we use \\(f^k_{t,n}\\) or \\(x_{t,n}^k\\) interchangeably and \\(\\theta\\) is some penalization intensity. Again, one of the aims of the regularization is to generate more robust estimates. If the patterns extracted hold out of sample, then \\[\\hat{r}_{t+1,n} = \\hat{\\alpha}_n + \\sum_{k=1}^K\\hat{\\beta}_n^kf^k_{t,n},\\] will be a relatively reliable proxy of future performance. 5.3.2 Code and results Given the form of our dataset, implementing penalized predictive regressions is easy. y_penalized_train <- training_sample$R1M_Usd # Dependent variable x_penalized_train <- training_sample %>% # Predictors dplyr::select(all_of(features)) %>% as.matrix() fit_pen_pred <- glmnet(x_penalized_train, y_penalized_train, # Model alpha = 0.1, lambda = 0.1) We then report two key performance measures: the mean squared error and the hit ratio, which is the proportion of times that the prediction guesses the sign of the return correctly. A detailed account of metrics is given later in the book (Chapter 12). x_penalized_test <- testing_sample %>% # Predictors dplyr::select(all_of(features)) %>% as.matrix() mean((predict(fit_pen_pred, x_penalized_test) - testing_sample$R1M_Usd)^2) # MSE ## [1] 0.03699696 mean(predict(fit_pen_pred, x_penalized_test) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.5460346 From an investor’s standpoint, the MSEs (or even the mean absolute error) are hard to interpret because it is complicated to map them mentally into some intuitive financial indicator. In this perspective, the hit ratio is more natural. It tells the proportion of correct signs achieved by the predictions. If the investor is long in positive signals and short in negative ones, the hit ratio indicates the proportion of ‘correct’ bets (the positions that go in the expected direction). A natural threshold is 50% but because of transaction costs, 51% of accurate forecasts probably won’t be profitable. The figure 0.546 can be deemed a relatively good hit ratio, though not a very impressive one. 5.4 Coding exercise On the test sample, evaluate the impact of the two elastic net parameters on out-of-sample accuracy. References "],["trees.html", "Chapter 6 Tree-based methods 6.1 Simple trees 6.2 Random forests 6.3 Boosted trees: Adaboost 6.4 Boosted trees: extreme gradient boosting 6.5 Discussion 6.6 Coding exercises", " Chapter 6 Tree-based methods Classification and regression trees are simple yet powerful clustering algorithms popularized by the monograph of Breiman et al. (1984). Decision trees and their extensions are known to be quite efficient forecasting tools when working on tabular data. A large proportion of winning solutions in ML contests (especially on the Kaggle website13) resort to improvements of simple trees. For instance, the meta-study in bioinformatics by Olson et al. (2018) finds that boosted trees and random forests are the top 2 algorithms from a group of 13, excluding neural networks. Recently, the surge in Machine Learning applications in Finance has led to multiple publications that use trees in portfolio allocation problems. A long, though not exhaustive, list includes: Ballings et al. (2015), Patel, Shah, Thakkar, and Kotecha (2015a), Patel, Shah, Thakkar, and Kotecha (2015b), Moritz and Zimmermann (2016), Krauss, Do, and Huck (2017), Gu, Kelly, and Xiu (2020b), Guida and Coqueret (2018a), Coqueret and Guida (2020) and Simonian et al. (2019). One notable contribution is Bryzgalova, Pelger, and Zhu (2019) in which the authors create factors from trees by sorting portfolios via simple trees, which they call Asset Pricing Trees. In this chapter, we review the methodologies associated to trees and their applications in portfolio choice. 6.1 Simple trees 6.1.1 Principle Decision trees seek to partition datasets into homogeneous clusters. Given an exogenous variable \\(\\mathbf{Y}\\) and features \\(\\mathbf{X}\\), trees iteratively split the sample into groups (usually two at a time) which are as homogeneous in \\(\\mathbf{Y}\\) as possible. The splits are made according to one variable within the set of features. A short word on nomenclature: when \\(\\mathbf{Y}\\) consists of real numbers, we talk about regression trees and when \\(\\mathbf{Y}\\) is categorical, we use the term classification trees. Before formalizing this idea, we illustrate this process in Figure 6.1. There are 12 stars with three features: color, size and complexity (number of branches). FIGURE 6.1: Elementary tree scheme; visualization of the splitting process. The dependent variable is the color (let’s consider the wavelength associated to the color for simplicity). The first split is made according to size or complexity. Clearly, complexity is the better choice: complicated stars are blue and green, while simple stars are yellow, orange and red. Splitting according to size would have mixed blue and yellow stars (small ones) and green and orange stars (large ones). The second step is to split the two clusters one level further. Since only one variable (size) is relevant, the secondary splits are straightforward. In the end, our stylized tree has four consistent clusters. The analogy with factor investing is simple: the color represents performance: red for high performance and blue for mediocre performance. The features (size and complexity of stars) are replaced by firm-specific attributes, such as capitalization, accounting ratios, etc. Hence, the purpose of the exercise is to find the characteristics that allow to split firms into the ones that will perform well versus those likely to fare more poorly. We now turn to the technical construction of regression trees (splitting process). We follow the standard literature as exposed in Breiman et al. (1984) or in chapter 9 of Hastie, Tibshirani, and Friedman (2009). Given a sample of (\\(y_i\\),\\(\\mathbf{x}_i\\)) of size \\(I\\), a regression tree seeks the splitting points that minimize the total variation of the \\(y_i\\) inside the two child clusters. These two clusters need not have the same size. In order to do that, it proceeds in two steps. First, it finds, for each feature \\(x_i^{(k)}\\), the best splitting point (so that the clusters are homogeneous in \\(\\mathbf{Y}\\)). Second, it selects the feature that achieves the highest level of homogeneity. Homogeneity in regression trees is closely linked to variance. Since we want the \\(y_i\\) inside each cluster to be similar, we seek to minimize their variability (or dispersion) inside each cluster and then sum the two figures. We cannot sum the variances because this would not take into account the relative sizes of clusters. Hence, we work with total variation, which is the variance times the number of elements in the clusters. Below, the notation is a bit heavy because we resort to superscripts \\(k\\) (the index of the feature), but it is largely possible to ignore these superscripts to ease understanding. The first step is to find the best split for each feature, that is, solve \\(\\underset{c^{(k)}}{\\text{argmin}} \\ V^{(k)}_I(c^{(k)})\\) with \\[\\begin{equation} \\tag{6.1} V^{(k)}_I(c^{(k}))= \\underbrace{\\sum_{x_i^{(k)}<c^{(k)}}\\left(y_i-m_I^{k,-}(c^{(k)}) \\right)^2}_{\\text{Total dispersion of first cluster}} + \\underbrace{\\sum_{x_i^{(k)}>c^{(k)}}\\left(y_i-m_I^{k,+}(c^{(k)}) \\right)^2}_{\\text{Total dispersion of second cluster}}, \\end{equation}\\] where \\[\\begin{align*} m_I^{k,-}(c^{(k)})&=\\frac{1}{\\#\\{i,x_i^{(k)}<c^{(k)} \\}}\\sum_{\\{x_i^{(k)}<c^{(k)} \\}}y_i \\quad \\text{ and } \\\\ m_I^{k,+}(c^{(k)})&=\\frac{1}{\\#\\{i,x_i^{(k)}>c^{(k)} \\}}\\sum_{\\{x_i^{(k)}>c^{(k)} \\}}y_i \\end{align*}\\] are the average values of \\(Y\\), conditional on \\(X^{(k)}\\) being smaller or larger than \\(c\\). The cardinal function \\(\\#\\{\\cdot\\}\\) counts the number of instances of its argument. For feature \\(k\\), the optimal split \\(c^{k,*}\\) is thus the one for which the total dispersion over the two subgroups is the smallest. The optimal splits satisfy \\(c^{k,*}= \\underset{c^{(k)}}{\\text{argmin}} \\ V^{(k)}_I(c^{(k)})\\). Of all the possible splitting variables, the tree will choose the one that minimizes the total dispersion not only over all splits, but also over all variables: \\(k^*=\\underset{k}{\\text{argmin}} \\ V^{(k)}_I(c^{k,*})\\). After one split is performed, the procedure continues on the two newly formed clusters. There are several criteria that can determine when to stop the splitting process (see Section 6.1.3). One simple criterion is to fix a maximum number of levels (the depth) for the tree. A usual condition is to impose a minimum gain that is expected for each split. If the reduction in dispersion after the split is only marginal and below a specified threshold, then the split is not executed. For further technical discussions on decision trees, we refer for instance to section 9.2.4 of Hastie, Tibshirani, and Friedman (2009). When the tree is built (trained), a prediction for new instances is easy to make. Given its feature values, the instance ends up in one leaf of the tree. Each leaf has an average value for the label: this is the predicted outcome. Of course, this only works when the label is numerical. We discuss below the changes that occur when it is categorical. 6.1.2 Further details on classification Classification exercises are somewhat more complex than regression tasks. The most obvious difference is the measure of dispersion or heterogeneity. This loss function which must take into account the fact that the final output is not a simple number, but a vector. The output \\(\\tilde{\\textbf{y}}_i\\) has as many elements as there are categories in the label and each element is the probability that the instance belongs to the corresponding category. For instance, if there are 3 categories: buy, hold and sell, then each instance would have a label with as many columns as there are classes. Following our example, one label would be (1,0,0) for a buy position for instance. We refer to Section 4.5.2 for a introduction on this topic. Inside a tree, labels are aggregated at each cluster level. A typical output would look like (0.6,0.1,0.3): they are the proportions of each class represented within the cluster. In this case, the cluster has 60% of buy, 10% of hold and 30% of sell. The loss function must take into account this multidimensionality of the label. When building trees, since the aim is to favor homogeneity, the loss penalizes outputs that are not concentrated towards one class. Indeed, facing a diversified output of (0.3,0.4,0.3) is much harder to handle than the concentrated case of (0.8,0.1,0.1). The algorithm is thus seeking purity: it searches a splitting criterion that will lead to clusters that are as pure as possible, i.e., with one very dominant class, or at least just a few dominant classes. There are several metrics proposed by the literature and all are based on the proportions generated by the output. If there are \\(J\\) classes, we denote these proportions with \\(p_j\\). For each leaf, the usual loss functions are: the Gini impurity index: \\(1-\\sum_{j=1}^Jp_j^2;\\) the misclassification error: \\(1-\\underset{j}{\\text{max}}\\, p_j;\\) entropy: \\(-\\sum_{j=1}^J\\log(p_j)p_j.\\) The Gini index is nothing but one minus the Herfindahl index which measures the diversification of a portfolio. Trees seek partitions that are the least diversified. The minimum value of the Gini index is zero and reached when one \\(p_j=1\\) and all others are equal to zero. The maximum value is equal to \\(1-1/J\\) and is reached when all \\(p_j=1/J\\). Similar relationships hold for the other two losses. One drawback of the misclassification error is its lack of differentiability which explains why the other two options are often favored. Once the tree is grown, new instances automatically belong to one final leaf. This leaf is associated to the proportions of classes it nests. Usually, to make a prediction, the class with highest proportion (or probability) is chosen when a new instance is associated with the leaf. 6.1.3 Pruning criteria When building a tree, the splitting process can be pursued until the full tree is grown, that is, when: all instances belong to separate leaves, and/or all leaves comprise instances that cannot be further segregated based on the current set of features. At this stage, the splitting process cannot be pursued. Obviously, fully grown trees often lead to almost perfect fits when the predictors are relevant, numerous and numerical. Nonetheless, the fine grained idiosyncrasies of the training sample are of little interest for out-of-sample predictions. For instance, being able to perfectly match the patterns of 2000 to 2006 will probably not be very interesting in the period from 2007 to 2009. The most reliable sections of the trees are those closest to the root because they embed large portions of the data: the average values in the early clusters are trustworthy because the are computed on a large number of observations. The first splits are those that matter the most because they determine the most general patterns. The deepest splits only deal with the peculiarities of the sample. Thus, it is imperative to limit the size of the tree to avoid overfitting. There are several ways to prune the tree and all depend on some particular criteria. We list a few of them below: Impose a minimum number of instances for each terminal node (leaf). This ensures that each final cluster is composed of a sufficient number of observations. Hence, the average value of the label will be reliable because it is calculated on a large amount of data. Similarly, it can be imposed that a cluster has a minimal size before even considering any further split. This criterion is of course related to the one above. Require a certain threshold of improvement in the fit. If a split does not sufficiently reduce the loss, then it can be deemed unnecessary. The user specifies a small number \\(\\epsilon>0\\) and a split is only validated if the loss obtained post-split is smaller than \\(1-\\epsilon\\) times the loss before the split. Limit the depth of the tree. The depth is defined as the overal maximum number of splits between the root and any leaf of the tree. In the example below, we implement all of these criteria at the same time, but usually, two of them at most should suffice. 6.1.4 Code and interpretation We start with a simple tree and its interpretation. We use the package rpart and its plotting engine rpart.plot. The label is the future 1-month return and the features are all predictors available in the sample. The tree is trained on the full sample. ## Loading required package: rpart library(rpart) # Tree package library(rpart.plot) # Tree plot package formula <- paste("R1M_Usd ~", paste(features, collapse = " + ")) # Defines the model formula <- as.formula(formula) # Forcing formula object fit_tree <- rpart(formula, data = data_ml, # Data source: full sample minbucket = 3500, # Min nb of obs required in each terminal node (leaf) minsplit = 8000, # Min nb of obs required to continue splitting cp = 0.0001, # Precision: smaller = more leaves maxdepth = 3 # Maximum depth (i.e. tree levels) ) rpart.plot(fit_tree) # Plot the tree FIGURE 6.2: Simple characteristics-based tree. The dependent variable is the 1 month future return. There usually exists a convention in the representation of trees. At each node, a condition describes the split with a Boolean expression. If the expression is true, then the instance goes to the left cluster; if not, it goes to the right cluster. Given the whole sample, the initial split in this tree (Figure 6.2) is performed according to the price-to-book ratio. If the Pb score (or value) of the instance is above 0.025, then the instance is placed in the left bucket; otherwise, it goes in the right bucket. At each node, there are two important metrics. The first one is the average value of the label in the cluster, and the second one is the proportion of instances in the cluster. At the top of the tree, all instances (100%) are present and the average 1-month future return is 1.3%. One level below, the left cluster is by far the most crowded, with roughly 98% of observations averaging a 1.2% return. The right cluster is much smaller (2%) but concentrates instances with a much higher average return (5.9%). This is possibly an idiosyncracy of the sample. The splitting process continues similarly at each node until some condition is satisfied (typically here: the maximum depth is reached). A color codes the average return: from white (low return) to blue (high return). The leftmost cluster with the lowest average return consists of firms that satisfy all the following criteria: have a Pb score above 0.025; have a 3-month market capitalization score above 0.16; have a score of average daily volume over the past 3 months above 0.85. Notice that one peculiarity of trees is their possible heterogeneity in cluster sizes. Sometimes, a few clusters gather almost all of the observations while a few small groups embed some outliers. This is not a favorable property of trees, as small groups are more likely to be flukes and may fail to generalize out-of-sample. This is why we imposed restrictions during the construction of the tree. The first one (minbucket = 3500 in the code) imposes that each cluster consists of at least 3500 instances. The second one (minsplit) further imposes that a cluster comprises at least 8000 observations in order to pursue the splitting process. These values logically depend on the size of the training sample. The cp = 0.0001 parameter in the code requires any split to reduce the loss below 0.9999 times its original value before the split. Finally, the maximum depth of three essentially means that there are at most three splits between the root of the tree and any terminal leaf. The complexity of the tree (measured by the number of terminal leaves) is a decreasing function of minbucket, minsplit and cp and an increasing function of maximum depth. Once the model has been trained (i.e., the tree is grown), a prediction for any instance is the average value of the label within the cluster where the instance should land. predict(fit_tree, data_ml[1:6,]) # Test (prediction) on the first six instances of the sample ## 1 2 3 4 5 6 ## 0.01088066 0.01088066 0.01088066 0.01088066 0.01088066 0.01088066 Given the figure, we immediately conclude that these first six instances all belong to the second cluster (starting from the left). As a verification of the first splits, we plot the smoothed average of future returns, conditionally on market capitalization, price-to-book ratio and trading volume. data_ml %>% ggplot() + stat_smooth(aes(x = Mkt_Cap_3M_Usd, y = R1M_Usd, color = "Market Cap"), se = FALSE) + stat_smooth(aes(x = Pb, y = R1M_Usd, color = "Price-to-Book"), se = FALSE) + stat_smooth(aes(x = Advt_3M_Usd, y = R1M_Usd, color = "Volume"), se = FALSE) + xlab("Predictor") + coord_fixed(11) + labs(color = "Characteristic") FIGURE 6.3: Average of 1-month future returns, conditionally on market capitalization, price-to-book and volatility scores. The graph shows the relevance of clusters based on market capitalizations and price-to-book ratios. For low score values of these two features, the average return is high (close to +4% on a monthly basis on the left of the curves). The pattern is more pronounced compared to volume for instance. Finally, we assess the predictive quality of a single tree on the testing set (the tree is grown on the training set). We use a deeper tree, with a maximum depth of five. fit_tree2 <- rpart(formula, data = training_sample, # Data source: training sample minbucket = 1500, # Min nb of obs required in each terminal node (leaf) minsplit = 4000, # Min nb of obs required to continue splitting cp = 0.0001, # Precision: smaller cp = more leaves maxdepth = 5 # Maximum depth (i.e. tree levels) ) mean((predict(fit_tree2, testing_sample) - testing_sample$R1M_Usd)^2) # MSE ## [1] 0.03700039 mean(predict(fit_tree2, testing_sample) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.5416619 The mean squared error is usually hard to interpret. It’s not easy to map an error on returns into the impact on investment decisions. The hit ratio is a more intuitive indicator because it evaluates the proportion of correct guesses (and hence profitable investments). Obviously, it is not perfect: 55% of small gains can be mitigated by 45% of large losses. Nonetheless, it is a popular metric and moreover it corresponds to the usual accuracy measure often computed in binary classification exercises. Here, an accuracy of 0.542 is satisfactory. Even if any number above 50% may seem valuable, it must not be forgotten that transaction costs will curtail benefits. Hence, the benchmark threshold is probably at least at 52%. 6.2 Random forests While trees give intuitive representations of relationships between \\(\\mathbf{Y}\\) and \\(\\mathbf{X}\\), they can be improved via the simple idea of ensembles in which predicting tools are combined (this topic of model aggregation is discussed both more generally and in more details in Chapter 11). 6.2.1 Principle Most of the time, when having several modelling options at hand, it is not obvious upfront which individual model is the best, hence a combination seems a reasonable path towards the diversification of prediction errors (when they are not too correlated). Some theoretical foundations of model diversification were laid out in Schapire (1990). More practical considerations were proposed later in Ho (1995) and more importantly in Breiman (2001) which is the major reference for random forests. Bagging is successfully used in Yin (2020) to aggregate equity forecasts. There are two ways to create multiple predictors from simple trees, and random forests combine both: first, the model can be trained on similar yet different datasets. One way to achieve this is via bootstrap: the instances are resampled with or without replacement (for each individual tree), yielding new training data each time a new tree is built. second, the data can be altered by curtailing the number of predictors. Alternative models are built based on different sets of features. The user chooses how many features to retain and then the algorithm selects these features randomly at each try. Hence, it becomes simple to grow many different trees and the ensemble is simply a weighted combination of all trees. Usually, equal weights are used, which is an agnostic and robust choice. We illustrate the idea of simple combinations (also referred to as bagging) in Figure 6.4 below. The terminal prediction is simply the mean of all intermediate predictions. FIGURE 6.4: Combining tree outputs via random forests. Random forests, because they are built on the idea of bootstrapping, are more efficient than simple trees. They are used by Ballings et al. (2015), Patel, Shah, Thakkar, and Kotecha (2015a), Krauss, Do, and Huck (2017), and Huck (2019) and they are shown to perform very well in these papers. The original theoretical properties of random forests are demonstrated in Breiman (2001) for classification trees. In classification exercises, the decision is taken by a vote: each tree votes for a particular class and the class with the most votes wins (with possible random picks in case of ties). Breiman (2001) defines the margin function as \\[mg=M^{-1}\\sum_{m=1}^M1_{\\{h_m(\\textbf{x})=y\\}}-\\max_{j\\neq y}\\left(M^{-1}\\sum_{m=1}^M1_{\\{h_m(\\textbf{x})=j\\}}\\right),\\] where the left part is the average number of votes based on the \\(M\\) trees \\(h_m\\) for the correct class (the models \\(h_m\\) based on \\(\\textbf{x}\\) matches the data value \\(y\\)). The right part is the maximum average for any other class. The margin reflects the confidence that the aggregate forest will classify properly. The generalization error is the probability that \\(mg\\) is strictly negative. Breiman (2001) shows that the inaccuracy of the aggregation (as measured by generalization error) is bounded by \\(\\bar{\\rho}(1-s^2)/s^2\\), where - \\(s\\) is the strength (average quality14) of the individual classifiers and - \\(\\bar{\\rho}\\) is the average correlation between the learners. Notably, Breiman (2001) also shows that as the number of trees grows to infinity, the inaccuracy converges to some finite number which explains why random forests are not prone to overfitting. While the original paper of Breiman (2001) is dedicated to classification models, many articles have since then tackled the problem of regression trees. We refer the interested reader to Biau (2012) and Scornet et al. (2015). Finally, further results on classifying ensembles can be obtained in Biau, Devroye, and Lugosi (2008) and we mention the short survey paper by Denil, Matheson, and De Freitas (2014) which sums up recent results in this field. 6.2.2 Code and results Several implementations of random forests exist. For simplicity, we choose to work with the original R library, but another choice could be the one developed by h2o, which is a highly efficient meta-environment for machine learning (coded in Java). The syntax of randomForest follows that of many ML libraries. The full list of options for some random forest implementations is prohibitively large.15 Below, we train a model and exhibit the predictions for the first 5 instances of the testing sample. library(randomForest) set.seed(42) # Sets the random seed fit_RF <- randomForest(formula, # Same formula as for simple trees! data = training_sample, # Data source: training sample sampsize = 10000, # Size of (random) sample for each tree replace = FALSE, # Is the sampling done with replacement? nodesize = 250, # Minimum size of terminal cluster ntree = 40, # Nb of random trees mtry = 30 # Nb of predictive variables for each tree ) predict(fit_RF, testing_sample[1:5,]) # Prediction over the first 5 test instances ## 1 2 3 4 5 ## 0.009787728 0.012507087 0.008722386 0.009398814 -0.011511758 One first comment is that each instance has its own prediction, which contrasts with the outcome of simple tree-based outcomes. Combining many trees leads to tailored forecasts. Note that the second line of the chunk freezes the random number generation. Indeed, random forests are by construction contingent on the arbitrary combinations of instances and features that are chosen to build the individual learners. In the above example, each individual learner (tree) is built on 10,000 randomly chosen instances (without replacement) and each terminal leaf (cluster) must comprise at least 240 elements (observations). In total, 40 trees are aggregated and each tree is constructed based on 30 randomly chosen predictors (out of the whole set of features). Unlike for simple trees, it is not possible to simply illustrate the outcome of the learning process (though solutions exist, see Section 13.1.1). It could be possible to extract all 40 trees, but a synthetic visualization is out-of-reach. A simplified view can be obtained via variable importance, as is discussed in Section 13.1.2. Finally, we can assess the accuracy of the model. mean((predict(fit_RF, testing_sample) - testing_sample$R1M_Usd)^2) # MSE ## [1] 0.03698197 mean(predict(fit_RF, testing_sample) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.5370186 The MSE is smaller than 4% and the hit ratio is close to 54%, which is reasonably above both 50% and 52% thresholds. Let’s see if we can improve the hit ratio by resorting to a classification exercise. We start by training the model on a new formula (the label is R1M_Usd_C). formula_C <- paste("R1M_Usd_C ~", paste(features, collapse = " + ")) # Defines the model formula_C <- as.formula(formula_C) # Forcing formula object fit_RF_C <- randomForest(formula_C, # New formula! data = training_sample, # Data source: training sample sampsize = 20000, # Size of (random) sample for each tree replace = FALSE, # Is the sampling done with replacement? nodesize = 250, # Minimum size of terminal cluster ntree = 40, # Number of random trees mtry = 30 # Number of predictive variables for each tree ) We can then assess the proportion of correct (binary) guesses. mean(predict(fit_RF_C, testing_sample) == testing_sample$R1M_Usd_C) # Hit ratio ## [1] 0.4980629 The accuracy is disappointing. There are two potential explanations for this (beyond the possibility of very different patterns in the training and testing sets). The first one is the sample size, which may be too small. The original training set has more than 200,000 observations, hence we retain only one in 10 in the above training specification. We are thus probably sidelining relevant information and the cost can be heavy. The second reason is the number of predictors, which is set to 30, i.e., one third of the total at our disposal. Unfortunately, this leaves room for the algorithm to pick less pertinent predictors. The default numbers of predictors chosen by the routines are \\(\\sqrt{p}\\) and \\(p/3\\) for classification and regression tasks, respectively. Here \\(p\\) is the total number of features. 6.3 Boosted trees: Adaboost The idea of boosting is slightly more advanced compared to agnostic aggregation. In random forest, we hope that the diversification through many trees will improve the overall quality of the model. In boosting, it is sought to iteratively improve the model whenever a new tree is added. There are many ways to boost learning and we present two that can easily be implemented with trees. The first one (Adaboost, for adaptive boosting) improves the learning process by progressively focusing on the instances that yield the largest errors. The second one (xgboost) is a flexible algorithm in which each new tree is only focused on the minimization of the training sample loss. 6.3.1 Methodology The origins of adaboost go back to Freund and Schapire (1997) and Freund and Schapire (1996), and for the sake of completeness, we also mention the book dedicated to boosting by Schapire and Freund (2012). Extensions of these ideas are proposed in Friedman et al. (2000) (the so-called real Adaboost algorithm) and in Drucker (1997) (for regression analysis). Theoretical treatments were derived by Breiman and others (2004). We start by directly stating the general structure of the algorithm: set equal weights \\(w_i=I^{-1}\\); For \\(m=1,\\dots,M\\) do: Find a learner \\(l_m\\) that minimizes the weighted loss \\(\\sum_{i=1}^Iw_iL(l_m(\\textbf{x}_i),\\textbf{y}_i)\\); Compute a learner weight \\[\\begin{equation} \\tag{6.2} a_m=f_a(\\textbf{w},l_m(\\textbf{x}),\\textbf{y}); \\end{equation}\\] Update the instance weights \\[\\begin{equation} \\tag{6.3} w_i \\leftarrow w_ie^{f_w(l_m(\\textbf{x}_i), \\textbf{y}_i)}; \\end{equation}\\] Normalize the \\(w_i\\) to sum to one. The output for instance \\(\\textbf{x}_i\\) is a simple function of \\(\\sum_{m=1}^M a_ml_m(\\textbf{x}_i)\\), \\[\\begin{equation} \\tag{6.4} \\tilde{y}_i=f_y\\left(\\sum_{m=1}^M a_ml_m(\\textbf{x}_i) \\right). \\end{equation}\\] Let us comment on the steps of the algorithm. The formulation holds for many variations of Adaboost and we will specify the functions \\(f_a\\) and \\(f_w\\) below. The first step seeks to find a learner (tree) \\(l_m\\) that minimizes a weighted loss. Here the base loss function \\(L\\) essentially depends on the task (regression versus classification). The second and third steps are the most interesting because they are the heart of Adaboost: they define the way the algorithm adapts sequentially. Because the purpose is to aggregate models, a more sophisticated approach compared to uniform weights for learners is a tailored weight for each learner. A natural property (for \\(f_a\\)) should be that a learner that yields a smaller error should have a larger weight because it is more accurate. The third step is to change the weights of observations. In this case, because the model aims at improving the learning process, \\(f_w\\) is constructed to give more weight on observations for which the current model does not do a good job (i.e., generates the largest errors). Hence, the next learner will be incentivized to pay more attention to these pathological cases. The third step is a simple scaling procedure. In Table 6.1, we detail two examples of weighting functions used in the literature. For the original Adaboost (Freund and Schapire (1996), Freund and Schapire (1997)), the label is binary with values +1 and -1 only. The second example stems from Drucker (1997) and is dedicated to regression analysis (with real-valued label). The interested reader can have a look at other possibilities in Schapire (2003) and Ridgeway, Madigan, and Richardson (1999). TABLE 6.1: Examples of functions for Adaboost-like algorithms. Bin. classif. (orig. Adaboost) Regression (Drucker (1997)) Individual error \\(\\epsilon_i=\\textbf{1}_{\\left\\{y_i\\neq l_m(\\textbf{x}_i) \\right\\}}\\) \\(\\epsilon_i=\\frac{|y_i- l_m(\\textbf{x}_i)|}{\\underset{i}{\\max}|y_i- l_m(\\textbf{x}_i)|}\\) Weight of learner via \\(f_a\\) \\(f_a=\\log\\left(\\frac{1-\\epsilon}{\\epsilon} \\right)\\),with \\(\\epsilon=I^{-1}\\sum_{i=1}^Iw_i \\epsilon_i\\) \\(f_a=\\log\\left(\\frac{1-\\epsilon}{\\epsilon} \\right)\\),with \\(\\epsilon=I^{-1}\\sum_{i=1}^Iw_i \\epsilon_i\\) Weight of instances via \\(f_w(i)\\) \\(f_w=f_a\\epsilon_i\\) \\(f_w=f_a\\epsilon_i\\) Output function via \\(f_y\\) \\(f_y(x) = \\text{sign}(x)\\) weighted median of predictions Let us comment on the original Adaboost specification. The basic error term \\(\\epsilon_i=\\textbf{1}_{\\left\\{y_i\\neq l_m(\\textbf{x}_i) \\right\\}}\\) is a dummy number indicating if the prediction is correct (we recall only two values are possible, +1 and -1). The average error \\(\\epsilon\\in [0,1]\\) is simply a weighted average of individual errors and the weight of the \\(m^{th}\\) learner defined in Equation (6.2) is given by \\(a_m=\\log\\left(\\frac{1-\\epsilon}{\\epsilon} \\right)\\). The function \\(x\\mapsto \\log((1-x)x^{-1})\\) decreases on \\([0,1]\\) and switches sign (from positive to negative) at \\(x=1/2\\). Hence, when the average error is small, the learner has a large positive weight, but when the error becomes large, the learner can even obtain a negative weight. Indeed, the threshold \\(\\epsilon>1/2\\) indicated that the learner is wrong more than 50% of the time. Obviously, this indicates a problem and the learner should even be discarded. The change in instance weights follows a similar logic. The new weight is proportional to \\(w_i\\left(\\frac{1-\\epsilon}{\\epsilon} \\right)^{\\epsilon_i}\\). If the prediction is right and \\(\\epsilon_i=0\\), the weight is unchanged. If the prediction is wrong and \\(\\epsilon_i=1\\), the weight is adjusted depending on the aggregate error \\(\\epsilon\\). If the error is small and the learner efficient (\\(\\epsilon<1/2\\)), then \\((1-\\epsilon)/\\epsilon>1\\) and the weight of the instance increases. This means that for the next round, the learner will have to focus more on instance \\(i\\). Lastly, the final prediction of the model corresponds to the sign of the weighted sums of individual predictions: if the sum is positive, the model will predict +1 and it will yield -1 otherwise.16 The odds of a zero sum are negligible. In the case of numerical labels, the process is slightly more complicated and we refer to Section 3, step 8 of Drucker (1997) for more details on how to proceed. We end this presentation with one word on instance weighting. There are two ways to deal with this topic. The first one works at the level of the loss functions. For regression trees, Equation (6.1) would naturally generalize to \\[V^{(k)}_N(c^{(k)}, \\textbf{w})= \\sum_{x_i^{(k)}<c^{(k)}}w_i\\left(y_i-m_N^{k,-}(c^{(k)}) \\right)^2 + \\sum_{x_i^{(k)}>c^{(k)}}w_i\\left(y_i-m_N^{k,+}(c^{(k)}) \\right)^2,\\] and hence an instance with a large weight \\(w_i\\) would contribute more to the dispersion of its cluster. For classification objectives, the alteration is more complex and we refer to Ting (2002) for one example of an instance-weighted tree-growing algorithm. The idea is closely linked to the alteration of the misclassification risk via a loss matrix (see Section 9.2.4 in Hastie, Tibshirani, and Friedman (2009)). The second way to enforce instance weighting is via random sampling. If instances have weights \\(w_i\\), then the training of learners can be performed over a sample that is randomly extracted with distribution equal to \\(w_i\\). In this case, an instance with a larger weight will have more chances to be represented in the training sample. The original adaboost algorithm relies on this method. 6.3.2 Illustration Below, we test an implementation of the original Adaboost classifier. As such, we work with the R1M_Usd_C variable and change the model formula. The computational cost of Adaboost is high on large datasets, thus we work with a smaller sample and we only impose three iterations. library(fastAdaboost) # Adaboost package subsample <- (1:52000)*4 # Target small sample fit_adaboost_C <- adaboost(formula_C, # Model spec. data = data.frame(training_sample[subsample,]), # Data source nIter = 3) # Number of trees Finally, we evaluate the performance of the classifier. mean(testing_sample$R1M_Usd_C == predict(fit_adaboost_C, testing_sample)$class) ## [1] 0.5028202 The accuracy (as evaluated by the hit ratio) is clearly not satisfactory. One reason for this may be the restrictions we enforced for the training (smaller sample and only three trees). 6.4 Boosted trees: extreme gradient boosting The ideas behind tree boosting were popularized, among others, by Mason et al. (2000), Friedman (2001), and Friedman (2002). In this case, the combination of learners (prediction tools) is not agnostic as in random forest, but adapted (or optimized) at the learner level. At each step \\(s\\), the sum of models \\(M_S=\\sum_{s=1}^{S-1}m_s+m_S\\) is such that the last learner \\(m_S\\) was precisely designed to reduce the loss of \\(M_S\\) on the training sample. Below, we follow closely the original work of T. Chen and Guestrin (2016) because their algorithm yields incredibly accurate predictions and also because it is highly customizable. It is their implementation that we use in our empirical section. The other popular alternative is lightgbm (see Ke et al. (2017)). What XGBoost seeks to minimize is the objective \\[O=\\underbrace{\\sum_{i=1}^I \\text{loss}(y_i,\\tilde{y}_i)}_{\\text{error term}} \\quad + \\underbrace{\\sum_{j=1}^J\\Omega(T_j)}_{\\text{regularization term}}.\\] The first term (over all instances) measures the distance between the true label and the output from the model. The second term (over all trees) penalizes models that are too complex. For simplicity, we propose the full derivation with the simplest loss function \\(\\text{loss}(y,\\tilde{y})=(y-\\tilde{y})^2\\), so that: \\[O=\\sum_{i=1}^I \\left(y_i-m_{J-1}(\\mathbf{x}_i)-T_J(\\mathbf{x}_i)\\right)^2+ \\sum_{j=1}^J\\Omega(T_j).\\] 6.4.1 Managing loss Let us assume that we have already built all trees \\(T_{j}\\) up to \\(j=1,\\dots,J-1\\) (and hence model \\(M_{J-1}\\)): how to choose tree \\(T_J\\) optimally? We rewrite \\[\\begin{align*} O&=\\sum_{i=1}^I \\left(y_i-m_{J-1}(\\mathbf{x}_i)-T_J(\\mathbf{x}_i)\\right)^2+ \\sum_{j=1}^J\\Omega(T_j) \\\\ &=\\sum_{i=1}^I\\left\\{y_i^2+m_{J-1}(\\mathbf{x}_i)^2+T_J(\\mathbf{x}_i)^2 \\right\\} + \\sum_{j=1}^{J-1}\\Omega(T_j)+\\Omega(T_J) \\quad \\text{(squared terms + penalization)}\\\\ & \\quad -2 \\sum_{i=1}^I\\left\\{y_im_{J-1}(\\mathbf{x}_i)+y_iT_J(\\mathbf{x}_i)-m_{J-1}(\\mathbf{x}_i) T_J(\\mathbf{x}_i))\\right\\}\\quad \\text{(cross terms)} \\\\ &= \\sum_{i=1}^I\\left\\{-2 y_iT_J(\\mathbf{x}_i)+2m_{J-1}(\\mathbf{x}_i) T_J(\\mathbf{x}_i))+T_J(\\mathbf{x}_i)^2 \\right\\} +\\Omega(T_J) + c \\end{align*}\\] All terms known at step \\(J\\) (i.e., indexed by \\(J-1\\)) vanish because they do not enter the optimization scheme. They are embedded in the constant \\(c\\). Things are fairly simple with quadratic loss. For more complicated loss functions, Taylor expansions are used (see the original paper). 6.4.2 Penalization In order to go any further, we need to specify the way the penalization works. For a given tree \\(T\\), we specify its structure by \\(T(x)=w_{q(x)}\\), where \\(w\\) is the output value of some leaf and \\(q(\\cdot)\\) is the function that maps an input to its final leaf. This encoding is illustrated in Figure 6.5. The function \\(q\\) indicates the path, while the vector \\(\\textbf{w}=w_i\\) codes the terminal leaf values. FIGURE 6.5: Coding a decision tree: decomposition between structure and node and leaf values. We write \\(l=1,\\dots,L\\) for the indices of the leaves of the tree. In XGBoost, complexity is defined as: \\[\\Omega(T)=\\gamma L+\\frac{\\lambda}{2}\\sum_{l=1}^Lw_l^2,\\] where the first term penalizes the total number of leaves; the second term penalizes the magnitude of output values (this helps reduce variance). The first penalization term reduces the depth of the tree, while the second shrinks the size of the adjustments that will come from the latest tree. 6.4.3 Aggregation We aggregate both sections of the objective (loss and penalization). We write \\(I_l\\) for the set of the indices of the instances belonging to leaf \\(l\\). Then, \\[\\begin{align*} O&= 2\\sum_{i=1}^I\\left\\{ -y_iT_J(\\mathbf{x}_i)+m_{J-1}(\\mathbf{x}_i) T_J(\\mathbf{x}_i))+\\frac{T_J(\\mathbf{x}_i)^2}{2} \\right\\} + \\gamma L+\\frac{\\lambda}{2}\\sum_{l=1}^Lw_l^2 \\\\ &=2\\sum_{i=1}^I\\left\\{- y_iw_{q(\\mathbf{x}_i)}+m_{J-1}(\\mathbf{x}_i)w_{q(\\mathbf{x}_i)})+\\frac{w_{q(\\mathbf{x}_i)}^2}{2} \\right\\} + \\gamma L+\\frac{\\lambda}{2}\\sum_{l=1}^Lw_l^2 \\\\ &=2 \\sum_{l=1}^L \\left(w_l\\sum_{i\\in I_l}(-y_i +m_{J-1}(\\mathbf{x}_i))+ \\frac{w_l^2}{2}\\sum_{i\\in I_l}\\left(1+\\frac{\\lambda}{2}\\right)\\right)+ \\gamma L \\end{align*}\\] The function is of the form \\(aw_l+\\frac{b}{2}w_l^2\\), which has minimum values \\(-\\frac{a^2}{2b}\\) at point \\(w_l=-a/b\\). Thus, writing #(.) for the cardinal function that counts the number of items in a set, \\[\\begin{align} \\tag{6.5} \\mathbf{\\rightarrow} \\quad w^*_l&=\\frac{\\sum_{i\\in I_l}(y_i -m_{J-1}(\\mathbf{x}_i))}{\\left(1+\\frac{\\lambda}{2}\\right)\\#\\{i\\in I_l\\}}, \\text{ so that} \\\\ O_L(q)&=-\\frac{1}{2}\\sum_{l=1}^L \\frac{\\left(\\sum_{i\\in I_l}(y_i -m_{J-1}(\\mathbf{x}_i))\\right)^2}{\\left(1+\\frac{\\lambda}{2}\\right)\\#\\{i\\in I_l\\}}+\\gamma L, \\nonumber \\end{align}\\] where we added the dependence of the objective both in \\(q\\) (structure of tree) and \\(L\\) (number of leaves). Indeed, the meta-shape of the tree remains to be determined. 6.4.4 Tree structure Final problem: the tree structure! Let us take a step back. In the construction of a simple regression tree, the output value at each node is equal to the average value of the label within the node (or cluster). When adding a new tree in order to reduce the loss, the node values must be computed completely differently, which is the purpose of Equation (6.5). Nonetheless, the growing of the iterative trees follows similar lines as simple trees. Features must be tested in order to pick the one that minimizes the objective for each given split. The final question is then: what’s the best depth and when to stop growing the tree? The method is to proceed node-by-node; for each node, look at whether a split is useful (in terms of objective) or not: \\[\\text{Gain}=\\frac{1}{2}\\left(\\text{Gain}_L+\\text{Gain}_R-\\text{Gain}_O \\right)-\\gamma\\] each gain is computed with respect to the instances in each bucket (cluster): \\[\\text{Gain}_\\mathcal{X}= \\frac{\\left(\\sum_{i\\in I_\\mathcal{X}}(y_i -m_{J-1}(\\mathbf{x}_i))\\right)^2}{\\left(1+\\frac{\\lambda}{2}\\right)\\#\\{i\\in I_\\mathcal{X}\\}},\\] where \\(I_\\mathcal{X}\\) is the set of instances within cluster \\(\\mathcal{X}\\). \\(\\text{Gain}_O\\) is the original gain (no split) and \\(\\text{Gain}_L\\) and \\(\\text{Gain}_R\\) are the gains of the left and right clusters, respectively. One word about the \\(-\\gamma\\) adjustment in the above formula: there is one unit of new leaves (two new minus one old)! This makes a one leaf difference; hence \\(\\Delta L =1\\) and the penalization intensity for each new leaf is equal to \\(\\gamma\\). Lastly, we underline the fact that XGBoost also applies a learning rate: each new tree is scaled by a factor \\(\\eta\\), with \\(\\eta \\in (0,1]\\). After each step of boosting the new tree \\(T_J\\) sees its values discounted by multiplying them by \\(\\eta\\). This is very useful because a pure aggregation of 100 optimized trees is the best way to overfit the training sample. 6.4.5 Extensions Several additional features are available to further prevent boosted trees to overfit. Indeed, given a sufficiently large number of trees, the aggregation is able to match the training sample very well, but may fail to generalize well out-of-sample. Following the pioneering work of Srivastava et al. (2014), the DART (Dropout for Additive Regression Trees) model was proposed by Rashmi and Gilad-Bachrach (2015). The idea is to omit a specified number of trees during training. The trees that are removed from the model are chosen randomly. The full specifications can be found at https://xgboost.readthedocs.io/en/latest/tutorials/dart.html and we use a 10% dropout in the first example below.. Monotonicity constraints are another element that is featured both in xgboost and lightgbm. Sometimes, it is expected that one particular feature has a monotonic impact on the label. For instance, if one deeply believes in momentum, then past returns should have an increasing impact on future returns (in the cross-section of stocks). Given the recursive nature of the splitting algorithm, it is possible to choose when to perform a split (according to a particular variable) and when not to. In Figure 6.6, we show how the algorithm proceeds. All splits are performed according to the same feature. For the first split, things are easy because it suffices to verify that the averages of each cluster are ranked in the right direction. Things are more complicated for the splits that occur below. Indeed, the average values set by all above splits matter as they give bounds for acceptable values for the future average values in lower splits. If a split violates these bounds, then it is overlooked and another variable will be chosen instead. FIGURE 6.6: Imposing monotonic constraints. The constraints are shown in bold blue in the bottom leaves. 6.4.6 Code and results In this section, we train a model using the XGBoost library. Other options include catboost, gbm, lightgbm, and h2o’s own version of boosted machines. Unlike many other packages, the XGBoost function requires a particular syntax and dedicated formats. The first step is thus to encapsulate the data accordingly. Moreover, because training times can be long, we shorten the training sample as advocated in Coqueret and Guida (2020). We retain only the 40% most extreme observations (in terms of label values: top 20% and bottom 20%) and work with the small subset of features. In all coding sections dedicated to boosted trees in this book, the models will be trained with only 7 features. library(xgboost) # The package for boosted trees train_features_xgb <- training_sample %>% filter(R1M_Usd < quantile(R1M_Usd, 0.2) | R1M_Usd > quantile(R1M_Usd, 0.8)) %>% # Extreme values only! dplyr::select(all_of(features_short)) %>% as.matrix() # Independent variable train_label_xgb <- training_sample %>% filter(R1M_Usd < quantile(R1M_Usd, 0.2) | R1M_Usd > quantile(R1M_Usd, 0.8)) %>% dplyr::select(R1M_Usd) %>% as.matrix() # Dependent variable train_matrix_xgb <- xgb.DMatrix(data = train_features_xgb, label = train_label_xgb) # XGB format! The second (optional) step is to determine the monotonicity constraints that we want to impose. For simplicity, we will only enforce three constraints on market capitalization (negative, because large firms have smaller returns under the size anomaly); price-to-book ratio (negative, because overvalued firms also have smaller returns under the value anomaly); past annual returns (positive, because winners outperform losers under the momentum anomaly). mono_const <- rep(0, length(features)) # Initialize the vector mono_const[which(features == "Mkt_Cap_12M_Usd")] <- (-1) # Decreasing in market cap mono_const[which(features == "Pb")] <- (-1) # Decreasing in price-to-book mono_const[which(features == "Mom_11M_Usd")] <- 1 # Increasing in past return The third step is to train the model on the formatted training data. We include the monotonicity constraints and the DART feature (via rate_drop). Just like random forests, boosted trees can grow individual trees on subsets of the data: both row-wise (by selecting random instances) and column-wise (by keeping a smaller portion of predictors). These options are implemented below with the subsample and colsample_bytree in the arguments of the function. fit_xgb <- xgb.train(data = train_matrix_xgb, # Data source eta = 0.3, # Learning rate objective = "reg:squarederror", # Objective function max_depth = 4, # Maximum depth of trees subsample = 0.6, # Train on random 60% of sample colsample_bytree = 0.7, # Train on random 70% of predictors lambda = 1, # Penalisation of leaf values gamma = 0.1, # Penalisation of number of leaves nrounds = 30, # Number of trees used (rather low here) monotone_constraints = mono_const, # Monotonicity constraints rate_drop = 0.1, # Drop rate for DART verbose = 0 # No comment from the algo ) ## [21:27:42] WARNING: amalgamation/../src/learner.cc:516: ## Parameters: { rate_drop } might not be used. ## ## This may not be accurate due to some parameters are only used in language bindings but ## passed down to XGBoost core. Or some parameters are not used but slip through this ## verification. Please open an issue if you find above cases. Finally, we evaluate the performance of the model. Note that before that, a proper formatting of the testing sample is required. xgb_test <- testing_sample %>% # Test sample => XGB format dplyr::select(all_of(features_short)) %>% as.matrix() mean((predict(fit_xgb, xgb_test) - testing_sample$R1M_Usd)^2) # MSE ## [1] 0.03908855 mean(predict(fit_xgb, xgb_test) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.5077626 The performance is comparable to those observed for other predictive tools. As a final exercise, we show one implementation of a classification task under XGBoost. Only the label changes. In XGBoost, labels must be coded with integer number, starting at zero exactly. In R, factors are numerically coded as integer numbers starting from one, hence the mapping is simple. train_label_C <- training_sample %>% filter(R1M_Usd < quantile(R1M_Usd, 0.2) | # Either low 20% returns R1M_Usd > quantile(R1M_Usd, 0.8)) %>% # Or top 20% returns dplyr::select(R1M_Usd_C) train_matrix_C <- xgb.DMatrix(data = train_features_xgb, label = as.numeric(train_label_C == "TRUE")) # XGB format! When working with categories, the loss function is usually the softmax function (see Section 1.1). fit_xgb_C <- xgb.train(data = train_matrix_C, # Data source (pipe input) eta = 0.8, # Learning rate objective = "multi:softmax", # Objective function num_class = 2, # Number of classes max_depth = 4, # Maximum depth of trees nrounds = 10, # Number of trees used verbose = 0 # No warning message ) We can then proceed to the assessment of the quality of the model. We adjust the prediction to the value of the true label and count the proportion of accurate forecasts. mean(predict(fit_xgb_C, xgb_test) + 1 == as.numeric(testing_sample$R1M_Usd_C)) # Hit ratio ## [1] 0.495613 Consistently with the previous classification attempts, the results are underwhelming, as if switching to binary labels incurred a loss of information. 6.4.7 Instance weighting In the computation of the aggregate loss, it is possible to introduce some flexibility and assign weights to instances: \\[O=\\underbrace{\\sum_{i=1}^I\\mathcal{W}_i \\times \\text{loss}(y_i,\\tilde{y}_i)}_{\\text{weighted error term}} \\quad + \\underbrace{\\sum_{j=1}^J\\Omega(T_j)}_{\\text{regularization term (unchanged)}}.\\] In factor investing, these weights can very well depend on the feature values (\\(\\mathcal{W}_i=\\mathcal{W}_i(\\textbf{x}_i)\\)). For instance, for one particular characteristic \\(\\textbf{x}^k\\), weights can be increasing thereby giving more importance to assets with high values of this characteristic (e.g., value stocks are favored compared to growth stocks). One other option is to increase weights when the values of the characteristic become more extreme (deep value and deep growth stocks have larger weights). If the features are uniform, the weights can simply be \\(\\mathcal{W}_i(x_i^k)\\propto|x_i^k-0.5|\\): firms with median value 0.5 have zero weight and as the feature value shifts towards 0 or 1, the weight increases. Specifying weights on instances biases the learning process just like views introduced à la Black and Litterman (1992) influence the asset allocation process. The difference is that the nudge is performed well ahead of the portfolio choice problem. In xgboost, the implementation instance weighting is done very early, in the definition of the xgb.DMatrix: inst_weights <- runif(nrow(train_features_xgb)) # Random weights inst_weights <- inst_weights / sum(inst_weights) # Normalization train_matrix_xgb <- xgb.DMatrix(data = train_features_xgb, label = train_label_xgb, weight = inst_weights) # Weights! Then, in the subsequent stages, the optimization will be performed with these hard-coded weights. The splitting points can be altered (via the total weighted loss in clusters) and the terminal weight values (6.5) are also impacted. 6.5 Discussion We end this chapter by a discussion on the choice of predictive engine with a view towards portfolio construction. As recalled in Chapter 2, the ML signal is just one building stage of construction of the investment strategy. At some point, this signal must be translated into portfolio weights. From this perspective, simple trees appear suboptimal. Tree depths are usually set between 3 and 6. This implies between 8 and 64 terminal leaves at most, with possibly very unbalanced clusters. The likelihood of having one cluster with 20% to 30% of the sample is high. This means that when it comes to predictions, roughly 20% to 30% of the instances will be given the same value. On the other side of the process, portfolio policies commonly have a fixed number of assets. Thus, having assets with equal signals does not permit to discriminate and select a subset to be included in the portfolio. For instance, if the policy requires exactly 100 stocks and 105 stocks have the same signal, the signal cannot be used for selection purposes. It would have to be combined with exogenous information such as the covariance matrix in a mean-variance type allocation. Overall, this is one reason to prefer aggregate models. When the number of learners is sufficiently large (5 is almost enough), the predictions for assets will be unique and tailored to these assets. It then becomes possible to discriminate via the signal and select only those assets that have the most favorable signal. In practice, random forests and boosted trees are probably the best choices. 6.6 Coding exercises Using the formula in the chunks above, build two simple trees on the training sample with only one parameter: cp. For the first tree, take cp=0.001 and for the second take cp=0.01. Evaluate the performance of both models on the testing sample. Comment. With the smaller set of predictors, build random forests on the training sample. Restrict the learning on 30,000 instances and over 5 predictors. Construct the forests on 10, 20, 40, 80 and 160 trees and evaluate their performance on the training sample. Is complexity worthwhile in this case and why? Plot a tree based on data from calendar year 2008 and then from 2009. Compare. References "],["NN.html", "Chapter 7 Neural networks 7.1 The original perceptron 7.2 Multilayer perceptron 7.3 How deep we should go and other practical issues 7.4 Code samples and comments for vanilla MLP 7.5 Recurrent networks 7.6 Other common architectures 7.7 Coding exercise", " Chapter 7 Neural networks Neural networks (NNs) are an immensely rich and complicated topic. In this chapter, we introduce the simple ideas and concepts behind the most simple architectures of NNs. For more exhaustive treatments on NN idiosyncracies, we refer to the monographs by Haykin (2009), Du and Swamy (2013) and Goodfellow et al. (2016). The latter is available freely online: www.deeplearningbook.org. For a practical introduction, we recommend the great book of Chollet (2017). For starters, we briefly comment on the qualification “neural network”. Most experts agree that the term is not very well chosen, as NNs have little to do with how the human brain works (of which we know not that much). This explains why they are often referred to as “artificial neural networks” - we do not use the adjective for notational simplicity. Because we consider it more appropriate, we recall the definition of NNs given by François Chollet: “chains of differentiable, parameterised geometric functions, trained with gradient descent (with gradients obtained via the chain rule)”. Early references of neural networks in finance are Bansal and Viswanathan (1993) and Eakins, Stansell, and Buck (1998). Both have very different goals. In the first one, the authors aim to estimate a nonlinear form for the pricing kernel. In the second one, the purpose is to identify and quantify relationships between institutional investments in stocks and the attributes of the firms (an early contribution towards factor investing). An early review (Burrell and Folarin (1997)) lists financial applications of NNs during the 1990s. More recently, Sezer, Gudelek, and Ozbayoglu (2019), W. Jiang (2020) and Lim and Zohren (2020) survey the attempts to forecast financial time series with deep-learning models, mainly by computer science scholars. The pure predictive ability of NNs in financial markets is a popular subject and we further cite for example Kimoto et al. (1990), Enke and Thawornwong (2005), Zhang and Wu (2009), Guresen, Kayakutlu, and Daim (2011), Krauss, Do, and Huck (2017), Fischer and Krauss (2018), Aldridge and Avellaneda (2019), Babiak and Barunik (2020), Y. Ma, Han, and Wang (2020), and Soleymani and Paquet (2020).17 The last reference even combines several types of NNs embedded inside an overarching reinforcement learning structure. This list is very far from exhaustive. In the field of financial economics, recent research on neural networks includes: Feng, Polson, and Xu (2019) use neural networks to find factors that are the best at explaining the cross-section of stock returns. Gu, Kelly, and Xiu (2020b) map firm attributes and macro-economic variables into future returns. This creates a strong predictive tool that is able to forecast future returns very accurately. Luyang Chen, Pelger, and Zhu (2020) estimate the pricing kernel with a complex neural network structure including a generative adversarial network. This again gives crucial information on the structure of expected stock returns and can be used for portfolio construction (by building an accurate maximum Sharpe ratio policy). 7.1 The original perceptron The origins of NNs go back at least to Rosenblatt (1958). Its aim is binary classification. For simplicity, let us assume that the output is \\(\\{0\\) = do not invest\\(\\}\\) versus \\(\\{1\\) = invest\\(\\}\\) (e.g., derived from return, negative versus positive). Given the current nomenclature, a perceptron can be defined as an activated linear mapping. The model is the following: \\[f(\\mathbf{x})=\\left\\{ \\begin{array}{lll} 1 & \\text{if } \\mathbf{x}'\\mathbf{w}+b >0\\\\ 0 &\\text{otherwise} \\end{array}\\right.\\] The vector of weights \\(\\mathbf{w}\\) scales the variables and the bias \\(b\\) shifts the decision barrier. Given values for \\(b\\) and \\(w_i\\), the error is \\(\\epsilon_i=y_i-1_{\\left\\{\\sum_{j=1}^Jx_{i,j}w_j+w_0>0\\right\\}}\\). As is customary, we set \\(b=w_0\\) and add an initial constant column to \\(x\\): \\(x_{i,0}=1\\), so that \\(\\epsilon_i=y_i-1_{\\left\\{\\sum_{j=0}^Jx_{i,j}w_j>0\\right\\}}\\). In contrast to regressions, perceptrons do not have closed-form solutions. The optimal weights can only be approximated. Just like for regression, one way to derive good weights is to minimize the sum of squared errors. To this purpose, the simplest way to proceed is to compute the current model value at point \\(\\textbf{x}_i\\): \\(\\tilde{y}_i=1_{\\left\\{\\sum_{j=0}^Jw_jx_{i,j}>0\\right\\}}\\), adjust the weight vector: \\(w_j \\leftarrow w_j + \\eta (y_i-\\tilde{y}_i)x_{i,j}\\), which amounts to shifting the weights in the direction. Just like for tree methods, the scaling factor \\(\\eta\\) is the learning rate. A large \\(\\eta\\) will imply large shifts: learning will be rapid but convergence may be slow or may even not occur. A small \\(\\eta\\) is usually preferable, as it helps reduce the risk of overfitting. In Figure 7.1, we illustrate this mechanism. The initial model (dashed grey line) was trained on 7 points (3 red and 4 blue). A new black point comes in. FIGURE 7.1: Scheme of a perceptron. if the point is red, there is no need for adjustment: it is labelled correctly as it lies on the right side of the border. if the point is blue, then the model needs to be updated appropriately. Given the rule mentioned above, this means adjusting the slope of the line downwards. Depending on \\(\\eta\\), the shift will be sufficient to change the classification of the new point - or not. At the time of its inception, the perceptron was an immense breakthrough which received an intense media coverage (see Olazaran (1996) and Anderson and Rosenfeld (2000)). Its rather simple structure was progressively generalized to networks (combinations) of perceptrons. Each one of them is a simple unit, and units are gathered into layers. The next section describes the organization of simple multilayer perceptrons (MLPs). 7.2 Multilayer perceptron 7.2.1 Introduction and notations A perceptron can be viewed as a linear model to which is applied a particular function: the Heaviside (step) function. Other choices of functions are naturally possible. In the NN jargon, they are called activation functions. Their purpose is to introduce nonlinearity in otherwise very linear models. Just like for random forests with trees, the idea behind neural networks is to combine perceptron-like building blocks. A popular representation of neural networks is shown in Figure 7.2. This scheme is overly simplistic. It hides what is really going on: there is a perceptron in each green circle and each output is activated by some function before it is sent to the final output aggregation. This is why such a model is called a Multilayer Perceptron (MLP). FIGURE 7.2: Simplified scheme of a multi-layer perceptron. A more faithful account of what is going on is laid out in Figure 7.3. FIGURE 7.3: Detailed scheme of a perceptron with 2 intermediate layers. Before we proceed with comments, we introduce some notation that will be used thoughout the chapter. The data is separated into a matrix \\(\\textbf{X}=x_{i,j}\\) of features and a vector of output values \\(\\textbf{y}=y_i\\). \\(\\textbf{x}\\) or \\(\\textbf{x}_i\\) denotes one line of \\(\\textbf{X}\\). A neural network will have \\(L\\ge1\\) layers and for each layer \\(l\\), the number of units is \\(U_l\\ge1\\). The weights for unit \\(k\\) located in layer \\(l\\) are denoted with \\(\\textbf{w}_{k}^{(l)}=w_{k,j}^{(l)}\\) and the corresponding biases \\(b_{k}^{(l)}\\). The length of \\(\\textbf{w}_{k}^{(l)}\\) is equal to \\(U_{l-1}\\). \\(k\\) refers to the location of the unit in layer \\(l\\) while \\(j\\) to the unit in layer \\(l-1\\). Outputs (post-activation) are denoted \\(o_{i,k}^{(l)}\\) for instance \\(i\\), layer \\(l\\) and unit \\(k\\). The process is the following. When entering the network, the data goes though the initial linear mapping: \\[v_{i,k}^{(1)}=\\textbf{x}_i'\\textbf{w}^{(1)}_k+b_k^{(1)}, \\text{for } l=1, \\quad k \\in [1,U_1], \\] which is then transformed by a non-linear function \\(f^{1}\\). The result of this alteration is then given as input of the next layer and so on. The linear forms will be repeated (with different weights) for each layer of the network: \\[v_{i,k}^{(l)}=(\\textbf{o}^{(l-1)}_i)'\\textbf{w}^{(l)}_k+b_k^{(l)}, \\text{for } l \\ge 2, \\quad k \\in [1,U_l]. \\] The connections between the layers are the so-called outputs, which are basically the linear mappings to which the activation functions \\(f^{(l)}\\) have been applied. The output of layer \\(l\\) is the input of layer \\(l+1\\). \\[o_{i,k}^{(l)}=f^{(l)}\\left(v_{i,k}^{(l)}\\right).\\] Finally, the terminal stage aggregates the outputs from the last layer: \\[\\tilde{y}_i =f^{(L+1)} \\left((\\textbf{o}^{(L)}_i)'\\textbf{w}^{(L+1)}+b^{(L+1)}\\right).\\] In the forward-propagation of the input, the activation function naturally plays an important role. In Figure 7.4, we plot the most usual activation functions used by neural network libraries. FIGURE 7.4: Plot of the most common activation functions. Let us rephrase the process through the lens of factor investing. The input \\(\\textbf{x}\\) are the characteristics of the firms. The first step is to multiply their value by weights and add a bias. This is performed for all the units of the first layer. The output, which is a linear combination of the input is then transformed by the activation function. Each unit provides one value and all of these values are fed to the second layer following the same process. This is iterated until the end of the network. The purpose of the last layer is to yield an output shape that corresponds to the label: if the label is numerical, the output is a single number, if it is categorical, then usually it is a vector with length equal to the number of categories. This vector indicates the probability that the value belongs to one particular category. It is possible to use a final activation function after the output. This can have a huge importance on the result. Indeed, if the labels are returns, applying a sigmoid function at the very end will be disastrous because the sigmoid is always positive. 7.2.2 Universal approximation One reason neural networks work well is that they are universal approximators. Given any bounded continuous function, there exists a one-layer network that can approximate this function up to arbitrary precision (see Cybenko (1989) for early references, section 4.2 in Du and Swamy (2013) and section 6.4.1 in Goodfellow et al. (2016) for more exhaustive lists of papers, and Guliyev and Ismailov (2018) for recent results). Formally, a one-layer perceptron is defined by \\[f_n(\\textbf{x})=\\sum_{l=1}^nc_l\\phi(\\textbf{x}\\textbf{w}_l+\\textbf{b}_l)+c_0,\\] where \\(\\phi\\) is a (non-constant) bounded continuous function. Then, for any \\(\\epsilon>0\\), it is possible to find one \\(n\\) such that for any continuous function \\(f\\) on the unit hypercube \\([0,1]^d\\), \\[|f(\\textbf{x})-f_n(\\textbf{x})|< \\epsilon, \\quad \\forall \\textbf{x} \\in [0,1]^d.\\] This result is rather intuitive: it suffices to add units to the layer to improve the fit. The process is more or less analogous to polynomial approximation, though some subtleties arise depending on the properties of the activations functions (boundedness, smoothness, convexity, etc.). We refer to Costarelli, Spigler, and Vinti (2016) for a survey on this topic. The raw results on universal approximation imply that any well-behaved function \\(f\\) can be approached sufficiently closely by a simple neural network, as long as the number of units can be arbitrarily large. Now, they do not directly relate to the learning phase, i.e., when the model is optimized with respect to a particular dataset. In a series of papers (Barron (1993) and Barron (1994), notably), Barron gives a much more precise characterization of what neural networks can achieve. In Barron (1993) it is for instance proved a more precise version of universal approximation: for particular neural networks (with sigmoid activation), \\(\\mathbb{E}[(f(\\textbf{x})-f_n(\\textbf{x}))^2]\\le c_f/n\\), which gives a speed of convergence related to the size of the network. In the expectation, the random term is \\(\\textbf{x}\\): this corresponds to the case where the data is considered to be a sample of i.i.d. observations of a fixed distribution (this is the most common assumption in machine learning). Below, we state one important result that is easy to interpret; it is taken from Barron (1994). In the sequel, \\(f_n\\) corresponds to a possibly penalized neural network with only one intermediate layer with \\(n\\) units and sigmoid activation function. Moreover, both the supports of the predictors and the label are assumed to be bounded (which is not a major constraint). The most important metric in a regression exercise is the mean squared error (MSE) and the main result is a bound (in order of magnitude) on this quantity. For \\(N\\) randomly sampled i.i.d. points \\(y_i=f(x_i)+\\epsilon_i\\) on which \\(f_n\\) is trained, the best possible empirical MSE behaves like \\[\\begin{equation} \\tag{7.1} \\mathbb{E}\\left[(f(x)-f_n(x))^2 \\right]=\\underbrace{O\\left(\\frac{c_f}{n} \\right)}_{\\text{size of network}}+\\ \\underbrace{O\\left(\\frac{nK \\log(N)}{N} \\right)}_{\\text{size of sample}}, \\end{equation}\\] where \\(K\\) is the dimension of the input (number of columns) and \\(c_f\\) is a constant that depends on the generator function \\(f\\). The above quantity provides a bound on the error that can be achieved by the best possible neural network given a dataset of size \\(N\\). There are clearly two components in the decomposition of this bound. The first one pertains to the complexity of the network. Just as in the original universal approximation theorem, the error decreases with the number of units in the network. But this is not enough! Indeed, the sample size is of course a key driver in the quality of learning (of i.i.d. observations). The second component of the bound indicates that the error decreases at a slightly slower pace with respect to the number of observations (\\(\\log(N)/N\\)) and is linear in the number of units and the size of the input. This clearly underlines the link (trade-off?) between sample size and model complexity: having a very complex model is useless if the sample is small just like a simple model will not catch the fine relationships in a large dataset. Overall, a neural network is a possibly very complicated function with a lot of parameters. In linear regressions, it is possible to increase the fit by spuriously adding exogenous variables. In neural networks, it suffices to increase the number of parameters by arbitrarily adding units to the layer(s). This is of course a very bad idea because high-dimensional networks will mostly capture the particularities of the sample they are trained on. 7.2.3 Learning via back-propagation Just like for tree methods, neural networks are trained by minimizing some loss function subject to some penalization: \\[O=\\sum_{i=1}^I \\text{loss}(y_i,\\tilde{y}_i)+ \\text{penalization},\\] where \\(\\tilde{y}_i\\) are the values obtained by the model and \\(y_i\\) are the true values of the instances. A simple requirement that eases computation is that the loss function be differentiable. The most common choices are the squared error for regression tasks and cross-entropy for classification tasks. We discuss the technicalities of classification in the next subsection. The training of a neural network amounts to alter the weights (and biases) of all units in all layers so that \\(O\\) defined above is the smallest possible. To ease the notation and given that the \\(y_i\\) are fixed, let us write \\(D(\\tilde{y}_i(\\textbf{W}))=\\text{loss}(y_i,\\tilde{y}_i)\\), where \\(\\textbf{W}\\) denotes the entirety of weights and biases in the network. The updating of the weights will be performed via gradient descent, i.e., via \\[\\begin{equation} \\tag{7.2} \\textbf{W} \\leftarrow \\textbf{W}-\\eta \\frac{\\partial D(\\tilde{y}_i) }{\\partial \\textbf{W}}. \\end{equation}\\] This mechanism is the most classical in the optimization literature and we illustrate it in Figure 7.5. We highlight the possible suboptimality of large learning rates. In the diagram, the descent associated with the high \\(\\eta\\) will oscillate around the optimal point, whereas the one related to the small eta will converge more directly. The complicated task in the above equation is to compute the gradient (derivative) which tells in which direction the adjustment should be done. The problem is that the successive nested layers and associated activations require many iterations of the chain rule for differentiation. FIGURE 7.5: Outline of gradient descent. The most common way to approximate a derivative is probably the finite difference method. Under the usual assumptions (the loss is twice differentiable), the centered difference satisfies: \\[\\frac{\\partial D(\\tilde{y}_i(w_k))}{\\partial w_k} = \\frac{D(\\tilde{y}_i(w_k+h))-D(\\tilde{y}_i(w_k-h))}{2h}+O(h^2),\\] where \\(h>0\\) is some arbitrarily small number. In spite of its apparent simplicity, this method is costly computationally because it requires a number of operations of the magnitude of the number of weights. Luckily, there is a small trick that can considerably ease and speed up the computation. The idea is to simply follow the chain rule and recycle terms along the way. Let us start by recalling \\[\\tilde{y}_i =f^{(L+1)} \\left((\\textbf{o}^{(L)}_i)'\\textbf{w}^{(L+1)}+b^{(L+1)}\\right)=f^{(L+1)}\\left(b^{(L+1)}+\\sum_{k=1}^{U_L} w^{(L+1)}_ko^{(L)}_{i,k} \\right),\\] so that if we differentiate with the most immediate weights and biases, we get: \\[\\begin{align} \\frac{\\partial D(\\tilde{y}_i)}{\\partial w_k^{(L+1)}}&=D'(\\tilde{y}_i) \\left(f^{(L+1)} \\right)'\\left( b^{(L+1)}+\\sum_{k=1}^{U_L} w^{(L+1)}_ko^{(L)}_{i,k} \\right)o^{(L)}_{i,k} \\\\ \\tag{7.3} &= D'(\\tilde{y}_i) \\left(f^{(L+1)} \\right)'\\left( v^{(L+1)}_{i,k} \\right)o^{(L)}_{i,k} \\\\ \\frac{\\partial D(\\tilde{y}_i)}{\\partial b^{(L+1)}}&=D'(\\tilde{y}_i) \\left(f^{(L+1)} \\right)'\\left( b^{(L+1)}+\\sum_{k=1}^{U_L} w^{(L+1)}_ko^{(L)}_{i,k} \\right). \\end{align}\\] This is the easiest part. We must now go back one layer and this can only be done via the chain rule. To access layer \\(L\\), we recall identity \\(v_{i,k}^{(L)}=(\\textbf{o}^{(L-1)}_i)'\\textbf{w}^{(L)}_k+b_k^{(L)}=b_k^{(L)}+\\sum_{j=1}^{U_L}o^{(L-1)}_{i,j}w^{(L)}_{k,j}\\). We can then proceed: \\[\\begin{align} \\frac{\\partial D(\\tilde{y}_i)}{\\partial w_{k,j}^{(L)}}&=\\frac{\\partial D(\\tilde{y}_i)}{\\partial v^{(L)}_{i,k}}\\frac{\\partial v^{(L)}_{i,k}}{\\partial w_{k,j}^{(L)}} = \\frac{\\partial D(\\tilde{y}_i)}{\\partial v^{(L)}_{i,k}}o^{(L-1)}_{i,j}\\\\ &=\\frac{\\partial D(\\tilde{y}_i)}{\\partial o^{(L)}_{i,k}} \\frac{\\partial o^{(L)}_{i,k} }{\\partial v^{(L)}_{i,k}} o^{(L-1)}_{i,j} = \\frac{\\partial D(\\tilde{y}_i)}{\\partial o^{(L)}_{i,k}} (f^{(L)})'(v_{i,k}^{(L)}) o^{(lL1)}_{i,j} \\\\ &=\\underbrace{D'(\\tilde{y}_i) \\left(f^{(L+1)} \\right)'\\left(v^{(L+1)}_{i,k} \\right)}_{\\text{computed above!}} w^{(L+1)}_k (f^{(L)})'(v_{i,k}^{(L)}) o^{(L-1)}_{i,j}, \\end{align}\\] where, as we show in the last line, one part of the derivative was already computed in the previous step (Equation (7.3)). Hence, we can recycle this number and only focus on the right part of the expression. The magic of the so-called back-propagation is that this will hold true for each step of the differentiation. When computing the gradient for weights and biases in layer \\(l\\), there will be two parts: one that can be recycled from previous layers and another, local part, that depends only on the values and activation function of the current layer. A nice illustration of this process is given by the Google developer team: playground.tensorflow.org. When the data is formatted using tensors, it is possible to resort to vectorization so that the number of calls is limited to an order of the magnitude of the number of nodes (units) in the network. The back-propagation algorithm can be summarized as follows. Given a sample of points (possibly just one): the data flows from left as is described in Figure 7.6. The blue arrows show the forward pass; this allows the computation of the error or loss function; all derivatives of this function (w.r.t. weights and biases) are computed, starting from the last layer and diffusing to the left (hence the term back-propagation) - the green arrows show the backward pass; all weights and biases can be updated to take the sample points into account (the model is adjusted to reduce the loss/error stemming from these points). FIGURE 7.6: Diagram of back-propagation. This operation can be performed any number of times with different sample sizes. We discuss this issue in Section 7.3. The learning rate \\(\\eta\\) can be refined. One option to reduce overfitting is to impose that after each epoch, the intensity of the update decreases. One possible parametric form is \\(\\eta=\\alpha e^{- \\beta t}\\), where \\(t\\) is the epoch and \\(\\alpha,\\beta>0\\). One further sophistication is to resort to so-called momentum (which originates from Polyak (1964)): \\[\\begin{align} \\tag{7.4} \\textbf{W}_{t+1} & \\leftarrow \\textbf{W}_{t} - \\textbf{m}_t \\quad \\text{with} \\nonumber \\\\ \\textbf{m}_t & \\leftarrow \\eta \\frac{\\partial D(\\tilde{y}_i)}{\\partial \\textbf{W}_{t}}+\\gamma \\textbf{m}_{t-1}, \\end{align}\\] where \\(t\\) is the index of the weight update. The idea of momentum is to speed up the convergence by including a memory term of the last adjustment (\\(\\textbf{m}_{t-1}\\)) and going in the same direction in the current update. The parameter \\(\\gamma\\) is often taken to be 0.9. More complex and enhanced methods have progressively been developed: - Nesterov (1983) improves the momentum term by forecasting the future shift in parameters; - Adagrad (Duchi, Hazan, and Singer (2011)) uses a different learning rate for each parameter; - Adadelta (Zeiler (2012)) and Adam (Kingma and Ba (2014)) combine the ideas of Adagrad and momentum. Lastly, in some degenerate case, some gradients may explode and push weights far from their optimal values. In order to avoid this phenomenon, learning libraries implement gradient clipping. The user specifies a maximum magnitude for gradients, usually expressed as a norm. Whenever the gradient surpasses this magnitude, it is rescaled to reach the authorized threshold. Thus, the direction remains the same, but the adjustment is smaller. 7.2.4 Further details on classification In decision trees, the ultimate goal is to create homogeneous clusters, and the process to reach this goal was outlined in the previous chapter. For neural networks, things work differently because the objective is explicitly to minimize the error between the prediction \\(\\tilde{\\textbf{y}}_i\\) and a target label \\(\\textbf{y}_i\\). Again, here \\(\\textbf{y}_i\\) is a vector full of zeros with only one one denoting the class of the instance. Facing a classification problem, the trick is to use an appropriate activation function at the very end of the network. The dimension of the terminal output of the network should be equal to \\(J\\) (number of classes to predict), and if, for simplicity, we write \\(\\textbf{x}_i\\) for the values of this output, the most commonly used activation is the so-called softmax function: \\[\\tilde{\\textbf{y}}_i=s(\\textbf{x})_i=\\frac{e^{x_i}}{\\sum_{j=1}^Je^{x_j}}.\\] The justification of this choice is straightforward: it can take any value as input (over the real line) and it sums to one over any (finite-valued) output. Similarly as for trees, this yields a ‘probability’ vector over the classes. Often, the chosen loss is a generalization of the entropy used for trees. Given the target label \\(\\textbf{y}_i=(y_{i,1},\\dots,y_{i,L})=(0,0,\\dots,0,1,0,\\dots,0)\\) and the predicted output \\(\\tilde{\\textbf{y}}_i=(\\tilde{y}_{i,1},\\dots,\\tilde{y}_{i,L})\\), the cross-entropy is defined as \\[\\begin{equation} \\tag{7.5} \\text{CE}(\\textbf{y}_i,\\tilde{\\textbf{y}}_i)=-\\sum_{j=1}^J\\log(\\tilde{y}_{i,j})y_{i,j}. \\end{equation}\\] Basically, it is a proxy of the dissimilarity between its two arguments. One simple interpretation is the following. For the nonzero label value, the loss is \\(-\\log(\\tilde{y}_{i,l})\\), while for all others, it is zero. In the log, the loss will be minimal if \\(\\tilde{y}_{i,l}=1\\), which is exactly what we seek (i.e., \\(y_{i,l}=\\tilde{y}_{i,l}\\)). In applications, this best case scenario will not happen, and the loss will simply increase when \\(\\tilde{y}_{i,l}\\) drifts away downwards from one. 7.3 How deep we should go and other practical issues Beyond the ones presented in the previous sections, the user faces many degrees of freedom when building a neural network. We present a few classical choices that are available when constructing and training neural networks. 7.3.1 Architectural choices Arguably, the first choice pertains to the structure of the network. Beyond the dichotomy feed-forward versus recurrent (see Section 7.5), the immediate question is: how big (or how deep) the networks should be. First of all, let us calculate the number of parameters (i.e., weights plus biases) that are estimated (optimized) in a network. For the first layer, this gives \\((U_0+1)U_1\\) parameters, where \\(U_0\\) is the number of columns in \\(\\mathbb{X}\\) (i.e., number of explanatory variables) and \\(U_1\\) is the number of units in the layer. For layer \\(l\\in[2,L]\\), the number of parameters is \\((U_{l-1}+1)U_l\\). For the final output, there are simply \\(U_L+1\\) parameters. In total, this means the total number of values to optimize is \\[\\mathcal{N}=\\left(\\sum_{l=1}^L(U_{l-1}+1)U_l\\right)+U_L+1\\] As in any model, the number of parameters should be much smaller than the number of instances. There is no fixed ratio, but it is preferable if the sample size is at least ten times larger than the number of parameters. Below a ratio of 5, the risk of overfitting is high. Given the amount of data readily available, this constraint is seldom an issue, unless one wishes to work with a very large network. The number of hidden layers in current financial applications rarely exceeds three or four. The number of units per layer \\((U_k)\\) is often chosen to follow the geometric pyramid rule (see, e.g., Masters (1993)). If there are \\(L\\) hidden layers, with \\(I\\) features in the input and \\(O\\) dimensions in the output (for regression tasks, \\(O=1\\)), then, for the \\(k^{th}\\) layer, a rule of thumb for the number of units is \\[U_k\\approx \\left\\lfloor O\\left( \\frac{I}{O}\\right)^{\\frac{L+1-k}{L+1}}\\right\\rfloor.\\] If there is only one intermediate layer, the recommended proxy is the integer part of \\(\\sqrt{IO}\\). If not, the network starts with many units and the number of unit decreases exponentially towards the output size. Often, the number of layers is a power of two because, in high dimensions, networks are trained on Graphics Processing Units (GPUs) or Tensor Processing Units (TPUs). Both pieces of hardware can be used optimally when the inputs have sizes equals to powers of two. Several studies have shown that very large architectures do not always perform better than more shallow ones (e.g., Gu, Kelly, and Xiu (2020b) and Orimoloye et al. (2019) for high frequency data, i.e., not factor-based). As a rule of thumb, a maximum of three hidden layers seem to be sufficient for prediction purposes. 7.3.2 Frequency of weight updates and learning duration In the expression (7.2), it is implicit that the computation is performed for one given instance. If the sample size is very large (hundreds of thousands or millions of instances), updating the weights according to each point is computationally too costly. The updating is then performed on groups of instances which are called batches. The sample is (randomly) split into batches of fixed sizes and each update is performed following the rule: \\[\\begin{equation} \\tag{7.6} \\textbf{W} \\leftarrow \\textbf{W}-\\eta \\frac{\\partial \\sum_{i \\in \\text{batch}} D(\\tilde{y}_i)/\\text{card}(\\text{batch}) }{\\partial \\textbf{W}}. \\end{equation}\\] The change in weights is computed over the average loss computed over all instances in the batch. The terminology for training includes: epoch: one epoch is reached when each instance of the sample has contributed to the update of the weights (i.e., the training). Often, training a NN requires several epochs and up to a few dozen. batch size: the batch size is the number of samples used for one single update of weights. iterations: the number of iterations can mean alternatively the ratio of sample size divided by batch size or this ratio multiplied by the number of epochs. It’s either the number of weight updates required to reach one epoch or the total number of updates during the whole training. When the batch is equal to only one instance, the method is referred to as ‘stochastic gradient descent’ (SGD): the instance is chosen randomly. When the batch size is strictly above one and below the total number of instances, the learning is performed via ‘mini’ batches, that is, small groups of instances. The batches are also chosen randomly, but without replacement in the sample because for one epoch, the union of batches must be equal to the full training sample. It is impossible to know in advance what a good number of epochs is. Sometimes, the network stops learning after just 5 epochs (the validation loss does not decrease anymore). In some cases when the validation sample is drawn from a distribution close to that of the training sample, the network continues to learn even after 200 epochs. It is up to the user to test different values to evaluate the learning speed. In the examples below, we keep the number of epochs low for computational purposes. 7.3.3 Penalizations and dropout At each level (layer), it is possible to enforce constraints or penalizations on the weights (and biases). Just as for tree methods, this helps slow down the learning to prevent overfitting on the training sample. Penalizations are enforced directly on the loss function and the objective function takes the form \\[O=\\sum_{i=1}^I \\text{loss}(y_i,\\tilde{y}_i)+ \\sum_{k} \\lambda_k||\\textbf{W}_k||_1+ \\sum_j\\delta_j||\\textbf{W}_j||_2^2,\\] where the subscripts \\(k\\) and \\(j\\) pertain to the weights to which the \\(L^1\\) and (or) \\(L^2\\) penalization is applied. In addition, specific constraints can be enforced on the weights directly during the training. Typically, two types of constraints are used: norm constraints: a maximum norm is fixed for the weight vectors or matrices; non-negativity constraint: all weights must be positive or zero. Lastly, another (somewhat exotic) way to reduce the risk of overfitting is simply to reduce the size (number of parameters) of the model. Srivastava et al. (2014) propose to omit units during training (hence the term ‘dropout’). The weights of randomly chosen units are set to zero during training. All links from and to the unit are ignored, which mechanically shrinks the network. In the testing phase, all units are back, but the values (weights) must be scaled to account for the missing activations during the training phase. The interested reader can check the advice compiled in Bengio (2012), Hanin and Rolnick (2018), and Smith (2018) for further tips on how to configure neural networks. A paper dedicated to hyperparameter tuning for stock return prediction is Lee (2020). 7.4 Code samples and comments for vanilla MLP There are several frameworks and libraries that allow robust and flexible constructions of neural networks. Among them, Keras and Tensorflow (developed by Google) are probably the most used at the time we write this book (PyTorch, from Facebook, is one alternative). For simplicity and because we believe it is the best choice, we implement the NN with Keras (which is the high level API of Tensorflow, see https://www.tensorflow.org). The original Python implementation is referenced on https://keras.io, and the details for the R version can be found here: https://keras.rstudio.com. We recommend a thorough installation before proceeding. Because the native versions of Tensorflow and Keras are written in Python (and accessed by R via the reticulate package), a running version of Python is required below. To install Keras, please follow the instructions provided at https://keras.rstudio.com. In this section, we provide a detailed (though far from exhaustive) account of how to train neural networks with Keras. For the sake of completeness, we proceed in two steps. The first one relates to a very simple regression exercise. Its purpose is to get the reader familiar with the syntax of Keras. In the second step, we lay out many of the options proposed by Keras to perform a classification exercise. With these two examples, we thus cover most of the mainstream topics falling under the umbrella of feed-forward multilayered perceptrons. 7.4.1 Regression example Before we head to the core of the NN, a short stage of data preparation is required. Just as for penalized regressions (glmnet package) and boosted trees (xgboost package), the data must be sorted into four parts which are the combination of two dichotomies: training versus testing and labels versus features. We define the corresponding variables below. For simplicity, the first example is a regression exercise. A classification task will be detailed below. NN_train_features <- dplyr::select(training_sample, features) %>% # Training features as.matrix() # Matrix = important NN_train_labels <- training_sample$R1M_Usd # Training labels NN_test_features <- dplyr::select(testing_sample, features) %>% # Testing features as.matrix() # Matrix = important NN_test_labels <- testing_sample$R1M_Usd # Testing labels In Keras, the training of neural networks is performed through three steps: Defining the structure/architecture of the network; Setting the loss function and learning process (options on the updating of weights); Train by specifying the batch sizes and number of rounds (epochs). We start with a very simple architecture with two hidden layers. library(keras) # install_keras() # To complete installation model <- keras_model_sequential() model %>% # This defines the structure of the network, i.e. how layers are organized layer_dense(units = 16, activation = 'relu', input_shape = ncol(NN_train_features)) %>% layer_dense(units = 8, activation = 'tanh') %>% layer_dense(units = 1) # No activation means linear activation: f(x) = x. The definition of the structure is very intuitive and uses the sequential syntax in which one input is iteratively transformed by a layer until the last iteration which gives the output. Each layer depends on two parameters: the number of units and the activation function that is applied to the output of the layer. One important point is the input_shape parameter for the first layer. It is required for the first layer and is equal to the number of features. For the subsequent layers, the input_shape is dictated by the number of units of the previous layer; hence it is not required. The activations that are currently available are listed on https://keras.io/activations/. We use the hyperbolic tangent in the second-to-last layer because it yields both positive and negative outputs. Of course, the last layer can generate negative values as well, but it’s preferable to satisfy this property one step ahead of the final output. model %>% compile( # Model specification loss = 'mean_squared_error', # Loss function optimizer = optimizer_rmsprop(), # Optimisation method (weight updating) metrics = c('mean_absolute_error') # Output metric ) summary(model) # Model architecture ## Model: "sequential" ## __________________________________________________________________________________________ ## Layer (type) Output Shape Param # ## ========================================================================================== ## dense (Dense) (None, 16) 1504 ## __________________________________________________________________________________________ ## dense_1 (Dense) (None, 8) 136 ## __________________________________________________________________________________________ ## dense_2 (Dense) (None, 1) 9 ## ========================================================================================== ## Total params: 1,649 ## Trainable params: 1,649 ## Non-trainable params: 0 ## __________________________________________________________________________________________ The summary of the model lists the layers in their order from input to output (forward pass). Because we are working with 93 features, the number of parameters for the first layer (16 units) is 93 plus one (for the bias) multiplied by 16, which makes 1504. For the second layer, the number of inputs is equal to the size of the output from the previous layer (16). Hence given the fact that the second layer has 8 units, the total number of parameters is (16+1)*8 = 136. We set the loss function to the standard mean squared error. Other losses are listed on https://keras.io/losses/, some of them work only for regressions (MSE, MAE) and others only for classification (categorical cross-entropy, see Equation (7.5)). The RMS propragation optimizer is the classical mini-batch back-propagation implementation. For other weight updating algorithms, we refer to https://keras.io/optimizers/. The metric is the function used to assess the quality of the model. It can be different from the loss: for instance, using entropy for training and accuracy as the performance metric. The final stage fits the model to the data and requires some additional training parameters: fit_NN <- model %>% fit(NN_train_features, # Training features NN_train_labels, # Training labels epochs = 10, batch_size = 512, # Training parameters validation_data = list(NN_test_features, NN_test_labels) # Test data ) plot(fit_NN) # Plot, evidently! FIGURE 7.7: Output from a trained neural network (regression task). The batch size is quite arbitrary. For technical reasons pertaining to training on GPUs, these sizes are often powers of 2. In Keras, the plot of the trained model shows four different curves (shown here in Figure 7.7). The top graph displays the improvement (or lack thereof) in loss as the number of epochs increases. Usually, the algorithm starts by learning rapidly and then converges to a point where any additional epoch does not improve the fit. In the example above, this point arrives rather quickly because it is hard to notice any gain beyond the fourth epoch. The two colors show the performance on the two samples: the training sample and the testing sample. By construction, the loss will always improve (even marginally) on the training sample. When the impact is negligible on the testing sample (the curve is flat, as is the case here), the model fails to generalize out-of-sample: the gains obtained by training on the original sample do not translate to gains on previously unseen data; thus, the model seems to be learning noise. The second graph shows the same behavior but is computed using the metric function. The correlation (in absolute terms) between the two curves (loss and metric) is usually high. If one of them is flat, the other should be as well. In order to obtain the parameters of the model, the user can call get_weights(model).18 We do not execute the code here because the size of the output is much too large, as there are thousands of weights. Finally, from a practical point of view, the prediction is obtained via the usual predict() function. We use this function below on the testing sample to calculate the hit ratio. mean(predict(model, NN_test_features) * NN_test_labels > 0) # Hit ratio ## [1] 0.5427159 Again, the hit ratio lies between 50% and 55%, which seems reasonably good. Most of the time, neural networks have their weights initialized randomly. Hence, two independently trained networks with the same architecture and same training data may well lead to very different predictions and performance! One way to bypass this issue is to freeze the random number generator. Models can also be easily exchanged by loading weights via the set_weights() function. 7.4.2 Classification example We pursue our exploration of neural networks with a much more detailed example. The aim is to carry out a classification task on the binary label R1M_Usd_C. Before we proceed, we need to format the label properly. To this purpose, we resort to one-hot encoding (see Section 4.5.2). library(dummies) # Package for one-hot encoding NN_train_labels_C <- training_sample$R1M_Usd_C %>% dummy() # One-hot encoding of the label NN_test_labels_C <- testing_sample$R1M_Usd_C %>% dummy() # One-hot encoding of the label The labels NN_train_labels_C and NN_test_labels_C have two columns: the first flags the instances with above median returns and the second flags those with below median returns. Note that we do not alter the feature variables: they remain unchanged. Below, we set the structure of the networks with many additional features compared to the first one. model_C <- keras_model_sequential() model_C %>% # This defines the structure of the network, i.e. how layers are organized layer_dense(units = 16, activation = 'tanh', # Nb units & activation input_shape = ncol(NN_train_features), # Size of input kernel_initializer = "random_normal", # Initialization of weights kernel_constraint = constraint_nonneg()) %>% # Weights should be nonneg layer_dropout(rate = 0.25) %>% # Dropping out 25% units layer_dense(units = 8, activation = 'elu', # Nb units & activation bias_initializer = initializer_constant(0.2), # Initialization of biases kernel_regularizer = regularizer_l2(0.01)) %>% # Penalization of weights layer_dense(units = 2, activation = 'softmax') # Softmax for categorical output Before we start commenting on the many options used above, we highlight that Keras models, unlike many R variables, are mutable objects. This means that any piping %>% after calling a model will alter it. Hence, successive trainings do not start from scratch but from the result of the previous training. First, the options used above and below were chosen as illustrative examples and do not serve to particularly improve the quality of the model. The first change compared to Section 7.4.1 is the activation functions. The first two are simply new cases, while the third one (for the output layer) is imperative. Indeed, since the goal is classification, the dimension of the output must be equal to the number of categories of the labels. The activation that yields a multivariate is the softmax function. Note that we must also specify the number of classes (categories) in the terminal layer. The second major innovation is options pertaining to parameters. One family of options deals with the initialization of weights and biases. In Keras, weights are referred to as the ‘kernel’. The list of initializers is quite long and we suggest the interested reader has a look at the Keras reference (https://keras.io/initializers/). Most of them are random, but some of them are constant. Another family of options is the constraints and norm penalization that are applied on the weights and biases during training. In the above example, the weights of the first layer are coerced to be non-negative, while the weights of the second layer see their magnitude penalized by a factor (0.01) times their \\(L^2\\) norm. Lastly, the final novelty is the dropout layer (see Section 7.3.3) between the first and second layers. According to this layer, one fourth of the units in the first layer will be (randomly) omitted during training. The specification of the training is outlined below. model_C %>% compile( # Model specification loss = 'binary_crossentropy', # Loss function optimizer = optimizer_adam(lr = 0.005, # Optimisation method (weight updating) beta_1 = 0.9, beta_2 = 0.95), metrics = c('categorical_accuracy') # Output metric ) summary(model_C) # Model structure ## Model: "sequential_1" ## __________________________________________________________________________________________ ## Layer (type) Output Shape Param # ## ========================================================================================== ## dense_3 (Dense) (None, 16) 1504 ## __________________________________________________________________________________________ ## dropout (Dropout) (None, 16) 0 ## __________________________________________________________________________________________ ## dense_4 (Dense) (None, 8) 136 ## __________________________________________________________________________________________ ## dense_5 (Dense) (None, 2) 18 ## ========================================================================================== ## Total params: 1,658 ## Trainable params: 1,658 ## Non-trainable params: 0 ## __________________________________________________________________________________________ Here again, many changes have been made: all levels have been revised. The loss is now the cross-entropy. Because we work with two categories, we resort to a specific choice (binary cross-entropy), but the more general form is the option categorical_crossentropy and works for any number of classes (strictly above 1). The optimizer is also different and allows for several parameters and we refer to Kingma and Ba (2014). Simply put, the two beta parameters control decay rates for exponentially weighted moving averages used in the update of weights. The two averages are estimates for the first and second moment of the gradient and can be exploited to increase the speed of learning. The performance metric in the above chunk is the categorical accuracy. In multiclass classification, the accuracy is defined as the average accuracy over all classes and all predictions. Since a prediction for one instance is a vector of weights, the ‘terminal’ prediction is the class that is associated with the largest weight. The accuracy then measures the proportion of times when the prediction is equal to the realized value (i.e., when the class is correctly guessed by the model). Finally, we proceed with the training of the model. fit_NN_C <- model_C %>% fit(NN_train_features, # Training features NN_train_labels_C, # Training labels epochs = 20, batch_size = 512, # Training parameters validation_data = list(NN_test_features, NN_test_labels_C), # Test data verbose = 0, # No comments from algo callbacks = list( callback_early_stopping(monitor = "val_loss", # Early stopping: min_delta = 0.001, # Improvement threshold patience = 3, # Nb epochs with no improvmt verbose = 0 # No warnings ) ) ) plot(fit_NN_C) FIGURE 7.8: Output from a trained neural network (classification task) with early stopping. There is only one major difference here compared to the previous training call. In Keras, callbacks are functions that can be used at given stages of the learning process. In the above example, we use one such function to stop the algorithm when no progress has been made for some time. When datasets are large, the training can be long, especially when batch sizes are small and/or the number of epochs is high. It is not guaranteed that going to the full number of epochs is useful, as the loss or metric functions may be plateauing much sooner. Hence, it can be very convenient to stop the process if no improvement is achieved during a specified time-frame. We set the number of epochs to 20, but the process will likely stop before that. In the above code, the improvement is focused on validation accuracy (“val_loss”; one alternative is “val_acc”). The min_delta value sets the minimum improvement that needs to be attained for the algorithm to continue. Therefore, unless the validation accuracy gains 0.001 points at each epoch, the training will stop. Nevertheless, some flexibility is introduced via the patience parameter, which in our case asserts that the halting decision is made only after three consecutive epochs with no improvement. In the option, the verbose parameter dictates the amount of comments that is made by the function. For simplicity, we do not want any comments, hence this value is set to zero. In Figure 7.8, the two graphs yield very different curves. One reason for that is the scale of the second graph. The range of accuracies is very narrow. Any change in this range does not represent much variation overall. The pattern is relatively clear on the training sample: the loss decreases, while the accuracy improves. Unfortunately, this does not translate to the testing sample which indicates that the model does not generalize well out-of-sample. 7.4.3 Custom losses In Keras, it is possible to define user-specified loss functions. This may be interesting in some cases. For instance, the quadratic error has three terms \\(y_i^2\\), \\(\\tilde{y}_i^2\\) and \\(-2y_i\\tilde{y}_i\\). In practice, it can make sense to focus more on the latter term because it is the most essential: we do want predictions and realized values to have the same sign! Below we show how to optimize on a simple (product) function in Keras, \\(l(y_i,\\tilde{y}_i)=(\\tilde{y}_i-\\tilde{m})^2-\\gamma (y_i-m)(\\tilde{y}_i-\\tilde{m})\\), where \\(m\\) and \\(\\tilde{m}\\) are the sample averages of \\(y_i\\) and \\(\\tilde{y}_i\\). With \\(\\gamma>2\\), we give more weight to the cross term. We start with a simple architecture. model_custom <- keras_model_sequential() model_custom %>% # This defines the structure of the network, i.e. how layers are organized layer_dense(units = 16, activation = 'relu', input_shape = ncol(NN_train_features)) %>% layer_dense(units = 8, activation = 'sigmoid') %>% layer_dense(units = 1) # No activation means linear activation: f(x) = x. Then we code the loss function and integrate it to the model. The important trick is to resort to functions that are specific to the library (the k_functions). We code the variance of predicted values minus the scaled covariance between realized and predicted values. Below we use a scale of five. # Defines the loss, we use gamma = 5 metric_cust <- custom_metric("custom_loss", function(y_true, y_pred) { k_mean((y_pred - k_mean(y_pred))*(y_pred - k_mean(y_pred)))-5*k_mean((y_true - k_mean(y_true))*(y_pred - k_mean(y_pred))) }) model_custom %>% compile( # Model specification loss = metric_cust, #function(y_true, y_pred) custom_loss(y_true, y_pred), # New loss function! optimizer = optimizer_rmsprop(), # Optim method metrics = c('mean_absolute_error') # Output metric ) Finally, we are ready to train and briefly evaluate the performance of the model. fit_NN_cust <- model_custom %>% fit(NN_train_features, # Training features NN_train_labels, # Training labels epochs = 10, batch_size = 512, # Training parameters validation_data = list(NN_test_features, NN_test_labels) # Test data ) plot(fit_NN_cust) The curves may go in opposite direction. One reason for that is that while improving correlation between realized and predicted values, we are also increasing the sum of squared predicted returns. mean(predict(model_custom, NN_test_features) * NN_test_labels > 0) # Hit ratio ## [1] 0.5460346 The outcome could be improved. There are several directions that could help. One of them is arguably that the model should be dynamic and not static (see Chapter 12). 7.5 Recurrent networks 7.5.1 Presentation Multilayer perceptrons are feed-forward networks because the data flows from left to right with no looping in between. For some particular tasks with sequential linkages (e.g., time-series or speech recognition), it might be useful to keep track of what happened with the previous sample (i.e., there is a natural ordering). One simple way to model ‘memory’ would be to consider the following network with only one intermediate layer: \\[\\begin{align*} \\tilde{y}_i&=f^{(y)}\\left(\\sum_{j=1}^{U_1}h_{i,j}w^{(y)}_j+b^{(2)}\\right) \\\\ \\textbf{h}_{i} &=f^{(h)}\\left(\\sum_{k=1}^{U_0}x_{i,k}w^{(h,1)}_k+b^{(1)}+ \\underbrace{\\sum_{k=1}^{U_1} w^{(h,2)}_{k}h_{i-1,k}}_{\\text{memory part}} \\right), \\end{align*}\\] where \\(h_0\\) is customarily set at zero (vector-wise). These kinds of models are often referred to as Elman (1990) models or to Jordan (1997) models if in the latter case \\(h_{i-1}\\) is replaced by \\(y_{i-1}\\) in the computation of \\(h_i\\). Both types of models fall under the overarching umbrella of Recurrent Neural Networks (RNNs). The \\(h_i\\) is usually called the state or the hidden layer. The training of this model is complicated and must be done by unfolding the network over all instances to obtain a simple feed-forward network and train it regularly. We illustrate the unfolding principle in Figure 7.9. It shows a very deep network. The first input impacts the first layer and then the second one via \\(h_1\\) and all following layers in the same fashion. Likewise, the second input impacts all layers except the first and each instance \\(i-1\\) is going to impact the output \\(\\tilde{y}_i\\) and all outputs \\(\\tilde{y}_j\\) for \\(j \\ge i\\). In Figure 7.9, the parameters that are trained are shown in blue. They appear many times, in fact, at each level of the unfolded network. FIGURE 7.9: Unfolding a recurrent network. The main problem with the above architecture is the loss of memory induced by vanishing gradients. Because of the depth of the model, the chain rule used in the back-propagation will imply a large number of products of derivatives of activation functions. Now, as is shown in Figure 7.4, these functions are very smooth and their derivatives are most of the time smaller than one (in absolute value). Hence, multiplying many numbers smaller than one leads to very small figures: beyond some layers, the learning does not propagate because the adjustments are too small. One way to prevent this progressive discounting of the memory was introduced in Hochreiter and Schmidhuber (1997) (Long-Short Term Memory - LSTM model). This model was subsequently simplified by the authors Chung et al. (2015) and we present this more parsimonious model below. The Gated Recurrent Unit (GRU) is a slightly more complicated version of the vanilla recurrent network defined above. It has the following representation: \\[\\begin{align*} \\tilde{y}_i&=z_i\\tilde{y}_{i-1}+ (1-z_i)\\tanh \\left(\\textbf{w}_y'\\textbf{x}_i+ b_y+ u_yr_i\\tilde{y}_{i-1}\\right) \\quad \\text{output (prediction)} \\\\ z_i &= \\text{sig}(\\textbf{w}_z'\\textbf{x}_i+b_z+u_z\\tilde{y}_{i-1}) \\hspace{9mm} \\text{`update gate'} \\ \\in (0,1)\\\\ r_i &= \\text{sig}(\\textbf{w}_r'\\textbf{x}_i+b_r+u_r\\tilde{y}_{i-1}) \\hspace{9mm} \\text{`reset gate'} \\ \\in (0,1). \\end{align*}\\] In compact form, this gives \\[\\tilde{y}_i=\\underbrace{z_i}_{\\text{weight}}\\underbrace{\\tilde{y}_{i-1}}_{\\text{past value}}+ \\underbrace{(1-z_i)}_{\\text{weight}}\\underbrace{\\tanh \\left(\\textbf{w}_y'\\textbf{x}_i+ b_y+ u_yr_i\\tilde{y}_{i-1}\\right)}_{\\text{candidate value (classical RNN)}}, \\] where the \\(z_i\\) decides the optimal mix between the current and past values. For the candidate value, \\(r_i\\) decides which amount of past/memory to retain. \\(r_i\\) is commonly referred to as the ‘reset gate’ and \\(z_i\\) to the ‘update gate’. There are some subtleties in the training of a recurrent network. Indeed, because of the chaining between the instances, each batch must correspond to a coherent time series. A logical choice is thus one batch per asset with instances (logically) chronologically ordered. Lastly, one option in some frameworks is to keep some memory between the batches by passing the final value of \\(\\tilde{y}_i\\) to the next batch (for which it will be \\(\\tilde{y}_0\\)). This is often referred to as the stateful mode and should be considered meticulously. It does not seem desirable in a portfolio prediction setting if the batch size corresponds to all observations for each asset: there is no particular link between assets. If the dataset is divided into several parts for each given asset, then the training must be handled very cautiously. Reccurrent networks and LSTM especially have been found to be good forecasting tools in financial contexts (see, e.g., Fischer and Krauss (2018) and Wang et al. (2020)). 7.5.2 Code and results Recurrent networks are theoretically more complicated compared to multilayered perceptrons. In practice, they are also more challenging in their implementation. Indeed, the serial linkages require more attention compared to feed-forward architectures. In an asset pricing framework, we must separate the assets because the stock-specific time series cannot be bundled together. The learning will be sequential, one stock at a time. The dimensions of variables are crucial. In Keras, they are defined for RNNs as: The size of the batch: in our case, it will be the number of assets. Indeed, the recurrence relationship holds at the asset level, hence each asset will represent a new batch on which the model will learn. The time steps: in our case, it will simply be the number of dates. The number of features: in our case, there is only one possible figure which is the number of predictors. For simplicity and in order to reduce computation times, we will use the same subset of stocks as that from Section 5.2.2. This yields a perfectly rectangular dataset in which all dates have the same number of observations. First, we create some new, intermediate variables. data_rnn <- data_ml %>% # Dedicated dataset filter(stock_id %in% stock_ids_short) training_sample_rnn <- filter(data_rnn, date < separation_date) testing_sample_rnn <- filter(data_rnn, date > separation_date) nb_stocks <- length(stock_ids_short) # Nb stocks nb_feats <- length(features) # Nb features nb_dates_train <- nrow(training_sample) / nb_stocks # Nb training dates (size of sample) nb_dates_test <- nrow(testing_sample) / nb_stocks # Nb testing dates Then, we construct the variables we will pass as arguments. We recall that the data file was ordered first by stocks and then by date (see Section 1.2). train_features_rnn <- array(NN_train_features, # Formats the training data into array dim = c(nb_dates_train, nb_stocks, nb_feats)) %>% # Tricky order aperm(c(2,1,3)) # The order is: stock, date, feature test_features_rnn <- array(NN_test_features, # Formats the testing data into array dim = c(nb_dates_test, nb_stocks, nb_feats)) %>% # Tricky order aperm(c(2,1,3)) # The order is: stock, date, feature train_labels_rnn <- as.matrix(NN_train_labels) %>% array(dim = c(nb_dates_train, nb_stocks, 1)) %>% aperm(c(2,1,3)) test_labels_rnn <- as.matrix(NN_test_labels) %>% array(dim = c(nb_dates_test, nb_stocks, 1)) %>% aperm(c(2,1,3)) Finally, we move towards the training part. For simplicity, we only consider a simple RNN with only one layer. The structure is outlined below. In terms of recurrence structure, we pick a Gated Recurrent Unit (GRU). model_RNN <- keras_model_sequential() %>% layer_gru(units = 16, # Nb units in hidden layer batch_input_shape = c(nb_stocks, # Dimensions = tricky part! nb_dates_train, nb_feats), activation = 'tanh', # Activation function return_sequences = TRUE) %>% # Return all the sequence layer_dense(units = 1) # Final aggregation layer model_RNN %>% compile( loss = 'mean_squared_error', # Loss = quadratic optimizer = optimizer_rmsprop(), # Backprop metrics = c('mean_absolute_error') # Output metric MAE ) There are many options available for recurrent layers. For GRUs, we refer to the Keras documentation https://keras.rstudio.com/reference/layer_gru.html. We comment briefly on the option return_sequences which we activate. In many cases, the output is simply the terminal value of the sequence. If we do not require the entirety of the sequence to be returned, we will face a problem in the dimensionality because the label is indeed a full sequence. Once the structure is determined, we can move forward to the training stage. fit_RNN <- model_RNN %>% fit(train_features_rnn, # Training features train_labels_rnn, # Training labels epochs = 10, # Number of rounds batch_size = nb_stocks, # Length of sequences verbose = 0) # No comments plot(fit_RNN) FIGURE 7.10: Output from a trained recurrent neural network (regression task). Compared to our previous models, the major difference both in the ouptut (the graph on Figure 7.10) and the input (the code) is the absence of validation (or testing) data. One reason for that is because Keras is very restrictive on RNNs and imposes that both the training and testing samples share the same dimensions. In our situation this is obviously not the case, hence we must bypass this obstacle by duplicating the model. new_model <- keras_model_sequential() %>% layer_gru(units = 16, batch_input_shape = c(nb_stocks, # New dimensions nb_dates_test, nb_feats), activation = 'tanh', # Activation function return_sequences = TRUE) %>% # Return the full sequence layer_dense(units = 1) # Output dimension new_model %>% keras::set_weights(keras::get_weights(model_RNN)) Finally, once the new model is ready, and with the matching dimensions, we can push forward to predicting the test values. We resort to the predict() function and immediately compute the hit ratio obtained by the model. pred_rnn <- predict(new_model, test_features_rnn, batch_size = nb_stocks) # Predictions mean(c(t(as.matrix(pred_rnn))) * test_labels_rnn > 0) # Hit ratio ## [1] 0.4957154 The hit ratio is close to 50%, hence the model does hardly better than coin tossing. Before we close this section on RNNs, we mention a new type architecture, called \\(\\alpha\\)-RNN which are simpler compared to LSTMs and GRUs. They consist in vanilla RNNs to which a simple autocorrelation is added to generate long term memory. We refer to the paper Matthew F Dixon (2020) for more details on this subject. 7.6 Other common architectures In this section, we present other network structures. Because they are less mainstream and often harder to implement, we do not propose code examples and stick to theoretical introductions. 7.6.1 Generative adversarial networks The idea of Generative Adversarial Networks (GANs) is to improve the accuracy of a classical neural network by trying to fool it. This very popular idea was introduced by Goodfellow et al. (2014). Imagine you are an expert in Picasso paintings and that you boast about being able to easily recognize any piece of work from the painter. One way to refine your skill is to test them against a counterfeiter. A true expert should be able to discriminate between a true original Picasso and one emanating from a forger. This is the principle of GANs. GANs consist in two neural networks: the first one tries to learn and the second one tries to fool the first (induce it into error). Just like in the example above, there are also two sets of data: one (\\(\\textbf{x}\\)) is true (or correct), stemming from a classical training sample and the other one (\\(\\textbf{z}\\)) is fake and generated by the counterfeiter network. In the GAN nomenclature, the network that learns is \\(D\\) because it is supposed to discriminate, while the forger is \\(G\\) because it generates false data. In their original formulation, GANs are aimed at classifying. To ease the presentation, we keep this scope. The discriminant network has a simple (scalar) output: the probability that its input comes from true data (versus fake data). The input of \\(G\\) is some arbitrary noise and its output has the same shape/form as the input of \\(D\\). We state the theoretical formula of a GAN directly and comment on it below. \\(D\\) and \\(G\\) play the following minimax game: \\[\\begin{equation} \\tag{7.7} \\underset{G}{\\min} \\ \\underset{D}{\\max} \\ \\left\\{ \\mathbb{E}[\\log(D(\\textbf{x}))]+\\mathbb{E}[\\log(1-D(G(\\textbf{z})))] \\right\\}. \\end{equation}\\] First, let us decompose this expression in its two parts (the optimizers). The first part (i.e., the first max) is the classical one: the algorithm seeks to maximize the probability of assigning the correct label to all examples it seeks to classify. As is done in economics and finance, the program does not maximize \\(D(\\textbf{x})\\) itself on average, but rather a functional form (like a utility function). On the left side, since the expectation is driven by \\(\\textbf{x}\\), the objective must be increasing in the output. On the right side, where the expectation is evaluated over the fake instances, the right classification is the opposite, i.e., \\(1-D(G(\\textbf{z}))\\). The second, overarching, part seeks to minimize the performance of the algorithm on the simulated data: it aims at shrinking the odds that \\(D\\) finds out that the data is indeed corrupt. A summarized version of the structure of the network is provided below in Figure (7.8). \\[\\begin{equation} \\tag{7.8} \\left. \\begin{array}{rlll} \\text{training sample} = \\textbf{x} = \\text{true data} && \\\\ \\text{noise}= \\textbf{z} \\quad \\overset{G}{\\rightarrow} \\quad \\text{fake data} & \\end{array} \\right\\} \\overset{D}{\\rightarrow} \\text{output = probability for label} \\end{equation}\\] In ML-based asset pricing, the most notable application of GANs was introduced in Luyang Chen, Pelger, and Zhu (2020). Their aim is to make use of the method of moment expression \\[\\mathbb{E}[M_{t+1}r_{t+1,n}g(I_t,I_{t,n})]=0,\\] which is an application of Equation (3.7) where the instrumental variables \\(I_{t,n}\\) are firm-dependent (e.g., characteristics and attributes) while the \\(I_t\\) are macro-economic variables (aggregate dividend yield, volatility level, credit spread, term spread, etc.). The function \\(g\\) yields a \\(d\\)-dimensional output, so that the above equation leads to \\(d\\) moment conditions. The trick is to model the SDF as an unknown combination of assets \\(M_{t+1}=1-\\sum_{n=1}^Nw(I_t,I_{t,n})r_{t+1,n}\\). The primary discriminatory network (\\(D\\)) is the one that approximates the SDF via the weights \\(w(I_t,I_{t,n})\\). The secondary generative network is the one that creates the moment condition through \\(g(I_t,I_{t,n})\\) in the above equation. The full specification of the network is given by the program: \\[\\underset{w}{\\text{min}} \\ \\underset{g}{\\text{max}} \\ \\sum_{j=1}^N \\left\\| \\mathbb{E} \\left[\\left(1-\\sum_{n=1}^Nw(I_t,I_{t,n})r_{t+1,n} \\right)r_{t+1,j}g(I_t,I_{t,j})\\right] \\right\\|^2,\\] where the \\(L^2\\) norm applies on the \\(d\\) values generated via \\(g\\). The asset pricing equations (moments) are not treated as equalities but as a relationship that is approximated. The network defined by \\(\\textbf{w}\\) is the asset pricing modeler and tries to determine the best possible model, while the network defined by \\(\\textbf{g}\\) seeks to find the worst possible conditions so that the model performs badly. We refer to the original article for the full specification of both networks. In their empirical section, Luyang Chen, Pelger, and Zhu (2020) report that adopting a strong structure driven by asset pricing imperatives add values compared to a pure predictive ‘vanilla’ approach such as the one detailed in Gu, Kelly, and Xiu (2020b). The out-of-sample behavior of decile sorted portfolios (based on the model’s prediction) display a monotonic pattern with respect to the order of the deciles. GANs can also be used to generate artificial financial data (see Efimov and Xu (2019), Marti (2019), Wiese et al. (2020), Ni et al. (2020), and, relatedly, Buehler et al. (2020)), but this topic is outside the scope of the book. 7.6.2 Autoencoders In the recent literature, autoencoders (AEs) are used in Huck (2019) (portfolio management), and Gu, Kelly, and Xiu (2020a) (asset pricing). AEs are a strange family of neural networks because they are classified among non-supervised algorithms. In the supervised jargon, their label is equal to the input. Like GANS, autoencoders consist of two networks, though the structure is very different: the first network encodes the input into some intermediary output (usually called the code), and the second network decodes the code into a modified version of the input. \\[\\begin{array}{ccccccccc} \\textbf{x} & &\\overset{E}{\\longrightarrow} && \\textbf{z} && \\overset{D}{\\longrightarrow} && \\textbf{x}' \\\\ \\text{input} && \\text{encoder} && \\text{code} && \\text{decoder} && \\text{modified input} \\end{array}\\] Because autoencoders do not belong to the large family of supervised algorithms, we postpone their presentation to Section 15.2.3. The article Gu, Kelly, and Xiu (2020a) resorts to the idea of AEs while at the same time augmenting the complexity of their asset pricing model. From the simple specification \\(r_t=\\boldsymbol{\\beta}_{t-1}\\textbf{f}_t+e_t\\) (we omit asset dependence for notational simplicity), they add the assumptions that the betas depend on firm characteristics, while the factors are possibly nonlinear functions of the returns themselves. The model takes the following form: \\[\\begin{equation} \\tag{7.9} r_{t,i}=\\textbf{NN}_{\\textbf{beta}}(\\textbf{x}_{t-1,i})+\\textbf{NN}_{\\textbf{factor}}(\\textbf{r}_t)+e_{t,i}, \\end{equation}\\] where \\(\\textbf{NN}_{\\textbf{beta}}\\) and \\(\\textbf{NN}_{\\textbf{factor}}\\) are two neural networks. The above equation looks like an autoencoder because the returns are both inputs and outputs. However, the additional complexity comes from the second neural network \\(\\textbf{NN}_{\\textbf{beta}}\\). Modern neural network libraries such as Keras allow for customized models like the one above. The coding of this structure is left as exercise (see below). 7.6.3 A word on convolutional networks Neural networks gained popularity during the 2010 decade thanks to a series of successes in computer vision competitions. The algorithms behind these advances are convolutional neural networks (CNNs). While they may seem a surprising choice for financial predictions, several teams of researchers in the Computer Science field have proposed approaches that rely on this variation of neural networks (J.-F. Chen et al. (2016), Loreggia et al. (2016), Dingli and Fournier (2017), Tsantekidis et al. (2017), Hoseinzade and Haratizadeh (2019)). Recently, J. Jiang, Kelly, and Xiu (2020) propose to extract signals from images of price trends. Hence, we briefly present the principle in this final section on neural networks. We lay out the presentation for CNNs of dimension two, but they can also be used in dimension one or three. The reason why CNNs are useful is because they allow to progressively reduce the dimension of a large dataset by keeping local information. An image is a rectangle of pixels. Each pixel is usually coded via three layers, one for each color: red, blue and green. But to keep things simple, let’s just consider one layer of, say 1,000 by 1,000 pixels, with one value for each pixel. In order to analyze the content of this image, a convolutional layer will reduce the dimension of inputs by resorting to some convolution. Visually, this simplification is performed by scanning and altering the values using rectangles with arbitrary weights. Figure 7.11 sketches this process (it is strongly inspired by Hoseinzade and Haratizadeh (2019)). The original data is a matrix \\((I\\times K)\\) \\(x_{i,k}\\) and the weights are also a matrix \\(w_{j,l}\\) of size \\((J\\times L)\\) with \\(J<I\\) and \\(L<K\\). The scanning transforms each rectangle of size \\((J\\times L)\\) into one real number. Hence, the output has a smaller size: \\((I-J+1)\\times(K-L+1)\\). If \\(I=K=1,000\\) and \\(J=L=201\\), then the output has dimension \\((800\\times 800)\\) which is already much smaller. The output values are given by \\[o_{i,k}=\\sum_{j=1}^J\\sum_{l=1}^Lw_{j,l}x_{i+j-1,k+l-1}.\\] FIGURE 7.11: Scheme of a convolutional unit. Note: the dimensions are general and do not correspond to the number of squares. Iteratively reducing the dimension of the output via sequences of convolutional layers like the one presented above would be costly in computation and could give rise to overfitting because the number of weights would be incredibly large. In order to efficiently reduce the size of outputs, pooling layers are often used. The job of pooling units is to simplify matrices by reducing them to a simple metric such as the minimum, maximum or average value of the matrix: \\[o_{i,k}=f(x_{i+j-1,k+l-1}, 1\\le j\\le J, 1 \\le l\\le L),\\] where \\(f\\) is the minimum, maximum or average value. We show examples of pooling in Figure 7.12 below. In order to increase the speed of compression, it is possible to add a stride to omit cells. A stride value of \\(v\\) will perform the operation only every \\(v\\) value and hence bypass intermediate steps. In Figure 7.12, the two cases on the left do not resort to pooling, hence the reduction in dimension is exactly equal to the size of the pooling size. When stride is into action (right pane), the reduction is more marked. From a 1,000 by 1,000 input, a 2-by-2 pooling layer with stride 2 will yield a 500-by-500 output: the dimension is shrinked fourfold, as in the right scheme of Figure 7.12. FIGURE 7.12: Scheme of pooling units. With these tools in hand, it is possible to build new predictive tools. In Hoseinzade and Haratizadeh (2019), predictors such as price quotes, technical indicators and macro-economic data are fed to a complex neural network with 6 layers in order to predict the sign of price variations. While this is clearly an interesting computer science exercise, the deep economic motivation behind this choice of architecture remains unclear. Sangadiev et al. (2020) use CNN to build portfolios relying on limit order book data. 7.6.4 Advanced architectures The superiority of neural networks in tasks related to computer vision and natural language processing is now well established. However, in many ML tournaments in the 2010 decade, neural networks have often been surpassed by tree-based models when dealing with tabular data. This puzzle encouraged researchers to construct novel NN structures that are better suited to tabular databases. Examples include Arik and Pfister (2019) and Popov, Morozov, and Babenko (2019), but their ideas lie outside the scope of this book. Surprisingly, the reverse idea also exists: Nuti, Rugama, and Thommen (2019) try to adapt trees and random forests so that they behave more like neural networks. The interested reader can have a look at the original papers. 7.7 Coding exercise The purpose of the exercise is to code the autoencoder model described in Gu, Kelly, and Xiu (2020a) (see Section 7.6.2). When coding NNs, the dimensions must be rigorously reported. This is why we reproduce a diagram of the model in Figure 7.13 which clearly shows the inputs and outputs along with their dimensions. FIGURE 7.13: Scheme of the autoencoder pricing model. In order to harness the full potential of Keras, it is imperative to switch to more general formulations of NNs. This can be done via the so-called functional API: https://keras.rstudio.com/articles/functional_api.html. References "],["svm.html", "Chapter 8 Support vector machines 8.1 SVM for classification 8.2 SVM for regression 8.3 Practice 8.4 Coding exercises", " Chapter 8 Support vector machines While the origins of support vector machines (SVMs) are old (and go back to Vapnik and Lerner (1963)), their modern treatment was initiated in Boser, Guyon, and Vapnik (1992) and Cortes and Vapnik (1995) (binary classification) and Drucker et al. (1997) (regression). We refer to http://www.kernel-machines.org/books for an exhaustive bibliography on their theoretical and empirical properties. SVMs have been very popular since their creation among the machine learning community. Nonetheless, other tools (neural networks especially) have gained popularity and progressively replaced SVMs in many applications like computer vision notably. 8.1 SVM for classification As is often the case in machine learning, it is easier to explain a complex tool through an illustration with binary classification. In fact, sometimes, it is originally how the tool was designed (e.g., for the perceptron). Let us consider a simple example in the plane, that is, with two features. In Figure 8.1, the goal is to find a model that correctly classifies points: filled circles versus empty squares. FIGURE 8.1: Diagram of binary classification with support vectors. A model consists of two weights \\(\\textbf{w}=(w_1,w_2)\\) that load on the variables and create a natural linear separation in the plane. In the example above, we show three separations. The red one is not a good classifier because there are circles and squares above and beneath it. The blue line is a good classifier: all circles are to its left and all squares to its right. Likewise, the green line achieves a perfect classification score. Yet, there is a notable difference between the two. The grey star at the top of the graph is a mystery point and given its location, if the data pattern holds, it should be a circle. The blue model fails to recognize it as such while the green one succeeds. The interesting features of the scheme are those that we have not mentioned yet, that is, the grey dotted lines. These lines represent the no-man’s land in which no observation falls when the green model is enforced. In this area, each strip above and below the green line can be viewed as a margin of error for the model. Typically, the grey star is located inside this margin. The two margins are computed as the parallel lines that maximize the distance between the model and the closest points that are correctly classified (on both sides). These points are called support vectors, which justifies the name of the technique. Obviously, the green model has a greater margin than the blue one. The core idea of SVMs is to maximize the margin, under the constraint that the classifier does not make any mistake. Said differently, SVMs try to pick the most robust model among all those that yield a correct classification. More formally, if we numerically define circles as +1 and squares as -1, any ‘good’ linear model is expected to satisfy: \\[\\begin{equation} \\tag{8.1} \\left\\{\\begin{array}{lll} \\sum_{k=1}^Kw_kx_{i,k}+b \\ge +1 & \\text{ when } y_i=+1 \\\\ \\sum_{k=1}^Kw_kx_{i,k}+b \\le -1 & \\text{ when } y_i=-1, \\end{array}\\right. \\end{equation}\\] which can be summarized in compact form \\(y_i \\times \\left(\\sum_{k=1}^K w_kx_{i,k}+b \\right)\\ge 1\\). Now, the margin between the green model and a support vector on the dashed grey line is equal to \\(||\\textbf{w}||^{-1}=\\left(\\sum_{k=1}^Kw_k^2\\right)^{-1/2}\\). This value comes from the fact that the distance between a point \\((x_0,y_0)\\) and a line parametrized by \\(ax+by+c=0\\) is equal to \\(d=\\frac{|ax_0+by_0+c|}{\\sqrt{a^2+b^2}}\\). In the case of the model defined above (8.1), the numerator is equal to 1 and the norm is that of \\(\\textbf{w}\\). Thus, the final problem is the following: \\[\\begin{equation} \\tag{8.2} \\underset{\\textbf{w}, b}{\\text{argmin}} \\ \\frac{1}{2} ||\\textbf{w}||^2 \\ \\text{ s.t. } y_i\\left(\\sum_{k=1}^Kw_kx_{i,k}+b \\right)\\ge 1. \\end{equation}\\] The dual form of this program (see chapter 5 in Boyd and Vandenberghe (2004)) is \\[\\begin{equation} \\tag{8.3} L(\\textbf{w},b,\\boldsymbol{\\lambda})= \\frac{1}{2}||\\textbf{w}||^2 + \\sum_{i=1}^I\\lambda_i\\left(y_i\\left(\\sum_{k=1}^Kw_kx_{i,k}+b \\right)- 1\\right), \\end{equation}\\] where either \\(\\lambda_i=0\\) or \\(y_i\\left(\\sum_{k=1}^Kw_kx_{i,k}+b \\right)= 1\\). Thus, only some points will matter in the solution (the so-called support vectors). The first order conditions impose that the derivatives of this Lagrangian be null: \\[\\frac{\\partial L}{\\partial \\textbf{w}}L(\\textbf{w},b,\\boldsymbol{\\lambda})=\\textbf{0}, \\quad \\frac{\\partial L}{\\partial b}L(\\textbf{w},b,\\boldsymbol{\\lambda})=0,\\] where the first condition leads to \\[\\textbf{w}^*=\\sum_{i=1}^I\\lambda_iu_i\\textbf{x}_i.\\] This solution is indeed a linear form of the features, but only some points are taken into account. They are those for which the inequalities (8.1) are equalities. Naturally, this problem becomes infeasible whenever the condition cannot be satisfied, that is, when a simple line cannot perfectly separate the labels, no matter the choice of coefficients. This is the most common configuration and datasets are then called logically not linearly separable. This complicates the process but it is possible to resort to a trick. The idea is to introduce some flexbility in (8.1) by adding correction variables that allow the conditions to be met: \\[\\begin{equation} \\tag{8.4} \\left\\{\\begin{array}{lll} \\sum_{k=1}^Kw_kx_{i,k}+b \\ge +1-\\xi_i & \\text{ when } y_i=+1 \\\\ \\sum_{k=1}^Kw_kx_{i,k}+b \\le -1+\\xi_i & \\text{ when } y_i=-1, \\end{array}\\right. \\end{equation}\\] where the novelties, the \\(\\xi_i\\) are positive so-called ‘slack’ variables that make the conditions feasible. They are illustrated in Figure 8.2. In this new configuration, there is no simple linear model that can perfectly discriminate between the two classes. FIGURE 8.2: Diagram of binary classification with SVM - linearly inseparable data. The optimization program then becomes \\[\\begin{equation} \\tag{8.5} \\underset{\\textbf{w},b, \\boldsymbol{\\xi}}{\\text{argmin}} \\ \\frac{1}{2} ||\\textbf{w}||^2+C\\sum_{i=1}^I\\xi_i \\ \\text{ s.t. } \\left\\{ y_i\\left(\\sum_{k=1}^Kw_k\\phi(x_{i,k})+b \\right)\\ge 1-\\xi_i \\ \\text{ and } \\ \\xi_i\\ge 0, \\ \\forall i \\right\\}, \\end{equation}\\] where the parameter \\(C>0\\) tunes the cost of mis-classification: as \\(C\\) increases, errors become more penalizing. In addition, the program can be generalized to nonlinear models, via the kernel \\(\\phi\\) which is applied to the input points \\(x_{i,k}\\). Nonlinear kernels can help cope with patterns that are more complex than straight lines (see Figure 8.3). Common kernels can be polynomial, radial or sigmoid. The solution is found using more or less standard techniques for constrained quadratic programs. Once the weights \\(\\textbf{w}\\) and bias \\(b\\) are set via training, a prediction for a new vector \\(\\textbf{x}_j\\) is simply made by computing \\(\\sum_{k=1}^Kw_k\\phi(x_{j,k})+b\\) and choosing the class based on the sign of the expression. FIGURE 8.3: Examples of nonlinear kernels. 8.2 SVM for regression The ideas of classification SVM can be transposed to regression exercises but the role of the margin is different. One general formulation is the following \\[\\begin{align} \\underset{\\textbf{w},b, \\boldsymbol{\\xi}}{\\text{argmin}} \\ & \\frac{1}{2} ||\\textbf{w}||^2+C\\sum_{i=1}^I\\left(\\xi_i+\\xi_i^* \\right)\\\\ \\text{ s.t. }& \\sum_{k=1}^Kw_k\\phi(x_{i,k})+b -y_i\\le \\epsilon+\\xi_i \\\\ \\tag{8.6} & y_i-\\sum_{k=1}^Kw_k\\phi(x_{i,k})-b \\le \\epsilon+\\xi_i^* \\\\ &\\xi_i,\\xi_i^*\\ge 0, \\ \\forall i , \\end{align}\\] and it is illustrated in Figure 8.4. The user specifies a margin \\(\\epsilon\\) and the model will try to find the linear (up to kernel transformation) relationship between the labels \\(y_i\\) and the input \\(\\textbf{x}_i\\). Just as in the classification task, if the data points are inside the strip, the slack variables \\(\\xi_i\\) and \\(\\xi_i^*\\) are set to zero. When the points violate the threshold, the objective function (first line of the code) is penalized. Note that setting a large \\(\\epsilon\\) leaves room for more error. Once the model has been trained, a prediction for \\(\\textbf{x}_j\\) is simply \\(\\sum_{k=1}^Kw_k\\phi(x_{j,k})+b\\). FIGURE 8.4: Diagram of regression SVM. Let us take a step back and simplify what the algorithm does, that is: minimize the sum of squared weights \\(||\\textbf{w}||^2\\) subject to the error being small enough (modulo a slack variable). In spirit, this somewhat the opposite of the penalized linear regressions which seek to minimize the error, subject to the weights being small enough. The models laid out in this section are a preview of the universe of SVM engines and several other formulations have been developed. One reference library that is coded in C and C++ is LIBSVM and it is widely used by many other programming languages. The interested reader can have a look at the corresponding article Chang and Lin (2011) for more details on the SVM zoo (a more recent November 2019 version is also available online). 8.3 Practice In R the LIBSVM library is exploited in several packages. One of them, e1071, is a good choice because it also nests many other interesting functions, especially a naive Bayes classifier that we will see later on. In the implementation of LIBSVM, the package requires to specify the label and features separately. For this reason, we recycle the variables used for the boosted trees. Moreover, the training being slow, we perform it on a subsample of these sets (first thousand instances). library(e1071) fit_svm <- svm(y = train_label_xgb[1:1000], # Train label x = train_features_xgb[1:1000,], # Training features type = "eps-regression", # SVM task type (see LIBSVM documentation) kernel = "radial", # SVM kernel (or: linear, polynomial, sigmoid) epsilon = 0.1, # Width of strip for errors gamma = 0.5, # Constant in the radial kernel cost = 0.1) # Slack variable penalisation test_feat_short <- dplyr::select(testing_sample,features_short) mean((predict(fit_svm, test_feat_short) - testing_sample$R1M_Usd)^2) # MSE ## [1] 0.03839085 mean(predict(fit_svm, test_feat_short) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.5222197 The results are slightly better than those of the boosted trees. All parameters are completely arbitrary, especially the choice of the kernel. We finally turn to a classification example. fit_svm_C <- svm(y = training_sample$R1M_Usd_C[1:1000], # Train label x = training_sample[1:1000,] %>% dplyr::select(features), # Training features type = "C-classification", # SVM task type (see LIBSVM doc.) kernel = "sigmoid", # SVM kernel gamma = 0.5, # Parameter in the sigmoid kernel coef0 = 0.3, # Parameter in the sigmoid kernel cost = 0.2) # Slack variable penalisation mean(predict(fit_svm_C, dplyr::select(testing_sample,features)) == testing_sample$R1M_Usd_C) # Accuracy ## [1] 0.5008973 Both the small training sample and the arbitrariness in our choice of the parameters may explain why the predictive accuracy is so poor. 8.4 Coding exercises From the simple example shown above, extend SVM models to other kernels and discuss the impact on the fit. Train a vanilla SVM model with labels being the 12-month forward (i.e., future) return and evaluate it on the testing sample. Do the same with a simple random forest. Compare. References "],["bayes.html", "Chapter 9 Bayesian methods 9.1 The Bayesian framework 9.2 Bayesian sampling 9.3 Bayesian linear regression 9.4 Naive Bayes classifier 9.5 Bayesian additive trees", " Chapter 9 Bayesian methods This section is dedicated to the subset of machine learning that makes prior assumptions on parameters. Before we explain how Bayes’ theorem can be applied to simple building blocks in machine learning, we introduce some notations and concepts in the subsection below. Good references for Bayesian analysis are Gelman et al. (2013) and Kruschke (2014). The latter, like the present book, illustrates the concepts with many lines of R code. 9.1 The Bayesian framework Up to now, the models that have been presented rely on data only. This approach is often referred to as ‘frequentist’. Given one dataset, a frequentist will extract (i.e., estimate) a unique set of optimal parameters and consider it to be the best model. Bayesians, on the other hand, consider datasets as snapshots of reality and, for them, parameters are thus random! Instead of estimating one value for parameters (e.g., a coefficient in a linear model), they are more ambitious and try to determine the whole distribution of the parameter. In order to outline how that can be achieved, we introduce basic notations and results. The foundational concept in Bayesian analysis is the conditional probability. Given two random sets (or events) \\(A\\) and \\(B\\), we define the probability of \\(A\\) knowing \\(B\\) (equivalently, the odds of having \\(A\\), conditionally on having \\(B\\)) as \\[P[A|B]=\\frac{P[A \\cap B]}{P[B]},\\] that is, the probability of the intersection between the two sets divided by the probability of \\(B\\). Likewise, the probability that both events occur is equal to \\(P[A \\cap B] = P[A]P[B|A]\\). Given \\(n\\) disjoint events \\(A_i\\), \\(i=1,...n\\) such that \\(\\sum_{i=1}^nP(A_i)=1\\), then for any event \\(B\\), the law of total probabilities is (or implies) \\[P(B)=\\sum_{i=1}^nP(B \\cap A_i)= \\sum_{i=1}^nP(B|A_i)P(A_i).\\] Given this expression, we can formulate a general version of Bayes’ theorem: \\[\\begin{equation} \\tag{9.1} P(A_i|B)=\\frac{P(A_i)P(B|A_i)}{P(B)}= \\frac{P(A_i)P(B|A_i)}{\\sum_{i=1}^nP(B|A_i)P(A_i)}. \\end{equation}\\] Endowed with this result, we can move forward to the core topic of this section, which is the estimation of some parameter \\(\\boldsymbol{\\theta}\\) (possibly a vector) given a dataset, which we denote with \\(\\textbf{y}\\) thereby following the conventions from Gelman et al. (2013). This notation is suboptimal in this book nonetheless because in all other chapters, \\(\\textbf{y}\\) stands for the label of a dataset. In Bayesian analysis, one sophistication (compared to a frequentist approach) comes from the fact that the data is not almighty. The distribution of the parameter \\(\\boldsymbol{\\theta}\\) will be a mix between some prior distribution set by the statistician (the user, the analyst) and the empirical distribution from the data. More precisely, a simple application of Bayes’ formula yields \\[\\begin{equation} \\tag{9.2} p(\\boldsymbol{\\theta}| \\textbf{y})=\\frac{p(\\boldsymbol{\\theta})p(\\textbf{y} |\\boldsymbol{\\theta})}{p(\\textbf{y})} \\propto p(\\boldsymbol{\\theta})p(\\textbf{y} |\\boldsymbol{\\theta}). \\end{equation}\\] The interpretation is immediate: the distribution of \\(\\boldsymbol{\\theta}\\) knowing the data \\(\\textbf{y}\\) is proportional to the distribution of \\(\\boldsymbol{\\theta}\\) times the distribution of \\(\\textbf{y}\\) knowing \\(\\boldsymbol{\\theta}\\). The term \\(p(\\textbf{y})\\) is often omitted because it is simply a scaling number that ensures that the density sums or integrates to one. We use a slightly different notation between Equation (9.1) and Equation (9.2). In the former, \\(P\\) denotes a true probability, i.e., it is a number. In the latter, \\(p\\) stands for the whole probability density function of \\(\\boldsymbol{\\theta}\\) or \\(\\textbf{y}\\). The whole purpose of Bayesian analysis is to compute the so-called posterior distribution \\(p(\\boldsymbol{\\theta}| \\textbf{y})\\) via the prior distribution \\(p(\\boldsymbol{\\theta})\\) and the likelihood function \\(p(\\textbf{y} |\\boldsymbol{\\theta})\\). Priors are sometimes qualified as informative, weakly informative or uninformative, depending on the degree to which the user is confident on the relevance and robustness of the prior. The simplest way to define a non-informative prior is to set a constant (uniform) distribution over some realistic interval(s). The most challenging part is usually the likelihood function. The easiest way to solve the problem is to resort to a specific distribution (possibly a parametric family) for the distribution of the data and then consider that obsevations are i.i.d., just as in a simple maximum likelihood inference. If we assume that new parameters for the distributions are gathered into \\(\\boldsymbol{\\lambda}\\), then the likelihood can be written as \\[\\begin{equation} \\tag{9.3} p(\\textbf{y} |\\boldsymbol{\\theta}, \\boldsymbol{\\lambda})=\\prod_{i=1}^I f_{\\boldsymbol{\\lambda}}(y_i; \\boldsymbol{\\beta}), \\end{equation}\\] but in this case the problem becomes slightly more complex because adding new parameters changes the posterior distribution to \\(p(\\boldsymbol{\\theta}, \\boldsymbol{\\lambda}|\\textbf{y})\\). The user must find out the joint distribution of \\(\\boldsymbol{\\theta}\\) and \\(\\boldsymbol{\\lambda}\\) - given \\(\\textbf{y}\\). Because of their nested structure, these models are often called hierarchical models. Bayesian methods are widely used for portfolio choice. The rationale is that the distribution of asset returns depends on some parameter and the main issue is to determine the posterior distribution. We very briefly review a vast literature below. Bayesian asset allocation is investigated in Lai et al. (2011) (via stochastic optimization), Guidolin and Liu (2016) and Dangl and Weissensteiner (2020). Shrinkage techniques (of means and covariance matrices) are tested in Frost and Savarino (1986), Kan and Zhou (2007) and DeMiguel, Martı́n-Utrera, and Nogales (2015). In a similar vein, Tu and Zhou (2010) build priors that are coherent with asset pricing theories. Finally, Bauder et al. (2020) sample portfolio returns which allows to dervive a Bayesian optimal frontier. We invite the interested reader to also dwelve in the references that are cited within these few articles. 9.2 Bayesian sampling 9.2.1 Gibbs sampling One adjacent field of applications of Bayes’ theorem is simulation. Suppose we want to simulate the multivariate distribution of a random vector \\(\\textbf{X}\\) given by its density \\(p=p(x_1,\\dots,x_J)\\). Often, the full distribution is complex, but its marginals are more accessible. Indeed, they are simpler because they depend on only one variable (when all other values are known): \\[p(X_j=x_j|X_1= x_1,\\dots,X_{j-1}=x_{j-1},X_{j+1}=x_{j+1},\\dots,X_J=x_J)=p(X_j=x_j|\\textbf{X}_{-j}=\\textbf{x}_{-j}),\\] where we use the compact notation \\(\\textbf{X}_{-j}\\) for all variables except \\(X_j\\). One way to generate samples with law \\(p\\) is the following and relies both on the knowledge of the conditionals \\(p(x_j|\\textbf{x}_{-j})\\) and on the notion of Markov Chain Monte Carlo, which we outline below. The process is iterative and assumes that it is possible to draw samples of the aforementioned conditionals. We write \\(x_j^{m}\\) for the \\(m^{th}\\) sample of the \\(j^{th}\\) variable (\\(X_j\\)). The simulation starts with a prior (or fixed, or random) sample \\(\\textbf{x}^0=(x^0_1,\\dots,x^0_J)\\). Then, for a sufficiently large number of times, say \\(T\\), new samples are drawn according to \\[\\begin{align*} x_1^{m+1} &= p(X_1|X_2=x_2^{m}, \\dots ,X_J=x_J^m) ;\\\\ x_2^{m+1} &=p(X_2|X_1=x_1^{m+1}, X_3=x^{m}_3, \\dots, X_J=x_J^m); \\\\ \\dots& \\\\ x_J^{m+1}&= p(X_J|X_1=x_1^{m+1}, X_2=x_2^{m+1}, \\dots, X_{J-1}=x_{J-1}^{m+1}). \\end{align*}\\] The important detail is that after each line, the value of the variable is updated. Hence, in the second line, \\(X_2\\) is sampled with the knowledge of \\(X_1=x_1^{m+1}\\) and in the last line, all variables except \\(X_J\\) have been updated to their \\((m+1)^{th}\\) state. The above algorithm is called Gibbs sampling. It relates to Markov chains because each new iteration depends only on the previous one. Under some technical assumptions, as \\(T\\) increases, the distribution of \\(\\textbf{x}_T\\) converges to that of \\(p\\). The conditions under which the convergence occurs have been widely discussed in series of articles in the 1990s. The interested reader can have a look for instance at Tierney (1994), Roberts and Smith (1994), as well as at section 11.7 of Gelman et al. (2013). Sometimes, the full distribution is complex and the conditional laws are hard to determine and to sample. Then, a more general method, called Metropolis-Hastings, can be used that relies on the rejection method for the simulation of random variables. 9.2.2 Metropolis-Hastings sampling The Gibbs algorithm can be considered as a particular case of the Metropolis-Hastings (MH) method, which, in its simplest version, was introduced in Metropolis and Ulam (1949). The premise is similar: the aim is to simulate random variables that follow \\(p(\\textbf{x})\\) with the ability to sample from a simpler form \\(p(\\textbf{x}|\\textbf{y})\\) which gives the probability of the future state \\(\\textbf{x}\\), given the past one \\(\\textbf{y}\\). Once an initial value for \\(\\textbf{x}\\) has been sampled (\\(\\textbf{x}_0\\)), each new iteration (\\(m\\)) of the simulation takes place in three stages: generate a candidate value \\(\\textbf{x}'_{m+1}\\) from \\(p(\\textbf{x}|\\textbf{x}_m)\\), compute the acceptance ratio \\(\\alpha=\\min\\left(\\frac{p(\\textbf{x}'_{m+1})p(\\textbf{x}_{m}|\\textbf{x}'_{m+1})}{p(\\textbf{x}_{m})p(\\textbf{x}'_{m+1}|\\textbf{x}_{m})} \\right)\\) pick \\(\\textbf{x}_{m+1}=\\textbf{x}'_{m+1}\\) with probability \\(\\alpha\\) or stick with the previous value (\\(\\textbf{x}_{m+1}=\\textbf{x}_{m}\\)) with probability \\(1-\\alpha\\). The interpretation of the acceptance ratio is not straightforward in the general case. When the sampling generator is symmetric (\\(p(\\textbf{x}|\\textbf{y})=p(\\textbf{y}|\\textbf{x})\\)), the candidate is always chosen whenever \\(p(\\textbf{x}'_{m+1})\\ge p(\\textbf{x}_{m})\\). If the reverse condition holds (\\(p(\\textbf{x}'_{m+1})< p(\\textbf{x}_{m})\\)), then the candidate is retained with odds equal to \\(p(\\textbf{x}'_{m+1})/p(\\textbf{x}_{m})\\), which is the ratio of likelihoods. The more likely the new proposal, the higher the odds of retaining it. Often, the first simulations are discarded in order to leave time to the chain to converge to a high probability region. This procedure (often called ‘burn in’) ensures that the first retained samples are located in a zone that is likely, i.e., that they are more representative of the law we are trying to simulate. For the sake of brevity, we stick to a succinct presentation here, but some additional details are outlined in section 11.2 of Gelman et al. (2013) and in chapter 7 of Kruschke (2014). 9.3 Bayesian linear regression Because Bayesian concepts are rather abstract, it is useful to illustrate the theoretical notions with a simple example. In a linear model, \\(y_i=\\textbf{x}_i\\textbf{b}+\\epsilon_i\\) and it is often statistically assumed that the \\(\\epsilon_i\\) are i.i.d. and normally distributed with zero mean and variance \\(\\sigma^2\\). Hence, the likelihood of Equation (9.3) translates into \\[p(\\boldsymbol{\\epsilon}|\\textbf{b}, \\sigma)=\\prod_{i=1}^I\\frac{e^{-\\frac{\\epsilon_i^2}{2\\sigma}}}{\\sigma\\sqrt{2\\pi}}=(\\sigma\\sqrt{2\\pi})^{-I}e^{-\\sum_{i=1}^I\\frac{\\epsilon_i^2}{2\\sigma^2}}.\\] In a regression analysis, the data is given both by \\(\\textbf{y}\\) and by \\(\\textbf{X}\\), hence both are reported in the notations. Simply acknowledging that \\(\\boldsymbol{\\epsilon}=\\textbf{y}-\\textbf{Xb}\\), we get \\[\\begin{align} p(\\textbf{y},\\textbf{X}|\\textbf{b}, \\sigma)&=\\prod_{i=1}^I\\frac{e^{-\\frac{\\epsilon_i^2}{2\\sigma}}}{\\sigma\\sqrt{2\\pi}}\\\\ &=(\\sigma\\sqrt{2\\pi})^{-I}e^{-\\sum_{i=1}^I\\frac{\\left(y_i-\\textbf{x}_i'\\textbf{b}\\right)^2}{2\\sigma^2}}=(\\sigma\\sqrt{2\\pi})^{-I} e^{-\\frac{\\left(\\textbf{y}-\\textbf{X}\\textbf{b}\\right)' \\left(\\textbf{y}-\\textbf{X}\\textbf{b}\\right)}{2\\sigma^2}} \\nonumber \\\\ \\tag{9.4} &=\\underbrace{(\\sigma\\sqrt{2\\pi})^{-I} e^{-\\frac{\\left(\\textbf{y}-\\textbf{X}\\hat{\\textbf{b}}\\right)' \\left(\\textbf{y}-\\textbf{X}\\hat{\\textbf{b}}\\right)}{2\\sigma^2}}}_{\\text{depends on } \\sigma, \\text{ not } \\textbf{b}}\\times \\underbrace{e^{-\\frac{(\\textbf{b}-\\hat{\\textbf{b}})'\\textbf{X}'\\textbf{X}(\\textbf{b}-\\hat{\\textbf{b}})}{2\\sigma^2}}}_{\\text{ depends on both } \\sigma, \\text{ and } \\textbf{b} }. \\end{align}\\] In the last line, the second term is a function of the difference \\(\\textbf{b}-\\hat{\\textbf{b}}\\), where \\(\\hat{\\textbf{b}}=(\\textbf{X}'\\textbf{X})^{-1}\\textbf{X}'\\textbf{y}\\). This is not surprising: \\(\\hat{\\textbf{b}}\\) is a natural benchmark for the mean of \\(\\textbf{b}\\). Moreover, introducing \\(\\hat{\\textbf{b}}\\) yields a relatively simple form for the probability. The above expression is the frequentist (data-based) block of the posterior: the likelihood. If we want to obtain a tractable expression for the posterior, we need to find a prior component that has a form that will combine well with this likelihood. These forms are called conjugate priors. A natural candidate for the right part (that depends on both b and \\(\\sigma\\)) is the multivariate Gaussian density: \\[\\begin{equation} \\tag{9.5} p[\\textbf{b}|\\sigma]=\\sigma^{-k}e^{-\\frac{(\\textbf{b}-\\textbf{b}_0)'\\boldsymbol{\\Lambda}_0(\\textbf{b}-\\textbf{b}_0)}{2\\sigma^2}}, \\end{equation}\\] where we are obliged to condition with respect to \\(\\sigma\\). The density has prior mean \\(\\textbf{b}_0\\) and prior covariance matrix \\(\\boldsymbol{\\Lambda}_0^{-1}\\). This prior gets us one step closer to the posterior because \\[\\begin{align} p[\\textbf{b},\\sigma|\\textbf{y},\\textbf{X}]& \\propto p[\\textbf{y},\\textbf{X}|\\textbf{b},\\sigma]p[\\textbf{b},\\sigma] \\nonumber \\\\ \\tag{9.6} &\\propto p[\\textbf{y},\\textbf{X}|\\textbf{b},\\sigma]p[\\textbf{b}|\\sigma]p[\\sigma]. \\end{align}\\] In order to fully specify the cascade of probabilities, we need to take care of \\(\\sigma\\) and set a density of the form \\[\\begin{equation} \\tag{9.7} p[\\sigma^2]\\propto (\\sigma^2)^{-1-a_0}e^{-\\frac{b_0}{2\\sigma^2}}, \\end{equation}\\] which is close to that of the left part of (9.4). This corresponds to an inverse gamma distribution for the variance with prior parameters \\(a_0\\) and \\(b_0\\) (this scalar notation is not optimal because it can be confused with the prior mean \\(\\textbf{b}_0\\) so we must pay extra attention). Now, we can simplify \\(p[\\textbf{b},\\sigma|\\textbf{y},\\textbf{X}]\\) with (9.4), (9.5) and (9.7): \\[\\begin{align*} p[\\textbf{b},\\sigma|\\textbf{y},\\textbf{X}]& \\propto (\\sigma\\sqrt{2\\pi})^{-I} \\sigma^{-2(1+a_0)} e^{-\\frac{\\left(\\textbf{y}-\\textbf{X}\\hat{\\textbf{b}}\\right)' \\left(\\textbf{y}-\\textbf{X}\\hat{\\textbf{b}}\\right)}{2\\sigma^2}} \\\\ &\\quad \\times e^{-\\frac{(\\textbf{b}-\\hat{\\textbf{b}})'\\textbf{X}'\\textbf{X}(\\textbf{b}-\\hat{\\textbf{b}})}{2\\sigma^2}}\\sigma^{-k}e^{-\\frac{(\\textbf{b}-\\textbf{b}_0)'\\boldsymbol{\\Lambda}_0(\\textbf{b}-\\textbf{b}_0)}{2\\sigma^2}}e^{-\\frac{b_0}{2\\sigma^2}} \\\\ \\end{align*}\\] which can be rewritten \\[\\begin{align*} p[\\textbf{b},\\sigma|\\textbf{y},\\textbf{X}]& \\propto \\sigma^{-I-k-2(1+a_0)} \\\\ &\\times \\exp\\left(-\\frac{\\left(\\textbf{y}-\\textbf{X}\\hat{\\textbf{b}}\\right)' \\left(\\textbf{y}-\\textbf{X}\\hat{\\textbf{b}}\\right) + (\\textbf{b}-\\hat{\\textbf{b}})'\\textbf{X}'\\textbf{X}(\\textbf{b}-\\hat{\\textbf{b}}) + (\\textbf{b}-\\textbf{b}_0)'\\boldsymbol{\\Lambda}_0(\\textbf{b}-\\textbf{b}_0)+b_0}{2\\sigma^2} \\right) . \\end{align*}\\] The above expression is simply a quadratic form in \\(\\textbf{b}\\) and it can be rewritten after burdensome algebra in a much more compact manner: \\[\\begin{equation} \\label{eq:linpost} p(\\textbf{b}|\\textbf{y},\\textbf{X},\\sigma) \\propto \\left[\\sigma^{-k}e^{-\\frac{(\\textbf{b}-\\textbf{b}_*)'\\boldsymbol{\\Lambda}_*(\\textbf{b}-\\textbf{b}_*)}{2\\sigma^2}}\\right] \\times \\left[ (\\sigma^2)^{-1-a_*}e^{-\\frac{b_*}{2\\sigma^2}} \\right], \\end{equation}\\] where \\[\\begin{align*} \\boldsymbol{\\Lambda}_* &= \\textbf{X}'\\textbf{X}+\\boldsymbol{\\Lambda}_0 \\\\ \\textbf{b}_*&= \\boldsymbol{\\Lambda}_*^{-1}(\\boldsymbol{\\Lambda}_0\\textbf{b}_0+\\textbf{X}'\\textbf{X}\\hat{\\textbf{b}}) \\\\ a_* & = a_0 + I/2 \\\\ b_* &=b_0+\\frac{1}{2}\\left(\\textbf{y}'\\textbf{y}+ \\textbf{b}_0'\\boldsymbol{\\Lambda}_0\\textbf{b}_0+\\textbf{b}_*'\\boldsymbol{\\Lambda}_*\\textbf{b}_* \\right).\\\\ \\end{align*}\\] This expression has two parts: the Gaussian component which relates mostly to \\(\\textbf{b}\\), and the inverse gamma component, entirely dedicated to \\(\\sigma\\). The mix between the prior and the data is clear. The posterior covariance matrix of the Gaussian part (\\(\\boldsymbol{\\Lambda}_*\\)) is the sum between the prior and a quadratic form from the data. The posterior mean \\(\\textbf{b}_*\\) is a weighted average of the prior \\(\\textbf{b}_0\\) and the sample estimator \\(\\hat{\\textbf{b}}\\). Such blends of quantities estimated from data and a user-supplied version are often called shrinkages. For instance, the original matrix of cross-terms \\(\\textbf{X}'\\textbf{X}\\) is shrunk towards the prior \\(\\boldsymbol{\\Lambda}_0\\). This can be viewed as a regularization procedure: the pure fit originating from the data is mixed with some ‘external’ ingredient to give some structure to the final estimation. The interested reader can also have a look at section 16.3 of Greene (2018) (the case of conjugate priors is treated in subsection 16.3.2). The formulae above can be long and risky to implement. Luckily, there is an R package (\\(spBayes\\)) that performs Bayesian inference for linear regression using the conjugate priors. Below, we provide one example of how it works. To simplify the code and curtail computation times, we consider two predictors: market capitalization (size anomaly) and price-to-book ratio (value anomaly). In statistics, the precision matrix is the inverse of the covariance matrix. In the parameters, the first two priors relate to the Gaussian law and the last two to the inverse gamma distribution: \\[f_\\text{invgamma}(x, \\alpha, \\beta)=\\frac{\\beta^\\alpha}{\\Gamma(\\alpha)}x^{-1-\\alpha}e^{-\\frac{\\beta}{x}},\\] where \\(\\alpha\\) is the shape and \\(\\beta\\) is the scale. prior_mean <- c(0.01,0.1,0.1) # Average value of parameters (prior) precision_mat <- diag(prior_mean^2) %>% solve() # Inverse cov matrix of parameters (prior) fit_lmBayes <- bayesLMConjugate( R1M_Usd ~ Mkt_Cap_3M_Usd + Pb, # Model: size and value data = testing_sample, # Data source, here, the test sample n.samples = 2000, # Number of samples used beta.prior.mean = prior_mean, # Avg prior: size & value rewarded & unit beta beta.prior.precision = precision_mat, # Precision matrix prior.shape = 0.5, # Shape for prior distribution of sigma prior.rate = 0.5) # Scale for prior distribution of sigma In the above specification, we must also provide a prior for the constant. By default, we set its average value to 0.01, which corresponds to a 1% average monthly return. Once the model has been estimated, we can plot the distribution of coefficient estimates. fit_lmBayes$p.beta.tauSq.samples[,1:3] %>% as_tibble() %>% `colnames<-`(c("Intercept", "Size", "Value")) %>% gather(key = coefficient, value = value) %>% ggplot(aes(x = value, fill = coefficient)) + geom_histogram(alpha = 0.5) FIGURE 9.1: Distribution of linear regression coefficients (betas). The distribution of the constant in Figure 9.1 is firmly to the right with a small dispersion, hence it is solidly positive. For the size coefficient, it is the opposite; it is negative (small firms are more profitable). With regard to value, it is hard to conclude, the distribution is balanced around zero: there is no clear exposition to the price-to-book ratio variable. 9.4 Naive Bayes classifier Bayes’ theorem can also be easily applied to classification. We formulate it with respect to the label and features and write \\[\\begin{equation} \\tag{9.8} P[\\textbf{y} | \\textbf{X}] = \\frac{P[ \\textbf{X} | \\textbf{y}]P[\\textbf{y}]}{P[\\textbf{X}]} \\propto P[ \\textbf{X} | \\textbf{y}]P[\\textbf{y}], \\end{equation}\\] and then split the input matrix into its column vectors \\(\\textbf{X}=(\\textbf{x}_1,\\dots,\\textbf{x}_K)\\). This yields \\[\\begin{equation} \\tag{9.9} P[\\textbf{y} | \\textbf{x}_1,\\dots,\\textbf{x}_K] \\propto P[\\textbf{x}_1,\\dots,\\textbf{x}_K| \\textbf{y}]P[\\textbf{y}]. \\end{equation}\\] The ‘naive’ qualification of the method comes from a simplifying assumption on the features.19 If they are all mutually independent, then the likelihood in the above expression can be expanded into \\[\\begin{equation} \\tag{9.10} P[\\textbf{y} | \\textbf{x}_1,\\dots,\\textbf{x}_K] \\propto P[\\textbf{y}]\\prod_{k=1}^K P[\\textbf{x}_k| \\textbf{y}]. \\end{equation}\\] The next step is to be more specific about the likelihood. This can be done non-parametrically (via kernel estimation) or with common distributions (Gaussian for continuous data, Bernoulli for binary data). In factor investing, the features are continuous, thus the Gaussian law is more adequate: \\[P[x_{i,k}=z|\\textbf{y}_i= c]=\\frac{e^{-\\frac{(z-m_c)^2}{2\\sigma_c^2}}}{\\sigma_c\\sqrt{2\\pi}},\\] where \\(c\\) is the value of the classes taken by \\(y\\) and \\(\\sigma_c\\) and \\(m_c\\) are the standard error and mean of \\(x_{i,k}\\), conditional on \\(y_i\\) being equal to \\(c\\). In practice, each class is spanned, the training set is filtered accordingly and \\(\\sigma_c\\) and \\(m_c\\) are taken to be the sample statistics. This Gaussian parametrization is probably ill-suited to our dataset because the features are uniformly distributed. Even after conditioning, it is unlikely that the distribution will be even remotely close to Gaussian. Technically, this can be overcome via a double transformation method. Given a vector of features \\(\\textbf{x}_k\\) with empirical cdf \\(F_{\\textbf{x}_k}\\), the variable \\[\\begin{equation} \\tag{9.11} \\tilde{\\textbf{x}}_k=\\Phi^{-1}\\left(F_{\\textbf{x}_k}(\\textbf{x}_k) \\right), \\end{equation}\\] will have a standard normal law whenever \\(F_{\\textbf{x}_k}\\) is not pathological. Non-pathological cases are when the cdf is continuous and strictly increasing and when observations lie in the open interval (0,1). If all features are independent, the transformation should not have any impact on the correlation structure. Otherwise, we refer to the literature on the NORmal-To-Anything (NORTA) method (see, e.g., Chen (2001) and Coqueret (2017)). Lastly, the prior \\(P[\\textbf{y}]\\) in Equation (9.10) is often either taken to be uniform across the classes (\\(1/K\\) for all \\(k\\)) or equal to the sample distribution. We illustrate the naive Bayes classification tool with a simple example. While the package e1071 embeds such a classifier, the naivebayes library offers more options (Gaussian, Bernoulli, multinomial and nonparametric likelihoods). Below, since the features are uniformly distributed, thus the transformation in (9.11) amounts to apply the Gaussian quantile function (inverse cdf). For visual clarity, we only use the small set of features. library(naivebayes) # Load package gauss_features_train <- training_sample %>% # Build sample dplyr::select(features_short) %>% as.matrix() %>% `*`(0.999) %>% # Features smaller than 1 + (0.0001) %>% # Features larger than 0 qnorm() %>% # Inverse Gaussian cdf `colnames<-`(features_short) fit_NB_gauss <- naive_bayes(x = gauss_features_train, # Transformed features y = training_sample$R1M_Usd_C) # Label layout(matrix(c(1,1,2,3,4,5,6,7), 4, 2, byrow = TRUE), # Organize graphs widths=c(0.9,0.45)) par(mar=c(1, 1, 1, 1)) plot(fit_NB_gauss, prob = "conditional") FIGURE 9.2: Distributions of predictor variables, conditional on the class of the label. TRUE is when the instance corresponds to an above median return and FALSE to a below median return. The plots in Figure 9.2 show the distributions of the features, conditionally on each value of the label. Essentially, those are the densities \\(P[\\textbf{x}_k| \\textbf{y}]\\). For each feature, both distributions are very similar. As usual, once the model has been trained, the accuracy of predictions can be evaluated. gauss_features_test <- testing_sample %>% dplyr::select(features_short) %>% as.matrix() %>% `*`(0.999) %>% + (0.0001) %>% qnorm() %>% `colnames<-`(features_short) mean(predict(fit_NB_gauss, gauss_features_test) == testing_sample$R1M_Usd_C) # Hit ratio ## [1] 0.4956985 The performance of the classifier is not satisfactory as it underperforms a random guess. 9.5 Bayesian additive trees 9.5.1 General formulation Bayesian additive regression trees (BARTs) are an ensemble technique that mixes Bayesian thinking and regression trees. In spirit, they are close to the tree ensembles seen in Chapter 6, but they differ greatly in their implementation. In BARTs like in Bayesian regressions, the regularization comes from the prior. The original article is Chipman, George, and McCulloch (2010) and the implementation (in R) follows Sparapani, Spanbauer, and McCulloch (2019). Formally, the model is an aggregation of \\(M\\) models, which we write as \\[\\begin{equation} \\tag{9.12} y = \\sum_{m=1}^M\\mathcal{T}_m(q_m,\\textbf{w}_m, \\textbf{x}) + \\epsilon, \\end{equation}\\] where \\(\\epsilon\\) is a Gaussian noise with variance \\(\\sigma^2\\), and the \\(\\mathcal{T}_m=\\mathcal{T}_m(q_m,\\textbf{w}_m, \\textbf{x})\\) are decision trees with structure \\(q_m\\) and weights vectors \\(\\textbf{w}_m\\). This decomposition of the tree is the one we used for boosted trees and is illustrated in Figure 6.5. \\(q_m\\) codes all splits (variables chosen for the splits and levels of the splits) and the vectors \\(\\textbf{w}_m\\) correspond to the leaf values (at the terminal nodes). At the macro-level, BARTs can be viewed as traditional Bayesian objects, where the parameters \\(\\boldsymbol{\\theta}\\) are all of the unknowns coded through \\(q_m\\), \\(\\textbf{w}_m\\) and \\(\\sigma^2\\) and where the focus is set on determining the posterior \\[\\begin{equation} \\tag{9.13} \\left(q_m,\\textbf{w}_m,\\sigma^2\\right) | (\\textbf{X}, \\textbf{Y}). \\end{equation}\\] Given particular forms of priors for \\(\\left(q_m,\\textbf{w}_m,\\sigma^2\\right)\\), the algorithm draws the parameters using a combination of Metropolis-Hastings and Gibbs samplers. 9.5.2 Priors The definition of priors in tree models is delicate and intricate. The first important assumption is independence: independence between \\(\\sigma^2\\) and all other parameters and independence between trees, that is, between couples \\((q_m,\\textbf{w}_m)\\) and \\((q_n,\\textbf{w}_n)\\) for \\(m\\neq n\\). This assumption makes BARTs closer to random forests in spirit and further from boosted trees. This independence entails \\[P(\\left(q_1,\\textbf{w}_1\\right),\\dots,\\left(q_M,\\textbf{w}_M\\right),\\sigma^2)=P(\\sigma^2)\\prod_{m=1}^MP\\left(q_m,\\textbf{w}_m\\right).\\] Moreover, it is customary (for simplicity) to separate the structure of the tree (\\(q_m\\)) and the terminal weights (\\(\\textbf{w}_m\\)), so that by a Bayesian conditioning \\[\\begin{equation} \\tag{9.14} P(\\left(q_1,\\textbf{w}_1\\right),\\dots,\\left(q_M,\\textbf{w}_M\\right),\\sigma^2)=\\underbrace{P(\\sigma^2)}_{\\text{noise term}}\\prod_{m=1}^M\\underbrace{P\\left(\\textbf{w}_m|q_m\\right)}_{\\text{tree weights}}\\underbrace{P(q_m)}_{\\text{tree struct.}} \\end{equation}\\] It remains to formulate the assumptions for each of the three parts. We start with the trees’ structures, \\(q_m\\). Trees are defined by their splits (at nodes) and these splits are characterized by the splitting variable and the splitting level. First, the size of trees is parametrized such that a node at depth \\(d\\) is nonterminal with probability given by \\[\\begin{equation} \\tag{9.15} \\alpha(1+d)^{-\\beta}, \\quad \\alpha \\in (0,1), \\quad \\beta >0. \\end{equation}\\] The authors recommend to set \\(\\alpha = 0.95\\) and \\(\\beta=2\\). This gives a probability of 5% to have 1 node, 55% to have 2 nodes, 28% to have 3 nodes, 9% to have 4 nodes and 3% to have 5 nodes. Thus, the aim is to force relatively shallow structures. Second, the choice of splitting variables is driven by a generalized Bernoulli (categorical) distribution which defines the odds of picking one particular feature. In the original paper by Chipman, George, and McCulloch (2010), the vector of probabilities was uniform (each predictor has the same odds of being chosen for the split). This vector can also be random and sampled from a more flexible Dirichlet distribution. The level of the split is drawn uniformly on the set of possible values for the chosen predictor. Having determined the prior of structure of the tree \\(q_m\\), it remains to fix the terminal values at the leaves (\\(\\textbf{w}_m|q_m\\)). The weights at all leaves are assumed to follow a Gaussian distribution \\(\\mathcal{N}(\\mu_\\mu,\\sigma_\\mu^2)\\), where \\(\\mu_\\mu=(y_\\text{min}+y_\\text{max})/2\\) is the center of the range of the label values. The variance \\(\\sigma_\\mu^2\\) is chosen such that \\(\\mu_\\mu\\) plus or minus two times \\(\\sigma_\\mu^2\\) covers 95% of the range observed in the training dataset. Those are default values and can be altered by the user. Lastly, for computational purposes similar to those of linear regressions, the parameter \\(\\sigma^2\\) (the variance of \\(\\epsilon\\) in (9.12)) is assumed to follow an inverse Gamma law \\(\\text{IG}(\\nu/2,\\lambda \\nu/2)\\) akin to that used in Bayesian regressions. The parameters are by default computed from the data so that the distribution of \\(\\sigma^2\\) is realistic and prevents overfitting. We refer to the original article, section 2.2.4, for more details on this topic. In sum, in addition to \\(M\\) (number of trees), the prior depends on a small number of parameters: \\(\\alpha\\) and \\(\\beta\\) (for the tree structure), \\(\\mu_\\mu\\) and \\(\\sigma_\\mu^2\\) (for the tree weights) and \\(\\nu\\) and \\(\\lambda\\) (for the noise term). 9.5.3 Sampling and predictions The posterior distribution in (9.13) cannot be obtained analytically but simulations are an efficient shortcut to the model (9.12). Just as in Gibbs and Metropolis-Hastings sampling, the distribution of simulations is expected to converge to the sought posterior. After some burn-in sample, a prediction for a newly observed set \\(\\textbf{x}_*\\) will simply be the average (or median) of the predictions from the simulations. If we assume \\(S\\) simulations after burn-in, then the average is equal to \\[\\tilde{y}(\\textbf{x}_*):=\\frac{1}{S}\\sum_{s=1}^S\\sum_{m=1}^M\\mathcal{T}_m\\left(q_m^{(s)},\\textbf{w}_m^{(s)}, \\textbf{x}_*\\right).\\] The complex part is naturally to generate the simulations. Each tree is sampled using the Metropolis-Hastings method: a tree is proposed, but it replaces the existing one only under some (possibly random) criterion. This procedure is then repeated in a Gibbs-like fashion. Let us start with the MH building block. We seek to simulate the conditional distribution \\[(q_m,\\textbf{w}_m) \\ | \\ (q_{-m},\\textbf{w}_{-m},\\sigma^2, \\textbf{y}, \\textbf{x}),\\] where \\(q_{-m}\\) and \\(\\textbf{w}_{-m}\\) collect the structures and weights of all trees except for tree number \\(m\\). One tour de force in BART is to simplify the above Gibbs draws to \\[(q_m,\\textbf{w}_m) \\ | \\ (\\textbf{R}_{m},\\sigma^2 ),\\] where \\(\\textbf{R}_{m}=\\textbf{y}-\\sum_{l \\neq m}\\mathcal{T}_l(q_l,\\textbf{w}_l, \\textbf{x})\\) is the partial residual on a prediction that excludes the \\(m^{th}\\) tree. The new MH proposition for \\(q_m\\) is based on the previous tree and there are three possible (and random) alterations to the tree: - growing a terminal node (increase the complexity of the tree by adding a supplementary leaf); - pruning a pair of terminal nodes (the opposite operation: reducing complexity); - changing splitting rules. For simplicity, the third option is often excluded. Once the tree structure is defined (i.e., sampled), the terminal weights are independently drawn according to a Gaussian distribution \\(\\mathcal{N}(\\mu_\\mu, \\sigma_\\mu^2)\\). After the tree is sampled, the MH principle requires that it be accepted or rejected based on some probability. This probability increases with the odds that the new tree increases the likelihood of the model. Its detailed computation is cumbersome and we refer to section 2.2 in Sparapani, Spanbauer, and McCulloch (2019) for details on the matter. Now, we must outline the overarching Gibbs procedure. First, the algorithm starts with trees that are simple nodes. Then, a specified number of loops include the following sequential steps: Step Task 1 sample \\((q_1,\\textbf{w}_1) \\ | \\ (\\textbf{R}_{1},\\sigma^2 )\\); 2 sample \\((q_2,\\textbf{w}_2) \\ | \\ (\\textbf{R}_{2},\\sigma^2 )\\); … …; m sample \\((q_m,\\textbf{w}_m) \\ | \\ (\\textbf{R}_{m},\\sigma^2 )\\); … …; M sample \\((q_M,\\textbf{w}_M) \\ | \\ (\\textbf{R}_{M},\\sigma^2 )\\); (last tree ) M+1 sample \\(\\sigma^2\\) given the full residual \\(\\textbf{R}=\\textbf{y}-\\sum_{l=1}^M\\mathcal{T}_l(q_l,\\textbf{w}_l, \\textbf{x})\\) At each step \\(m\\), the residual \\(\\textbf{R}_{m}\\) is updated with the values from step \\(m-1\\). We illustrate this process in Figure 9.3 in which \\(M=3\\). At step 1, a partition is proposed for the first tree, which is a simple node. In this particular case, the tree is accepted. In this scheme, the terminal weights are omitted for simplicity. At step 2, another partition is proposed for the tree, but it is rejected. In the third step, the proposition for the third is accepted. After the third step, a new value for \\(\\sigma^2\\) is drawn and a new round of Gibbs sampling can commence. FIGURE 9.3: Diagram of the MH/Gibbs sampling of BARTs. At step 2, the proposed tree is not validated. 9.5.4 Code There are several R packages that implement BART methods: BART, bartMachine and an older one (the original), BayesTree. The first one is highly efficient, hence we work with it. We resort to only a few parameters, like the power and base, which are the \\(\\beta\\) and \\(\\alpha\\) defined in (9.15). The program is a bit verbose and delivers a few parametric details. library(BART) # Load package fit_bart <- gbart( # Main function x.train = dplyr::select(training_sample, features_short) %>% # Training features data.frame(), y.train = dplyr::select(training_sample, R1M_Usd) %>% # Training label as.matrix() , x.test = dplyr::select(testing_sample, features_short) %>% # Testing features data.frame(), type = "wbart", # Option: label is continuous ntree = 20, # Number of trees in the model nskip = 100, # Size of burn-in sample ndpost = 200, # Number of posteriors drawn power = 2, # beta in the tree structure prior base = 0.95) # alpha in the tree structure prior ## *****Calling gbart: type=1 ## *****Data: ## data:n,p,np: 198128, 7, 70208 ## y1,yn: -0.049921, 0.024079 ## x1,x[n*p]: 0.010000, 0.810000 ## xp1,xp[np*p]: 0.270000, 0.880000 ## *****Number of Trees: 20 ## *****Number of Cut Points: 100 ... 100 ## *****burn,nd,thin: 100,200,1 ## *****Prior:beta,alpha,tau,nu,lambda,offset: 2,0.95,1.57391,3,2.84908e-31,0.0139209 ## *****sigma: 0.000000 ## *****w (weights): 1.000000 ... 1.000000 ## *****Dirichlet:sparse,theta,omega,a,b,rho,augment: 0,0,1,0.5,1,7,0 ## *****printevery: 100 ## ## MCMC ## done 0 (out of 300) ## done 100 (out of 300) ## done 200 (out of 300) ## time: 30s ## trcnt,tecnt: 200,200 Once the model is trained,20 we evaluated its performance. We simply compute the hit ratio. The predictions are embedded within the fit variable, under the name ‘yhat.test’. mean(fit_bart$yhat.test * testing_sample$R1M_Usd > 0) ## [1] 0.5433102 The performance seems reasonable but is by no means impressive. The data from all sampled trees is available in the fit_bart variable. It has nonetheless a complex structure (as is often the case with trees). The simplest information we can extract is the value of \\(\\sigma\\) across all 300 simulations (see Figure 9.4). data.frame(simulation = 1:300, sigma = fit_bart$sigma) %>% ggplot(aes(x = simulation, y = sigma)) + geom_point(size = 0.7) FIGURE 9.4: Evolution of sigma across BART simulations. And we see that, as the number of samples increases, \\(\\sigma\\) decreases. References "],["valtune.html", "Chapter 10 Validating and tuning 10.1 Learning metrics 10.2 Validation 10.3 The search for good hyperparameters 10.4 Short discussion on validation in backtests", " Chapter 10 Validating and tuning As is shown in Chapters 5 to 11, ML models require user-specified choices before they can be trained. These choices encompass parameter values (learning rate, penalization intensity, etc.) or architectural choices (e.g., the structure of a network). Alternative designs in ML engines can lead to different predictions, hence selecting a good one can be critical. We refer to the work of Probst, Bischl, and Boulesteix (2018) for a study on the impact of hyperparameter tuning on model performance. For some models (neural networks and boosted trees), the number of degrees of freedom is so large that finding the right parameters can become complicated and challenging. This chapter addresses these issues but the reader must be aware that there is no shortcut to building good models. Crafting an effective model is time-consuming and often the result of many iterations. 10.1 Learning metrics The parameter values that are set before training are called hyperparameters. In order to be able to choose good hyperparameters, it is imperative to define metrics that evaluate the performance of ML models. As is often the case in ML, there is a dichotomy between models that seek to predict numbers (regressions) and those that try to forecast categories (classifications). Before we outline common evaluation benchmarks, we mention the econometric approach of J. Li, Liao, and Quaedvlieg (2020). The authors propose to assess the performance of a forecasting method compared to a given benchmark, conditional on some external variable. This helps monitor under which (economic) conditions the model beats the benchmark. The full implementation of the test is intricate, and we recommend the interested reader have a look at the derivations in the paper. 10.1.1 Regression analysis Errors in regression analyses are usually evaluated in a straightforward way. The \\(L^1\\) and \\(L^2\\) norms are mainstream; they are both easy to interpret and to compute. The second one, the root mean squared error (RMSE) is differentiable everywhere but harder to grasp and gives more weight to outliers. The first one, the mean absolute error gives the average distance to the realized value but is not differentiable at zero. Formally, we define them as \\[\\begin{align} \\tag{10.1} \\text{MAE}(\\textbf{y},\\tilde{\\textbf{y}})&=\\frac{1}{I}\\sum_{i=1}^I|y_i-\\tilde{y}_i|, \\\\ \\tag{10.2} \\text{MSE}(\\textbf{y},\\tilde{\\textbf{y}})&=\\frac{1}{I}\\sum_{i=1}^I(y_i-\\tilde{y}_i)^2, \\end{align}\\] and the RMSE is simply the square root of the MSE. It is always possible to generalize these formulae by adding weights \\(w_i\\) to produce heterogeneity in the importance of instances. Let us briefly comment on the MSE. It is by far the most common loss function in machine learning, but it is not necessarily the exact best choice for return prediction in a portfolio allocation task. If we decompose the loss into its 3 terms, we get the sum of squared realized returns, the sum of squared predicted returns and the product between the two (roughly speaking, a covariance term if we assume zero means). The first term does not matter. The second controls the dispersion around zero of the predictions. The third term is the most interesting from the allocator’s standpoint. The negativity of the cross-product \\(-2y_i\\tilde{y}_i\\) is always to the investor’s benefit: either both terms are positive and the model has recognized a profitable asset, or they are negative and it has identified a bad opportunity. It is when \\(y_i\\) and \\(\\tilde{y}_i\\) don’t have the same sign that problems arise. Thus, compared to the \\(\\tilde{y}_i^2\\), the cross-term is more important. Nonetheless, algorithms do not optimize with respect to this indicator.21 These metrics (MSE and RMSE) are widely used outside ML to assess forecasting errors. Below, we present other indicators that are also sometimes used to quantify the quality of a model. In line with the linear regressions, the \\(R^2\\) can be computed in any predictive exercise. \\[\\begin{equation} \\tag{10.3} R^2(\\textbf{y},\\tilde{\\textbf{y}})=1- \\frac{\\sum_{i=1}^I(y_i-\\tilde{y}_i)^2}{\\sum_{i=1}^I(y_i-\\bar{y})^2}, \\end{equation}\\] where \\(\\bar{y}\\) is the sample average of the label. One important difference with the classical \\(R^2\\) is that the above quantity can be computed on the testing sample and not on the training sample. In this case, the \\(R^2\\) can be negative when the mean squared error in the numerator is larger than the (biased) variance of the testing sample. Sometimes, the average value \\(\\bar{y}\\) is omitted in the denominator (as in Gu, Kelly, and Xiu (2020b) for instance). The benefit of removing the average value is that it compares the predictions of the model to a zero prediction. This is particularly relevant with returns because the simplest prediction of all is the constant zero value and the \\(R^2\\) can then measure if the model beats this naive benchmark. A zero prediction is always preferable to a sample average because the latter can be very much period dependent. Also, removing \\(\\bar{y}\\) in the denominator makes the metric more conservative as it mechanically reduces the \\(R^2\\). Beyond the simple indicators detailed above, several exotic extensions exist and they all consist in altering the error before taking the averages. Two notable examples are the Mean Absolute Percentage Error (MAPE) and the Mean Square Percentage Error (MSPE). Instead of looking at the raw error, they compute the error relative to the original value (to be predicted). Hence, the error is expressed in a percentage score and the averages are simply equal to: \\[\\begin{align} \\tag{10.4} \\text{MAPE}(\\textbf{y},\\tilde{\\textbf{y}})&=\\frac{1}{I}\\sum_{i=1}^I\\left|\\frac{y_i-\\tilde{y}_i}{y_i}\\right|, \\\\ \\tag{10.5} \\text{MSPE}(\\textbf{y},\\tilde{\\textbf{y}})&=\\frac{1}{I}\\sum_{i=1}^I\\left(\\frac{y_i-\\tilde{y}_i}{y_i}\\right)^2, \\end{align}\\] where the latter can be scaled by a square root if need be. When the label is positive with possibly large values, it is possible to scale the magnitude of errors, which can be very large. One way to do this is to resort to the Root Mean Squared Logarithmic Error (RMSLE), defined below: \\[\\begin{equation} \\tag{10.6} \\text{RMSLE}(\\textbf{y},\\tilde{\\textbf{y}})=\\sqrt{\\frac{1}{I}\\sum_{i=1}^I\\log\\left(\\frac{1+y_i}{1+\\tilde{y}_i}\\right)}, \\end{equation}\\] where it is obvious that when \\(y_i=\\tilde{y}_i\\), the error metric is equal to zero. Before we move on to categorical losses, we briefly comment on one shortcoming of the MSE, which is by far the most widespread metric and objective in regression tasks. A simple decomposition yields: \\[\\text{MSE}(\\textbf{y},\\tilde{\\textbf{y}})=\\frac{1}{I}\\sum_{i=1}^I(y_i^2+\\tilde{y}_i^2-2y_i\\tilde{y}_i).\\] In the sum, the first term is given, there is nothing to be done about it, hence models focus on the minimization of the other two. The second term is the dispersion of model values. The third term is a cross-product. While variations in \\(\\tilde{y}_i\\) do matter, the third term is by far the most important, especially in the cross-section. It is more valuable to reduce the MSE by increasing \\(y_i\\tilde{y}_i\\). This product is indeed positive when the two terms have the same sign, which is exactly what an investor is looking for: correct directions for the bets. For some algorithms (like neural networks), it is possible to manually specify custom losses. Maximizing the sum of \\(y_i\\tilde{y}_i\\) may be a good alternative to vanilla quadratic optimization (see Section 7.4.3 for an example of implementation). 10.1.2 Classification analysis The performance metrics for categorical outcomes are substantially different compared to those of numerical outputs. A large proportion of these metrics are dedicated to binary classes, though some of them can easily be generalized to multiclass models. We present the concepts pertaining to these metrics in an increasing order of complexity and start with the two dichotomies true versus false and positive versus negative. In binary classification, it is convenient to think in terms of true versus false. In an investment setting, true can be related to a positive return, or a return being above that of a benchmark - false being the opposite. There are then 4 types of possible results for a prediction. Two when the prediction is right (predict true with true realization or predict false with false outcome) and two when the prediction is wrong (predict true with false realization and the opposite). We define the corresponding aggregate metrics below: frequency of true positive: \\(TP=I^{-1}\\sum_{i=1}^I1_{\\{y_i=\\tilde{y}_i=1 \\}},\\) frequency of true negative: \\(TN=I^{-1}\\sum_{i=1}^I1_{\\{y_i=\\tilde{y}_i=0 \\}},\\) frequency of false positive: \\(FP=I^{-1}\\sum_{i=1}^I1_{\\{\\tilde{y}_i=1,y_i=0 \\}},\\) frequency of false negative: \\(FN=I^{-1}\\sum_{i=1}^I1_{\\{\\tilde{y}_i=0,y_i=1 \\}},\\) where true is conventionally encoded into 1 and false into 0. The sum of the four figures is equal to one. These four numbers have very different impacts on out-of-sample results, as is shown in Figure 10.1. In this table (also called a confusion matrix), it is assumed that some proxy for future profitability is forecast by the model. Each row stands for the model’s prediction and each column for the realization of the profitability. The most important cases are those in the top row, when the model predicts a positive result because it is likely that assets with positive predicted profitability (possibly relative to some benchmark) will end up in the portfolio. Of course, this is not a problem if the asset does well (left cell), but it becomes penalizing if the model is wrong because the portfolio will suffer. FIGURE 10.1: Confusion matrix: summary of binary outcomes. Among the two types of errors, type I is the most daunting for investors because it has a direct effect on the portfolio. The type II error is simply a missed opportunity and is somewhat less impactful. Finally, true negatives are those assets which are correctly excluded from the portfolio. From the four baseline rates, it is possible to derive other interesting metrics: Accuracy = \\(TP+TN\\) is the percentage of correct forecasts; Recall = \\(\\frac{TP}{TP+FN}\\) measures the ability to detect a winning strategy/asset (left column analysis). Also known as sensitivity or true positive rate (TPR); Precision = \\(\\frac{TP}{TP+FP}\\) computes the probability of good investments (top row analysis); Specificity = \\(\\frac{TN}{FP+TN}\\) measures the proportion of actual negatives that are correctly identified as such (right column analysis); Fallout = \\(\\frac{FP}{FP+TN}=1-\\)Specificity is the probability of false alarm (or false positive rate), i.e., the frequence at which the algorithm detects falsely performing assets (right column analysis); F-score, \\(\\mathbf{F}_1=2\\frac{\\text{recall}\\times \\text{precision}}{\\text{recall}+ \\text{precision}}\\) is the harmonic average of recall and precision. All of these items lie in the unit interval and a model is deemed to perform better when they increase (except for fallout for which it is the opposite). Many other indicators also exist, like the false discovery rate or false omission rate, but they are not as mainstream and less cited. Moreover, they are often simple functions of the ones mentioned above. A metric that is popular but more complex is the Area Under the (ROC) Curve, often referred to as AUC. The complicated part is the ROC curve where ROC stands for Receiver Operating Characteristic; the name comes from signal theory. We explain how it is built below. As seen in Chapters 6 and 7, classifiers generate output that are probabilities that one instance belongs to one class. These probabilities are then translated into a class by choosing the class that has the highest value. In binary classification, the class with a score above 0.5 basically wins. In practice, this 0.5 threshold may not be optimal and the model could very well correctly predict false instances when the probability is below 0.4 and true ones otherwise. Hence, it is a natural idea to test what happens if the decision threshold changes. The ROC curve does just that and plots the recall as a function of the fallout when the threshold increases from zero to one. When the threshold is equal to 0, true positives are equal to zero because the model never forecasts positive values. Thus, both recall and fallout are equal to zero. When the threshold is equal to one, false negatives shrink to zero and true negatives too, hence recall and fallout are equal to one. The behavior of their relationship in between these two extremes is called the ROC curve. We provide stylized examples below in Figure 10.2. A random classifier would fare equally good for recall and fallout and thus the ROC curve would be a linear line from the point (0,0) to (1,1). To prove this, imagine a sample with a \\(p\\in (0,1)\\) proportion of true instances and a classifier that predicts true randomly with a probability \\(p'\\in (0,1)\\). Then because the sample and predictions are independent, \\(TP=p'p\\), \\(FP = p'(1-p)\\), \\(TN=(1-p')(1-p)\\) and \\(FN=(1-p')p\\). Given the above definition, this yields that both recall and fallout are equal to \\(p'\\). FIGURE 10.2: Stylized ROC curves. An algorithm with a ROC curve above the 45° angle is performing better than an average classifier. Indeed, the curve can be seen as a tradeoff between benefits (probability of detecting good strategies on the \\(y\\) axis) minus costs (odds of selecting the wrong assets on the \\(x\\) axis). Hence being above the 45° is paramount. The best possible classifier has a ROC curve that goes from point (0,0) to point (0,1) to point (1,1). At point (0,1), fallout is null, hence there are no false positives, and recall is equal to one so that there are also no false negatives: the model is always right. The opposite is true: at point (1,0), the model is always wrong. Below, we use a particular package (caTools) to compute a ROC curve for a given set of predictions on the testing sample. if(!require(caTools)){install.packages("caTools")} library(caTools) # Package for AUC computation colAUC(X = predict(fit_RF_C, testing_sample, type = "prob"), y = testing_sample$R1M_Usd_C, plotROC = TRUE) FIGURE 10.3: Example of ROC curve. ## FALSE TRUE ## FALSE vs. TRUE 0.5003885 0.5003885 In Figure 10.3, the curve is very close to the 45° angle and the model seems as good (or, rather, as bad) as a random classifier. Finally, having one entire curve is not practical for comparison purposes, hence the information of the whole curve is synthesized into the area below the curve, i.e., the integral of the corresponding function. The 45° angle (quadrant bisector) has an area of 0.5 (it is half the unit square which has a unit area). Thus, any good model is expected to have an area under the curve (AUC) above 0.5. A perfect model has an AUC of one. We end this subsection with a word on multiclass data. When the output (i.e., the label) has more than two categories, things become more complex. It is still possible to compute a confusion matrix, but the dimension is larger and harder to interpret. The simple indicators like \\(TP\\), \\(TN\\), etc., must be generalized in a non-standard way. The simplest metric in this case is the cross-entropy defined in Equation (7.5). We refer to Section 6.1.2 for more details on losses related to categorical labels. 10.2 Validation Validation is the stage at which a model is tested and tuned before it starts to be deployed on real or live data (e.g., for trading purposes). Needless to say, it is critical. 10.2.1 The variance-bias tradeoff: theory The variance-bias tradeoff is one of the core concepts in supervised learning. To explain it, let us assume that the data is generated by the simple model \\[y_i=f(\\textbf{x}_i)+\\epsilon_i, \\quad \\mathbb{E}[\\boldsymbol{\\epsilon}]=0, \\quad \\mathbb{V}[\\boldsymbol{\\epsilon}]=\\sigma^2,\\] but the model that is estimated yields \\[y_i=\\hat{f}(\\textbf{x}_i)+\\hat{\\epsilon}_i. \\] Given an unknown sample \\(\\textbf{x}\\), the decomposition of the average squared error is \\[\\begin{align} \\tag{10.7} \\mathbb{E}[\\hat{\\epsilon}^2]&=\\mathbb{E}[(y-\\hat{f}(\\textbf{x}))^2]=\\mathbb{E}[(f(\\textbf{x})+\\epsilon-\\hat{f}(\\textbf{x}))^2] \\\\ &= \\underbrace{\\mathbb{E}[(f(\\textbf{x})-\\hat{f}(\\textbf{x}))^2]}_{\\text{total quadratic error}}+\\underbrace{\\mathbb{E}[\\epsilon^2]}_{\\text{irreducible error}} \\nonumber \\\\ &= \\mathbb{E}[\\hat{f}(\\textbf{x})^2]+\\mathbb{E}[f(\\textbf{x})^2]-2\\mathbb{E}[f(\\textbf{x})\\hat{f}(\\textbf{x})]+\\sigma^2\\nonumber\\\\ &=\\mathbb{E}[\\hat{f}(\\textbf{x})^2]+f(\\textbf{x})^2-2f(\\textbf{x})\\mathbb{E}[\\hat{f}(\\textbf{x})]+\\sigma^2\\nonumber\\\\ &=\\left[ \\mathbb{E}[\\hat{f}(\\textbf{x})^2]-\\mathbb{E}[\\hat{f}(\\textbf{x})]^2\\right]+\\left[\\mathbb{E}[\\hat{f}(\\textbf{x})]^2+f(\\textbf{x})^2-2f(\\textbf{x})\\mathbb{E}[\\hat{f}(\\textbf{x})]\\right]+\\sigma^2\\nonumber\\\\ &=\\underbrace{\\mathbb{V}[\\hat{f}(\\textbf{x})]}_{\\text{variance of model}}+ \\quad \\underbrace{\\mathbb{E}[(f(\\textbf{x})-\\hat{f}(\\textbf{x}))]^2}_{\\text{squared bias}}\\quad +\\quad\\sigma^2 \\nonumber \\end{align}\\] In the above derivation, \\(f(x)\\) is not random, but \\(\\hat{f}(x)\\) is. Also, in the second line, we assumed \\(\\mathbb{E}[\\epsilon(f(x)-\\hat{f}(x))]=0\\), which may not always hold (though it is a very common assumption). The average squared error thus has three components: the variance of the model (over its predictions); the squared bias of the model; and one irreducible error (independent from the choice of a particular model). The last one is immune to changes in models, so the challenge is to minimize the sum of the first two. This is known as the variance-bias tradeoff because reducing one often leads to increasing the other. The goal is thus to assess when a small increase in either one can lead to a larger decrease in the other. There are several ways to represent this tradeoff and we display two of them. The first one relates to archery (see Figure 10.4) below. The best case (top left) is when all shots are concentrated in the middle: on average, the archer aims correctly and all the arrows are very close to one another. The worst case (bottom right) is the exact opposite: the average arrow is above the center of the target (the bias is nonzero) and the dispersion of arrows is large. FIGURE 10.4: First representation of the variance-bias tradeoff. The most often encountered cases in ML are the other two configurations: either the arrows (predictions) are concentrated in a small perimeter, but the perimeter is not the center of the target; or the arrows are on average well distributed around the center, but they are, on average, far from it. The second way the variance bias tradeoff is often depicted is via the notion of model complexity. The most simple model of all is a constant one: the prediction is always the same, for instance equal to the average value of the label in the training set. Of course, this prediction will often be far from the realized values of the testing set (its bias will be large), but at least its variance is zero. On the other side of the spectrum, a decision tree with as many leaves as there are instances has a very complex structure. It will probably have a smaller bias, but undoubtedly it is not obvious that this will compensate the increase in variance incurred by the intricacy of the model. This facet of the tradeoff is depicted in Figure 10.5 below. To the left of the graph, a simple model has a small variance but a large bias, while to the right it is the opposite for a complex model. Good models often lie somewhere in the middle, but the best mix is hard to find. FIGURE 10.5: Second representation of the variance-bias tradeoff. The most tractable theoretical form of the variance-bias tradeoff is the ridge regression.22 The coefficient estimates in this type of regression are given by \\(\\hat{\\mathbf{b}}_\\lambda=(\\mathbf{X}'\\mathbf{X}+\\lambda \\mathbf{I}_N)^{-1}\\mathbf{X}'\\mathbf{Y}\\) (see Section 5.1.1), where \\(\\lambda\\) is the penalization intensity. Assuming a true linear form for the data generating process (\\(\\textbf{y}=\\textbf{Xb}+\\boldsymbol{\\epsilon}\\) where \\(\\textbf{b}\\) is unknown and \\(\\sigma^2\\) is the variance of errors - which have identity correlation matrix), this yields \\[\\begin{align} \\mathbb{E}[\\hat{\\textbf{b}}_\\lambda]&=\\textbf{b}-\\lambda(\\textbf{X}'\\textbf{X}+\\lambda \\textbf{I}_N)^{-1} \\textbf{b}, \\\\ \\tag{10.8} \\mathbb{V}[\\hat{\\textbf{b}}_\\lambda]&=\\sigma^2(\\textbf{X}'\\textbf{X}+\\lambda \\textbf{I}_N)^{-1}\\textbf{X}'\\textbf{X} (\\textbf{X}'\\textbf{X}+\\lambda \\textbf{I}_N)^{-1}. \\end{align}\\] Basically, this means that the bias of the estimator is equal to \\(-\\lambda(\\textbf{X}'\\textbf{X}+\\lambda \\textbf{I}_N)^{-1} \\textbf{b}\\), which is zero in the absence of penalization (classical regression) and converges to some finite number when \\(\\lambda \\rightarrow \\infty\\), i.e., when the model becomes constant. Note that if the estimator has a zero bias, then predictions will too: \\(\\mathbb{E}[\\textbf{X}(\\textbf{b}-\\hat{\\textbf{b}})]=\\textbf{0}\\). The variance (of estimates) in the case of an unconstrained regression is equal to \\(\\mathbb{V}[\\hat{\\textbf{b}}]=\\sigma (\\textbf{X}'\\textbf{X})^{-1}\\). In Equation (10.8), the \\(\\lambda\\) reduces the magnitude of figures in the inverse matrix. The overall effect is that as \\(\\lambda\\) increases, the variance decreases and in the limit \\(\\lambda \\rightarrow \\infty\\), the variance is zero when the model is constant. The variance of predictions is \\[\\begin{align*} \\mathbb{V}[\\textbf{X}\\hat{\\textbf{b}}]&=\\mathbb{E}[(\\textbf{X}\\hat{\\textbf{b}}-\\mathbb{E}[\\textbf{X}\\hat{\\textbf{b}}])(\\textbf{X}\\hat{\\textbf{b}}-\\mathbb{E}[\\textbf{X}\\hat{\\textbf{b}}])'] \\\\ &= \\textbf{X}\\mathbb{E}[(\\hat{\\textbf{b}}-\\mathbb{E}[\\hat{\\textbf{b}}])(\\hat{\\textbf{b}}-\\mathbb{E}[\\hat{\\textbf{b}}])']\\textbf{X}' \\\\ &= \\textbf{X}\\mathbb{V}[\\hat{\\textbf{b}}]\\textbf{X} \\end{align*}\\] All in all, ridge regressions are very handy because with a single parameter, they are able to provide a cursor that directly tunes the variance-bias tradeoff. It’s easy to illustrate how simple it is to display the tradeoff with the ridge regression. In the example below, we recycle the ridge model trained in Chapter 5. ridge_errors <- predict(fit_ridge, x_penalized_test) - # Errors from all models (rep(testing_sample$R1M_Usd, 100) %>% matrix(ncol = 100, byrow = FALSE)) ridge_bias <- ridge_errors %>% apply(2, mean) # Biases ridge_var <- predict(fit_ridge, x_penalized_test) %>% apply(2, var) # Variance tibble(lambda, ridge_bias^2, ridge_var, total = ridge_bias^2+ridge_var) %>% # Plot gather(key = Error_Component, value = Value, -lambda) %>% ggplot(aes(x = lambda, y = Value, color = Error_Component)) + geom_line() FIGURE 10.6: Error decomposition for a ridge regression. In Figure 10.6, the pattern is different from the one depicted in Figure 10.5. In the graph, when the intensity lambda increases, the magnitude of parameters shrinks and the model becomes simpler. Hence, the most simple model seems like the best choice: adding complexity increases variance but does not improve the bias! One possible reason for that is that features don’t actually carry much predictive value and hence a constant model is just as good as more sophisticated ones based on irrelevant variables. 10.2.2 The variance-bias tradeoff: illustration The variance-bias tradeoff is often presented in theoretical terms that are easy to grasp. It is nonetheless useful to demonstrate how it operates on true algorithmic choices. Below, we take the example of trees because their complexity is easy to evaluate. Basically, a tree with many terminal nodes is more complex than a tree with a handful of clusters. We start with the parsimonious model, which we train below. fit_tree_simple <- rpart(formula, data = training_sample, # Data source: training sample cp = 0.0001, # Precision: smaller = more leaves maxdepth = 2 # Maximum depth (i.e. tree levels) ) rpart.plot(fit_tree_simple) FIGURE 10.7: Simple tree. The model depicted in Figure 10.7 only has 4 clusters, which means that the predictions can only take four values. The smallest one is 0.011 and encompasses a large portion of the sample (85%) and the largest one is 0.062 and corresponds to only 4% of the training sample. We are then able to compute the bias and the variance of the predictions on the testing set. mean(predict(fit_tree_simple, testing_sample) - testing_sample$R1M_Usd) # Bias ## [1] 0.004973917 var(predict(fit_tree_simple, testing_sample)) # Variance ## [1] 0.0001398003 On average, the error is slightly positive, with an overall overestimation of 0.005. As expected, the variance is very small (10^{-4}). For the complex model, we take the boosted tree that was obtained in Section 6.4.6 (fit_xgb). The model aggregates 40 trees with a maximum depth of 4, it is thus undoubtedly more complex. mean(predict(fit_xgb, xgb_test) - testing_sample$R1M_Usd) # Bias ## [1] 0.003347665 var(predict(fit_xgb, xgb_test)) # Variance ## [1] 0.003542071 The bias is indeed smaller compared to that of the simple model, but in exchange, the variance increases substantially. The net effect (via the squared bias) is in favor of the simpler model. 10.2.3 The risk of overfitting: principle The notion of overfitting is one of the most important in machine learning. When a model overfits, the accuracy of its predictions will be disappointing, thus it is one major reason why some strategies fail out-of-sample. Therefore, it is important to understand not only what overfitting is, but also how to mitigate its effects. One recent reference on this topic and its impact on portfolio strategies is Hsu et al. (2018), which builds on the work of White (2000). Both of these references do not deal with ML models, but the principle is the same. When given a dataset, a sufficiently intense level of analysis (by a human or a machine) will always be able to detect some patterns. Whether these patterns are spurious or not is the key question. In Figure 10.8, we illustrate this idea with a simple visual example. We try to find a model that maps x into y. The (training) data points are the small black circles. The simplest model is the constant one (only one parameter), but with two parameters (level and slope), the fit is already quite good. This is shown with the blue line. With a sufficient number of parameters, it is possible to build a model that flows through all the points. One example would be a high-dimensional polynomial. One such model is represented with the red line. Now there seems to be a strange point in the dataset and the complex model fits closely to match this point. FIGURE 10.8: Illustration of overfitting: a model closely matching training data is rarely a good idea. A new point is added in light green. It is fair to say that it follows the general pattern of the other points. The simple model is not perfect and the error is non-negligible. Nevertheless, the error stemming from the complex model (shown with the dotted gray line) is approximately twice as large. This simplified example shows that models that are too close to the training data will catch idiosyncracies that will not occur in other datasets. A good model would overlook these idiosyncracies and stick to the enduring structure of the data. 10.2.4 The risk of overfitting: some solutions Obviously, the easiest way to avoid overfitting is to resist the temptation of complicated models (e.g., high-dimensional neural networks or tree ensembles). The complexity of models is often proxied via two measures: the number of parameters of the model and their magnitude (often synthesized through their norm). These proxies are not perfect because some complex models may only require a small number of parameters (or even small parameter values), but at least they are straightforward and easy to handle. There is no universal way of handling overfitting. Below, we detail a few tricks for some families of ML tools. For regressions, there are two simple ways to deal with overfitting. The first is the number of parameters, that is, the number of predictors. Sometimes, it can be better to only select a subsample of features, especially if some of them are highly correlated (often, a threshold of 70% is considered as too high for absolute correlations between features). The second solution is penalization (via LASSO, ridge or elasticnet), which helps reduce the magnitude of estimates and thus of the variance of predictions. For tree-based methods, there are a variety of ways to reduce the risk of overfitting. When dealing with simple trees, the only way to proceed is to limit the number of leaves. This can be done in many ways. First, by imposing a maximum depth. If it is equal to \\(d\\), then the tree can have at most \\(2^d\\) terminal nodes. It is often advised not to go beyond \\(d=6\\). The complexity parameter in rpart (cp) is another way to shrink the size of trees: any new split must lead to a reduction in loss at least equal to cp. If not, the split is not deemed useful and is thus not performed. Thus when cp is large, the tree is not grown. The last two parameters available in rpart are the minimum number of instances required in each leaf and the minimum number of instances per cluster requested in order to continue the splitting process. The higher (i.e., the more coercive) these figures are, the harder it is to grow complex trees. In addition to these options, random forests allow to control for the number of trees in the forest. Theoretically (see Breiman (2001)), this parameter is not supposed to impact the risk of overfitting because new trees only help reduce the total error via diversification. In practice, and for the sake of computation times, it is not recommended to go beyond 1,000 trees. Two other hyperparameters are the subsample size (on which each learner is trained) and the number of features retained for learning. They do not have a straightforward impact on bias and tradeoff, but rather on raw performace. For instance, if subsamples are too small, the trees will not learn enough. Same problem if the number of features is too low. On the other hand, choosing a large number of predictors (i.e., close to the total number) may lead to high correlations between each learner’s prediction because the overlap in information contained in the training samples may be high. Boosted trees have other options that can help alleviate the risk of overfitting. The most obvious one is the learning rate, which discounts the impact of each new tree by \\(\\eta \\in (0,1)\\). When the learning rate is high, the algorithm learns too quickly and is prone to sticking close to the training data. When it’s low, the model learns very progressively, which can be efficient if there are sufficiently many trees in the ensemble. Indeed, the learning rate and the number of trees must be chosen synchronously: if both are low, the ensemble will learn nothing and if both are large, it will overfit. The arsenal of boosted tree parameters does not stop there. The penalizations, both of score values and of the number of leaves, are naturally a tool to prevent the model from going too deep in the particularities of the training sample. Finally, constraints of monotonicity like those mentioned in Section 6.4.5 are also an efficient way to impose some structure on the model and force it to detect particular patterns. Lastly neural networks also have many options aimed at protecting them against overfitting. Just like for boosted trees, some of them are the learning rate and the penalization of weights and biases (via their norm). Constraints, like nonnegative constraints, can also help when the model theoretically requires positive inputs. Finally, dropout is always a direct way to reduce the dimension (number of parameters) of a network. 10.3 The search for good hyperparameters 10.3.1 Methods Let us assume that there are \\(p\\) parameters to be defined before a model is run. The simplest way to proceed is to test different values of these parameters and choose the one that yields the best results. There are mainly two ways to perform these tests: independently and sequentially. Independent tests are easy and come in two families: grid (deterministic) search and random exploration. The advantage of a deterministic approach is that it covers the space uniformly and makes sure that no corners are omitted. The drawback is the computation time. Indeed, for each parameter, it seems reasonable to test at least five values, which makes \\(5^p\\) combinations. If \\(p\\) is small (smaller than 3), this is manageable when the backtests are not too lengthy. When \\(p\\) is large, the number of combinations may become prohibitive. This is when random exploration can be useful because in this case, the user specifies the number of tests upfront and the parameters are drawn randomly (usually uniformly over a given range for each parameter). The flaw in random search is that some areas in the parameter space may not be covered, which can be problematic if the best choice is located there. It is nonetheless shown in Bergstra and Bengio (2012) that random exploration is preferable to grid search. Both grid and random searches are suboptimal because they are likely to spend time in zones of the parameter space that are irrelevant, thereby wasting computation time. Given a number of parameter points that have been tested, it is preferable to focus the search in areas where the best points are the most likely. This is possible via an interative process that adapts the search after each new point has been tested. In the large field of finance, a few papers dedicated to tuning are Lee (2020) and Nystrup, Lindstrom, and Madsen (2020). One other popular approach in this direction is Bayesian optimization (BO). The central object is the objective function of the learning process. We call this function \\(O\\) and it can be widely seen as a loss function possibly combined with penalization and constraints. For simplicity here, we will not mention the training/testing samples and they are considered to be fixed. The variable of interest is the vector \\(\\textbf{p}=(p_1,\\dots,p_l)\\) which synthesizes the hyperparameters (learning rate, penalization intensities, number of models, etc.) that have an impact on \\(O\\). The program we are interested in is \\[\\begin{equation} \\tag{10.9} \\textbf{p}_*=\\underset{\\textbf{p}}{\\text{argmin}} \\ O(\\textbf{p}). \\end{equation}\\] The main problem with this optimization is that the computation of \\(O(\\textbf{p})\\) is very costly. Therefore, it is critical to choose each trial for \\(\\textbf{p}\\) wisely. One key assumption of BO is that the distribution of \\(O\\) is Gaussian and that \\(O\\) can be proxied by a linear combination of the \\(p_l\\). Said differently, the aim is to build a Bayesian linear regression between the input \\(\\textbf{p}\\) and the output (dependent variable) \\(O\\). Once a model has been estimated, the information that is concentrated in the posterior density of \\(O\\) is used to make an educated guess at where to look for new values of \\(\\textbf{p}\\). This educated guess is made based on a so-called acquisition function. Suppose we have tested \\(m\\) values for \\(\\textbf{p}\\), which we write \\(\\textbf{p}^{(m)}\\). The current best parameter is written \\(\\textbf{p}_m^*=\\underset{1\\le k\\le m}{\\text{argmin}} \\ O(\\textbf{p}^{(k)})\\). If we test a new point \\(\\textbf{p}\\), then it will lead to an improvement only if \\(O(\\textbf{p})<O(\\textbf{p}_m^*)\\), that is if the new objective improves the minimum value that we already know. The average value of this improvement is \\[\\begin{equation} \\tag{10.10} \\textbf{EI}_m(\\textbf{p})=\\mathbb{E}_m[[O(\\textbf{p}_m^*)-O(\\textbf{p})]_+], \\end{equation}\\] where the positive part \\([\\cdot]_+\\) emphasizes that when \\(O(\\textbf{p})\\ge O(\\textbf{p}_m^*)\\), the gain is zero. The expectation is indexed by \\(m\\) because it is computed with respect to the posterior distribution of \\(O(\\textbf{p})\\) based on the \\(m\\) samples \\(\\textbf{p}^{(m)}\\). The best choice for the next sample \\(\\textbf{p}^{m+1}\\) is then \\[\\begin{equation} \\tag{10.11} \\textbf{p}^{m+1}=\\underset{\\textbf{p}}{\\text{argmax}} \\ \\textbf{EI}_m(\\textbf{p}), \\end{equation}\\] which corresponds to the maximum location of the expected improvement. Instead of the EI, the optimization can be performed on other measures, like the probability of improvement, which is \\(\\mathbb{P}_m[O(\\textbf{p})<O(\\textbf{p}_m^*)]\\). In compact form, the iterative process can be outlined as follows: step 1: compute \\(O(\\textbf{p}^{(m)})\\) for \\(m=1,\\dots,M_0\\) values of parameters. step 2a: compute sequentially the posterior density of \\(O\\) on all available points. step 2b: compute the optimal new point to test \\(\\textbf{p}^{(m+1)}\\) given in Equation (10.11). step 2c: compute the new objective value \\(O(\\textbf{p}^{(m+1)})\\). step 3: repeat steps 2a to 2c as much as deemed reasonable and return the \\(\\textbf{p}^{(m)}\\) that yields the smallest objective value. The interested reader can have a look at Snoek, Larochelle, and Adams (2012) and Frazier (2018) for more details on the numerical facets of this method. Finally, for the sake of completeness, we mention a last way to tune hyperparameters. Since the optimization scheme is \\(\\underset{\\textbf{p}}{\\text{argmin}} \\ O(\\textbf{p})\\), a natural way to proceed would be to use the sensitivity of \\(O\\) with respect to \\(\\textbf{p}\\). Indeed, if the gradient \\(\\frac{\\partial O}{\\partial p_l}\\) is known, then a gradient descent will always improve the objective value. The problem is that it is hard to compute a reliable gradient (finite differences can become costly). Nonetheless, some methods (e.g., Maclaurin, Duvenaud, and Adams (2015)) have been applied successfully to optimize over large dimensional parameter spaces. We conclude by mentioning the survey Bouthillier and Varoquaux (2020), which spans 2 major AI conferences that took place in 2019. It shows that most papers resort to hyperparameter tuning. The two most often cited methods are manual tuning (hand-picking) and grid search. 10.3.2 Example: grid search In order to illustrate the process of grid search, we will try to find the best parameters for a boosted tree. We seek to quantify the impact of three parameters: eta, the learning rate, nrounds, the number of trees that are grown, lambda, the weight regularizer which penalizes the objective function through the total sum of squared weights/scores. Below, we create a grid with the values we want to test for these parameters. eta <- c(0.1, 0.3, 0.5, 0.7, 0.9) # Values for eta nrounds <- c(10, 50, 100) # Values for nrounds lambda <- c(0.01, 0.1, 1, 10, 100) # Values for lambda pars <- expand.grid(eta, nrounds, lambda) # Exploring all combinations! head(pars) # Let's see the parameters ## Var1 Var2 Var3 ## 1 0.1 10 0.01 ## 2 0.3 10 0.01 ## 3 0.5 10 0.01 ## 4 0.7 10 0.01 ## 5 0.9 10 0.01 ## 6 0.1 50 0.01 eta <- pars[,1] nrounds <- pars[,2] lambda <- pars[,3] Given the computational cost of grid search, we perform the exploration on the dataset with the small number of features (which we recycle from Chapter 6). In order to avoid the burden of loops, we resort to the functional programming capabilities of R, via the purrr package. This allows us to define a function that will lighten and simplify the code. This function, coded below, takes data and parameter inputs and returns an error metric for the algorithm. We choose the mean squared error to evaluate the impact of hyperparameter values. grid_par <- function(train_matrix, test_features, test_label, eta, nrounds, lambda){ fit <- train_matrix %>% xgb.train(data = ., # Data source (pipe input) eta = eta, # Learning rate objective = "reg:linear", # Objective function max_depth = 5, # Maximum depth of trees lambda = lambda, # Penalisation of leaf values gamma = 0.1, # Penalisation of number of leaves nrounds = nrounds, # Number of trees used verbose = 0 # No comment from algo ) pred <- predict(fit, test_features) # Predictions based on model & test values return(mean((pred-test_label)^2)) # Mean squared error } The grid_par function can then be processed by the functional programming tool pmap that is going to perform the loop on parameter values automatically. # grid_par(train_matrix_xgb, xgb_test, testing_sample$R1M_Usd, 0.1, 3, 0.1) # Possible test grd <- pmap(list(eta, nrounds, lambda), # Parameters for the grid search grid_par, # Function on which to apply the search train_matrix = train_matrix_xgb, # Input for function: training data test_features = xgb_test, # Input for function: test features test_label = testing_sample$R1M_Usd # Input for function: test labels (returns) ) ## [17:00:18] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:18] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:19] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:19] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:20] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:21] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:23] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:30] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:37] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:40] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:47] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:53] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:00:59] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:05] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:10] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:11] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:12] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:12] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:13] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:13] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:16] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:19] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:21] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:24] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:27] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:32] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:38] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:44] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:49] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:55] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:56] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:56] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:57] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:57] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:01:58] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:04] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:07] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:10] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:16] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:23] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:35] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:45] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:51] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:02:58] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:01] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:02] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:03] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:06] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:09] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:13] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:17] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:21] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:26] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:34] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:41] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:49] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:55] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:55] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:56] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:56] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:57] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:03:57] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:00] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:03] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:06] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:09] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:12] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:19] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:25] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:30] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [17:04:41] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. grd <- data.frame(eta, nrounds, lambda, error = unlist(grd)) # Dataframe with all results Once the squared mean errors have been gathered, it is possible to plot them. We chose to work with 3 parameters on purpose because their influence can be simultaneuously plotted on one graph. grd$eta <- as.factor(eta) # Params as categories (for plot) grd %>% ggplot(aes(x = eta, y = error, fill = eta)) + # Plot! geom_bar(stat = "identity") + facet_grid(rows = vars(nrounds), cols = vars(lambda)) + theme(axis.text.x = element_text(size = 6)) FIGURE 10.9: Plot of error metrics (SMEs) for many parameter values. Each row of graph corresponds to nrounds and each column to lambda. In Figure 10.9, the main information is that a small learning rate (\\(\\eta=0.1\\)) is detrimental to the quality of the forecasts when the number of trees is small (nrounds=10), which means that the algorithm does not learn enough. Grid search can be performed in two stages: the first stage helps locate the zones that are of interest (with the lowest loss/objective values) and then zoom in on these zones with refined values for the parameter on the grid. With the results above, this would mean considering many learners (more than 50, possibly more than 100), and avoiding large learning rates such as \\(\\eta=0.9\\) or \\(\\eta=0.8\\). 10.3.3 Example: Bayesian optimization There are several packages in R that relate to Bayesian optimization. We work with rBayesianOptimization, which is general purpose but also needs more coding involvement. Just as for the grid search, we need to code the objective function on which the hyperparameters will be optimized. Under rBayesianOptimization, the output has to have a particular form, with a score and a prediction variable. The function will maximize the score, hence we will define it as minus the mean squared error. bayes_par_opt <- function(train_matrix = train_matrix_xgb, # Input for func: train data test_features = xgb_test, # Input for func: test feats test_label = testing_sample$R1M_Usd, # Input for func: test label eta, nrounds, lambda){ # Input for func params fit <- train_matrix %>% xgb.train(data = ., # Data source (pipe input) eta = eta, # Learning rate objective = "reg:linear", # Objective function max_depth = 5, # Maximum depth of trees lambda = lambda, # Penalisation of leaf values gamma = 0.1, # Penalisation of number of leaves nrounds = round(nrounds), # Number of trees used verbose = 0 # No comment from algo ) pred <- predict(fit, test_features) # Forecast based on fitted model & test values list(Score = -mean((pred-test_label)^2), # Minus RMSE Pred = pred) # Predictions on test set } Once the objective function is defined, it can be plugged into the Bayesian optimizer. library(rBayesianOptimization) bayes_opt <- BayesianOptimization(bayes_par_opt, # Function to maximize bounds = list(eta = c(0.2, 0.8), # Bounds for eta lambda = c(0.5, 15), # Bounds for lambda nrounds = c(10, 100)), # Bounds for nrounds init_points = 10, # Nb initial points for first estimation n_iter = 24, # Nb optimization steps/trials acq = "ei", # Acquisition function = expected improvement verbose = FALSE) ## ## Best Parameters Found: ## Round = 4 eta = 0.3718 lambda = 6.2302 nrounds = 12.9980 Value = -0.0375 bayes_opt$Best_Par ## eta lambda nrounds ## 0.3718497 6.2301516 12.9980352 The final parameters indicate that it is advised to resist overfitting: small number of learners and large penalization seem to be the best choices. To confirm these results, we plot the relationship between the loss (up to the sign) and two hyperparameters. Each point corresponds to a value tested in the optimization. The best values are clearly to the left of the left graph and to the right of the right graph and the pattern is reliably pronounced. According to these graphs, it seems indeed wiser to pick a smaller number of trees and a larger penalization factor (to maximize minus the loss). library("ggpubr") # Package for combining plots plot_rounds <- bayes_opt$History %>% ggplot(aes(x = nrounds, y = Value)) + geom_point() + geom_smooth(method = "lm") plot_lambda <- bayes_opt$History %>% ggplot(aes(x = lambda, y = Value)) + geom_point() + geom_smooth(method = "lm") par(mar = c(1,1,1,1)) ggarrange(plot_rounds, plot_lambda, ncol = 2) FIGURE 10.10: Relationship between (minus) the loss and hyperparameter values. 10.4 Short discussion on validation in backtests The topic of validation in backtests is more complex than it seems. There are in fact two scales at which it can operate, depending on whether the forecasting model is dynamic (updated at each rebalancing) or fixed. Let us start with the first option. In this case, the aim is to build a unique model and to test it on different time periods. There is an ongoing debate on the methods that are suitable to validate a model in that case. Usually, it makes sense to test the model on successive dates, moving forward posterior to the training. This is what makes more sense, as it replicates what would happen in a live situation. In machine learning, a popular approach is to split the data into \\(K\\) partitions and to test \\(K\\) different models: each one is tested on one of the partitions but trained on the \\(K-1\\) others. This so-called cross-validation (CV) is proscribed by most experts (and common sense) for a simple reason: most of the time, the training set encompasses data from future dates and tests on past values. Nonetheless, some advocate one particular form of CV that aims at making sure that there is no informational overlap between the training and testing set (Sections 7.4 and 12.4 in De Prado (2018)). The premise is that if the structure of the cross-section of returns is constant through time, then training on future points and testing on past data is not problematic as long as there is no overlap. The paper Schnaubelt (2019) provides a comprehensive and exhaustive tour in many validation schemes. One example cited in De Prado (2018) is the reaction to a model to an unseen crisis. Following the market crash of 2008, at least 11 years have followed without any major financial shake. One option to test the reaction of a recent model to a crash would be to train it on recent years (say 2015-2019) and test it on various points (e.g., months) in 2008 to see how it performs. The advantage of a fixed model is that validation is easy: for one set of hyperparameters, test the model on a set of dates, and evaluate the performance of the model. Repeat the process for other parameters and choose the best alternative (or use Bayesian optimization). The second major option is when the model is updated (retrained) at each rebalancing. The underlying idea here is that the structure of returns evolves through time and a dynamic model will capture the most recent trends. The drawback is that validation must (should?) be rerun at each rebalancing date. Let us recall the dimensions of backtests: - number of strategies: possibly dozens or hundreds, or even more; - number of trading dates: hundreds for monthly rebalancing; - number of assets: hundreds or thousands; - number of features: dozens or hundreds. Even with a lot of computational power (GPUs, etc.), training many models over many dates is time-consuming, especially when it comes to hyperparameter tuning when the parameter space is large. Thus, validating models at each trading date of the out-of-sample period is not realistic. One solution is to keep an early portion of the training data and to perform a smaller scale validation on this subsample. Hyperparameters are tested on a limited number of dates and most of the time, they exhibit stability: satisfactory parameters for one date are usually acceptable for the next one and the following one as well. Thus, the full backtest can be carried out with these values when updating the models at each period. The backtest nonetheless remains compute-intensive because the model has to be retrained with the most recent data for each rebalancing date. References "],["ensemble.html", "Chapter 11 Ensemble models 11.1 Linear ensembles 11.2 Stacked ensembles 11.3 Extensions 11.4 Exercise", " Chapter 11 Ensemble models Let us be honest. When facing a prediction task, it is not obvious to determine the best choice between ML tools: penalized regressions, tree methods, neural networks, SVMs, etc. A natural and tempting alternative is to combine several algorithms (or the predictions that result from them) to try to extract value out of each engine (or learner). This intention is not new and contributions towards this goal go back at least to Bates and Granger (1969) (for the purpose of passenger flow forecasting). Below, we outline a few books on the topic of ensembles. The latter have many names and synonyms, such as forecast aggregation, model averaging, mixture of experts or prediction combination. The first four references below are monographs, while the last two are compilations of contributions: Zhou (2012): a very didactic book that covers the main ideas of ensembles; Schapire and Freund (2012): the main reference for boosting (and hence, ensembling) with many theoretical results and thus strong mathematical groundings; Seni and Elder (2010): an introduction dedicated to tree methods mainly; Claeskens and Hjort (2008): an overview of model selection techniques with a few chapters focused on model averaging; Zhang and Ma (2012): a collection of thematic chapters on ensemble learning; Okun, Valentini, and Re (2011): examples of applications of ensembles. In this chapter, we cover the basic ideas and concepts behind the notion of ensembles. We refer to the above books for deeper treatments on the topic. We underline that several ensemble methods have already been mentioned and covered earlier, notably in Chapter 6. Indeed, random forests and boosted trees are examples of ensembles. Hence, other early articles on the combination of learners are Schapire (1990), Jacobs et al. (1991) (for neural networks particularly), and Freund and Schapire (1997). Ensembles can for instance be used to aggregate models that are built on different datasets (Pesaran and Pick (2011)), and can be made time-dependent (Sun et al. (2020)). For a theoretical view on ensembles with a Bayesian perspective, we refer to Razin and Levy (2020). Finally, perspectives linked to asset pricing and factor modelling are provided in Gospodinov and Maasoumi (2020) and De Nard, Hediger, and Leippold (2020) (subsampling and forecast aggregation). 11.1 Linear ensembles 11.1.1 Principles In this chapter we adopt the following notations. We work with \\(M\\) models where \\(\\tilde{y}_{i,m}\\) is the prediction of model \\(m\\) for instance \\(i\\) and errors \\(\\epsilon_{i,m}=y_i-\\tilde{y}_{i,m}\\) are stacked into a \\((I\\times M)\\) matrix \\(\\textbf{E}\\). A linear combination of models has sample errors equal to \\(\\textbf{Ew}\\), where \\(\\textbf{w}=w_m\\) are the weights assigned to each model and we assume \\(\\textbf{w}'\\textbf{1}_M=1\\). Minimizing the total (squared) error is thus a simple quadratic program with unique constraint. The Lagrange function is \\(L(\\textbf{w})=\\textbf{w}'\\textbf{E}'\\textbf{E}\\textbf{w}-\\lambda (\\textbf{w}'\\textbf{1}_M-1)\\) and hence \\[\\frac{\\partial}{\\partial \\textbf{w}}L(\\textbf{w})=\\textbf{E}'\\textbf{E}\\textbf{w}-\\lambda \\textbf{1}_M=0 \\quad \\Leftrightarrow \\quad \\textbf{w}=\\lambda(\\textbf{E}'\\textbf{E})^{-1}\\textbf{1}_M,\\] and the constraint imposes \\(\\textbf{w}^*=\\frac{(\\textbf{E}'\\textbf{E})^{-1}\\textbf{1}_M}{(\\textbf{1}_M'\\textbf{E}'\\textbf{E})^{-1}\\textbf{1}_M}\\). This form is similar to that of minimum variance portfolios. If errors are unbiased (\\(\\textbf{1}_I'\\textbf{E}=\\textbf{0}_M'\\)), then \\(\\textbf{E}'\\textbf{E}\\) is the covariance matrix of errors. This expression shows an important feature of optimized linear ensembles: they can only add value if the models tell different stories. If two models are redundant, \\(\\textbf{E}'\\textbf{E}\\) will be close to singular and \\(\\textbf{w}^*\\) will arbitrage one against the other in a spurious fashion. This is the exact same problem as when mean-variance portfolios are constituted with highly correlated assets: in this case, diversification fails because when things go wrong, all assets go down. Another problem arises when the number of observations is too small compared to the number of assets so that the covariance matrix of returns is singular. This is not an issue for ensembles because the number of observations will usually be much larger than the number of models (\\(I>>M\\)). In the limit when correlations increase to one, the above formulation becomes highly unstable and ensembles cannot be trusted. One heuristic way to see this is when \\(M=2\\) and \\[\\textbf{E}'\\textbf{E}=\\left[ \\begin{array}{cc} \\sigma_1^2 & \\rho\\sigma_1\\sigma_2 \\\\ \\rho\\sigma_1\\sigma_2 & \\sigma_2^2 \\\\ \\end{array} \\right] \\quad \\Leftrightarrow \\quad (\\textbf{E}'\\textbf{E})^{-1}=\\frac{1}{1-\\rho^2}\\left[ \\begin{array}{cc} \\sigma_1^{-2} & -\\rho(\\sigma_1\\sigma_2)^{-1} \\\\ -\\rho(\\sigma_1\\sigma_2)^{-1} & \\sigma_2^{-2} \\\\ \\end{array} \\right]\\] so that when \\(\\rho \\rightarrow 1\\), the model with the smallest errors (minimum \\(\\sigma_i^2\\)) will see its weight increasing towards infinity while the other model will have a similarly large negative weight: the model arbitrages between two highly correlated variables. This seems like a very bad idea. There is another illustration of the issues caused by correlations. Let’s assume we face \\(M\\) correlated errors \\(\\epsilon_m\\) with pairwise correlation \\(\\rho\\), zero mean and variance \\(\\sigma^2\\). The variance of errors is \\[\\begin{align*} \\mathbb{E}\\left[\\frac{1}{M}\\sum_{m=1}^M \\epsilon_m^2 \\right]&=\\frac{1}{M^2}\\left[\\sum_{m=1}^M\\epsilon_m^2+\\sum_{m\\neq n}\\epsilon_n\\epsilon_m\\right] \\\\ &=\\frac{\\sigma^2}{M}+\\frac{1}{M^2}\\sum_{n\\neq m} \\rho \\sigma^2 \\\\ & =\\rho \\sigma^2 +\\frac{\\sigma^2(1-\\rho)}{M} \\end{align*}\\] where while the second term converges to zero as \\(M\\) increases, the first term remains and is linearly increasing with \\(\\rho\\). In passing, because variances are always positive, this result implies that the common pairwise correlation between \\(M\\) variables is bounded below by \\(-(M-1)^{-1}\\). This result is interesting but rarely found in textbooks. One improvement proposed to circumvent the trouble caused by correlations, advocated in a seminal publication (Breiman (1996)), is to enforce positivity constraints on the weights and solve \\[\\underset{\\textbf{w}}{\\text{argmin}} \\ \\textbf{w}'\\textbf{E}'\\textbf{E}\\textbf{w} , \\quad \\text{s.t.} \\quad \\left\\{ \\begin{array}{l} \\textbf{w}'\\textbf{1}_M=1 \\\\ w_m \\ge 0 \\quad \\forall m \\end{array}\\right. .\\] Mechanically, if several models are highly correlated, the constraint will impose that only one of them will have a nonzero weight. If there are many models, then just a few of them will be selected by the minimization program. In the context of portfolio optimization, Jagannathan and Ma (2003) have shown the counter-intuitive benefits of constraints in the construction of mean-variance allocations. In our setting, the constraint will similarly help discriminate wisely among the ‘best’ models. In the literature, forecast combination and model averaging (which are synonyms of ensembles) have been tested on stock markets as early as in Von Holstein (1972). Surprisingly, the articles were not published in Finance journals but rather in fields such as Management (Virtanen and Yli-Olli (1987), Wang et al. (2012)), Economics and Econometrics (Donaldson and Kamstra (1996), Clark and McCracken (2009), Mascio, Fabozzi, and Zumwalt (2020)), Operations Reasearch (Huang, Nakamori, and Wang (2005), Leung, Daouk, and Chen (2001), and Bonaccolto and Paterlini (2019)), and Computer Science (Harrald and Kamstra (1997), Hassan, Nath, and Kirley (2007)). In the general forecasting literature, many alternative (refined) methods for combining forecasts have been studied. Trimmed opinion pools (Grushka-Cockayne, Jose, and Lichtendahl Jr (2016)) compute averages over the predictions that are not too extreme. Ensembles with weights that depend on previous past errors are developed in Pike and Vazquez-Grande (2020). We refer to Gaba, Tsetlin, and Winkler (2017) for a more exhaustive list of combinations as well as for an empirical study of their respective efficiency. Overall, findings are mixed and the heuristic simple average is, as usual, hard to beat (see, e.g., Genre et al. (2013)). 11.1.2 Example In order to build an ensemble, we must gather the predictions and the corresponding errors into the \\(\\textbf{E}\\) matrix. We will work with 5 models that were trained in the previous chapters: penalized regression, simple tree, random forest, xgboost and feed-forward neural network. The training errors have zero means, hence \\(\\textbf{E}'\\textbf{E}\\) is the covariance matrix of errors between models. err_pen_train <- predict(fit_pen_pred, x_penalized_train) - training_sample$R1M_Usd # Reg. err_tree_train <- predict(fit_tree, training_sample) - training_sample$R1M_Usd # Tree err_RF_train <- predict(fit_RF, training_sample) - training_sample$R1M_Usd # RF err_XGB_train <- predict(fit_xgb, train_matrix_xgb) - training_sample$R1M_Usd # XGBoost err_NN_train <- predict(model, NN_train_features) - training_sample$R1M_Usd # NN E <- cbind(err_pen_train, err_tree_train, err_RF_train, err_XGB_train, err_NN_train) # E matrix colnames(E) <- c("Pen_reg", "Tree", "RF", "XGB", "NN") # Names cor(E) # Cor. mat. ## Pen_reg Tree RF XGB NN ## Pen_reg 1.0000000 0.9984394 0.9968224 0.9310186 0.9965702 ## Tree 0.9984394 1.0000000 0.9974647 0.9296081 0.9973310 ## RF 0.9968224 0.9974647 1.0000000 0.9281725 0.9972484 ## XGB 0.9310186 0.9296081 0.9281725 1.0000000 0.9279230 ## NN 0.9965702 0.9973310 0.9972484 0.9279230 1.0000000 As is shown by the correlation matrix, the models fail to generate heterogeneity in their predictions. The minimum correlation (though above 95%!) is obtained by the boosted tree models. Below, we compare the training accuracy of models by computing the average absolute value of errors. apply(abs(E), 2, mean) # Mean absolute error or columns of E ## Pen_reg Tree RF XGB NN ## 0.08345916 0.08362133 0.08327121 0.08986993 0.08372445 The best performing ML engine is the random forest. The boosted tree model is the worst, by far. Below, we compute the optimal (non-constrained) weights for the combination of models. w_ensemble <- solve(t(E) %*% E) %*% rep(1,5) # Optimal weights w_ensemble <- w_ensemble / sum(w_ensemble) w_ensemble ## [,1] ## Pen_reg -0.5781710818 ## Tree -0.1685807693 ## RF 1.3024288196 ## XGB -0.0002405839 ## NN 0.4445636155 Because of the high correlations, the optimal weights are not balanced and diversified: they load heavily on the random forest learner (best in sample model) and ‘short’ a few models in order to compensate. As one could expect, the model with the largest negative weights (Pen_reg) has a very high correlation with the random forest algorithm (0.997). Note that the weights are of course computed with training errors. The optimal combination is then tested on the testing sample. Below, we compute out-of-sample (testing) errors and their average absolute value. err_pen_test <- predict(fit_pen_pred, x_penalized_test) - testing_sample$R1M_Usd # Reg. err_tree_test <- predict(fit_tree, testing_sample) - testing_sample$R1M_Usd # Tree err_RF_test <- predict(fit_RF, testing_sample) - testing_sample$R1M_Usd # RF err_XGB_test <- predict(fit_xgb, xgb_test) - testing_sample$R1M_Usd # XGBoost err_NN_test <- predict(model, NN_test_features) - testing_sample$R1M_Usd # NN E_test <- cbind(err_pen_test, err_tree_test, err_RF_test, err_XGB_test, err_NN_test) # E matrix colnames(E_test) <- c("Pen_reg", "Tree", "RF", "XGB", "NN") apply(abs(E_test), 2, mean) # Mean absolute error or columns of E ## Pen_reg Tree RF XGB NN ## 0.06618181 0.06653527 0.06710349 0.07170802 0.06704251 The boosted tree model is still the worst performing algorithm while the simple models (regression and simple tree) are the ones that fare the best. The most naive combination is the simple average of model and predictions. err_EW_test <- apply(E_test, 1, mean) # Equally weighted combination mean(abs(err_EW_test)) ## [1] 0.06690517 Because the errors are very correlated, the equally weighted combination of forecasts yields an average error which lies ‘in the middle’ of individual errors. The diversification benefits are too small. Let us now test the ‘optimal’ combination \\(\\textbf{w}^*=\\frac{(\\textbf{E}'\\textbf{E})^{-1}\\textbf{1}_M}{(\\textbf{1}_M'\\textbf{E}'\\textbf{E})^{-1}\\textbf{1}_M}\\). err_opt_test <- E_test %*% w_ensemble # Optimal unconstrained combination mean(abs(err_opt_test)) ## [1] 0.06836327 Again, the result is disappointing because of the lack of diversification across models. The correlations between errors are high not only on the training sample, but also on the testing sample, as shown below. cor(E_test) ## Pen_reg Tree RF XGB NN ## Pen_reg 1.0000000 0.9987069 0.9968882 0.9537914 0.9962205 ## Tree 0.9987069 1.0000000 0.9978366 0.9583641 0.9974515 ## RF 0.9968882 0.9978366 1.0000000 0.9606570 0.9975484 ## XGB 0.9537914 0.9583641 0.9606570 1.0000000 0.9612949 ## NN 0.9962205 0.9974515 0.9975484 0.9612949 1.0000000 The leverage from the optimal solution only exacerbates the problem and underperforms the heuristic uniform combination. We end this section with the constrained formulation of Breiman (1996) using the quadprog package. If we write \\(\\mathbf{\\Sigma}\\) for the covariance matrix of errors, we seek \\[\\mathbf{w}^*=\\underset{\\mathbf{w}}{\\text{argmin}} \\ \\mathbf{w}'\\mathbf{\\Sigma}\\mathbf{w}, \\quad \\mathbf{1}'\\mathbf{w}=1, \\quad w_i\\ge 0,\\] The constraints will be handled as: \\[\\mathbf{A} \\mathbf{w}= \\begin{bmatrix} 1 & 1 & 1 \\\\ 1 & 0 & 0\\\\ 0 & 1 & 0 \\\\ 0 & 0 & 1 \\end{bmatrix} \\mathbf{w} \\hspace{9mm} \\text{ compared to} \\hspace{9mm} \\mathbf{b}=\\begin{bmatrix} 1 \\\\ 0 \\\\ 0 \\\\ 0 \\end{bmatrix}, \\] where the first line will be an equality (weights sum to one) and the last three will be inequalities (weights are all positive). library(quadprog) # Package for quadratic programming Sigma <- t(E) %*% E # Unscaled covariance matrix nb_mods <- nrow(Sigma) # Number of models w_const <- solve.QP(Dmat = Sigma, # D matrix = Sigma dvec = rep(0, nb_mods), # Zero vector Amat = rbind(rep(1, nb_mods), diag(nb_mods)) %>% t(), # A matrix for constraints bvec = c(1,rep(0, nb_mods)), # b vector for constraints meq = 1 # 1 line of equality constraints, others = inequalities ) w_const$solution %>% round(3) # Solution ## [1] 0.000 0.000 0.854 0.000 0.146 Compared to the unconstrained solution, the weights are sparse and concentrated in one or two models, usually those with small training sample errors. 11.2 Stacked ensembles 11.2.1 Two-stage training Stacked ensembles are a natural generalization of linear ensembles. The idea of generalizing linear ensembles goes back at least to Wolpert (1992b). In the general case, the training is performed in two stages. The first stage is the simple one, whereby the \\(M\\) models are trained independently, yielding the predictions \\(\\tilde{y}_{i,m}\\) for instance \\(i\\) and model \\(m\\). The second step is to consider the output of the trained models as input for a new level of machine learning optimization. The second level predictions are \\(\\breve{y}_i=h(\\tilde{y}_{i,1},\\dots,\\tilde{y}_{i,M})\\), where \\(h\\) is a new learner (see Figure 11.1). Linear ensembles are of course stacked ensembles in which the second layer is a linear regression. The same techniques are then applied to minimize the error between the true values \\(y_i\\) and the predicted ones \\(\\breve{y}_i\\). FIGURE 11.1: Scheme of stacked ensembles. 11.2.2 Code and results Below, we create a low-dimensional neural network which takes in the individual predictions of each model and compiles them into a synthetic forecast. model_stack <- keras_model_sequential() model_stack %>% # This defines the structure of the network, i.e. how layers are organized layer_dense(units = 8, activation = 'relu', input_shape = nb_mods) %>% layer_dense(units = 4, activation = 'tanh') %>% layer_dense(units = 1) The configuration is very simple. We do not include any optional arguments and hence the model is likely to overfit. As we seek to predict returns, the loss function is the standard \\(L^2\\) norm. model_stack %>% compile( # Model specification loss = 'mean_squared_error', # Loss function optimizer = optimizer_rmsprop(), # Optimisation method (weight updating) metrics = c('mean_absolute_error') # Output metric ) summary(model_stack) # Model architecture ## Model: "sequential_5" ## __________________________________________________________________________________________ ## Layer (type) Output Shape Param # ## ========================================================================================== ## dense_11 (Dense) (None, 8) 48 ## __________________________________________________________________________________________ ## dense_12 (Dense) (None, 4) 36 ## __________________________________________________________________________________________ ## dense_13 (Dense) (None, 1) 5 ## ========================================================================================== ## Total params: 89 ## Trainable params: 89 ## Non-trainable params: 0 ## __________________________________________________________________________________________ y_tilde <- E + matrix(rep(training_sample$R1M_Usd, nb_mods), ncol = nb_mods) # Train preds y_test <- E_test + matrix(rep(testing_sample$R1M_Usd, nb_mods), ncol = nb_mods) # Testing fit_NN_stack <- model_stack %>% fit(y_tilde, # Train features training_sample$R1M_Usd, # Train labels epochs = 12, batch_size = 512, # Train parameters validation_data = list(y_test, # Test features testing_sample$R1M_Usd) # Test labels ) plot(fit_NN_stack) # Plot, evidently! FIGURE 11.2: Training metrics for the ensemble model. The performance of the ensemble is again disappointing: the learning curve is flat in Figure 11.2, hence the rounds of back-propagation are useless. The training adds little value which means that the new overarching layer of ML does not enhance the original predictions. Again, this is because all ML engines seem to be capturing the same patterns and both their linear and non-linear combinations fail to improve their performance. 11.3 Extensions 11.3.1 Exogenous variables In a financial context, macro-economic indicators could add value to the process. It is possible that some models perform better under certain conditions and exogenous predictors can help introduce a flavor of economic-driven conditionality in the predictions. Adding macro-variables to the set of predictors (here, predictions) \\(\\tilde{y}_{i,m}\\) could seem like one way to achieve this. However, this would amount to mix predicted values with (possibly scaled) economic indicators and that would not make much sense. One alternative outside the perimeter of ensembles is to train simple trees on a set of macro-economic indicators. If the labels are the (possibly absolute) errors stemming from the original predictions, then the trees will create clusters of homogeneous error values. This will hint towards which conditions lead to the best and worst forecasts. We test this idea below, using aggregate data from the Federal Reserve of Saint Louis. A simple downloading function is available in the quantmod package. We download and format the data in the next chunk. CPIAUCSL is a code for consumer price index and T10Y2YM is a code for the term spread (10Y minus 2Y). library(quantmod) # Package that extracts the data library(lubridate) # Package for date management getSymbols("CPIAUCSL", src = "FRED") # FRED is the Fed of St Louis ## [1] "CPIAUCSL" getSymbols("T10Y2YM", src = "FRED") ## [1] "T10Y2YM" cpi <- fortify(CPIAUCSL) %>% mutate (inflation = CPIAUCSL / lag(CPIAUCSL) - 1) # Inflation via Consumer Price Index ts <- fortify(T10Y2YM) # Term spread (10Y minus 2Y rates) colnames(ts)[2] <- "termspread" # To make things clear ens_data <- testing_sample %>% # Creating aggregate dataset dplyr::select(date) %>% cbind(err_NN_test) %>% mutate(Index = make_date(year = lubridate::year(date), # Change date to first day of month month = lubridate::month(date), day = 1)) %>% left_join(cpi) %>% # Add CPI to the dataset left_join(ts) # Add termspread head(ens_data) # Show first lines ## date err_NN_test Index CPIAUCSL inflation termspread ## 1 2014-01-31 -0.15116310 2014-01-01 235.288 0.002424175 2.47 ## 2 2014-02-28 0.07187722 2014-02-01 235.547 0.001100779 2.38 ## 3 2014-03-31 -0.02526811 2014-03-01 236.028 0.002042055 2.32 ## 4 2014-04-30 -0.09116794 2014-04-01 236.468 0.001864186 2.29 ## 5 2014-05-31 -0.09811382 2014-05-01 236.918 0.001903006 2.17 ## 6 2014-06-30 0.03238936 2014-06-01 237.231 0.001321132 2.15 We can now build a tree that tries to explain the accuracy of models as a function of macro-variables. library(rpart.plot) # Load package for tree plotting fit_ens <- rpart(abs(err_NN_test) ~ inflation + termspread, # Tree model data = ens_data, cp = 0.001) # Complexity param (size of tree) rpart.plot(fit_ens) # Plot tree FIGURE 11.3: Conditional performance of a ML engine. The tree creates clusters which have homogeneous values of absolute errors. One big cluster gathers 92% of predictions (the left one) and is the one with the smallest average. It corresponds to the periods when the term spread is above 0.29 (in percentage points). The other two groups (when the term spread is below 0.29%) are determined according to the level of inflation. If the latter is positive, then the average absolute error is 7%, if not, it is 12%. This last number, the highest of the three clusters, indicates that when the term spread is low and the inflation negative, the model’s predictions are not trustworthy because their errors have a magnitude twice as large as in other periods. Under these circumstances (which seem to be linked to a dire economic environment), it may be wiser not to use ML-based forecasts. 11.3.2 Shrinking inter-model correlations As shown earlier in this chapter, one major problem with ensembles arises when the first layer of predictions is highly correlated. In this case, ensembles are pretty much useless. There are several tricks that can help reduce this correlation, but the simplest and best is probably to alter training samples. If algorithms do not see the same data, they will probably infer different patterns. There are several ways to split the training data so as to build different subsets of training samples. The first dichotomy is between random versus deterministic splits. Random splits are easy and require only the target sample size to be fixed. Note that the training samples can be overlapping as long as the overlap is not too large. Hence if the original training sample has \\(I\\) instance and the ensemble requires \\(M\\) models, then a subsample size of \\(\\lfloor I/M \\rfloor\\) may be too conservative especially if the training sample is not very large. In this case \\(\\lfloor I/\\sqrt{M} \\rfloor\\) may be a better alternative. Random forests are one example of ensembles built in random training samples. One advantage of deterministic splits is that they are easy to reproduce and their outcome does not depend on the random seed. By the nature of factor-based training samples, the second splitting dichotomy is between time and assets. A split within assets is straightforward: each model is trained on a different set of stocks. Note that the choices of sets can be random, or dictacted by some factor-based criterion: size, momentum, book-to-market ratio, etc. A split in dates requires other decisions: is the data split in large blocks (like years) and each model gets a block, which may stand for one particular kind of market condition? Or are the training dates divided more regularly? For instance, if there are 12 models in the ensemble, each model can be trained on data from a given month (e.g., January for the first models, February for the second, etc.). Below, we train four models on four different years to see if this helps reduce the inter-model correlations. This process is a bit lengthy because the samples and models need to be all redefined. We start by creating the four training samples. The third model works on the small subset of features, hence the sample is smaller. training_sample_2007 <- training_sample %>% filter(date > "2006-12-31", date < "2008-01-01") training_sample_2009 <- training_sample %>% filter(date > "2008-12-31", date < "2010-01-01") training_sample_2011 <- training_sample %>% dplyr::select(c("date",features_short, "R1M_Usd")) %>% filter(date > "2010-12-31", date < "2012-01-01") training_sample_2013 <- training_sample %>% filter(date > "2012-12-31", date < "2014-01-01") Then, we proceed to the training of the models. The syntaxes are those used in the previous chapters, nothing new here. We start with a penalized regression. In all predictions below, the original testing sample is used for all models. y_ens_2007 <- training_sample_2007$R1M_Usd # Dep. var. x_ens_2007 <- training_sample_2007 %>% # Predictors dplyr::select(features) %>% as.matrix() fit_ens_2007 <- glmnet(x_ens_2007, y_ens_2007, alpha = 0.1, lambda = 0.1) # Model err_ens_2007 <- predict(fit_ens_2007, x_penalized_test) - testing_sample$R1M_Usd # Pred. errs We continue with a random forest. fit_ens_2009 <- randomForest(formula, # Same formula as for simple trees! data = training_sample_2009, # Data source: 2011 training sample sampsize = 4000, # Size of (random) sample for each tree replace = FALSE, # Is the sampling done with replacement? nodesize = 100, # Minimum size of terminal cluster ntree = 40, # Nb of random trees mtry = 30 # Nb of predictive variables for each tree ) err_ens_2009 <- predict(fit_ens_2009, testing_sample) - testing_sample$R1M_Usd # Pred. errs The third model is a boosted tree. train_features_2011 <- training_sample_2011 %>% dplyr::select(features_short) %>% as.matrix() # Independent variable train_label_2011 <- training_sample_2011 %>% dplyr::select(R1M_Usd) %>% as.matrix() # Dependent variable train_matrix_2011 <- xgb.DMatrix(data = train_features_2011, label = train_label_2011) # XGB format! fit_ens_2011 <- xgb.train(data = train_matrix_2011, # Data source eta = 0.4, # Learning rate objective = "reg:linear", # Objective function max_depth = 4, # Maximum depth of trees nrounds = 18 # Number of trees used ) ## [21:30:00] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. err_ens_2011 <- predict(fit_ens_2011, xgb_test) - testing_sample$R1M_Usd # Prediction errors Finally, the last model is a simple neural network. NN_features_2013 <- dplyr::select(training_sample_2013, features) %>% as.matrix() # Matrix format is important NN_labels_2013 <- training_sample_2013$R1M_Usd model_ens_2013 <- keras_model_sequential() model_ens_2013 %>% # This defines the structure of the network, i.e. how layers are organized layer_dense(units = 16, activation = 'relu', input_shape = ncol(NN_features_2013)) %>% layer_dense(units = 8, activation = 'tanh') %>% layer_dense(units = 1) model_ens_2013 %>% compile( # Model specification loss = 'mean_squared_error', # Loss function optimizer = optimizer_rmsprop(), # Optimisation method (weight updating) metrics = c('mean_absolute_error') # Output metric ) model_ens_2013 %>% fit(NN_features_2013, # Training features NN_labels_2013, # Training labels epochs = 9, batch_size = 128 # Training parameters ) err_ens_2013 <- predict(model_ens_2013, NN_test_features) - testing_sample$R1M_Usd Endowed with the errors of the four models, we can compute their correlation matrix. E_subtraining <- tibble(err_ens_2007, err_ens_2009, err_ens_2011, err_ens_2013) cor(E_subtraining) ## err_ens_2007 err_ens_2009 err_ens_2011 err_ens_2013 ## err_ens_2007 1.0000000 0.9542497 0.6460091 0.9996685 ## err_ens_2009 0.9542497 1.0000000 0.6317006 0.9549044 ## err_ens_2011 0.6460091 0.6317006 1.0000000 0.6464010 ## err_ens_2013 0.9996685 0.9549044 0.6464010 1.0000000 The results are overall disappointing. Only one model manages to extract patterns that are somewhat different from the other ones, resulting in a 65% correlation across the board. Neural networks (on 2013 data) and penalized regressions (2007) remain highly correlated. One possible explanation could be that the models capture mainly noise and little signal. Working with long-term labels like annual returns could help improve diversification across models. 11.4 Exercise Build an integrated ensemble on top of 3 neural networks trained entirely with Keras. Each network obtains one third of predictors as input. The three networks yield a classification (yes/no or buy/sell). The overarching network aggregates the three outputs into a final decision. Evaluate its performance on the testing sample. Use the functional API. References "],["backtest.html", "Chapter 12 Portfolio backtesting 12.1 Setting the protocol 12.2 Turning signals into portfolio weights 12.3 Performance metrics 12.4 Common errors and issues 12.5 Implication of non-stationarity: forecasting is hard 12.6 First example: a complete backtest 12.7 Second example: backtest overfitting 12.8 Coding exercises", " Chapter 12 Portfolio backtesting In this section, we introduce the notations and framework that will be used when analyzing and comparing investment strategies. Portfolio backtesting is often conceived and perceived as a quest to find the best strategy - or at least a solidly profitable one. When carried out thoroughly, this possibly long endeavor may entice the layman to confuse a fluke for a robust policy. Two papers published back-to-back warn against the perils of data snooping, which is related to \\(p\\)-hacking. In both cases, the researcher will torture the data until the sought result is found. Fabozzi and Prado (2018) acknowledge that only strategies that work make it to the public, while thousands (at least) have been tested. Picking the pleasing outlier (the only strategy that seemed to work) is likely to generate disappointment when switching to real trading. In a similar vein, R. Arnott, Harvey, and Markowitz (2019) provide a list of principles and safeguards that any analyst should follow to avoid any type of error when backtesting strategies. The worst type is arguably false positives whereby strategies are found (often by cherrypicking) to outperform in one very particular setting, but will likely fail in live implementation. In addition to these recommendations on portfolio constructions, R. Arnott et al. (2019) also warn against the hazards of blindly investing in smart beta products related to academic factors. Plainly, expectations should not be set too high or face the risk of being disappointed. Another takeaway from their article is that economic cycles have a strong impact on factor returns: correlations change quickly and drawdowns can be magnified in times of major downturns. Backtesting is more complicated than it seems and it is easy to make small mistakes that lead to apparently good portfolio policies. This chapter lays out a rigorous approach to this exercise, discusses a few caveats, and proposes a lengthy example. 12.1 Setting the protocol We consider a dataset with three dimensions: time \\(t=1,\\dots,T\\), assets \\(n=1,\\dots,N\\) and characteristics \\(k=1,\\dots,K\\). One of these attributes must be the price of asset \\(n\\) at time \\(t\\), which we will denote \\(p_{t,n}\\). From that, the computation of the arithmetic return is straightforward (\\(r_{t,n}=p_{t,n}/p_{t-1,n}-1\\)) and so is any heuristic measure of profitability. For simplicity, we assume that time points are equidistant or uniform, i.e., that \\(t\\) is the index of a trading day or of a month for example. If each point in time \\(t\\) has data available for all assets, then this makes a dataset with \\(I=T\\times N\\) rows. The dataset is first split in two: the out-of-sample period and the initial buffer period. The buffer period is required to train the models for the first portfolio composition. This period is determined by the size of the training sample. There are two options for this size: fixed (usually equal to 2 to 10 years) and expanding. In the first case, the training sample will roll over time, taking into account only the most recent data. In the second case, models are built on all of the available data, the size of which increases with time. This last option can create problems because the first dates of the backtest are based on much smaller amounts of information compared to the last dates. Moreover, there is an ongoing debate on whether including the full history of returns and characteristics is advantageous or not. Proponents argue that this allows models to see many different market conditions. Opponents make the case that old data is by definition outdated and thus useless and possibly misleading because it won’t reflect current or future short-term fluctuations. Henceforth, we choose the rolling period option for the training sample, as depicted in Figure 12.1. FIGURE 12.1: Backtesting with rolling windows. The training set of the first period is simply the buffer period. Two crucial design choices are the rebalancing frequency and the horizon at which the label is computed. It is not obvious that they should be equal but their choice should make sense. It can seem right to train on a 12-month forward label (which captures longer trends) and invest monthly or quarterly. However, it seems odd to do the opposite and train on short-term movements (monthly) and invest at a long horizon. These choices have a direct impact on how the backtest is carried out. If we note: \\(\\Delta_h\\) for the holding period between 2 rebalancing dates (in days or months); \\(\\Delta_s\\) for the size of the desired training sample (in days or months - not taking the number of assets into consideration); \\(\\Delta_l\\) for the horizon at which the label is computed (in days or months), then the total length of the training sample should be \\(\\Delta_s+\\Delta_l\\). Indeed, at any moment \\(t\\), the training sample should stop at \\(t-\\Delta_l\\) so that the last point corresponds to a label that is calculated until time \\(t\\). This is highlighted in Figure 12.2 in the form of the red danger zone. We call it the red zone because any observation which has a time index \\(s\\) inside the interval \\((t-\\Delta_l,t]\\) will engender a forward looking bias. Indeed if a feature is indexed by \\(s \\in (t-\\Delta_l,t]\\), then by definition, the label covers the period \\([s,s+\\Delta_l]\\) with \\(s+\\Delta_l>t\\). At time \\(t\\), this requires knowledge of the future and is naturally not realistic. FIGURE 12.2: The subtleties in rolling training samples. 12.2 Turning signals into portfolio weights The predictive tools outlined in Chapters 5 to 11 are only meant to provide a signal that is expected to give some information on the future profitability of assets. There are many ways that this signal can be integrated in an investment decision (see Snow (2020) for ways to integrate ML tools into this task). First and foremost, there are at least two steps in the portfolio construction process and the signal can be used at any of these stages. Relying on the signal for both steps puts a lot of emphasis on the predictions and should only be considered when the level of confidence in the forecasts is high. The first step is selection. While a forecasting exercise can be carried out on a large number of assets, it is not compulsory to invest in all of these assets. In fact, for long-only portfolios, it would make sense to take advantage of the signal to exclude those assets that are presumably likely to underperform in the future. Often, portfolio policies have fixed sizes that impose a constant number of assets. One heuristic way to exploit the signal is to select the assets that have the most favorable predictions and to discard the others. This naive idea is often used in the asset pricing literature: portfolios are formed according to the quantiles of underlying characteristics and some characteristics are deemed interesting if the corresponding sorted portfolios exhibit very different profitabilities (e.g., high average return for high quantiles versus low average return for low quantiles). This is for instance an efficient way to test the relevance of the signal. If \\(Q\\) portfolios \\(q=1,\\dots,Q\\) are formed according to the rankings of the assets with respect to the signal, then one would expect that the out-of-sample performance of the portfolios be monotonic with \\(q\\). While a rigorous test of monotonicity would require to account for all portfolios (see, e.g., Romano and Wolf (2013)), it is often only assumed that the extreme portfolios suffice. If the difference between portfolio number 1 and portfolio number \\(Q\\) is substantial, then the signal is valuable. Whenever the investor is able to short assets, this amounts to a dollar neutral strategy. The second step is weighting. If the selection process relied on the signal, then a simple weighting scheme is often a good idea. Equally weighted portfolios are known to be hard to beat (see DeMiguel, Garlappi, and Uppal (2009)), especially compared to their cap-weighted alternative, as is shown in Plyakha, Uppal, and Vilkov (2016). More advanced schemes include equal risk contributions (Maillard, Roncalli, and Teiletche (2010)) and constrained minimum variance (Coqueret (2015)). Both only rely on the covariance matrix of the assets and not on any proxy for the vector of expected returns. For the sake of completeness, we explicitize a generalization of Coqueret (2015) which is a generic constrained quadratic program: \\[\\begin{equation} \\tag{12.1} \\underset{\\textbf{w}}{\\text{min}} \\ \\frac{\\lambda}{2} \\textbf{w}'\\boldsymbol{\\Sigma}\\textbf{w}-\\textbf{w}'\\boldsymbol{\\mu} , \\quad \\text{s.t.} \\quad \\begin{array}{ll} \\textbf{w}'\\textbf{1}=1, \\\\ (\\textbf{w}-\\textbf{w}_-)'\\boldsymbol{\\Lambda}(\\textbf{w}-\\textbf{w}_-) \\le \\delta_R,\\\\ \\textbf{w}'\\textbf{w} \\le \\delta_D, \\end{array} \\end{equation}\\] where it is easy to recognize the usual mean-variance optimization in the left-hand side. We impose three constraints on the right-hand side.23 The first one is the budget constraint (weights sum to one). The second one penalizes variations in weights (compared to the current allocation, \\(\\textbf{w}_-\\)) via a diagonal matrix \\(\\boldsymbol{\\Lambda}\\) that penalizes trading costs. This is a crucial point. Portfolios are rarely constructed from scratch and are most of the time adjustments from existing positions. In order to reduce the orders and the corresponding transaction costs, it is possible to penalize large variations from the existing portfolio. In the above program, the current weights are written \\(\\textbf{w}_-\\) and the desired ones \\(\\textbf{w}\\) so that \\(\\textbf{w}-\\textbf{w}_-\\) is the vector of deviations from the current positions. The term \\((\\textbf{w}-\\textbf{w}_-)'\\boldsymbol{\\Lambda}(\\textbf{w}-\\textbf{w}_-)\\) is an expression that characterizes the sum of squared deviations, weighted by the diagonal coefficients \\(\\Lambda_{n,n}\\). This can be helpful because some assets may be more costly to trade due to liquidity (large cap stocks are more liquid and their trading costs are lower). When \\(\\delta_R\\) decreases, the rotation is reduced because weights are not allowed too deviate too much from \\(\\textbf{w}_-\\). The last constraint enforces diversification via the Herfindhal-Hirschmann index of the portfolio: the smaller \\(\\delta_D\\), the more diversified the portfolio. Recalling that there are \\(N\\) assets in the universe, the Lagrange form of (12.1) is: \\[\\begin{equation} \\tag{12.2} L(\\textbf{w})= \\frac{\\lambda}{2} \\textbf{w}'\\boldsymbol{\\Sigma}\\textbf{w}-\\textbf{w}'\\boldsymbol{\\mu}-\\eta (\\textbf{w}'\\textbf{1}_N-1)+\\kappa_R ( (\\textbf{w}-\\textbf{w}_-)'\\boldsymbol{\\Lambda}(\\textbf{w}-\\textbf{w}_-) - \\delta_R)+\\kappa_D(\\textbf{w}'\\textbf{w}-\\delta_D), \\end{equation}\\] and the first order condition \\[\\frac{\\partial}{\\partial \\textbf{w}}L(\\textbf{w})= \\lambda \\boldsymbol{\\Sigma}\\textbf{w}-\\boldsymbol{\\mu}-\\eta\\textbf{1}_N+2\\kappa_R \\boldsymbol{\\Lambda}(\\textbf{w}-\\textbf{w}_-)+2\\kappa_D\\textbf{w}=0,\\] yields \\[\\begin{equation} \\tag{12.3} \\textbf{w}^*_\\kappa= (\\lambda \\boldsymbol{\\Sigma}+2\\kappa_R \\boldsymbol{\\Lambda} +2\\kappa_D\\textbf{I}_N)^{-1} \\left(\\boldsymbol{\\mu} + \\eta_{\\lambda,\\kappa_R,\\kappa_D} \\textbf{1}_N+2\\kappa_R \\boldsymbol{\\Lambda}\\textbf{w}_-\\right), \\end{equation}\\] with \\[\\eta_{\\lambda,\\kappa_R,\\kappa_D}=\\frac{1- \\textbf{1}_N'(\\lambda\\boldsymbol{\\Sigma}+2\\kappa_R \\boldsymbol{\\Lambda}+2\\kappa_D\\textbf{I}_N)^{-1}(\\boldsymbol{\\mu}+2\\kappa_R\\boldsymbol{\\Lambda}\\textbf{w}_-)}{\\textbf{1}'_N(\\lambda \\boldsymbol{\\Sigma}+2\\kappa_R \\boldsymbol{\\Lambda}+2\\kappa_D\\textbf{I}_N)^{-1}\\textbf{1}_N}.\\] This parameter ensures that the budget constraint is satisfied. The optimal weights in (12.3) depend on three tuning parameters: \\(\\lambda\\), \\(\\kappa_R\\) and \\(\\kappa_D\\). - When \\(\\lambda\\) is large, the focus is set more on risk reduction than on profit maximization (which is often a good idea given that risk is easier to predict); - When \\(\\kappa_R\\) is large, the importance of transaction costs in (12.2) is high and thus, in the limit when \\(\\kappa_R \\rightarrow \\infty\\), the optimal weights are equal to the old ones \\(\\textbf{w}_-\\) (for finite values of the other parameters). - When \\(\\kappa_D\\) is large, the portfolio is more diversified and (all other things equal) when \\(\\kappa_D \\rightarrow \\infty\\), the weights are all equal (to \\(1/N\\)). - When \\(\\kappa_R=\\kappa_D=0\\), we recover the classical mean-variance weights which are a mix between the maximum Sharpe ratio portfolio proportional to \\((\\boldsymbol{\\Sigma})^{-1} \\boldsymbol{\\mu}\\) and the minimum variance portfolio proportional to \\((\\boldsymbol{\\Sigma})^{-1} \\textbf{1}_N\\). This seemingly complex formula is in fact very flexible and tractable. It requires some tests and adjustments before finding realistic values for \\(\\lambda\\), \\(\\kappa_R\\) and \\(\\kappa_D\\) (see exercise at the end of the chapter). In Pedersen, Babu, and Levine (2020), the authors recommend a similar form, except that the covariance matrix is shrunk towards the diagonal matrix of sample variances and the expected returns are mix between a signal and an anchor portfolio. The authors argue that their general formulation has links with robust optimization (see also Kim, Kim, and Fabozzi (2014)), Bayesian inference (Lai et al. (2011)), matrix denoising via random matrix theory, and, naturally, shrinkage. In fact, shrunk expected returns have been around for quite some time (Jorion (1985), Kan and Zhou (2007) and Bodnar, Parolya, and Schmid (2013)) and simply seek to diversify and reduce estimation risk. 12.3 Performance metrics The evaluation of performance is a key stage in a backtest. This section, while not exhaustive, is intended to cover the most important facets of portfolio assessment. 12.3.1 Discussion While the evaluation of the accuracy of ML tools (See Section 10.1) is of course valuable (and imperative!), the portfolio returns are the ultimate yardstick during a backtest. One essential element in such an exercise is a benchmark because raw and absolute metrics don’t mean much on their own. This is not only true at the portfolio level, but also at the ML engine level. In most of the trials of the previous chapters, the MSE of the models on the testing set revolves around 0.037. An interesting figure is the variance of one-month returns on this set, which corresponds to the error made by a constant prediction of 0 all the time. This figure is equal to 0.037, which means that the sophisticated algorithms don’t really improve on a naive heuristic. This benchmark is the one used in the out-of-sample \\(R^2\\) of Gu, Kelly, and Xiu (2020b). In portfolio choice, the most elementary allocation is the uniform one, whereby each asset receives the same weight. This seemingly simplistic solution is in fact an incredible benchmark, one that is hard to beat consistently (see DeMiguel, Garlappi, and Uppal (2009) and Plyakha, Uppal, and Vilkov (2016)). Theoretically, uniform portfolios are optimal when uncertainty, ambiguity or estimation risk is high (Pflug, Pichler, and Wozabal (2012), Maillet, Tokpavi, and Vaucher (2015)) and empirically, it cannot be outperformed even at the factor level (Dichtl, Drobetz, and Wendt (2020)). Below, we will pick an equally weighted (EW) portfolio of all stocks as our benchmark. 12.3.2 Pure performance and risk indicators We then turn to the definition of the usual metrics used both by practitioners and academics alike. Henceforth, we write \\(r^P=(r_t^P)_{1\\le t\\le T}\\) and \\(r^B=(r_t^B)_{1\\le t\\le T}\\) for the returns of the portfolio and those of the benchmark, respectively. When referring to some generic returns, we simply write \\(r_t\\). There are many ways to analyze them and most of them rely on their distribution. The simplest indicator is the average return: \\[\\bar{r}_P=\\mu_P=\\mathbb{E}[r^P]\\approx \\frac{1}{T}\\sum_{t=1}^T r_t^P, \\quad \\bar{r}_B=\\mu_B=\\mathbb{E}[r^B]\\approx \\frac{1}{T}\\sum_{t=1}^T r_t^B,\\] where, obviously, the portfolio is noteworthy if \\(\\mathbb{E}[r^P]>\\mathbb{E}[r^B]\\). Note that we use the arithmetic average above but the geometric one is also an option, in which case: \\[\\tilde{\\mu}_P\\approx \\left(\\prod_{t=1}^T(1+r^P_t) \\right)^{1/T}-1 , \\quad \\tilde{\\mu}_B \\approx \\left(\\prod_{t=1}^T(1+r^B_t) \\right)^{1/T}-1.\\] The benefit of this second definition is that it takes the compounding of returns into account and hence compensates for volatility pumping. To see this, consider a very simple two-period model with returns \\(-r\\) and \\(+r\\). The arithmetic average is zero, but the geometric one \\(\\sqrt{1-r^2}-1\\) is negative. Akin to accuracy, it ratios evaluate the proportion of times when the position is in the right direction (long when the realized return is positive and short when it is negative). Hence hit ratios evaluate the propensity to make good guesses. This can be computed at the asset level (the proportion of positions in the correct direction24) or at the portfolio level. In all cases, the computation can be performed on raw returns or on relative returns (e.g., compared to a benchmark). A meaningful hit ratio is the proportion of times that a strategy beats its benchmark. This is of course not sufficient, as many small gains can be offset by a few large losses. Lastly, one important precision. In all examples of supervised learning tools in the book, we compared the hit ratios to 0.5. This is in fact wrong because if an investor is bullish, he or she may always bet on upward moves. In this case, the hit ratio is the percentage of time that returns are positive. Over the long run, this probability is above 0.5. In our sample, it is equal to 0.556, which is well above 0.5. This could be viewed as a benchmark to be surpassed. Pure performance measures are almost always accompanied by risk measures. The second moment of returns is usually used to quantify the magnitude of fluctuations of the portfolio. A large variance implies sizable movements in returns, and hence in portfolio values. This is why the standard deviation of returns is called the volatility of the portfolio. \\[\\sigma^2_P=\\mathbb{V}[r^P]\\approx \\frac{1}{T-1}\\sum_{t=1}^T (r_t^P-\\mu_P)^2, \\quad \\sigma^2_B=\\mathbb{V}[r^B]\\approx \\frac{1}{T-1}\\sum_{t=1}^T (r_t^B-\\mu_B)^2.\\] In this case, the portfolio can be preferred if it is less risky compared to the benchmark, i.e., when \\(\\sigma_P^2<\\sigma_B^2\\) and when average returns are equal (or comparable). Higher order moments of returns are sometimes used (skewness and kurtosis), but they are far less common. We refer for instance to Harvey et al. (2010) for one method that takes them into account in the portfolio construction process. For some people, the volatility is an incomplete measure of risk. It can be argued that it should be decomposed into ‘good’ volatility (when prices go up) versus ‘bad’ volatility when they go down. The downward semi-variance is computed as the variance taken over the negative returns: \\[\\sigma^2_-\\approx \\frac{1}{\\text{card}(r_t<0)}\\sum_{t=1}^T (r_t-\\mu_P)^21_{\\{r_t<0\\}}.\\] The average return and the volatility are the typical moment-based metrics used by practitioners. Other indicators rely on different aspects of the distribution of returns with a focus on tails and extreme events. The Value-at-Risk (VaR) is one such example. If \\(F_r\\) is the empirical cdf of returns, the VaR at a level of confidence \\(\\alpha\\) (often taken to be 95%) is \\[\\text{VaR}_\\alpha(\\textbf{r}_t)=F_r(1-\\alpha).\\] It is equal to the realization of a bad scenario (of return) that is expected to happen \\((1-\\alpha)\\)% of the time on average. An even more conservative measure is the so-called Conditional Value at Risk (CVaR), also known as expected shortfall, which computes the average loss of the worst (\\(1-\\alpha\\))% scenarios. Its empirical evaluation is \\[\\text{CVaR}_\\alpha(\\textbf{r}_t)=\\frac{1}{\\text{Card}(r_t < \\text{VaR}_\\alpha(\\text{r}_t))}\\sum_{r_t < \\text{VaR}_\\alpha(\\text{r}_t)}r_t.\\] Going crescendo in the severity of risk measures, the ultimate evaluation of loss is the maximum drawdown. It is equal to the maximum loss suffered from the peak value of the strategy. If we write \\(P_t\\) for the time-\\(t\\) value of a portfolio, the drawdown is \\[D_T^P=\\underset{0 \\le t \\le T}{\\text{max}} P_t-P_T ,\\] and the maximum drawdown is \\[MD_T^P=\\underset{0 \\le s \\le T}{\\text{max}} \\left(\\underset{0 \\le t \\le s}{\\text{max}} P_t-P_s, 0\\right) .\\] This quantity evaluates the greatest loss over the time frame \\([0,T]\\) and is thus the most conservative risk measure of all. 12.3.3 Factor-based evaluation In the spirit of factor models, performance can also be assessed through the lens of exposures. If we recall the original formulation from Equation (3.1): \\[r_{t,n}= \\alpha_n+\\sum_{k=1}^K\\beta_{t,k,n}f_{t,k}+\\epsilon_{t,n}, \\] then the estimated \\(\\hat{\\alpha}_n\\) is the performance that cannot be explained by the other factors. When returns are excess returns (over the risk-free rate) and when there is only one factor, the market factor, then this quantity is called Jensen’s alpha (Jensen (1968)). Often, it is simply referred to as alpha. The other estimate, \\(\\hat{\\beta}_{t,M,n}\\) (\\(M\\) for market), is the market beta. Because of the rise of factor investing, it has become customary to also report the alpha of more exhaustive regressions. Adding the size and value premium (as in Fama and French (1993)) and even momentum (Carhart (1997)) helps understand if a strategy generates value beyond that which can be obtained through the usual factors. 12.3.4 Risk-adjusted measures Now, the tradeoff between the average return and the volatility is a cornerstone in modern finance, since Markowitz (1952). The simplest way to synthesize both metrics is via the information ratio: \\[IR(P,B)=\\frac{\\mu_{P-B}}{\\sigma_{P-B}},\\] where the index \\(P-B\\) implies that the mean and standard deviations are computed on the long-short portfolio with returns \\(r_t^P-r_t^B\\). The denominator \\(\\sigma_{P-B}\\) is sometimes called the tracking error. The most widespread information ratio is the Sharpe ratio (Sharpe (1966)) for which the benchmark is some riskless asset. Instead of directly computing the information ratio between two portfolios or strategies, it is often customary to compare their Sharpe ratios. Simple comparisons can benefit from statistical tests (see, e.g., Ledoit and Wolf (2008)). More extreme risk measures can serve as denominator in risk-adjusted indicators. The Managed Account Report (MAR) ratio is, for example, computed as \\[MAR^P = \\frac{\\tilde{\\mu}_P}{MD^P},\\] while the Treynor ratio is equal to \\[\\text{Treynor}=\\frac{\\mu_P}{\\hat{\\beta}_M},\\] i.e., the (excess) return divided by the market beta (see Treynor (1965)). This definition was generalized to multifactor expositions by Hübner (2005) into the generalized Treynor ratio: \\[\\text{GT}=\\mu_P\\frac{\\sum_{k=1}^K\\bar{f}_k}{\\sum_{k=1}^K\\hat{\\beta}_k\\bar{f}_k},\\] where the \\(\\bar{f}_k\\) are the sample average of the factors \\(f_{t,k}\\). We refer to the original article for a detailed account of the analytical properties of this ratio. 12.3.5 Transaction costs and turnover Updating portfolio composition is not free. In all generality, the total cost of one rebalancing at time \\(t\\) is proportional to \\(C_t=\\sum_{n=1}^N | \\Delta w_{t,n}|c_{t,n}\\), where \\(\\Delta w_{t,n}\\) is the change in position for asset \\(n\\) and \\(c_{t,n}\\) the corresponding fee. This last quantity is often hard to predict, thus it is customary to use a proxy that depends for instance on market capitalization (large stocks have more liquid shares and thus require smaller fees) or bid-ask spreads (smaller spreads mean smaller fees). As a first order approximation, it is often useful to compute the average turnover: \\[\\text{Turnover}=\\frac{1}{T-1}\\sum_{t=2}^T\\sum_{n=1}^N|w_{t,n}-w_{t-,n}|,\\] where \\(w_{t,n}\\) are the desired \\(t\\)-time weights in the portfolio and \\(w_{t-,n}\\) are the weights just before the rebalancing. The positions of the first period (launching weights) are exluded from the computation by convention. Transaction costs can then be proxied as a multiple of turnover (times some average or median cost in the cross-section of firms). This is a first order estimate of realized costs that does not take into consideration the evolution of the scale of the portfolio. Nonetheless, a rough figure is much better than none at all. Once transaction costs (TCs) have been annualized, they can be deducted from average returns to yield a more realistic picture of profitability. In the same vein, the transaction cost-adjusted Sharpe ratio of a portfolio \\(P\\) is given by \\[\\begin{equation} \\tag{12.4} SR_{TC}=\\frac{\\mu_P-TC}{\\sigma_P}. \\end{equation}\\] Transaction costs are often overlooked in academic articles but can have a sizable impact in real life trading (see, e.g., Novy-Marx and Velikov (2015)). DeMiguel et al. (2020) show how to use factor investing (and exposures) to combine and offset positions and reduce overall fees. 12.4 Common errors and issues 12.4.1 Forward looking data One of the most common mistakes in portfolio backtesting is the use of forward looking data. It is for instance easy to fall in the trap of the danger zone depicted in Figure 12.2. In this case, the labels used at time \\(t\\) are computed with knowledge of what happens at times \\(t+1\\), \\(t+2\\), etc. It is worth triple checking every step in the code to make sure that strategies are not built on prescient data. 12.4.2 Backtest overfitting The second major problem is backtest overfitting. The analogy with training set overfitting is easy to grasp. It is a well-known issue and was formalized for instance in White (2000) and Romano and Wolf (2005). In portfolio choice, we refer to Bajgrowicz and Scaillet (2012), Bailey and Prado (2014) and Lopez de Prado and Bailey (2020), and the references therein. At any given moment, a backtest depends on only one particular dataset. Often, the result of the first backtest will not be satisfactory - for many possible reasons. Hence, it is tempting to have another try, when altering some parameters that were probably not optimal. This second test may be better, but not quite good enough - yet. Thus, in a third trial, a new weighting scheme can be tested, along with a new forecasting engine (more sophisticated). Iteratively, the backtester can only end up with a strategy that performs well enough, it is just a matter of time and trials. One consequence of backtest overfitting is that it is illusory to hope for the same Sharpe ratios in live trading as those obtained in the backtest. Reasonable professionals divide the Sharpe ratio by two at least (Harvey and Liu (2015), Suhonen, Lennkh, and Perez (2017)). In Bailey and Prado (2014), the authors even propose a statistical test for Sharpe ratios, provided that some metrics of all tested strategies are stored in memory. The formula for deflated Sharpe ratios is: \\[\\begin{equation} \\tag{12.5} t = \\phi\\left((SR-SR^*)\\sqrt{\\frac{T-1}{1-\\gamma_3SR+\\frac{\\gamma_4-1}{4}SR^2}} \\right), \\end{equation}\\] where \\(SR\\) is the Sharpe Ratio obtained by the best strategy among all that were tested, and \\[SR^*=\\mathbb{E}[SR]+\\sqrt{\\mathbb{V}[SR]}\\left((1-\\gamma)\\phi^{-1}\\left(1-\\frac{1}{N}\\right)+\\gamma \\phi^{-1}\\left(1-\\frac{1}{Ne}\\right) \\right),\\] is the theoretical average maximum SR. Moreover, \\(T\\) is the number of trading dates; \\(\\gamma_3\\) and \\(\\gamma_4\\) are the \\(skewness\\) and \\(kurtosis\\) of the returns of the chosen (best) strategy; \\(\\phi\\) is the cdf of the standard Gaussian law and \\(\\gamma\\approx 0,577\\) is the Euler-Mascheroni constant; \\(N\\) refers to the number of strategy trials. If \\(t\\) defined above is below a certain threshold (e.g., 0.95), then the \\(SR\\) cannot be deemed significant: compared to all of those that were tested. Most of the time, sadly, that is the case. In Equation (12.5), the realized SR must be above the theoretical maximum \\(SR^*\\) and the scaling factor must be sufficiently large to push the argument inside \\(\\phi\\) close enough to two, so that \\(t\\) surpasses 0.95. In the scientific community, test overfitting is also known as p-hacking. It is rather common in financial economics and the reading of Harvey (2017) is strongly advised to grasp the magnitude of the phenomenon. p-hacking is also present in most fields that use statistical tests (see, e.g., Head et al. (2015) to cite but one reference). There are several ways to cope with p-hacking: don’t rely on p-values (Amrhein, Greenland, and McShane (2019)); use detection tools (Elliott, Kudrin, and Wuthrich (2019)); or, finally, use advanced methods that process arrays of statistics (e.g., the Bayesianized versions of p-values to include some prior assessment from Harvey (2017), or other tests such as those proposed in Romano and Wolf (2005) and Simonsohn, Nelson, and Simmons (2014)). The first option is wise, but the drawback is that the decision process is then left to another arbitrary yardstick. 12.4.3 Simple safeguards As is mentioned at the beginning of the chapter, two common sense references for backtesting are Fabozzi and Prado (2018) and R. Arnott, Harvey, and Markowitz (2019). The pieces of advice provided in these two articles are often judicious and thoughtful. One additional comment pertains to the output of the backtest. One simple, intuitive and widespread metric is the transaction cost-adjusted Sharpe ratio defined in Equation (12.4). In the backtest, let us call \\(SR_{TC}^B\\) the corresponding value for the benchmark, which we like to define as the equally-weighted portfolio of all assets in the trading universe (in our dataset, roughly one thousand US equities). If the \\(SR_{TC}^P\\) of the best strategy is above \\(2\\times SR_{TC}^B\\), then there is probably a glitch somewhere in the backtest. This criterion holds under two assumptions: a sufficiently long enough out-of-sample period and long-only portfolios. It is unlikely that any realistic strategy can outperform a solid benchmark by a very wide margin over the long term. Being able to improve the benchmark’s annualized return by 150 basis points (with comparable volatility) is already a great achievement. Backtests that deliver returns more than 5% above those of the benchmark are dubious. 12.5 Implication of non-stationarity: forecasting is hard This subsection is split into two parts: in the first, we discuss the reason that makes forecasting such a difficult task and in the second we present an important theoretical result originally developed towards machine learning but that sheds light on any discipline confronted with out-of-sample tests. An interesting contribution related to this topic is the study from Farmer, Schmidt, and Timmermann (2019). The authors assess the predictive fit of linear models through time: they show that the fit is strongly varying: sometimes the model performs very well, sometimes, not so much. There is no reason why this should not be the case for ML algorithms as well. 12.5.1 General comments The careful reader must have noticed that throughout Chapters 5 to 11, the performance of ML engines is underwhelming. These disappointing results are there on purpose and highlight the crucial truth that machine learning is no panacea, no magic wand, no philosopher’s stone that can transform data into golden predictions. Most ML-based forecasts fail. This is in fact not only true for very enhanced and sophisticated techniques, but also for simpler econometric approaches (Dichtl et al. (2020)), which again underlines the need to replicate results to challenge their validity. One reason for that is that datasets are full of noise and extracting the slightest amount of signal is a tough challenge (we recommend a careful reading of the introduction of Timmermann (2018) for more details on this topic). One rationale for that is the ever time-varying nature of factor analysis in the equity space. Some factors can perform very well during one year and then poorly the next year and these reversals can be costly in the context of fully automated data-based allocation processes. In fact, this is one major difference with many fields for which ML has made huge advances. In image recognition, numbers will always have the same shape, and so will cats, buses, etc. Likewise, a verb will always be a verb and syntaxes in languages do not change. This invariance, though sometimes hard to grasp,25 is nonetheless key to the great improvement both in computer vision and natural language processing. In factor investing, there does not seem to be such invariance (see Cornell (2020)). There is no factor and no (possibly nonlinear) combination of factors that can explain and accurately forecast returns over long periods of several decades.26 The academic literature has yet to find such a model; but even if it did, a simple arbitrage reasoning would logically invalidate its conclusions in future datasets. 12.5.2 The no free lunch theorem We start by underlying that the no free lunch theorem in machine learning has nothing to do with the asset pricing condition with the same name (see, e.g., Delbaen and Schachermayer (1994), or, more recently, Cuchiero, Klein, and Teichmann (2016)). The original formulation was given by Wolpert (1992a) but we also recommend a look at the more recent reference Ho and Pepyne (2002). There are in fact several theorems and two of them can be found in Wolpert and Macready (1997). The statement of the theorem is very abstract and requires some notational conventions. We assume that any training sample \\(S=(\\{\\textbf{x}_1,y_1\\}, \\dots, \\{\\textbf{x}_I,y_I\\})\\) is such that there exists an oracle function \\(f\\) that perfectly maps the features to the labels: \\(y_i=f(\\textbf{x}_i)\\). The oracle function \\(f\\) belongs to a very large set of functions \\(\\mathcal{F}\\). In addition, we write \\(\\mathcal{H}\\) for the set of functions to which the forecaster will resort to approximate \\(f\\). For instance, \\(\\mathcal{H}\\) can be the space of feed-forward neural networks, or the space of decision trees, or the reunion of both. Elements of \\(\\mathcal{H}\\) are written \\(h\\) and \\(\\mathbb{P}[h|S]\\) stands for the (largely unknown) distribution of \\(h\\) knowing the sample \\(S\\). Similarly, \\(\\mathbb{P}[f|S]\\) is the distribution of oracle functions knowing \\(S\\). Finally, the features have a given law, \\(\\mathbb{P}[\\textbf{x}]\\). Let us now consider two models, say \\(h_1\\) and \\(h_2\\). The statement of the theorem is usually formulated with respect to a classification task. Knowing \\(S\\), the error when choosing \\(h_k\\) induced by samples outside of the training sample \\(S\\) can be quantified as: \\[\\begin{equation} \\tag{12.6} E_k(S)= \\int_{f,h}\\int_{\\textbf{x}\\notin S} \\underbrace{ (1-\\delta(f(\\textbf{x}),h_k(\\textbf{x})))}_{\\text{error term}} \\underbrace{\\mathbb{P}[f|S]\\mathbb{P}[h|S]\\mathbb{P}[\\textbf{x}]}_{\\text{distributional terms}}, \\end{equation}\\] where \\(\\delta(\\cdot,\\cdot)\\) is the delta Kronecker function: \\[\\begin{equation} \\tag{12.7} \\delta(x,y)=\\left\\{\\begin{array}{ll} 0 & \\text{if } x\\neq y \\\\ 1 & \\text{if } x = y \\end{array} .\\right. \\end{equation}\\] One of the no free lunch theorems states that \\(E_1(S)=E_2(S)\\), that is, that with the sole knowledge of \\(S\\), there can be no superior algorithm, on average. In order to build a performing algorithm, the analyst or econometrician must have prior views on the structure of the relationship between \\(y\\) and \\(\\textbf{x}\\) and integrate these views in the construction of the model. Unfortunately, this can also yield underperforming models if the views are incorrect. 12.6 First example: a complete backtest We finally propose a full detailed example of one implementation of a ML-based strategy run on a careful backtest. What follows is a generalization of the content of Section 5.2.2. In the same spirit, we split the backtest in four parts: the creation/initialization of variables; the definition of the strategies in one main function; the backtesting loop itself; the performance indicators. Accordingly, we start with initializations. sep_oos <- as.Date("2007-01-01") # Starting point for backtest ticks <- data_ml$stock_id %>% # List of all asset ids as.factor() %>% levels() N <- length(ticks) # Max number of assets t_oos <- returns$date[returns$date > sep_oos] %>% # Out-of-sample dates unique() %>% # Remove duplicates as.Date(origin = "1970-01-01") # Transform in date format Tt <- length(t_oos) # Nb of dates, avoid T = TRUE nb_port <- 2 # Nb of portfolios/stragegies portf_weights <- array(0, dim = c(Tt, nb_port, N)) # Initialize portfolio weights portf_returns <- matrix(0, nrow = Tt, ncol = nb_port) # Initialize portfolio returns This first step is crucial, it lays the groundwork for the core of the backtest. We consider only two strategies: one ML-based and the EW (1/N) benchmark. The main (weighting) function will consist of these two components, but we define the sophisticated one in a dedicated wrapper. The ML-based weights are derived from XGBoost predictions with 80 trees, a learning rate of 0.3 and a maximum tree depth of 4. This makes the model complex but not exceedingly so. Once the predictions are obtained, the weighting scheme is simple: it is an EW portfolio over the best half of the stocks (those with above median prediction). In the function below, all parameters (e.g., the learning rate, eta or the number of trees nrounds) are hard-coded. They can easily be passed in arguments next to the data inputs. One very important detail is that in contrast to the rest of the book, the label is the 12-month future return. The main reason for this is rooted in the discussion from Section 4.6. Also, to speed up the computations, we remove the bulk of the distribution of the labels and keep only the top 20% and bottom 20%, as is advised in Coqueret and Guida (2020). The filtering levels could also be passed as arguments. weights_xgb <- function(train_data, test_data, features){ train_features <- train_data %>% dplyr::select(features) %>% as.matrix() # Indep. variable train_label <- train_data$R12M_Usd / exp(train_data$Vol1Y_Usd) # Dep. variable ind <- which(train_label < quantile(train_label,0.2)| # Filter train_label > quantile(train_label, 0.8)) train_features <- train_features[ind, ] # Filt'd features train_label <- train_label[ind] # Filtered label train_matrix <- xgb.DMatrix(data = train_features, label = train_label) # XGB format fit <- train_matrix %>% xgb.train(data = ., # Data source (pipe input) eta = 0.3, # Learning rate objective = "reg:squarederror", # Number of random trees max_depth = 4, # Maximum depth of trees nrounds = 80, # Number of trees used verbose = 0 # No comments ) xgb_test <- test_data %>% # Test sample => XGB format dplyr::select(features) %>% as.matrix() %>% xgb.DMatrix() pred <- predict(fit, xgb_test) # Single prediction w <- pred > median(pred) # Keep only the 50% best predictions w$weights <- w / sum(w) w$names <- unique(test_data$stock_id) return(w) # Best predictions, equally-weighted } Compared to the structure proposed in Section 6.4.6, the differences are that the label is not only based on long-term returns, but it also relies on a volatility component. Even though the denominator in the label is the exponential quantile of the volatility, it seems fair to say that it is inspired by the Sharpe ratio and that the model seeks to explain and forecast a risk-adjusted return instead of a raw return. A stock with very low volatility will have its return unchanged in the label, while a stock with very high volatility will see its return divided by a factor close to three (exp(1)=2.718). This function is then embedded in the global weighting function which only wraps two schemes: the EW benchmark and the ML-based policy. portf_compo <- function(train_data, test_data, features, j){ if(j == 1){ # This is the benchmark N <- test_data$stock_id %>% # Test data dictates allocation factor() %>% nlevels() w <- 1/N # EW portfolio w$weights <- rep(w,N) w$names <- unique(test_data$stock_id) # Asset names return(w) } if(j == 2){ # This is the ML strategy. return(weights_xgb(train_data, test_data, features)) } } Equipped with this function, we can turn to the main backtesting loop. Given the fact that we use a large-scale model, the computation time for the loop is large (possibly a few hours on a slow machine with CPU). Resorting to functional programming can speed up the loop (see exercise at the end of the chapter). Also, a simple benchmark equally weighted portfolio can be coded with tidyverse functions only. m_offset <- 12 # Offset in months for buffer period train_size <- 5 # Size of training set in years for(t in 1:(length(t_oos)-1)){ # Stop before last date: no fwd ret.! if(t%%12==0){print(t_oos[t])} # Just checking the date status train_data <- data_ml %>% filter(date < t_oos[t] - m_offset * 30, # Roll window w. buffer date > t_oos[t] - m_offset * 30 - 365 * train_size) test_data <- data_ml %>% filter(date == t_oos[t]) # Test sample realized_returns <- test_data %>% # Computing returns via: dplyr::select(R1M_Usd) # 1M holding period! for(j in 1:nb_port){ temp_weights <- portf_compo(train_data, test_data, features, j) # Weights ind <- match(temp_weights$names, ticks) %>% na.omit() # Index: test vs all portf_weights[t,j,ind] <- temp_weights$weights # Allocate weights portf_returns[t,j] <- sum(temp_weights$weights * realized_returns) # Compute returns } } ## [1] "2007-12-31" ## [1] "2008-12-31" ## [1] "2009-12-31" ## [1] "2010-12-31" ## [1] "2011-12-31" ## [1] "2012-12-31" ## [1] "2013-12-31" ## [1] "2014-12-31" ## [1] "2015-12-31" ## [1] "2016-12-31" ## [1] "2017-12-31" There are two important comments to be made on the above code. The first comment pertains to the two parameters that are defined in the first lines. They refer to the size of the training sample (5 years) and the length of the buffer period shown in Figure 12.2. This buffer period is imperative because the label is based on a long-term (12-month) return. This lag is compulsory to avoid any forward-looking bias in the backtest. Below, we create a function that computes the turnover (variation in weights). It requires both the weight values as well as the returns of all assets because the weights just before a rebalancing depend on the weights assigned in the previous period, as well as on the returns of the assets that have altered these original weights during the holding period. turnover <- function(weights, asset_returns, t_oos){ turn <- 0 for(t in 2:length(t_oos)){ realised_returns <- returns %>% filter(date == t_oos[t]) %>% dplyr::select(-date) prior_weights <- weights[t-1,] * (1 + realised_returns) # Before rebalancing turn <- turn + apply(abs(weights[t,] - prior_weights/sum(prior_weights)),1,sum) } return(turn/(length(t_oos)-1)) } Once turnover is defined, we embed it into a function that computes several key indicators. perf_met <- function(portf_returns, weights, asset_returns, t_oos){ avg_ret <- mean(portf_returns, na.rm = T) # Arithmetic mean vol <- sd(portf_returns, na.rm = T) # Volatility Sharpe_ratio <- avg_ret / vol # Sharpe ratio VaR_5 <- quantile(portf_returns, 0.05) # Value-at-risk turn <- 0 # Initialisation of turnover for(t in 2:dim(weights)[1]){ realized_returns <- asset_returns %>% filter(date == t_oos[t]) %>% dplyr::select(-date) prior_weights <- weights[t-1,] * (1 + realized_returns) turn <- turn + apply(abs(weights[t,] - prior_weights/sum(prior_weights)),1,sum) } turn <- turn/(length(t_oos)-1) # Average over time met <- data.frame(avg_ret, vol, Sharpe_ratio, VaR_5, turn) # Aggregation of all of this rownames(met) <- "metrics" return(met) } Lastly, we build a function that loops on the various strategies. perf_met_multi <- function(portf_returns, weights, asset_returns, t_oos, strat_name){ J <- dim(weights)[2] # Number of strategies met <- c() # Initialization of metrics for(j in 1:J){ # One very ugly loop temp_met <- perf_met(portf_returns[, j], weights[, j, ], asset_returns, t_oos) met <- rbind(met, temp_met) } row.names(met) <- strat_name # Stores the name of the strat return(met) } Given the weights and returns of the portfolios, it remains to compute the returns of the assets to plug them in the aggregate metrics function. asset_returns <- data_ml %>% # Compute return matrix: start from data dplyr::select(date, stock_id, R1M_Usd) %>% # Keep 3 attributes spread(key = stock_id, value = R1M_Usd) # Shape in matrix format asset_returns[is.na(asset_returns)] <- 0 # Zero returns for missing points met <- perf_met_multi(portf_returns = portf_returns, # Computes performance metrics weights = portf_weights, asset_returns = asset_returns, t_oos = t_oos, strat_name = c("EW", "XGB_SR")) met # Displays perf metrics ## avg_ret vol Sharpe_ratio VaR_5 turn ## EW 0.009697248 0.05642917 0.1718481 -0.07712509 0.0714512 ## XGB_SR 0.012602882 0.06376845 0.1976351 -0.08335864 0.5679932 The ML-based strategy performs finally well! The gain is mostly obtained by the average return, while the volatility is higher than that of the benchmark. The net effect is that the Sharpe ratio is improved compared to the benchmark. The augmentation is not breathtaking, but (hence?) it seems reasonable. It is noteworthy to underline that turnover is substantially higher for the sophisticated strategy. Removing costs in the numerator (say, 0.005 times the turnover, as in Goto and Xu (2015), which is a conservative figure) only mildly reduces the superiority in Sharpe ratio of the ML-based strategy. Finally, it is always tempting to plot the corresponding portfolio values and we display two related graphs in Figure 12.3. library(lubridate) # Date management library(cowplot) # Plot grid management g1 <- tibble(date = t_oos, benchmark = cumprod(1+portf_returns[,1]), ml_based = cumprod(1+portf_returns[,2])) %>% gather(key = strat, value = value, -date) %>% ggplot(aes(x = date, y = value, color = strat)) + geom_line() +theme_grey() g2 <- tibble(year = lubridate::year(t_oos), benchmark = portf_returns[,1], ml_based = portf_returns[,2]) %>% gather(key = strat, value = value, -year) %>% group_by(year, strat) %>% summarise(avg_return = mean(value)) %>% ggplot(aes(x = year, y = avg_return, fill = strat)) + geom_col(position = "dodge") + theme_grey() plot_grid(g1,g2, nrow = 2) FIGURE 12.3: Graphical representation of the performance of the portfolios. Out of the 12 years of the backtest, the advanced strategy outperforms the benchmark during 10 years. It is less hurtful in two of the four years of aggregate losses (2015 and 2018). This is a satisfactory improvement because the EW benchmark is tough to beat! 12.7 Second example: backtest overfitting To end this chapter, we quantify the concepts of Section 12.4.2. First, we build a function that is able to generate performance metrics for simple strategies that can be evaluated in batches. The strategies are pure factor bets and depend on three inputs: the chosen characteristic (e.g., market capitalization), a threshold level (quantile of the characteristic) and a direction (long position in the top or bottom of the distribution). strat <- function(data, feature, thresh, direction){ data_tmp <- dplyr::select(data, feature, date, R1M_Usd) # Data colnames(data_tmp)[1] <- "feature" # Colname data_tmp %>% mutate(decision = direction * feature > direction * thresh) %>% # Investment decision group_by(date) %>% # Date-by-date analysis mutate(nb = sum(decision), # Nb assets in portfolio w = decision / nb, # Weights of assets return = w * R1M_Usd) %>% # Asset contribution summarise(p_return = sum(return)) %>% # Portfolio return summarise(avg = mean(p_return), sd = sd(p_return), SR = avg/sd) %>% # Perf. metrics return() } Then, we test the function on a triplet of arguments. We pick the price-to-book (Pb) ratio. The position is positive and the threshold is 0.3, which means that the strategy buys the stocks that have a Pb value above the 0.3 quantile of the distribution. strat(data_ml, "Pb", 0.3, 1) # Large cap ## # A tibble: 1 x 3 ## avg sd SR ## <dbl> <dbl> <dbl> ## 1 0.0102 0.0496 0.207 The output keeps three quantities that will be useful to compute the statistic (12.5). We must now generate these indicators for many strategies. We start by creating the grid of parameters. feature <- c("Div_Yld", "Ebit_Bv", "Mkt_Cap_6M_Usd", "Mom_11M_Usd", "Pb", "Vol1Y_Usd") thresh <- seq(0.2,0.8, by = 0.1) # Threshold values values direction <- c(1,-1) # Decision direction pars <- expand.grid(feature, thresh, direction) # The grid feature <- pars[,1] %>% as.character() # re-features thresh <- pars[,2] # re-thresholds direction <- pars[,3] # re-directions This makes 84 strategies in total. We can proceed to see how they fare. We plot the corresponding Sharpe ratios below in Figure 12.4. The top plot shows the strategies that invest in the bottoms of the distributions of characteristics while the bottom plot pertains to the portfolios that are long in the lower parts of these distributions. grd <- pmap(list(feature, thresh, direction), # Parameters for the grid search strat, # Function on which to apply the grid search data = data_ml # Data source/input ) %>% unlist() %>% matrix(ncol = 3, byrow = T) grd <- data.frame(feature, thresh, direction, grd) # Gather & reformat results colnames(grd)[4:6] <- c("mean", "sd", "SR") # Change colnames grd <- grd %>% mutate_at(vars(direction), as.factor) # Change type: factor (for plot) grd %>% ggplot(aes(x = thresh, y = SR, color = feature)) + # Plot! geom_point() + geom_line() + facet_grid(direction~.) FIGURE 12.4: Sharpe ratios of all backtested strategies. The last step is to compute the statistic (12.5). We code it here: DSR <- function(SR, Tt, M, g3, g4, SR_m, SR_v){ # First, we build the function gamma <- -digamma(1) # Euler-Mascheroni constant SR_star <- SR_m + sqrt(SR_v)*((1-gamma)*qnorm(1-1/M) + gamma*qnorm(1-1/M/exp(1))) # SR* num <- (SR-SR_star) * sqrt(Tt-1) # Numerator den <- sqrt(1 - g3*SR + (g4-1)/4*SR^2) # Denominator return(pnorm(num/den)) } All that remains to do is to evaluate the arguments of the function. The “best” strategy is the one on the top left corner of Figure 12.4 and it is based on market capitalization. M <- nrow(pars) # Number of strategies we tested SR <- max(grd$SR) # The SR we want to test SR_m <- mean(grd$SR) # Average SR across all strategies SR_v <- var(grd$SR) # Std dev of SR # Below, we compute the returns of the strategy by recycling the code of the strat() function data_tmp <- dplyr::select(data_ml, "Mkt_Cap_6M_Usd", date, R1M_Usd) # feature = Mkt_Cap colnames(data_tmp)[1] <- "feature" returns_DSR <- data_tmp %>% mutate(decision = feature < 0.2) %>% # Investment decision: 0.2 is the best threshold group_by(date) %>% # Date-by-date computations mutate(nb = sum(decision), # Nb assets in portfolio w = decision / nb, # Portfolio weights return = w * R1M_Usd) %>% # Asset contribution to return summarise(p_return = sum(return)) # Portfolio return g3 <- skewness(returns_DSR$p_return) # Function from the e1071 package g4 <- kurtosis(returns_DSR$p_return) + 3 # Function from the e1071 package Tt <- nrow(returns_DSR) # Number of dates DSR(SR, Tt, M, g3, g4, SR_m, SR_v) # The sought value! ## [1] 0.6676416 The value 0.6676416 is not high enough (it does not reach the 90% or 95% threshold) to make the strategy significantly superior to the other ones that were considered in the batch of tests. 12.8 Coding exercises Code the returns of the EW portfolio with tidyverse functions only (no loop). Code the advanced weighting function defined in Equation (12.3). Test it in a small backtest and check its sensitivity to the parameters. Using the functional programming package purrr, avoid the loop in the backtest. References "],["interp.html", "Chapter 13 Interpretability 13.1 Global interpretations 13.2 Local interpretations", " Chapter 13 Interpretability This chapter is dedicated to the techniques that help understand the way models process inputs into outputs. A recent book (Molnar (2019) available at https://christophm.github.io/interpretable-ml-book/) is entirely devoted to this topic and we highly recommend to have a look at it. The survey of Belle and Papantonis (2020) is also worthwhile. Another more introductory and less technical reference is Hall and Gill (2019). Obviously, in this chapter, we will adopt a tone which is factor-investing orientated and discuss examples related to ML models trained on a financial dataset. Quantitative tools that aim for interpretability of ML models are required to satisfy two simple conditions: That they provide information about the model. That they are highly comprehensible. Often, these tools generate graphical outputs which are easy to read and yield immediate conclusions. In attempts to white-box complex machine learning models, one dichotomy stands out: Global models seek to determine the relative role of features in the construction of the predictions once the model has been trained. This is done at the global level, so that the patterns that are shown in the interpretation hold on average over the whole training set. Local models aim to characterize how the model behaves around one particular instance by considering small variations around this instance. The way these variations are processed by the original model allows to simplify it by approximating it, e.g., in a linear fashion. This approximation can for example determine the sign and magnitude of the impact of each relevant feature in the vicinity of the original instance. Molnar (2019) proposes another classification of interpretability solutions by splitting interpretations that depend on one particular model (e.g., linear regression or decision tree) versus the interpretations that can be obtained for any kind of model. In the sequel, we present the methods according to the global versus local dichotomy. 13.1 Global interpretations 13.1.1 Simple models as surrogates Let us start with the simplest example of all. In a linear model, \\[y_i=\\alpha+\\sum_{k=1}^K\\beta_kx_i^k+\\epsilon_i,\\] the following elements are usually extracted from the estimation of the \\(\\beta_k\\): the \\(R^2\\), which appreciates the global fit of the model (possibly penalized to prevent overfitting with many regressors). The \\(R^2\\) is usually computed in-sample; the sign of the estimates \\(\\hat{\\beta}_k\\), which indicates the direction of the impact of each feature \\(x^k\\) on \\(y\\); the \\(t\\)-statistics \\(t_{\\hat{\\beta_k}}\\), which evaluate the magnitude of this impact: regardless of its direction, large statistics in absolute value reveal prominent variables. Often, the \\(t\\)-statistics are translated into \\(p\\)-values which are computed under some suitable distributional assumptions. The last two indicators are useful because they inform the user on which features matter the most and on the sign of the effect of each predictor. This gives a simplified view of how the model processes the features into the output. Most tools that aim to explain black boxes follow the same principles. Decision trees, because they are easy to picture, are also great models for interpretability. Thanks to this favorable feature, they are target benchmarks for simple models. Recently, Vidal, Pacheco, and Schiffer (2020) propose a method to reduce an ensemble of trees into a unique tree. The aim is to propose a simpler model that behaves exactly like the complex one. More generally, it is an intuitive idea to resort to simple models to proxy more complex algorithms. One simple way to do so is to build so-called surrogate models. The process is simple: train the original model \\(f\\) on features \\(\\textbf{X}\\) and labels \\(\\textbf{y}\\); train a simpler model \\(g\\) to explain the predictions of the trained model \\(\\hat{f}\\) given the features \\(\\textbf{X}\\): \\[\\hat{f}(\\textbf{X})=g(\\textbf{X})+\\textbf{error}\\] The estimated model \\(\\hat{g}\\) explains how the initial model \\(\\hat{f}\\) maps the features into the labels. To illustrate this, we use the iml package (see Molnar, Casalicchio, and Bischl (2018)). The simpler model is a tree with a depth of two. library(iml) mod <- Predictor$new(fit_RF, data = training_sample %>% dplyr::select(features)) dt <- TreeSurrogate$new(mod, maxdepth = 2) plot(dt) FIGURE 13.1: Example of surrogate tree. The representation of the tree is different, compared to those seen in Chapter 6. Indeed, the four possible outcomes (determined by the conditions in the top lines) no longer yield a simple value (average of the label), but more information is given, in the form of a box plot (including the interquartile range and outliers). In the above representation, it is the top right cluster that seems to have the highest rewards, with especially many upward outliers. This cluster consists of small firms with volatile past returns. 13.1.2 Variable importance (tree-based) One incredibly favorable feature of simple decision trees is their interpretability. Their visual representation is clear and straightforward. Just like regressions (which are another building block in ML), simple trees are easy to comprehend and do not suffer from the black-box rebuke that is often associated with more sophisticated tools. Indeed, both random forests and boosted trees fail to provide perfectly accurate accounts of what is happening inside the engine. In contrast, it is possible to compute the aggregate share (or importance) of each feature in the determination of the structure of the tree once it has been trained. After training, it is possible to compute, at each node \\(n\\) the gain \\(G(n)\\) obtained by the subsequent split if there are any, i.e., if the node is not a terminal leaf. It is also easy to determine which variable is chosen to perform the split, hence we write \\(\\mathcal{N}_k\\) the set of nodes for which feature \\(k\\) is chosen for the partition. Then, the global importance of each feature is given by \\[I(k)=\\sum_{n\\in \\mathcal{N}_k}G(n),\\] and it is often rescaled so that the sum of \\(I(k)\\) across all \\(k\\) is equal to one. In this case, \\(I(k)\\) measures the relative contribution of feature \\(k\\) in the reduction of loss during the training. A variable with high importance will have a greater impact on predictions. Generally, these variables are those that are located close to the root of the tree. Below, we take a look at the results obtained from the tree-based models trained in Chapter 6. We start by recycling the output from the three regression models we used. Notice that each fitted output has its own structure and importance vectors have different names. tree_VI <- fit_tree$variable.importance %>% # VI from tree model as_tibble(rownames = NA) %>% # Transform in tibble rownames_to_column("Feature") # Add feature column RF_VI <- fit_RF$importance %>% # VI from random forest as_tibble(rownames = NA) %>% # Transform in tibble rownames_to_column("Feature") # Add feature column XGB_VI <- xgb.importance(model = fit_xgb)[,1:2] # VI from boosted trees VI_trees <- tree_VI %>% left_join(RF_VI) %>% left_join(XGB_VI) # Aggregate the VIs colnames(VI_trees)[2:4] <- c("Tree", "RF", "XGB") # New column names norm_1 <- function(x){return(x / sum(x))} # Normalizing function VI_trees %>% na.omit %>% mutate_if(is.numeric, norm_1) %>% # Plotting sequence gather(key = model, value = value, -Feature) %>% ggplot(aes(x = Feature, y = value, fill = model)) + geom_col(position = "dodge") + theme(axis.text.x = element_text(angle = 35, hjust = 1)) FIGURE 13.2: Variable importance for tree-based models. In the above code, tibbles are like dataframes (they are the v2.0 of dataframes, so to speak). Given the way the graph is coded, Figure 13.2 is in fact misleading. Indeed, by construction, the simple tree model only has a small number of features with nonzero importance: in the above graph, there are only 3: capitalization, price-to-book and volatility. In contrast, because random forest and boosted trees are much more complex, they give some importance to many predictors. The graph shows the variables related to the simple tree model only. For scale reasons, the normalization is performed after the subset of features is chosen. We preferred to limit the number of features shown on the graph for obvious readability concerns. There are differences in the way the models rely on the features. For instance, the most important feature changes from a model to the other: the simple tree model gives the most importance to the price-to-book ratio, while the random forest bets more on volatility and boosted trees give more weight to capitalization. One defining property of random forests is that they give a chance to all features. Indeed, by randomizing the choice of predictors, each individual exogenous variable has a shot at explaining the label. Along with boosted trees, the allocation of importance is more balanced across predictors, compared to the simple tree which puts most of its eggs in just a few baskets. 13.1.3 Variable importance (agnostic) The idea of quantifying the importance of each feature in the learning process can be extended to nontree-based models. We refer to the papers mentioned in the study by Fisher, Rudin, and Dominici (2019) for more information on this stream of the literature. The premise is the same as above: the aim is to quantify to what extent one feature contributes to the learning process. One way to track the added value of one particular feature is to look at what happens if its values inside the training set are entirely shuffled. If the original feature plays an important role in the explanation of the dependent variable, then the shuffled version of the feature will lead to a much higher loss. The baseline method to assess feature importance in the general case is the following: Train the model on the original data and compute the associated loss \\(l^*\\). For each feature \\(k\\), create a new training dataset in which the feature’s values are randomly permuted. Then, evaluate the loss \\(l_k\\) of the model based on this altered sample. Rank the variable importance of each feature, computed as a difference \\(\\text{VI}_k=l_k-l^*\\) or a ratio \\(\\text{VI}_k=l_k/l^*\\). Whether to compute the losses on the training set or the testing set is an open question and remains to the appreciation of the analyst. The above procedure is of course random and can be repeated so that the importances are averaged over several trials: this improves the stability of the results. This algorithm is implemented in the FeatureImp() function of the iml R package developed by the author of Molnar (2019). We also recommend the vip package, see Greenwell and Boehmke (n.d.). Below, we implement this algorithm manually so to speak for the features appearing in Figure 13.2. We test this approach on ridge regressions and recycle the variables used in Chapter 5. We start by the first step: computing the loss on the original training sample. fit_ridge_0 <- glmnet(x_penalized_train, y_penalized_train, # Trained model alpha = 0, lambda = 0.01) l_star <- mean((y_penalized_train-predict(fit_ridge_0, x_penalized_train))^2) # Loss Next, we evaluate the loss when each of the predictors has been sequentially shuffled. To reduce computation time, we only make one round of shuffling. l <- c() # Initialize for(i in 1:nrow(VI_trees)){ # Loop on the features feat_name <- as.character(VI_trees[i,1]) temp_data <- training_sample %>% dplyr::select(features) # Temp feature matrix temp_data[, which(colnames(temp_data) == feat_name)] <- # Shuffles the values sample(temp_data[, which(colnames(temp_data) == feat_name)] %>% pull(1), replace = FALSE) x_penalized_temp <- temp_data %>% as.matrix() # Predictors into matrix l[i] <- mean((y_penalized_train-predict(fit_ridge_0, x_penalized_temp))^2) # = Loss } Finally, we plot the results. data.frame(Feature = VI_trees[,1], loss = l - l_star) %>% ggplot(aes(x = Feature, y = loss)) + geom_col() + theme(axis.text.x = element_text(angle = 35, hjust = 1)) FIGURE 13.3: Variable importance for a ridge regression model. The resulting importances are in line with thoses of the tree-based models: the most prominent variables are volatility-based, market capitalization-based, and the price-to-book ratio; these closely match the variables from Figure 13.2. Note that in some cases (e.g., the share turnover), the score can even be negative, which means that the predictions are more accurate than the baseline model when the values of the predictor are shuffled! 13.1.4 Partial dependence plot Partial dependence plots (PDPs) aim at showing the relationship between the output of a model and the value of a feature (we refer to section 8.2 of Friedman (2001) for an early treatment of this subject). Let us fix a feature \\(k\\). We want to understand the average impact of \\(k\\) on the predictions of the trained model \\(\\hat{f}\\). In order to do so, we assume that the feature space is random and we split it in two: \\(k\\) versus \\(-k\\), which stands for all features except for \\(k\\). The partial dependence plot is defined as \\[\\begin{equation} \\tag{13.1} \\bar{f}_k(x_k)=\\mathbb{E}[\\hat{f}(\\textbf{x}_{-k},x_k)]=\\int \\hat{f}(\\textbf{x}_{-k},x_k)d\\mathbb{P}_{-k}(\\textbf{x}_{-k}), \\end{equation}\\] where \\(d\\mathbb{P}_{-k}(\\cdot)\\) is the (multivariate) distribution of the non-\\(k\\) features \\(\\textbf{x}_{-k}\\). The above function takes the feature values \\(x_k\\) as argument and keeps all other features frozen via their sample distributions: this shows the impact of feature \\(k\\) solely. In practice, the average is evaluated using Monte-Carlo simulations: \\[\\begin{equation} \\tag{13.2} \\bar{f}_k(x_k)\\approx \\frac{1}{M}\\sum_{m=1}^M\\hat{f}\\left(x_k,\\textbf{x}_{-k}^{(m)}\\right), \\end{equation}\\] where \\(\\textbf{x}_{-k}^{(m)}\\) are independent samples of the non-\\(k\\) features. Theoretically, PDPs could be computed for more than one feature at a time. In practice, this is only possible for two features (yielding a 3D surface) and is more computationally intense. We illustrate this concept below, using the dedicated package iml (interpretable machine learning); see also the pdp package documented in Greenwell (2017). The model we seek to explain is the random forest built in Section 6.2. We recycle some variables used therein. We choose to test the impact of the price-to-book ratio on the outcome of the model. library(iml) # One package for interpretability mod_iml <- Predictor$new(fit_RF, # This line encapsulates the objects data = training_sample %>% dplyr::select(features)) pdp_PB = FeatureEffect$new(mod_iml, feature = "Pb") # This line computes the PDP for p/b ratio plot(pdp_PB) # Plot the partial dependence. FIGURE 13.4: Partial dependence plot for the price-to-book ratio on the random forest model. The average impact of the price-to-book ratio on the predictions is decreasing. This was somewhat expected, given the conditional average of the dependent variable given the price-to-book ratio. This latter function is depicted in Figure 6.3 and shows a behavior comparable to the above curve: strongly decreasing for small value of P/B and then relatively flat. When the price-to-book ratio is low, firms are undervalued. Hence, their higher returns are in line with the value premium. Finally, we refer to Zhao and Hastie (2020) for a theoretical discussion on the causality property of PDPs. Indeed, a deep look at the construction of the PDPs suggests that they could be interpreted as a causal representation of the feature on the model’s output. 13.2 Local interpretations Whereas global interpretations seek to assess the impact of features on the output \\(overall\\), local methods try to quantify the behavior of the model on particular instances or the neighborhood thereof. Local interpretability has recently gained traction and many papers have been published on this topic. Below, we outline the most widespread methods.27 13.2.1 LIME LIME (Local Interpretable Model-Agnostic Explanations) is a methodology originally proposed by Ribeiro, Singh, and Guestrin (2016). Their aim is to provide a faithful account of the model under two constraints: simple interpretability, which implies a limited number of variables with visual or textual representation. This is to make sure any human can easily understand the outcome of the tool; local faithfulness: the explanation holds for the vicinity of the instance. The original (black-box) model is \\(f\\) and we assume we want to approximate its behavior around instance \\(x\\) with the interpretable model \\(g\\). The simple function \\(g\\) belongs to a larger class \\(G\\). The vicinity of \\(x\\) is denoted \\(\\pi_x\\) and the complexity of \\(g\\) is written \\(\\Omega(g)\\). LIME seeks an interpretation of the form \\[\\xi(x)=\\underset{g \\in G}{\\text{argmin}} \\, \\mathcal{L}(f,g,\\pi_x)+\\Omega(g),\\] where \\(\\mathcal{L}(f,g,\\pi_x)\\) is the loss function (error/imprecision) induced by \\(g\\) in the vicinity \\(\\pi_x\\) of \\(x\\). The penalization \\(\\Omega(g)\\) is for instance the number of leaves or depth of a tree, or the number of predictors in a linear regression. It now remains to define some of the above terms. The vicinity of \\(x\\) is defined by \\(\\pi_x(z)=e^{-D(x,z)^2/\\sigma^2},\\) where \\(D\\) is some distance measure and \\(\\sigma^2\\) some scaling constant. We underline that this function decreases when \\(z\\) shifts away from \\(x\\). The tricky part is the loss function. In order to minimize it, LIME generates artificial samples close to \\(x\\) and averages/sums the error on the label that the simple representation makes. For simplicity, we assume a scalar output for \\(f\\), hence the formulation is the following: \\[\\mathcal{L}(f,g,\\pi_x)=\\sum_z \\pi_x(z)(f(z)-g(z))^2\\] and the errors are weighted according to their distance from the initial instance \\(x\\): the closest points get the largest weights. In its most basic implementation, the set of models \\(G\\) consists of all linear models. In Figure 13.5, we provide a simplified diagram of how LIME works. FIGURE 13.5: Simplistic explanation of LIME: the explained instance is surrounded by a red square. Five points are generated (the triangles) and a weighted linear model is fitted accordingly (dashed grey line). For expositional clarity, we work with only one dependent variable. The original training sample is shown with the black points. The fitted (trained) model is represented with the blue line (smoothed conditional average) and we want to approximate how the model works around one particular instance which is highlighted by the red square around it. In order to build the approximation, we sample 5 new points around the instance (the 5 red triangles). Each triangle lies on the blue line (they are model predictions) and has a weight proportional to its size: the triangle closest to the instance has a bigger weight. Using weighted least-squares, we build a linear model that fits to these 5 points (the dashed grey line). This is the outcome of the approximation. It gives the two parameters of the model: the intercept and the slope. Both can be evaluated with standard statistical tests. The sign of the slope is important. It is fairly clear that if the instance had been taken closer to \\(x=0\\), the slope would have probably been almost flat and hence the predictor could be locally discarded. Another important detail is the number of sample points. In our explanation, we take only five, but in practice, a robust estimation usually requires around one thousand points or more. Indeed, when too few neighbors are sampled, the estimation risk is high and the approximation may be rough. We proceed with an example of implementation. There are several steps: Fit a model on some training data. Wrap everything using the lime() function. Focus on a few predictors and see their impact over a few particular instances (via the explain() function). We start with the first step. This time, we work with a boosted tree model. library(lime) # Package for LIME interpretation params_xgb <- list( # Parameters of the boosted tree max_depth = 5, # Max depth of each tree eta = 0.5, # Learning rate gamma = 0.1, # Penalization colsample_bytree = 1, # Proportion of predictors to be sampled (1 = all) min_child_weight = 10, # Min number of instances in each node subsample = 1) # Proportion of instance to be sampled (1 = all) xgb_model <- xgb.train(params_xgb, # Training of the model train_matrix_xgb, # Training data nrounds = 10) # Number of trees Then, we head on to steps two and three. As underlined above, we resort to the lime() and explain() functions. explainer <- lime(training_sample %>% dplyr::select(features_short), xgb_model) # Step 2. explanation <- explain(x = training_sample %>% # Step 3. dplyr::select(features_short) %>% dplyr::slice(1:2), # First two instances in train_sample explainer = explainer, # Explainer variable created above n_permutations = 900, # Nb samples for loss function dist_fun = "euclidean", # Dist.func. "gower" is one alternative n_features = 6 # Nb of features shown (important ones) ) plot_features(explanation, ncol = 1) # Visual display In each graph (one graph corresponds to the explanation around one instance), there are two types of information: the sign of the impact and the magnitude of the impact. The sign is revealed with the color (positive in blue, negative in red) and the magnitude is shown with the size of the rectangles. The values to the left of the graphs show the ranges of the features with which the local approximations were computed. Lastly, we briefly discuss the choice of distance function chosen in the code. It is used to evaluate the discrepancy between the true instance and a simulated one to give more or less weight to the prediction of the sampled instance. Our dataset comprises only numerical data; hence, the Euclidean distance is a natural choice: \\[\\text{Euclidean}(\\textbf{x}, \\textbf{y})=\\sqrt{\\sum_{n=1}^N(x_i-y_i)^2}.\\] Another possible choice would be the Manhattan distance: \\[\\text{Manhattan}(\\textbf{x}, \\textbf{y})=\\sum_{n=1}^N|x_i-y_i|.\\] The problem with these two distances is that they fail to handle categorical variables. This is where the Gower distance steps in (Gower (1971)). The distance imposes a different treatment on features of different types (classes versus numbers essentially, but it can also handle missing data!). For categorical features, the Gower distance applies a binary treatment: the value is equal to 1 if the features are equal, and to zero if not (i.e., \\(1_{\\{x_n=y_n\\}}\\)). For numerical features, the spread is quantified as \\(1-\\frac{|x_n-y_n|}{R_n}\\), where \\(R_n\\) is the maximum absolute value the feature can take. All similarity measurements are then aggregated to yield the final score. Note that in this case, the logic is reversed: \\(\\textbf{x}\\) and \\(\\textbf{y}\\) are very close if the Gower distance is close to one, and they are far away if the distance is close to zero. 13.2.2 Shapley values The approach of Shapley values is somewhat different compared to LIME and closer in spirit to PDPs. It originates from cooperative game theory (Shapley (1953)). The rationale is the following. One way to assess the impact (or usefulness) of a variable is to look at what happens if we remove this variable from the dataset. If this is very detrimental to the quality of the model (i.e., to the accuracy of its predictions), then it means that the variable is substantially valuable. The simplest way to proceed is to take all variables and remove one to evaluate its predictive ability. Shapley values are computed on a larger scale because they consider all possible combinations of variables to which they add the target predictor. Formally, this gives: \\[\\begin{equation} \\tag{13.3} \\phi_k=\\sum_{S \\subseteq \\{x_1,\\dots,x_K \\} \\backslash x_k}\\underbrace{\\frac{\\text{Card}(S)!(K-\\text{Card}(S)-1)!}{K!}}_{\\text{weight of coalition}}\\underbrace{\\left(\\hat{f}_{S \\cup \\{x_k\\}}(S \\cup \\{x_k\\})-\\hat{f}_S(S)\\right)}_{\\text{gain when adding } x_k} \\end{equation}\\] \\(S\\) is any subset of the that doesn’t include feature \\(k\\) and its size is Card(\\(S\\)). In the equation above, the model \\(f\\) must be altered because it’s impossible to evaluate \\(f\\) when features are missing. In this case, there are several possible options: Obviously, Shapley values can take a lot of time to compute if the number of predictors is large. We refer to Chen et al. (2018) for a discussion on a simplifying method that reduces computation times in this case. Extensions of Shapley values for interpretability are studied in Lundberg and Lee (2017). The implementation of Shapley values is permitted in R via the iml package. There are two restrictions compared to LIME. First, the features must be filtered upfront because all features are shown on the graph (which becomes illegible beyond 20 features). This is why in the code below, we use the short list of predictors (from Section 1.2). Second, instances are analyzed one at a time. We start by fitting a random forest model. fit_RF_short <- randomForest(R1M_Usd ~., # Same formula as for simple trees! data = training_sample %>% dplyr::select(c(features_short), "R1M_Usd"), sampsize = 10000, # Size of (random) sample for each tree replace = FALSE, # Is the sampling done with replacement? nodesize = 250, # Minimum size of terminal cluster ntree = 40, # Nb of random trees mtry = 4 # Nb of predictive variables for each tree ) We can then analyze the behavior of the model around the first instance of the training sample. predictor <- Predictor$new(fit_RF_short, # This wraps the model & data data = training_sample %>% dplyr::select(features_short), y = training_sample$R1M_Usd) shapley <- Shapley$new(predictor, # Compute the Shapley values... x.interest = training_sample %>% dplyr::select(features_short) %>% dplyr::slice(1)) # On the first instance plot(shapley) + coord_fixed(1500) + # Plot theme(axis.text.x = element_text(angle = 35, hjust = 1)) + coord_flip() FIGURE 13.6: Illustration of the Shapley method. In the output shown in Figure 13.6, we again obtain the two crucial insights: sign of the impact of the feature and relative importance (compared to other features). 13.2.3 Breakdown Breakdown (see, e.g., Staniak and Biecek (2018)) is a mixture of ideas from PDPs and Shapley values. The core of breakdown is the so-called relaxed model prediction defined in Equation (13.4). It is close in spirit to Equation (13.1). The difference is that we are working at the local level, i.e., on one particular observation, say \\(x^*\\). We want to measure the impact of a set of predictors on the prediction associated to \\(x^*\\); hence, we fix two sets \\(\\textbf{k}\\) (fixed features) and \\(-\\textbf{k}\\) (free features) and evaluate a proxy for the average prediction of the estimated model \\(\\hat{f}\\) when the set \\(\\textbf{k}\\) of predictors is fixed at the values of \\(x^*\\), that is, equal to \\(x^*_{\\textbf{k}}\\) in the expression below: \\[\\begin{equation} \\tag{13.4} \\tilde{f}_{\\textbf{k}}(x^*)=\\frac{1}{M}\\sum_{m=1}^M \\hat{f}\\left(x^{(m)}_{-\\textbf{k}},x^*_{\\textbf{k}} \\right). \\end{equation}\\] The \\(x^{(m)}\\) in the above expression are either simulated values of instances or simply sampled values from the dataset. The notation implies that the instance has some values replaced by those of \\(x^*\\), namely those that correspond to the indices \\(\\textbf{k}\\). When \\(\\textbf{k}\\) consists of all features, then \\(\\tilde{f}_{\\textbf{k}}(x^*)\\) is equal to the raw model prediction \\(\\hat{f}(x^*)\\) and when \\(\\textbf{k}\\) is empty, it is equal to the average sample value of the label (constant prediction). The quantity of interest is the so-called contribution of feature \\(j\\notin \\textbf{k}\\) with respect to data point \\(x^*\\) and set \\(\\textbf{k}\\): \\[\\phi_{\\textbf{k}}^j(x^*)=\\tilde{f}_{\\textbf{k} \\cup j}(x^*)-\\tilde{f}_{\\textbf{k}}(x^*).\\] Just as for Shapley values, the above indicator computes an average impact when augmenting the set of predictors with feature \\(j\\). By definition, it depends on the set \\(\\textbf{k}\\), so this is one notable difference with Shapley values (that span all permutations). In Staniak and Biecek (2018), the authors devise a procedure that incrementally increases or decreases the set \\(\\textbf{k}\\). This greedy idea helps alleviate the burden of computing all possible combinations of features. Moreover, a very convenient property of their algorithm is that the sum of all contributions is equal to the predicted value: \\[\\sum_j \\phi_{\\textbf{k}}^j(x^*)=f(x^*).\\] The visualization makes that very easy to see (as in Figure 13.7 below). In order to illustrate one implementation of breakdown, we train a random forest on a limited number of features, as shown below. This will increase the readability of the output of the breakdown. formula_short <- paste("R1M_Usd ~", paste(features_short, collapse = " + ")) # Model formula_short <- as.formula(formula_short) # Formula format fit_RF_short <- randomForest(formula_short, # Same formula as before data = dplyr::select(training_sample, c(features_short, "R1M_Usd")), sampsize = 10000, # Size of (random) sample for each tree replace = FALSE, # Is the sampling done with replacement? nodesize = 250, # Minimum size of terminal cluster ntree = 12, # Nb of random trees mtry = 5 # Nb of predictive variables for each tree ) Once the model is trained, the syntax for the breakdown of predictions is very simple. library(breakDown) explain_break <- broken(fit_RF_short, data_ml[6,] %>% dplyr::select(features_short), data = data_ml %>% dplyr::select(features_short)) plot(explain_break) FIGURE 13.7: Example of a breakdown output. The graphical output is intuitively interpreted. The grey bar is the prediction of the model at the chosen instance. Green bars signal a positive contribution and the yellowish rectangles show the variables with negative impact. The relative sizes indicate the importance of each feature. References "],["causality.html", "Chapter 14 Two key concepts: causality and non-stationarity 14.1 Causality 14.2 Dealing with changing environments", " Chapter 14 Two key concepts: causality and non-stationarity A prominent point of criticism faced by ML tools is their inability to uncover causality relationships between features and labels because they are mostly focused (by design) to capture correlations. Correlations are much weaker than causality because they characterize a two-way relationship (\\(\\textbf{X}\\leftrightarrow \\textbf{y}\\)), while causality specifies a direction \\(\\textbf{X}\\rightarrow \\textbf{y}\\) or \\(\\textbf{X}\\leftarrow \\textbf{y}\\). One fashionable example is sentiment. Many academic articles seem to find that sentiment (irrespectively of its definition) is a significant driver of future returns. A high sentiment for a particular stock may increase the demand for this stock and push its price up (though contrarian reasonings may also apply: if sentiment is high, it is a sign that mean-reversion is possibly about to happen). The reverse causation is also plausible: returns may well cause sentiment. If a stock experiences a long period of market growth, people become bullish about this stock and sentiment increases (this notably comes from extrapolation, see Barberis et al. (2015) for a theoretical model). In Coqueret (2020), it is found (in opposition to most findings in this field), that the latter relationship (returns \\(\\rightarrow\\) sentiment) is more likely. This result is backed by causality driven tests (see Section 14.1.1). Statistical causality is a large field and we refer to Pearl (2009) for a deep dive into this topic. Recently, researchers have sought to link causality with ML approaches (see, e.g., Peters, Janzing, and Schölkopf (2017), Heinze-Deml, Peters, and Meinshausen (2018), Arjovsky et al. (2019)). The key notion in their work is invariance. Often, data is collected not at once, but from different sources at different moments. Some relationships found in these different sources will change, while others may remain the same. The relationships that are invariant to changing environments are likely to stem from (and signal) causality. One counter-example is the following (related in Beery, Van Horn, and Perona (2018)): training a computer vision algorithm to discriminate between cows and camels will lead the algorithm to focus on grass versus sand! This is because most camels are pictured in the desert while cows are shown in green fields of grass. Thus, a picture of a camel on grass will be classified as cow, while a cow on sand would be labelled “camel”. It is only with pictures of these two animals in different contexts (environments) that the learner will end up truly finding what makes a cow and a camel. A camel will remain a camel no matter where it is pictured: it should be recognized as such by the learner. If so, the representation of the camel becomes invariant over all datasets and the learner has discovered causality, i.e., the true attributes that make the camel a camel (overall silhouette, shape of the back, face, color (possibly misleading!), etc.). This search for invariance makes sense for many disciplines like computer vision or natural language processing (cats will always look like cats and languages don’t change much). In finance, it is not obvious that invariance may exist. Market conditions are known to be time-varying and the relationships between firm characteristics and returns also change from year to year. One solution to this issue may simply be to embrace non-stationarity (see Section 1.1 for a definition of stationarity). In Chapter 12, we advocate to do that by updating models as frequently as possible with rolling training sets: this allows the predictions to be based on the most recent trends. In Section 14.2 below, we introduce other theoretical and practical options. 14.1 Causality Traditional machine learning models aim to uncover relationships between variables but do not usually specify directions for these relationships. One typical example is the linear regression. If we write \\(y=a+bx+\\epsilon\\), then it is also true that \\(x=b^{-1}(y-a-\\epsilon)\\), which is of course also a linear relationship (with respect to \\(y\\)). These equations do not define causation whereby \\(x\\) would be a clear determinant of \\(y\\) (\\(x \\rightarrow y\\), but the opposite could be false). 14.1.1 Granger causality The most notable tool first proposed by Granger (1969) is probably the simplest. For simplicity, we consider only two stationary processes, \\(X_t\\) and \\(Y_t\\). A strict definition of causality could be the following. \\(X\\) can be said to cause \\(Y\\), whenever, for some integer \\(k\\), \\[(Y_{t+1},\\dots,Y_{t+k})|(\\mathcal{F}_{Y,t}\\cup \\mathcal{F}_{X,t}) \\quad \\overset{d}{\\neq} \\quad (Y_{t+1},\\dots,Y_{t+k})|\\mathcal{F}_{Y,t},\\] that is, when the distribution of future values of \\(Y_t\\), conditionally on the knowledge of both processes is not the same as the distribution with the sole knowledge of the filtration \\(\\mathcal{F}_{Y,t}\\). Hence \\(X\\) does have an impact on \\(Y\\) because its trajectory alters that of \\(Y\\). Now, this formulation is too vague and impossible to handle numerically, thus we simplify the setting via a linear formulation. We keep the same notations as section 5 of the original paper by Granger (1969). The test consists of two regressions: \\[\\begin{align*} X_t&=\\sum_{j=1}^ma_jX_{t-j}+\\sum_{j=1}^mb_jY_{t-j} + \\epsilon_t \\\\ Y_t&=\\sum_{j=1}^mc_jX_{t-j}+\\sum_{j=1}^md_jY_{t-j} + \\nu_t \\end{align*}\\] where for simplicity, it is assumed that both processes have zero mean. The usual assumptions apply: the Gaussian noises \\(\\epsilon_t\\) and \\(\\nu_t\\) are uncorrelated in every possible way (mutually and through time). The test is the following: if one \\(b_j\\) is nonzero, then it is said that \\(Y\\) Granger-causes \\(X\\) and if one \\(c_j\\) is nonzero, \\(X\\) Granger-causes \\(Y\\). The two are not mutually exclusive and it is widely accepted that feedback loops can very well occur. Statistically, under the null hypothesis, \\(b_1=\\dots=b_m=0\\) (resp. \\(c_1=\\dots=c_m=0\\)), which can be tested using the usual Fischer distribution. Obviously, the linear restriction can be dismissed but the tests are then much more complex. The main financial article in this direction is Hiemstra and Jones (1994). There are many R packages that embed Granger causality functionalities. One of the most widespread is lmtest, so we work with it below. The syntax is incredibly simple. The order is the maximum lag \\(m\\) in the above equation. We test if market capitalization averaged over the past 6 months Granger-causes 1 month ahead returns for one particular stock (the first in the sample). library(lmtest) x_granger <- training_sample %>% # X variable =... filter(stock_id ==1) %>% # ... stock nb 1 pull(Mkt_Cap_6M_Usd) # ... & Market cap y_granger <- training_sample %>% # Y variable = ... filter(stock_id ==1) %>% # ... stock nb 1 pull(R1M_Usd) # ... & 1M return fit_granger <- grangertest(x_granger, # X variable y_granger, # Y variable order = 6, # Maximmum lag na.action = na.omit) # What to do with missing data fit_granger ## Granger causality test ## ## Model 1: y_granger ~ Lags(y_granger, 1:6) + Lags(x_granger, 1:6) ## Model 2: y_granger ~ Lags(y_granger, 1:6) ## Res.Df Df F Pr(>F) ## 1 149 ## 2 155 -6 4.111 0.0007554 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 The test is directional and only tests if \\(X\\) Granger-causes \\(Y\\). In order to test the reverse effect, it is required to inverse the arguments in the function. In the output above, the \\(p\\)-value is very low, hence the probability of observing samples similar to ours knowing that \\(H_0\\) holds is negligible. Thus it seems that market capitalization does Granger-cause one-month returns. We nonetheless underline that Granger causality is arguably weaker than the one defined in the next subsection. A process that Granger-causes another one simply contains useful predictive information, which is not proof of causality in a strict sense. Moreover, our test is limited to a linear model and including nonlinearities may alter the conclusion. Lastly, including other regressors (possibly omitted variables) could also change the results (see, e.g., Chow, Cotsomitis, and Kwan (2002)). 14.1.2 Causal additive models The zoo of causal model encompasses a variety of beasts (even BARTs from Section 9.5 are used for this purpose in Hahn, Murray, and Carvalho (2019)). The interested reader can have a peek at Pearl (2009), Peters, Janzing, and Schölkopf (2017), Maathuis et al. (2018) and Hünermund and Bareinboim (2019) and the references therein. One central tool in causal models is the do-calculus developed by Pearl. Whereas traditional probabilities \\(P[Y|X]\\) link the odds of \\(Y\\) conditionally on observing \\(X\\) take some value \\(x\\), the do(\\(\\cdot\\)) forces \\(X\\) to take value \\(x\\). This is a looking versus doing dichotomy. One classical example is the following. Observing a barometer gives a clue what the weather will be because high pressures are more often associated with sunny days: \\[P[\\text{sunny weather}|\\text{barometer says ``high''} ]>P[\\text{sunny weather}|\\text{barometer says ``low''} ],\\] but if you hack the barometer (force it to display some value), \\[P[\\text{sunny weather}|\\text{barometer hacked to ``high''} ]=P[\\text{sunny weather}|\\text{barometer hacked ``low''} ],\\] because hacking the barometer will have no impact on the weather. In short notation, when there is an intervention on the barometer, \\(P[\\text{weather}|\\text{do(barometer)}]=P[\\text{weather}]\\). This is an interesting example related to causality. The overarching variable is pressure. Pressure impacts both the weather and the barometer and this joint effect is called confounding. However, it may not be true that the barometer impacts the weather. The interested reader who wants to dive deeper into these concepts should have a closer look at the work of Judea Pearl. Do-calculus is a very powerful theoretical framework, but it is not easy to apply it to any situation or dataset (see for instance the book review Aronow and Sävje (2019)). While we do not formally present an exhaustive tour of the theory behind causal inference, we wish to show some practical implementations because they are easy to interpret. It is always hard to single out one type of model in particular so we choose one that can be explained with simple mathematical tools. We start with the simplest definition of a structural causal model (SCM), where we follow here chapter 3 of Peters, Janzing, and Schölkopf (2017). The idea behind these models is to introduce some hierarchy (i.e., some additional structure) in the model. Formally, this gives \\[\\begin{align*} X&=\\epsilon_X \\\\ Y&=f(X,\\epsilon_Y), \\end{align*}\\] where the \\(\\epsilon_X\\) and \\(\\epsilon_Y\\) are independent noise variables. Plainly, a realization of \\(X\\) is drawn randomly and has then an impact on the realization of \\(Y\\) via \\(f\\). Now this scheme could be more complex if the number of observed variables was larger. Imagine a third variable comes in so that \\[\\begin{align*} X&=\\epsilon_X \\\\ Y&=f(X,\\epsilon_Y),\\\\ Z&=g(Y,\\epsilon_Z) \\end{align*}\\] In this case, \\(X\\) has a causation effect on \\(Y\\) and then \\(Y\\) has a causation effect on \\(Z\\). We thus have the following connections: \\[\\begin{array}{ccccccc} X & &&&\\\\ &\\searrow & &&\\\\ &&Y&\\rightarrow&Z. \\\\ &\\nearrow &&\\nearrow& \\\\ \\epsilon_Y & &\\epsilon_Z \\end{array}\\] The above representation is called a graph and graph theory has its own nomenclature, which we very briefly summarize. The variables are often referred to as vertices (or nodes) and the arrows as edges. Because arrows have a direction, they are called directed edges. When two vertices are connected via an edge, they are called adjacent. A sequence of adjacent vertices is called a path, and it is directed if all edges are arrows. Within a directed path, a vertex that comes first is a parent node and the one just after is a child node. Graphs can be summarized by adjacency matrices. An adjacency matrix \\(\\textbf{A}=A_{ij}\\) is a matrix filled with zeros and ones. \\(A_{ij}=1\\) whenever there is an edge from vertex \\(i\\) to vertex \\(j\\). Usually, self-loops (\\(X \\rightarrow X\\)) are prohibited so that adjacency matrices have zeros on the diagonal. If we consider a simplified version of the above graph like \\(X \\rightarrow Y \\rightarrow Z\\), the corresponding adjacency matrix is \\[\\textbf{A}=\\begin{bmatrix} 0 & 1 & 0 \\\\ 0 & 0 & 1 \\\\ 0& 0&0 \\end{bmatrix}.\\] where letters \\(X\\), \\(Y\\), and \\(Z\\) are naturally ordered alphabetically. There are only two arrows: from \\(X\\) to \\(Y\\) (first row, second column) and from \\(Y\\) to \\(Z\\) (second row, third column). A cycle is a particular type of path that creates a loop, i.e., when the first vertex is also the last. The sequence \\(X \\rightarrow Y \\rightarrow Z \\rightarrow X\\) is a cycle. Technically, cycles pose problems. To illustrate this, consider the simple sequence \\(X \\rightarrow Y \\rightarrow X\\). This would imply that a realization of \\(X\\) causes \\(Y\\) which in turn would cause the realization of \\(Y\\). While Granger causality can be viewed as allowing this kind of connection, general causal models usually avoid cycles and work with directed acyclic graphs (DAGs). Formal graph manipulations (possibly linked to do-calculus) can be computed via the causaleffect package Tikka and Karvanen (2017). Direct acyclic graphs can also be created and manipulated with the dagitty (textor2016robust) and ggdag packages. Equipped with these tools, we can explicitize a very general form of models: \\[\\begin{equation} \\tag{14.1} X_j=f_j\\left(\\textbf{X}_{\\text{pa}_D(j)},\\epsilon_j \\right), \\end{equation}\\] where the noise variables are mutually independent. The notation \\(\\text{pa}_D(j)\\) refers to the set of parent nodes of vertex \\(j\\) within the graph structure \\(D\\). Hence, \\(X_j\\) is a function of all of its parents and some noise term \\(\\epsilon_j\\). An additive causal model is a mild simplification of the above specification: \\[\\begin{equation} \\tag{14.2} X_j=\\sum_{k\\in \\text{pa}_D(j)}f_{j,k}\\left(\\textbf{X}_{k} \\right)+\\epsilon_j, \\end{equation}\\] where the nonlinear effect of each variable is cumulative, hence the term ‘additive’. Note that there is no time index there. In contrast to Granger causality, there is no natural ordering. Such models are very complex and hard to estimate. The details can be found in Bühlmann et al. (2014). Fortunately, the authors have developed an R package that determines the DAG \\(D\\). Below, we build the adjacency matrix pertaining to the small set of predictor variables plus the 1-month ahead return (on the training sample). The original version of the book used the CAM package which has a very simple syntax.28 Below, we test the more recent InvariantCausalPrediction package. [[NOTE: the remainder of the subsection is under revision.]] # library(CAM) # Activate the package data_caus <- training_sample %>% dplyr::select(c("R1M_Usd", features_short)) # fit_cam <- CAM(data_caus) # The main function # fit_cam$Adj # Showing the adjacency matrix library(InvariantCausalPrediction) ICP(X = training_sample %>% dplyr::select(all_of(features_short)) %>% as.matrix(), Y = training_sample %>% dplyr::pull("R1M_Usd"), ExpInd = round(runif(nrow(training_sample))), alpha = 0.05) ## ## *** 2% complete: tested 2 of 128 sets of variables ## ## Invariant Linear Causal Regression at level 0.05 (including multiplicity correction for the number of variables) ## Model has been rejected at the chosen level 0.05, that is no subset of variables leads to invariance across the environments. This can be for example due to presence of ## (a) non-linearities or ## (b) hidden variables or ## (c) interventions on the target variable. ## ## We will try to extend the functionality soon to allow non-linear models and address issue (a) [non-linearity], which currently leads to rejection of the linear model. ## If the reason might be related to issue (b) [presence of hidden variables], one can use function hiddenICP which allows for hidden variables. The matrix is not too sparse, which means that the model has uncovered many relationships between the variables within the sample. Sadly, none are in the direction that is of interest for the prediction task that we seek. Indeed, the first variable is the one we want to predict and its column is empty. However, its row is full, which indicates the reverse effect: future returns cause the predictor values, which may seem rather counter-intuitive, given the nature of features. For the sake of completeness, we also provide an implementation of the pcalg package (Kalisch et al. (2012)).29 Below, an estimation via the so-called PC (named after its authors Peter Spirtes and Clark Glymour) is performed. The details of the algorithm are out of the scope of the book, and the interested reader can have a look at section 5.4 of Spirtes et al. (2000) or section 2 from Kalisch et al. (2012) for more information on this subject. We use the Rgraphviz package available at https://www.bioconductor.org/packages/release/bioc/html/Rgraphviz.html. library(pcalg) # Load packages library(Rgraphviz) est_caus <- list(C = cor(data_caus), n = nrow(data_caus)) # Compute correlations pc.fit <- pc(est_caus, indepTest = gaussCItest, # Estimate model p = ncol(data_caus),alpha = 0.01) iplotPC(pc.fit) # Plot model FIGURE 14.1: Representation of a directed graph. A bidirectional arrow is shown when the model was unable to determine the edge orientation. While the adjacency matrix is different compared to the first model, there are still no predictors that seem to have a clear causal effect on the dependent variable (first circle). 14.1.3 Structural time series models We end the topic of causality by mentioning a particular type of structural models: structural time series. Because we illustrate their relevance for a particular kind of causal inference, we closely follow the notations of Brodersen et al. (2015). The model is driven by two equations: \\[\\begin{align*} y_t&=\\textbf{Z}_t'\\boldsymbol{\\alpha}_t+\\epsilon_t \\\\ \\boldsymbol{\\alpha}_{t+1}& =\\textbf{T}_t\\boldsymbol{\\alpha}_{t}+\\textbf{R}_t\\boldsymbol{\\eta}_t. \\end{align*}\\] The dependent variable is expressed as a linear function of state variables \\(\\boldsymbol{\\alpha}_t\\) plus an error term. These variables are in turn linear functions of their past values plus another error term which can have a complex structure (it’s a product of a matrix \\(\\textbf{R}_t\\) with a centered Gaussian term \\(\\boldsymbol{\\eta}_t\\)). This specification nests many models as special cases, like ARIMA for instance. The goal of Brodersen et al. (2015) is to detect causal impacts via regime changes. They estimate the above model over a given training period and then predict the model’s response on some test set. If the aggregate (summed/integrated) error between the realized versus predicted values is significant (based on some statistical test), then the authors conclude that the breaking point is relevant. Originally, the aim of the approach is to quantify the effect of an intervention by looking at how a model trained before the intervention behaves after the intervention. Below, we test if the 100\\(^{th}\\) date point in the sample (April 2008) is a turning point. Arguably, this date belongs to the time span of the subprime financial crisis. We use the CausalImpact package which uses the bsts library (Bayesian structural time series). library(CausalImpact) stock1_data <- data_ml %>% filter(stock_id == 1) # Data of first stock struct_data <- data.frame(y = stock1_data$R1M_Usd) %>% # Combine label... cbind(stock1_data %>% dplyr::select(features_short)) # ... and features pre.period <- c(1,100) # Pre-break period (pre-2008) post.period <- c(101,200) # Post-break period impact <- CausalImpact(zoo(struct_data), pre.period, post.period) summary(impact) ## Posterior inference {CausalImpact} ## ## Average Cumulative ## Actual 0.016 1.638 ## Prediction (s.d.) 0.031 (0.017) 3.091 (1.712) ## 95% CI [-0.0023, 0.064] [-0.2331, 6.430] ## ## Absolute effect (s.d.) -0.015 (0.017) -1.453 (1.712) ## 95% CI [-0.048, 0.019] [-4.792, 1.871] ## ## Relative effect (s.d.) -47% (55%) -47% (55%) ## 95% CI [-155%, 61%] [-155%, 61%] ## ## Posterior tail-area probability p: 0.19309 ## Posterior prob. of a causal effect: 81% ## ## For more details, type: summary(impact, "report") #summary(impact, "report") # Get the full report (see below) The time series associated with the model are shown in Figure 14.2. plot(impact) FIGURE 14.2: Output of the causal impact study. Below, we copy and paste the report generated by the function (obtained by the commented line in the above code). The conclusions do not support a marked effect of the crisis on the model probably because the signs of the error in the post period constantly change sign. During the post-intervention period, the response variable had an average value of approx. 0.016. In the absence of an intervention, we would have expected an average response of 0.031. The 95% interval of this counterfactual prediction is [-0.0059, 0.063]. Subtracting this prediction from the observed response yields an estimate of the causal effect the intervention had on the response variable. This effect is -0.015 with a 95% interval of [-0.047, 0.022]. Summing up the individual data points during the post-intervention period (which can only sometimes be meaningfully interpreted), the response variable had an overall value of 1.64. Had the intervention not taken place, we would have expected a sum of 3.09. The 95% interval of this prediction is [-0.59, 6.34]. The above results are given in terms of absolute numbers. In relative terms, the response variable showed a decrease of -47%. The 95% interval of this percentage is [-152%, +72%]. This means that, although it may look as though the intervention has exerted a negative effect on the response variable when considering the intervention period as a whole, this effect is not statistically significant, and so cannot be meaningfully interpreted. The apparent effect could be the result of random fluctuations that are unrelated to the intervention. This is often the case when the intervention period is very long and includes much of the time when the effect has already worn off. It can also be the case when the intervention period is too short to distinguish the signal from the noise. Finally, failing to find a significant effect can happen when there are not enough control variables or when these variables do not correlate well with the response variable during the learning period. The probability of obtaining this effect by chance is p = 0.199. This means the effect may be spurious and would generally not be considered statistically significant. 14.2 Dealing with changing environments The most common assumption in machine learning contributions is that the samples that are studied are i.i.d. realizations of a phenomenon that we are trying to characterize. This constraint is natural because if the relationship between \\(X\\) and \\(y\\) always changes, then it is very hard to infer anything from observations. One major problem in Finance is that this is often the case: markets, behaviors, policies, etc., evolve all the time. This is at least partly related to the notion of absence of arbitrage: if a trading strategy worked all the time, all agents would eventually adopt it via herding, which would annihilate the corresponding gains.30 If the strategy is kept private, its holder would become infinitely rich, which obviously has never happened. There are several ways to define changes in environments. If we denote with \\(\\mathbb{P}_{XY}\\) the multivariate distribution of all variables (features and label), with \\(\\mathbb{P}_{XY}=\\mathbb{P}_{X}\\mathbb{P}_{Y|X}\\), then two simple changes are possible: covariate shift: \\(\\mathbb{P}_{X}\\) changes but \\(\\mathbb{P}_{Y|X}\\) does not: the features have a fluctuating distribution, but their relationship with \\(Y\\) holds still; concept drift: \\(\\mathbb{P}_{Y|X}\\) changes but \\(\\mathbb{P}_{X}\\) does not: feature distributions are stable, but their relation to \\(Y\\) is altered. Obviously, we omit the case when both items change, as it is too complex to handle. In factor investing, the feature engineering process (see Section 4.4) is partly designed to bypass the risk of covariate shift. Uniformization guarantees that the marginals stay the same but correlations between features may of course change. The main issue is probably concept drift when the way features explain the label changes through time. In Cornuejols, Miclet, and Barra (2018),31 the authors distinguish four types of drifts, which we reproduce in Figure 14.3. In factor models, changes are presumably a combination of all four types: they can be abrupt during crashes, but most of the time they are progressive (gradual or incremental) and never-ending (continuously recurring). FIGURE 14.3: Different flavors of concept change. Naturally, if we aknowledge that the environment changes, it appears logical to adapt models accordingly, i.e., dynamically. This gives rise to the so-called stability-plasticity dilemma. This dilemma is a trade-off between model reactiveness (new instances have an important impact on updates) versus stability (these instances may not be representative of a slower trend and they may thus shift the model in a suboptimal direction). Practically, there are two ways to shift the cursor with respect to this dilemma: alter the chronological depth of the training sample (e.g., go further back in time) or, when it’s possible, allocate more weight to recent instances. We discuss the first option in Section 12.1 and the second is mentioned in Section 6.3 (though the purpose in Adaboost is precisely to let the algorithm handle the weights). In neural networks, it is possible, in all generality to introduce instance-based weights in the computation of the loss function, though this option is not (yet) available in Keras (to the best of our knowledge: the framework evolves rapidly). For simple regressions, this idea is known as weighted least squares wherein errors are weighted inside the loss: \\[L=\\sum_{i=1}^Iw_i(y_i-\\textbf{x}_i\\textbf{b})^2.\\] In matrix terms, \\(L=(\\textbf{y}-\\textbf{Xb})'\\textbf{W}(\\textbf{y}-\\textbf{Xb})\\), where \\(\\textbf{W}\\) is a diagonal matrix of weights. The gradient with respect to \\(\\textbf{b}\\) is equal to \\(2\\textbf{X}'\\textbf{WX}\\textbf{b}-2\\textbf{X}'\\textbf{Wy}\\) so that the loss is minimized for \\(\\textbf{b}^*=(\\textbf{X}'\\textbf{WX})^{-1}\\textbf{X}'\\textbf{Wy}\\). The standard least-square solution is recovered for \\(\\textbf{W}=\\textbf{I}\\). In order to fine-tune the reactiveness of the model, the weights must be a function that decreases as instances become older in the sample. There is of course no perfect solution to changing financial environements. Below, we mention two routes that are taken in the ML literature to overcome the problem of non-stationarity in the data generating process. But first, we propose yet another clear verification that markets do experience time-varying distributions. 14.2.1 Non-stationarity: yet another illustration One of the most basic practices in (financial) econometrics is to work with returns (relative price changes). The simple reason is that returns seem to behave consistently through time (monthly returns are bounded, they usually lie between -1 and +1). Prices on the other hand shift and, often, some prices never come back to past values. This makes prices harder to study. Stationarity is a key notion in financial econometrics: it is much easier to characterize a phenomenon with distributional properties that remain the same through time (this makes them possible to capture). Sadly, the distribution of returns is not stationary: both the mean and the variance of returns change along cycles. Below, in Figure 14.4, we illustrate this fact by computing the average monthly return for all calendar years in the whole dataset. data_ml %>% mutate(year = year(date)) %>% # Create a year variable group_by(year) %>% # Group by year summarize(avg_ret = mean(R1M_Usd)) %>% # Compute average return ggplot(aes(x = year, y = avg_ret)) + geom_col() + theme_grey() FIGURE 14.4: Average monthly return on a yearly basis. These changes in the mean are also accompanied by variations in the second moment (variance/volatility). This effect, known as volatility clustering, has been widely documented ever since the theoretical breakthrough of Engle (1982) (and even well before). We refer for instance to Cont (2007) for more details on this topic. For the computation of realized volatility in R, we strongly recommend chapter 4 in Regenstein (2018). In terms of machine learning models, this is also true. Below, we estimate a pure characteristic regression with one predictor, the market capitalization averaged over the past 6-months (\\(r_{t+1,n}=\\alpha+\\beta x_{t,n}^{\\text{cap}}+\\epsilon_{t+1,n}\\)). The label is the 6-month forward return and the estimation is performed over every calendar year. data_ml %>% mutate(year = year(date)) %>% # Create a year variable group_by(year) %>% # Group by year summarize(beta_cap = lm(R6M_Usd ~ Mkt_Cap_6M_Usd) %>% # Perform regression coef() %>% # Extract coefs t() %>% # Transpose data.frame() %>% # Format into df pull(Mkt_Cap_6M_Usd)) %>% # Pull coef (remove intercept) ggplot(aes(x = year, y = beta_cap)) + geom_col() + # Plot theme_grey() FIGURE 14.5: Variations in betas with respect to 6-month market capitalization. The bars in Figure 14.5 highlight the concept drift: overall, the relationship between capitalization and returns is negative (the size effect again). Sometimes it is markedly negative, sometimes, not so much. The ability of capitalization to explain returns is time-varying and models must adapt accordingly. 14.2.2 Online learning Online learning refers to a subset of machine learning in which new information arrives progressively and the integration of this flow is performed iteratively (the term ‘online’ is not linked to the internet). In order to take the latest data updates into account, it is imperative to update the model (stating the obvious). This is clearly the case in finance and this topic is closely related to the discussion on learning windows in Section 12.1. The problem is that if a 2019 model is trained on data from 2010 to 2019, the (dynamic) 2020 model will have to be re-trained with the whole dataset including the latest points from 2020. This can be heavy and including just the latest points in the learning process would substantially decrease its computational cost. In neural networks, the sequential batch updating of weights can allow a progressive change in the model. Nonetheless, this is typically impossible for decision trees because the splits are decided once and for all. One notable exception is Basak (2004), but, in that case, the construction of the trees differs strongly from the original algorithm. The simplest example of online learning is the Widrow-Hodd algorithm (originally from Widrow and Hoff (1960)). Originally, the idea comes from the so-called ADALINE (ADAptive LInear NEuron) model which is a neural network with one hidden layer with linear activation function (i.e., like a perceptron, but with a different activation). Suppose the model is linear, that is \\(\\textbf{y}=\\textbf{Xb}+\\textbf{e}\\) (a constant can be added to the list of predictors) and that the amount of data is both massive and coming in at a high frequency so that updating the model on the full sample is proscribed because it is technically intractable. A simple and heuristic way to update the values of \\(\\textbf{b}\\) is to compute \\[\\textbf{b}_{t+1} \\longleftarrow \\textbf{b}_t-\\eta (\\textbf{x}_t\\textbf{b}-y_t)\\textbf{x}_t',\\] where \\(\\textbf{x}_t\\) is the row vector of instance \\(t\\). The justification is simple. The quadratic error \\((\\textbf{x}_t\\textbf{b}-y_t)^2\\) has a gradient with respect to \\(\\textbf{b}\\) equal to \\(2(\\textbf{x}_t\\textbf{b}-y_t)\\textbf{x}_t'\\); therefore, the above update is a simple example of gradient descent. \\(\\eta\\) must of course be quite small: if not, each new point will considerably alter \\(\\textbf{b}\\), thereby resulting in a volatile model. An exhaustive review of techniques pertaining to online learning is presented in Hoi et al. (2018) (section 4.11 is even dedicated to portfolio selection). The book Hazan and others (2016) covers online convex optimization which is a very close domain with a large overlap with online learning. The presentation below is adapted from the second and third parts of the first survey. Datasets are indexed by time: we write \\(\\textbf{X}_t\\) and \\(\\textbf{y}_t\\) for features and labels (the usual column index (\\(k\\)) and row index (\\(i\\)) will not be used in this section). Time has a bounded horizon \\(T\\). The machine learning model depends on some parameters \\(\\boldsymbol{\\theta}\\) and we denote it with \\(f_{\\boldsymbol{\\theta}}\\). At time \\(t\\) (when dataset (\\(\\textbf{X}_t\\), \\(\\textbf{y}_t\\)) is gathered), the loss function \\(L\\) of the trained model naturally depends on the data (\\(\\textbf{X}_t\\), \\(\\textbf{y}_t\\)) and on the model via \\(\\boldsymbol{\\theta}_t\\) which are the parameter values fitted to the time-\\(t\\) data. For notational simplicity, we henceforth write \\(L_t(\\boldsymbol{\\theta}_t)=L(\\textbf{X}_t,\\textbf{y}_t,\\boldsymbol{\\theta}_t )\\). The key quantity in online learning is the regret over the whole time sequence: \\[\\begin{equation} \\tag{14.3} R_T=\\sum_{t=1}^TL_t(\\boldsymbol{\\theta}_t)-\\underset{\\boldsymbol{\\theta}^*\\in \\boldsymbol{\\Theta}}{\\inf} \\ \\sum_{t=1}^TL_t(\\boldsymbol{\\theta}^*). \\end{equation}\\] The regret is the total loss incurred by the models \\(\\boldsymbol{\\theta}_t\\) minus the minimal loss that could have been obtained with full knowledge of the data sequence (hence computed in hindsight). The basic methods in online learning are in fact quite similar to the batch-training of neural networks. The updating of the parameter is based on \\[\\begin{equation} \\tag{14.4} \\textbf{z}_{t+1}=\\boldsymbol{\\theta}_t-\\eta_t\\nabla L_t(\\boldsymbol{\\theta}_t), \\end{equation}\\] where \\(\\nabla L_t(\\boldsymbol{\\theta}_t)\\) denotes the gradient of the current loss \\(L_t\\). One problem that can arise is when \\(\\textbf{z}_{t+1}\\) falls out of the bounds that are prescribed for \\(\\boldsymbol{\\theta}_t\\). Thus, the candidate vector for the new parameters, \\(\\textbf{z}_{t+1}\\), is projected onto the feasible domain which we call \\(S\\) here: \\[\\begin{equation} \\tag{14.5} \\boldsymbol{\\theta}_{t+1}=\\Pi_S(\\textbf{z}_{t+1}), \\quad \\text{with} \\quad \\Pi_S(\\textbf{u}) = \\underset{\\boldsymbol{\\theta}\\in S}{\\text{argmin}} \\ ||\\boldsymbol{\\theta}-\\textbf{u}||_2. \\end{equation}\\] Hence \\(\\boldsymbol{\\theta}_{t+1}\\) is as close as possible to the intermediate choice \\(\\textbf{z}_{t+1}\\). In Hazan, Agarwal, and Kale (2007), it is shown that under suitable assumptions (e.g., \\(L_t\\) being strictly convex with bounded gradient \\(\\left|\\left|\\underset{\\boldsymbol{\\theta}}{\\sup} \\, \\nabla L_t(\\boldsymbol{\\theta})\\right|\\right|\\le G\\)), the regret \\(R_T\\) satisfies \\[R_T \\le \\frac{G^2}{2H}(1+\\log(T)),\\] where \\(H\\) is a scaling factor for the learning rate (also called step sizes): \\(\\eta_t=(Ht)^{-1}\\). More sophisticated online algorithms generalize (14.4) and (14.5) by integrating the Hessian matrix \\(\\nabla^2 L_t(\\boldsymbol{\\theta}):=[\\nabla^2 L_t]_{i,j}=\\frac{\\partial}{\\partial \\boldsymbol{\\theta}_i \\partial \\boldsymbol{\\theta}_j}L_t( \\boldsymbol{\\theta})\\) and/or by including penalizations to reduce instability in \\(\\boldsymbol{\\theta}_t\\). We refer to section 2 in Hoi et al. (2018) for more details on these extensions. An interesting stream of parameter updating is that of the passive-aggressive algorithms (PAAs) formalized in Crammer et al. (2006). The base case involves classification tasks, but we stick to the regression setting below (section 5 in Crammer et al. (2006)). One strong limitation with PAAs is that they rely on the set of parameters where the loss is either zero or negligible: \\(\\boldsymbol{\\Theta}^*_\\epsilon=\\{\\boldsymbol{\\theta}, L_t(\\boldsymbol{\\theta})< \\epsilon\\}\\). For general loss functions and learner \\(f\\), this set is largely inaccessible. Thus, the algorithms in Crammer et al. (2006) are restricted to a particular case, namely linear \\(f\\) and \\(\\epsilon\\)-insensitive hinge loss: \\[L_\\epsilon(\\boldsymbol{\\theta})=\\left\\{ \\begin{array}{ll} 0 & \\text{if } \\ |\\boldsymbol{\\theta}'\\textbf{x}-y|\\le \\epsilon \\quad (\\text{close enough prediction}) \\\\ |\\boldsymbol{\\theta}'\\textbf{x}-y|- \\epsilon & \\text{if } \\ |\\boldsymbol{\\theta}'\\textbf{x}-y| > \\epsilon \\quad (\\text{prediction too far}) \\end{array}\\right.,\\] for some parameter \\(\\epsilon>0\\). If the weight \\(\\boldsymbol{\\theta}\\) is such that the model is close enough to the true value, then the loss is zero; if not, it is equal to the absolute value of the error minus \\(\\epsilon\\). In PAA, the update of the parameter is given by \\[\\boldsymbol{\\theta}_{t+1}= \\underset{\\boldsymbol{\\theta}}{\\text{argmin}} ||\\boldsymbol{\\theta}-\\boldsymbol{\\theta}_t||_2^2, \\quad \\text{subject to} \\quad L_\\epsilon(\\boldsymbol{\\theta})=0,\\] hence the new parameter values are chosen such that two conditions are satisfied: - the loss is zero (by the definition of the loss, this means that the model is close enough to the true value); - and, the parameter is as close as possible to the previous parameter values. By construction, if the model is good enough, the model does not move (passive phase), but if not, it is rapidly shifted towards values that yield satisfactory results (aggressive phase). We end this section with a historical note. Some of the ideas from online learning stem from the financial literature and from the concept of universal portfolios originally coined by Cover (1991) in particular. The setting is the following. The function \\(f\\) is assumed to be linear \\(f(\\textbf{x}_t)=\\boldsymbol{\\theta}'\\textbf{x}_t\\) and the data \\(\\textbf{x}_t\\) consists of asset returns, thus, the values are portfolio returns as long as \\(\\boldsymbol{\\theta}'\\textbf{1}_N=1\\) (the budget constraint). The loss functions \\(L_t\\) correspond to a concave utility function (e.g., logarithmic) and the regret is reversed: \\[R_T=\\underset{\\boldsymbol{\\theta}^*\\in \\boldsymbol{\\Theta}}{\\sup} \\ \\sum_{t=1}^TL_t(\\textbf{r}_t'\\boldsymbol{\\theta}^*)-\\sum_{t=1}^TL_t(\\textbf{r}_t'\\boldsymbol{\\theta}_t),\\] where \\(\\textbf{r}_t'\\) are the returns. Thus, the program is transformed to maximize a concave function. Several articles (often from the Computer Science or ML communities) have proposed solutions to this type of problems: Blum and Kalai (1999), Agarwal et al. (2006) and Hazan, Agarwal, and Kale (2007). Most contributions work with price data only, with the notable exception of Cover and Ordentlich (1996), which mentions external data (‘side information’). In the latter article, it is proven that constantly rebalanced portfolios distributed according to two random distributions achieve growth rates that are close to the unattainable optimal rates. The two distributions are the uniform law (equally weighting, once again) and the Dirichlet distribution with constant parameters equal to 1/2. Under this universal distribution, Cover and Ordentlich (1996) show that the wealth obtained is bounded below by: \\[\\text{wealth universal} \\ge \\frac{\\text{wealth from optimal strategy}}{2(n+1)^{(m-1)/2}}, \\] where \\(m\\) is the number of assets and \\(n\\) is the number of periods. The literature on online portfolio allocation is reviewed in Li and Hoi (2014) and outlined in more details in Li and Hoi (2018). Online learning, combined to early stopping for neural networks, is applied to factor investing in Wong et al. (2020). Finally, online learning is associated to clustering methods for portfolio choice in Khedmati and Azin (2020). 14.2.3 Homogeneous transfer learning This subsection is mostly conceptual and will not be illustrated by coded applications. The ideas behind transfer learning can be valuable in that they can foster novel ideas, which is why we briefly present them below. Transfer learning has been surveyed numerous times. One classical reference is Pan and Yang (2009), but Weiss, Khoshgoftaar, and Wang (2016) is more recent and more exhaustive. Suppose we are given two datasets \\(D_S\\) (source) and \\(D_T\\) (target). Each dataset has its own features \\(\\textbf{X}^S\\) and \\(\\textbf{X}^T\\) and labels \\(\\textbf{y}^S\\) and \\(\\textbf{y}^T\\). In classical supervised learning, the patterns of the target set are learned only through \\(\\textbf{X}^T\\) and \\(\\textbf{y}^T\\). Transfer learning proposes to improve the function \\(f^T\\) (obtained by minimizing the fit \\(y_i^T=f^T(\\textbf{x}_i^T)+\\epsilon^T_i\\) on the target data) via the function \\(f^S\\) (from \\(y_i^S=f^S(\\textbf{x}_i^S)+\\varepsilon^S_i\\) on the source data). Homogeneous transfer learning is when the feature space does not change, which is the case in our setting. In asset management, this may not always be the case if for instance new predictors are included (e.g., based on alternative data like sentiment, satellite imagery, credit card logs, etc.). There are many subcategories in transfer learning depending on what changes between the source \\(S\\) and the target \\(T\\): is it the feature space, the distribution of the labels, and/or the relationship between the two? These are the same questions as in Section 14.2. The latter case is of interest in finance because the link with non-stationarity is evident: it is when the model \\(f\\) in \\(\\textbf{y}=f(\\textbf{X})\\) changes through time. In transfer learning jargon, it is written as \\(P[\\textbf{y}^S|\\textbf{X}^S]\\neq P[\\textbf{y}^T|\\textbf{X}^T]\\): the conditional law of the label knowing the features is not the same when switching from the source to the target. Often, the term ‘domain adaptation’ is used as synonym to transfer learning. Because of a data shift, we must adapt the model to increase its accuracy. These topics are reviewed in a series of chapters in the collection by Quionero-Candela et al. (2009). An important and elegant result in the theory was proven by Ben-David et al. (2010) in the case of binary classification. We state it below. We consider \\(f\\) and \\(h\\) two classifiers with values in \\(\\{0,1 \\}\\). The average error between the two over the domain \\(S\\) is defined by \\[\\epsilon_S(f,h)=\\mathbb{E}_S[|f(\\textbf{x})-h(\\textbf{x})|].\\] Then, \\[\\begin{equation} \\small \\epsilon_T(f_T,h)\\le \\epsilon_S(f_S,h)+\\underbrace{2 \\sup_B|P_S(B)-P_T(B)|}_{\\text{ difference between domains }} + \\underbrace{ \\min\\left(\\mathbb{E}_S[|f_S(\\textbf{x})-f_T(\\textbf{x})|],\\mathbb{E}_T[|f_S(\\textbf{x})-f_T(\\textbf{x})|]\\right)}_{\\text{difference between the two learning tasks}}, \\nonumber \\end{equation}\\] where \\(P_S\\) and \\(P_T\\) denote the distribution of the two domains. The above inequality is a bound on the generalization performance of \\(h\\). If we take \\(f_S\\) to be the best possible classifier for \\(S\\) and \\(f_T\\) the best for \\(T\\), then the error generated by \\(h\\) in \\(T\\) is smaller than the sum of three components: - the error in the \\(S\\) space; - the distance between the two domains (by how much the data space has shifted); - the distance between the two best models (generators). One solution that is often mentioned in transfer learning is instance weighting. We present it here in a general setting. In machine learning, we seek to minimize \\[\\begin{align*} \\epsilon_T(f)=\\mathbb{E}_T\\left[L(\\text{y},f(\\textbf{X})) \\right], \\end{align*}\\] where \\(L\\) is some loss function that depends on the task (regression versus classification). This can be arranged \\[\\begin{align*} \\epsilon_T(f)&=\\mathbb{E}_T \\left[\\frac{P_S(\\textbf{y},\\textbf{X})}{P_S(\\textbf{y},\\textbf{X})} L(\\text{y},f(\\textbf{X})) \\right] \\\\ &=\\sum_{\\textbf{y},\\textbf{X}}P_T(\\textbf{y},\\textbf{X})\\frac{P_S(\\textbf{y},\\textbf{X})}{P_S(\\textbf{y},\\textbf{X})} L(\\text{y},f(\\textbf{X})) \\\\ &=\\mathbb{E}_S \\left[\\frac{P_T(\\textbf{y},\\textbf{X})}{P_S(\\textbf{y},\\textbf{X})} L(\\text{y},f(\\textbf{X})) \\right] \\end{align*}\\] The key quantity is thus the transition ratio \\(\\frac{P_T(\\textbf{y},\\textbf{X})}{P_S(\\textbf{y},\\textbf{X})}\\) (Radon–Nikodym derivative under some assumptions). Of course this ratio is largely inaccessible in practice, but it is possible to find a weighting scheme (over the instances) that yields improvements over the error in the target space. The weighting scheme, just as in Coqueret and Guida (2020), can be binary, thereby simply excluding some observations in the computation of the error. Simply removing observations from the training sample can have beneficial effects. More generally, the above expression can be viewed as a theoretical invitation for user-specified instance weighting (as in Section 6.4.7). In the asset allocation parlance, this can be viewed as introducing views as to which observations are the most interesting, e.g., value stocks can be allowed to have a larger weight in the computation of the loss if the user believes they carry more relevant information. Naturally, it then always remains to minimize this loss. We close this topic by mentioning a practical application of transfer learning developed in Koshiyama et al. (2020). The authors propose a neural network architecture that allows to share the learning process from different strategies across several markets. This method is, among other things, aimed at alleviating the backtest overfitting problem. References "],["unsup.html", "Chapter 15 Unsupervised learning 15.1 The problem with correlated predictors 15.2 Principal component analysis and autoencoders 15.3 Clustering via k-means 15.4 Nearest neighbors 15.5 Coding exercise", " Chapter 15 Unsupervised learning All algorithms presented in Chapters 5 to 9 belong to the larger class of supervised learning tools. Such tools seek to unveil a mapping between predictors \\(\\textbf{X}\\) and a label \\(\\textbf{Z}\\). The supervision comes from the fact that it is asked that the data tries to explain this particular variable \\(\\textbf{Z}\\). Another important part of machine learning consists of unsupervised tasks, that is, when \\(\\textbf{Z}\\) is not specified and the algorithm tries to make sense of \\(\\textbf{X}\\) on its own. Often, relationships between the components of \\(\\textbf{X}\\) are identified. This field is much too vast to be summarized in one book, let alone one chapter. The purpose here is to briefly explain in what ways unsupervised learning can be used, especially in the data pre-processing phase. 15.1 The problem with correlated predictors Often, it is tempting to supply all predictors to a ML-fueled predictive engine. That may not be a good idea when some predictors are highly correlated. To illustrate this, the simplest example is a regression on two variables with zero mean and covariance and precisions matrices: \\[\\boldsymbol{\\Sigma}=\\textbf{X}'\\textbf{X}=\\begin{bmatrix} 1 & \\rho \\\\ \\rho & 1 \\end{bmatrix}, \\quad \\boldsymbol{\\Sigma}^{-1}=\\frac{1}{1-\\rho^2}\\begin{bmatrix} 1 & -\\rho \\\\ -\\rho & 1 \\end{bmatrix}.\\] When the covariance/correlation \\(\\rho\\) increase towards 1 (the two variables are co-linear), the scaling denominator in \\(\\boldsymbol{\\Sigma}^{-1}\\) goes to zero and the formula \\(\\hat{\\boldsymbol{\\beta}}=\\boldsymbol{\\Sigma}^{-1}\\textbf{X}'\\textbf{Z}\\) implies that one coefficient will be highly positive and one highly negative. The regression creates a spurious arbitrage between the two variables. Of course, this is very inefficient and yields disastrous results out-of-sample. We illustrate what happens when many variables are used in the regression below (Table 15.1). One elucidation of the aforementioned phenomenon comes from the variables Mkt_Cap_12M_Usd and Mkt_Cap_6M_Usd, which have a correlation of 99.6% in the training sample. Both are singled out as highly significant but their signs are contradictory. Moreover, the magnitude of their coefficients are very close (0.21 versus 0.18) so that their net effect cancels out. Naturally, providing the regression with only one of these two inputs would have been wiser. library(broom) # Package for clean regression output training_sample %>% dplyr::select(c(features, "R1M_Usd")) %>% # List of variables lm(R1M_Usd ~ . , data = .) %>% # Model: predict R1M_Usd tidy() %>% # Put output in clean format filter(abs(statistic) > 3) %>% # Keep significant predictors only knitr::kable(booktabs = TRUE, caption = "Significant predictors in the training sample.") TABLE 15.1: Significant predictors in the training sample. term estimate std.error statistic p.value (Intercept) 0.0405741 0.0053427 7.594323 0.0000000 Ebitda_Margin 0.0132374 0.0034927 3.789999 0.0001507 Ev_Ebitda 0.0068144 0.0022563 3.020213 0.0025263 Fa_Ci 0.0072308 0.0023465 3.081471 0.0020601 Fcf_Bv 0.0250538 0.0051314 4.882465 0.0000010 Fcf_Yld -0.0158930 0.0037359 -4.254126 0.0000210 Mkt_Cap_12M_Usd 0.2047383 0.0274320 7.463476 0.0000000 Mkt_Cap_6M_Usd -0.1797795 0.0459390 -3.913443 0.0000910 Mom_5M_Usd -0.0186690 0.0044313 -4.212972 0.0000252 Mom_Sharp_11M_Usd 0.0178174 0.0046948 3.795131 0.0001476 Ni 0.0154609 0.0044966 3.438361 0.0005854 Ni_Avail_Margin 0.0118135 0.0038614 3.059359 0.0022184 Ocf_Bv -0.0198113 0.0052939 -3.742277 0.0001824 Pb -0.0178971 0.0031285 -5.720637 0.0000000 Pe -0.0089908 0.0023539 -3.819565 0.0001337 Sales_Ps -0.0157856 0.0046278 -3.411062 0.0006472 Vol1Y_Usd 0.0114250 0.0027923 4.091628 0.0000429 Vol3Y_Usd 0.0084587 0.0027952 3.026169 0.0024771 In fact, there are several indicators for the market capitalization and maybe only one would suffice, but it is not obvious to tell which one is the best choice. To further depict correlation issues, we compute the correlation matrix of the predictors below (on the training sample). Because of its dimension, we show it graphically. As there are too many labels, we remove them. library(corrplot) # Package for plots of correlation matrices C <- cor(training_sample %>% dplyr::select(features)) # Correlation matrix corrplot(C, tl.pos='n') # Plot FIGURE 15.1: Correlation matrix of predictors. The graph of Figure 15.1 reveals several blue squares around the diagonal. For instance, the biggest square around the first third of features relates to all accounting ratios based on free cash flows. Because of this common term in their calculation, the features are naturally highly correlated. These local correlation patterns occur several times in the dataset and explain why it is not a good idea to use simple regression with this set of features. In full disclosure, multicollinearity (when predictors are correlated) can be much less a problem for ML tools than it is for pure statistical inference. In statistics, one central goal is to study the properties of \\(\\beta\\) coefficients. Collinearity perturbs this kind of analysis. In machine learning, the aim is to maximize out-of-sample accuracy. If having many predictors can be helpful, then so be it. One simple example can help clarify this matter. When building a regression tree, having many predictors will give more options for the splits. If the features make sense, then they can be useful. The same reasoning applies to random forests and boosted trees. What does matter is that the large spectrum of features helps improve the generalization ability of the model. Their collinearity is irrelevant. In the remainder of the chapter, we present two approaches that help reduce the number of predictors: the first one aims at creating new variables that are uncorrelated with each other. Low correlation is favorable from an algorithmic point of view, but the new variables lack interpretability; the second one gathers predictors into homogeneous clusters and only one feature should be chosen out of this cluster. Here the rationale is reversed: interpretability is favored over statistical properties because the resulting set of features may still include high correlations, albeit to a lesser point compared to the original one. 15.2 Principal component analysis and autoencoders The first method is a cornerstone in dimensionality reduction. It seeks to determine a smaller number of factors (\\(K'<K\\)) such that: - i) the level of explanatory power remains as high as possible; - ii) the resulting factors are linear combinations of the original variables; - iii) the resulting factors are orthogonal. 15.2.1 A bit of algebra In this short subsection, we define some key concepts that are required to fully understand the derivation of principal component analysis (PCA). Henceforth, we work with matrices (in bold fonts). An \\(I \\times K\\) matrix \\(\\textbf{X}\\) is orthonormal if \\(I> K\\) and \\(\\textbf{X}'\\textbf{X}=\\textbf{I}_K\\). When \\(I=K\\), the (square) matrix is called orthogonal and \\(\\textbf{X}'\\textbf{X}=\\textbf{X}\\textbf{X}'=\\textbf{I}_K\\), i.e., \\(\\textbf{X}^{-1}=\\textbf{X}'\\). One foundational result in matrix theory is the Singular Value Decomposition (SVD, see, e.g., chapter 5 in Meyer (2000)). The SVD is formulated as follows: any \\(I \\times K\\) matrix \\(\\textbf{X}\\) can be decomposed into \\[\\begin{equation} \\tag{15.1} \\textbf{X}=\\textbf{U} \\boldsymbol{\\Delta} \\textbf{V}', \\end{equation}\\] where \\(\\textbf{U}\\) (\\(I\\times I\\)) and \\(\\textbf{V}\\) (\\(K \\times K\\)) are orthogonal and \\(\\boldsymbol{\\Delta}\\) (with dimensions \\(I\\times K\\)) is diagonal, i.e., \\(\\Delta_{i,k}=0\\) whenever \\(i\\neq k\\). In addition, \\(\\Delta{i,i}\\ge 0\\): the diagonal terms of \\(\\boldsymbol{\\Delta}\\) are nonnegative. For simplicity, we assume below that \\(\\textbf{1}_I'\\textbf{X}=\\textbf{0}_K'\\), i.e., that all columns have zero sum (and hence zero mean).32 This allows to write that the covariance matrix is equal to its sample estimate \\(\\boldsymbol{\\Sigma}_X= \\frac{1}{I-1}\\textbf{X}'\\textbf{X}\\). One crucial feature of covariance matrices is their symmetry. Indeed, real-valued symmetric (square) matrices enjoy a SVD which is much more powerful: when \\(\\textbf{X}\\) is symmetric, there exist an orthogonal matrix \\(\\textbf{Q}\\) and a diagonal matrix \\(\\textbf{D}\\) such that \\[\\begin{equation} \\tag{15.2} \\textbf{X}=\\textbf{Q}\\textbf{DQ}'. \\end{equation}\\] This process is called diagonalization (see chapter 7 in Meyer (2000)) and conveniently applies to covariance matrices. 15.2.2 PCA The goal of PCA is to build a dataset \\(\\tilde{\\textbf{X}}\\) that has fewer columns but that keeps as much information as possible when compressing the original one, \\(\\textbf{X}\\). The key notion is the change of base, which is a linear transformation of \\(\\textbf{X}\\) into \\(\\textbf{Z}\\), a matrix with identical dimension, via \\[\\begin{equation} \\tag{15.3} \\textbf{Z}=\\textbf{XP}, \\end{equation}\\] where \\(\\textbf{P}\\) is a \\(K \\times K\\) matrix. There are of course an infinite number of ways to transform \\(\\textbf{X}\\) into \\(\\textbf{Z}\\), but two fundamental constraints help reduce the possibilities. The first constraint is that the columns of \\(\\textbf{Z}\\) be uncorrelated. Having uncorrelated features is desirable because they then all tell different stories and have zero redundancy. The second constraint is that the variance of the columns of \\(\\textbf{Z}\\) is highly concentrated. This means that a few factors (columns) will capture most of the explanatory power (signal), while most (the others) will consist predominantly of noise. All of this is coded in the covariance matrix of \\(\\textbf{Y}\\): the first condition imposes that the covariance matrix be diagonal; the second condition imposes that the diagonal elements, when ranked in decreasing magnitude, see their value decline (sharply if possible). The covariance matrix of \\(\\textbf{Z}\\) is \\[\\begin{equation} \\tag{15.4} \\boldsymbol{\\Sigma}_Y=\\frac{1}{I-1}\\textbf{Z}'\\textbf{Z}=\\frac{1}{I-1}\\textbf{P}'\\textbf{X}'\\textbf{XP}=\\frac{1}{I-1}\\textbf{P}'\\boldsymbol{\\Sigma}_X\\textbf{P}. \\end{equation}\\] In this expression, we plug the decomposition (15.2) of \\(\\boldsymbol{\\Sigma}_X\\): \\[\\boldsymbol{\\Sigma}_Y=\\frac{1}{I-1}\\textbf{P}'\\textbf{Q}\\textbf{DQ}'\\textbf{P},\\] thus picking \\(\\textbf{P}=\\textbf{Q}\\), we get, by orthogonality, \\(\\boldsymbol{\\Sigma}_Y=\\frac{1}{I-1}\\textbf{D}\\), that is, a diagonal covariance matrix for \\(\\textbf{Z}\\). The columns of \\(\\textbf{Z}\\) can then be re-shuffled in decreasing order of variance so that the diagonal elements of \\(\\boldsymbol{\\Sigma}_Y\\) progressively shrink. This is useful because it helps locate the factors with most informational content (the first factors). In the limit, a constant vector (with zero variance) carries no signal. The matrix \\(\\textbf{Z}\\) is a linear transformation of \\(\\textbf{X}\\), thus, it is expected to carry the same information, even though this information is coded differently. Since the columns are ordered according to their relative importance, it is simple to omit some of them. The new set of features \\(\\tilde{\\textbf{X}}\\) consists in the first \\(K'\\) (with \\(K'<K\\)) columns of \\(\\textbf{Z}\\). Below, we show how to perform PCA and visualize the output with the factoextra package. To ease readability, we use the smaller sample with few predictors. pca <- training_sample %>% dplyr::select(features_short) %>% # Smaller number of predictors prcomp() # Performs PCA pca # Show the result ## Standard deviations (1, .., p=7): ## [1] 0.4536601 0.3344080 0.2994393 0.2452000 0.2352087 0.2010782 0.1140988 ## ## Rotation (n x k) = (7 x 7): ## PC1 PC2 PC3 PC4 PC5 PC6 ## Div_Yld 0.27159946 -0.57909866 0.04572501 -0.52895604 -0.22662581 -0.506566090 ## Eps 0.42040708 -0.15008243 -0.02476659 0.33737265 0.77137719 -0.301883295 ## Mkt_Cap_12M_Usd 0.52386846 0.34323935 0.17228893 0.06249528 -0.25278113 -0.002987057 ## Mom_11M_Usd 0.04723846 0.05771359 -0.89715955 0.24101481 -0.25055884 -0.258476580 ## Ocf 0.53294744 0.19588990 0.18503939 0.23437100 -0.35759553 -0.049015486 ## Pb 0.15241340 0.58080620 -0.22104807 -0.68213576 0.30866476 -0.038674594 ## Vol1Y_Usd -0.40688963 0.38113933 0.28216181 0.15541056 -0.06157461 -0.762587677 ## PC7 ## Div_Yld 0.032011635 ## Eps 0.011965041 ## Mkt_Cap_12M_Usd 0.714319417 ## Mom_11M_Usd 0.043178747 ## Ocf -0.676866120 ## Pb -0.168799297 ## Vol1Y_Usd 0.008632062 The rotation gives the matrix \\(\\textbf{P}\\): it’s the tool that changes the base. The first row of the output indicates the standard deviation of each new factor (column). Each factor is indicated via a PC index (principal component). Often, the first PC (first column PC1 in the output) loads positively on all initial features: a convex weighted average of all predictors is expected to carry a lot of information. In the above example, it is almost the case, with the exception of volatility, which has a negative coefficient in the first PC. The second PC is an arbitrage between price-to-book (long) and dividend yield (short). The third PC is contrarian, as it loads heavily and negatively on momentum. Not all principal components are easy to interpret. Sometimes, it can be useful to visualize the way the principal components are built. In Figure 15.2, we show one popular representation that is used for two factors (usually the first two). library(factoextra) # Package for PCA visualization fviz_pca_var(pca, # Source of PCA decomposition col.var="contrib", gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), repel = TRUE # Avoid text overlapping ) FIGURE 15.2: Visual representation of PCA with two dimensions. The plot shows that no initial factor has negative signs for the first two principal components. Volatility is negative for the first one and earnings per share and dividend yield are negative for the second. The numbers indicated along the axes are the proportion of explained variance of each PC. Compared to the figures in the first line of the output, the numbers are squared and then divided by the total sum of squares. Once the rotation is known, it is possible to select a subsample of the transformed data. From the original 7 features, it is easy to pick just 4. training_sample %>% # Start from large sample dplyr::select(features_short) %>% # Keep only 7 features as.matrix() %>% # Transform in matrix multiply_by_matrix(pca$rotation[,1:4]) %>% # Rotate via PCA (first 4 columns of P) `colnames<-`(c("PC1", "PC2", "PC3", "PC4")) %>% # Change column names head() # Show first 6 lines ## PC1 PC2 PC3 PC4 ## [1,] 0.3989674 0.7578132 -0.13915223 0.3132578 ## [2,] 0.4284697 0.7587274 -0.40164338 0.3745255 ## [3,] 0.5215295 0.5679119 -0.10533870 0.2574949 ## [4,] 0.5445359 0.5335619 -0.08833864 0.2281793 ## [5,] 0.5672644 0.5339749 -0.06092424 0.2320938 ## [6,] 0.5871306 0.6420126 -0.44566482 0.3075399 These 4 factors can then be used as orthogonal features in any ML engine. The fact that the features are uncorrelated is undoubtedly an asset. But the price of this convenience is high: the features are no longer immediately interpretable. De-correlating the predictors adds yet another layer of “blackbox-ing” in the algorithm. PCA can also be used to estimate factor models. In Equation (15.3), it suffices to replace \\(\\textbf{Z}\\) with returns, \\(\\textbf{X}\\) with factor values and \\(\\textbf{P}\\) with factor loadings (see, e.g., Connor and Korajczyk (1988) for an early reference). More recently, Lettau and Pelger (2020a) and Lettau and Pelger (2020b) propose a thorough analysis of PCA estimation techniques. They notably argue that first moments of returns are important and should be included in the objective function, alongside the optimization on the second moments. We end this subsection with a technical note. Usually, PCA is performed on the covariance matrix of returns. Sometimes, it may be preferable to decompose the correlation matrix. The result may adjust substantially if the variables have very different variances (which is not really the case in the equity space). If the investment universe encompasses several asset classes, then a correlation-based PCA will reduce the importance of the most volatile class. In this case, it is as if all returns are scaled by their respective volatilities. 15.2.3 Autoencoders In a PCA, the coding from \\(\\textbf{X}\\) to \\(\\textbf{Z}\\) is straightfoward, linear and works both ways: \\[\\textbf{Z}=\\textbf{X}\\textbf{P} \\quad \\text{and} \\quad \\textbf{X}=\\textbf{YP}',\\] so that we recover \\(\\textbf{X}\\) from \\(\\textbf{Z}\\). This can be writen differently: \\[\\begin{equation} \\tag{15.5} \\textbf{X} \\quad \\overset{\\text{encode via }\\textbf{P}}{\\longrightarrow} \\quad \\textbf{Z} \\quad \\overset{\\text{decode via } \\textbf{P}'}{\\longrightarrow} \\quad \\textbf{X} \\end{equation}\\] If we take the truncated version and seek a smaller output (with only \\(K'\\) columns), this gives: \\[\\begin{equation} \\tag{15.6} \\textbf{X}, \\ (I\\times K) \\quad \\overset{\\text{encode via }\\textbf{P}_{K'}}{\\longrightarrow} \\quad \\tilde{\\textbf{X}}, \\ (I \\times K') \\quad \\overset{\\text{decode via } \\textbf{P}'_{K'}}{\\longrightarrow} \\quad \\breve{\\textbf{X}},\\ (I \\times K), \\end{equation}\\] where \\(\\textbf{P}_{K'}\\) is the restriction of \\(\\textbf{P}\\) to the \\(K'\\) columns that correspond to the factors with the largest variances. The dimensions of matrices are indicated inside the brackets. In this case, the recoding cannot recover \\(\\textbf{P}\\) exactly but only an approximation, which we write \\(\\breve{\\textbf{X}}\\). This approximation is coded with less information, hence this new data \\(\\breve{\\textbf{X}}\\) is compressed and provides a parsimonious representation of the original sample \\(\\textbf{X}\\). An autoencodeur generalizes this concept to nonlinear coding functions. Simple linear autoencoders are linked to latent factor models (see Proposition 1 in for the case of single layer autoencoders.) The scheme is the following \\[\\begin{equation} \\tag{15.7} \\textbf{X},\\ (I\\times K) \\quad \\overset{\\text{encode via } N} {\\longrightarrow} \\quad \\tilde{\\textbf{X}}=N(\\textbf{X}), \\ (I \\times K') \\quad \\overset{\\text{decode via } N'}{\\longrightarrow} \\quad \\breve{\\textbf{X}}=N'(\\tilde{\\textbf{X}}), \\ (I \\times K), \\end{equation}\\] where the encoding and decoding functions \\(N\\) and \\(N'\\) are often taken to be neural networks. The term autoencoder comes from the fact that the target output, which we often write \\(\\textbf{Z}\\) is the original sample \\(\\textbf{X}\\). Thus, the algorithm seeks to determine the function \\(N\\) that minimizes the distance (to be defined) between \\(\\textbf{X}\\) and the output value \\(\\breve{\\textbf{X}}\\). The encoder generates an alternative representation of \\(\\textbf{X}\\), whereas the decoder tries to recode it back to its original values. Naturally, the intermediate (coded) version \\(\\tilde{\\textbf{X}}\\) is targeted to have a smaller dimension compared to \\(\\textbf{X}\\). 15.2.4 Application Autoencoders are easy to code in Keras (see Chapter 7 for more details on Keras). To underline the power of the framework, we resort to another way of coding a NN: the so-called functional API. For simplicity, we work with the small number of predictors (7). The structure of the network consists of two symmetric networks with only one intermediate layer containing 32 units. The activation function is sigmoid; this makes sense since the input has values in the unit interval. input_layer <- layer_input(shape = c(7)) # features_short has 7 columns encoder <- input_layer %>% # First, encode layer_dense(units = 32, activation = "sigmoid") %>% layer_dense(units = 4) # 4 dimensions for the output layer (same as PCA example) decoder <- encoder %>% # Then, from encoder, decode layer_dense(units = 32, activation = "sigmoid") %>% layer_dense(units = 7) # the original sample has 7 features In the training part, we optimize the MSE and use an Adam update of the weights (see Section 7.2.3). ae_model <- keras_model(inputs = input_layer, outputs = decoder) # Builds the model ae_model %>% compile( # Learning parameters loss = 'mean_squared_error', optimizer = 'adam', metrics = c('mean_absolute_error') ) Finally, we are ready to train the data onto itself! The evolution of loss on the training and testing samples is depicted in Figure 15.3. The decreasing pattern shows the progress of the quality in compression. fit_ae <- ae_model %>% fit(training_sample %>% dplyr::select(features_short) %>% as.matrix(), # Input training_sample %>% dplyr::select(features_short) %>% as.matrix(), # Output epochs = 15, batch_size = 512, validation_data = list(testing_sample %>% dplyr::select(features_short) %>% as.matrix(), testing_sample %>% dplyr::select(features_short) %>% as.matrix()) ) plot(fit_ae) + theme_grey() FIGURE 15.3: Output from the training of an autoencoder. In order to get the details of all weights and biases, the syntax is the following. ae_weights <- ae_model %>% get_weights() Retrieving the encoder and processing the data into the compressed format is just a matter of matrix manipulation. In practice, it is possible to build a submodel by loading the weights from the encoder (see exercise below). 15.3 Clustering via k-means The second family of unsupervised tools pertains to clustering. Features are grouped into homogeneous families of predictors. It is then possible to single out one among the group (or to create a synthetic average of all of them). Mechanically, the number of predictors is reduced. The principle is simple: among a group of variables (the reasoning would be the same for observations in the other dimension) \\(\\textbf{x}_{\\{1 \\le j \\le J\\}}\\), find the combination of \\(k<J\\) groups that minimize \\[\\begin{equation} \\tag{15.8} \\sum_{i=1}^k\\sum_{\\textbf{x}\\in S_i}||\\textbf{x}-\\textbf{m}_i||^2, \\end{equation}\\] where \\(||\\cdot ||\\) is some norm which is usually taken to be the Euclidean \\(l^2\\)-norm. The \\(S_i\\) are the groups and the minimization is run on the whole set of groups \\(\\textbf{S}\\). The \\(\\textbf{m}_i\\) are the group means (also called centroids or barycenters): \\(\\textbf{m}_i=(\\text{card}(S_i))^{-1}\\sum_{\\textbf{x}\\in S_i}\\textbf{x}\\). In order to ensure optimality, all possible arrangements must be tested, which is prohibitively long when \\(k\\) and \\(J\\) are large. Therefore, the problem is usually solved with greedy algorithms that seek (and find) solutions that are not optimal but ‘good enough’. One heuristic way to proceed is the following: Start with a (possibly random) partition of \\(k\\) clusters. For each cluster, compute the optimal mean values \\(\\textbf{m}_i^*\\) that minimizes expression (15.8). This is a simple quadratic program. Given the optimal centers \\(\\textbf{m}_i^*\\), reassign the points \\(\\textbf{x}_i\\) so that they are all the closest to their center. Repeat steps 1. and 2. until the points do not change cluster at step 2. Below, we illustrate this process with an example. From all 93 features, we build 10 clusters. set.seed(42) # Setting the random seed (the optim. is random) k_means <- training_sample %>% # Performs the k-means clustering dplyr::select(features) %>% as.matrix() %>% t() %>% kmeans(10) clusters <- tibble(factor = names(k_means$cluster), # Organize the cluster data cluster = k_means$cluster) %>% arrange(cluster) clusters %>% filter(cluster == 4) # Shows one particular group ## # A tibble: 4 x 2 ## factor cluster ## <chr> <int> ## 1 Asset_Turnover 4 ## 2 Bb_Yld 4 ## 3 Recurring_Earning_Total_Assets 4 ## 4 Sales_Ps 4 We single out the fourth cluster which is composed mainly of accounting ratios related to the profitability of firms. Given these 10 clusters, we can build a much smaller group of features that can then be fed to the predictive engines described in Chapters 5 to 9. The representative of a cluster can be the member that is closest to the center, or simply the center itself. This pre-processing step can nonetheless cause problems in the forecasting phase. Typically, it requires that the training data be also clustered. The extension to the testing data is not straightforward (the clusters may not be the same). 15.4 Nearest neighbors To the best of our knowledge, nearest neighbors are not used in large-scale portfolio choice applications. The reason is simple: computational cost. Nonetheless, the concept of neighbors is widespread in unsupervised learning and can be used locally in complement to interpretability tools. Theoretical results on k-NN relating to bounds for error rates on classification tasks can be found in section 6.2 of Ripley (2007). The rationale is the following. If: the training sample is able to accurately span the distribution of \\((\\textbf{y}, \\textbf{X})\\); and the testing sample follows the same distribution as the training sample (or close enough); then the neighborhood of one instance \\(\\textbf{x}_i\\) from the testing features computed on the training sample will yield valuable information on \\(y_i\\). In what follows, we thus seek to find neighbors of one particular instance \\(\\textbf{x}_i\\) (a \\(K\\)-dimensional row vector). Note that there is a major difference with the previous section: the clustering is intended at the observation level (row) and not at the predictor level (column). Given a dataset with the same (corresponding) columns \\(\\textbf{X}_{i,k}\\), the neighbors are defined via a similarity measure (or distance) \\[\\begin{equation} \\tag{15.9} D(\\textbf{x}_j,\\textbf{x}_i)=\\sum_{k=1}^Kc_k d_k(x_{j,k},x_{i,k}), \\end{equation}\\] where the distance functions \\(d_k\\) can operate on various data types (numerical, categorical, etc.). For numerical values, \\(d_k(x_{j,k},x_{i,k})=(x_{j,k}-x_{i,k})^2\\) or \\(d_k(x_{j,k},x_{i,k})=|x_{j,k}-x_{i,k}|\\). For categorical values, we refer to the exhaustive survey by Boriah, Chandola, and Kumar (2008) which lists 14 possible measures. Finally the \\(c_k\\) in Equation (15.9) allow some flexbility by weighting features. This is useful because both raw values (\\(x_{i,k}\\) versus \\(x_{i,k'}\\)) or measure outputs (\\(d_k\\) versus \\(d_{k'}\\)) can have different scales. Once the distances are computed over the whole sample, they are ranked using indices \\(l_1^i, \\dots, l_I^i\\): \\[D\\left(\\textbf{x}_{l_1^i},\\textbf{x}_i\\right) \\le D\\left(\\textbf{x}_{l_2^i},\\textbf{x}_i\\right) \\le \\dots, \\le D\\left(\\textbf{x}_{l_I^i},\\textbf{x}_i\\right)\\] The nearest neighbors are those indexed by \\(l_m^i\\) for \\(m=1,\\dots,k\\). We leave out the case when there are problematic equalities of the type \\(D\\left(\\textbf{x}_{l_m^i},\\textbf{x}_i\\right)=D\\left(\\textbf{x}_{l_{m+1}^i},\\textbf{x}_i\\right)\\) for the sake of simplicity and because they rarely occur in practice as long as there are sufficiently many numerical predictors. Given these neighbors, it is now possible to build a prediction for the label side \\(y_i\\). The rationale is straightforward: if \\(\\textbf{x}_i\\) is close to other instances \\(\\textbf{x}_j\\), then the label value \\(y_i\\) should also be close to \\(y_j\\) (under the assumption that the features carry some predictive information over the label \\(y\\)). An intuitive prediction for \\(y_i\\) is the following weighted average: \\[\\hat{y}_i=\\frac{\\sum_{j\\neq i} h(D(\\textbf{x}_j,\\textbf{x}_i)) y_j}{\\sum_{j\\neq i} h(D(\\textbf{x}_j,\\textbf{x}_i))},\\] where \\(h\\) is a decreasing function. Thus, the further \\(\\textbf{x}_j\\) is from \\(\\textbf{x}_i\\), the smaller the weight in the average. A typical choice for \\(h\\) is \\(h(z)=e^{-az}\\) for some parameter \\(a>0\\) that determines how penalizing the distance \\(D(\\textbf{x}_j,\\textbf{x}_i)\\) is. Of course, the average can be taken in the set of \\(k\\) nearest neighbors, in which case the \\(h\\) is equal to zero beyond a particular distance threshold: \\[\\hat{y}_i=\\frac{\\sum_{j \\text{ neighbor}} h(D(\\textbf{x}_j,\\textbf{x}_i)) y_j}{\\sum_{j \\text{ neighbor}} h(D(\\textbf{x}_j,\\textbf{x}_i))}.\\] A more agnostic rule is to take \\(h:=1\\) over the set of neighbors and in this case, all neighbors have the same weight (see the old discussion by Bailey and Jain (1978) in the case of classification). For classification tasks, the procedure involves a voting rule whereby the class with the most votes wins the contest, with possible tie-breaking methods. The interested reader can have a look at the short survey in Bhatia and others (2010). For the choice of optimal \\(k\\), several complicated techniques and criteria exist (see, e.g., Ghosh (2006) and Hall et al. (2008)). Heuristic values often do the job pretty well. A rule of thumb is that \\(k=\\sqrt{I}\\) (\\(I\\) being the total number of instances) is not too far from the optimal value, unless \\(I\\) is exceedingly large. Below, we illustrate this concept. We pick one date (31th of December 2006) and single out one asset (with stock_id equal to 13). We then seek to find the \\(k=30\\) stocks that are the closest to this asset at this particular date. We resort to the FNN package that proposes an efficient computation of Euclidean distances (and their ordering). library(FNN) # Package for Fast Nearest Neighbors detection knn_data <- filter(data_ml, date == "2006-12-31") # Dataset for k-NN exercise knn_target <- filter(knn_data, stock_id == 13) %>% # Target observation dplyr::select(features) knn_sample <- filter(knn_data, stock_id != 13) %>% # All other observations dplyr::select(features) neighbors <- get.knnx(data = knn_sample, query = knn_target, k = 30) neighbors$nn.index # Indices of the k nearest neighbors ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] ## [1,] 905 876 730 548 1036 501 335 117 789 54 618 130 342 360 673 ## [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] ## [1,] 153 265 858 830 286 1150 166 946 192 340 162 951 376 785 ## [,30] ## [1,] 2 Once the neighbors and distances are known, we can compute a prediction for the return of the target stock. We use the function \\(h(z)=e^{-z}\\) for the weighting of instances (via the distances). knn_labels <- knn_data[as.vector(neighbors$nn.index),] %>% # y values for neighb. dplyr::select(R1M_Usd) sum(knn_labels * exp(-neighbors$nn.dist) / sum(exp(-neighbors$nn.dist))) # Pred w. k(z)=e^(-z) ## [1] 0.003042282 filter(knn_data, stock_id == 13) %>% # True y dplyr::select(R1M_Usd) ## # A tibble: 1 x 1 ## R1M_Usd ## <dbl> ## 1 0.089 The prediction is neither very good, nor very bad (the sign is correct!). However, note that this example cannot be used for predictive purposes because we use data from 2006-12-31 to predict a return at the same date. In order to avoid the forward-looking bias, the knn_sample variable should be chosen from a prior point in time. The above computations are fast (a handful of seconds at most), but hold for only one asset. In a \\(k\\)-NN exercise, each stock gets a customed prediction and the set of neighbors must be re-assessed each time. For \\(N\\) assets, \\(N(N-1)/2\\) distances must be evaluated. This is particularly costly in a backtest, especially when several parameters can be tested (the number of neighbors, \\(k\\), or \\(a\\) in the weighting function \\(h(z)=e^{-az}\\)). When the investment universe is small (when trading indices for instance), k-NN methods become computationally attractive (see for instance Chen and Hao (2017)). 15.5 Coding exercise Code the compressed version of the data (narrow training sample) via the encoder part of the autoencoder. References "],["RL.html", "Chapter 16 Reinforcement learning 16.1 Theoretical layout 16.2 The curse of dimensionality 16.3 Policy gradient 16.4 Simple examples 16.5 Concluding remarks 16.6 Exercises", " Chapter 16 Reinforcement learning Due to its increasing popularity within the Machine Learning community, we dedicate a chapter to reinforcement learning (RL). In 2019 only, more than 25 papers dedicated to RL have been submitted to (or updated on) arXiv under the q:fin (quantitative finance) classification. Applications to trading include Xiong et al. (2018) and Théate and Ernst (2020). Market microstructure is a focal framework (Wei et al. (2019), ferreira2020reinforced, karpe2020multi). Moreover, an early survey of RL-based portfolios is compiled in Sato (2019) (see also Zhang, Zohren, and Roberts (2020)) and general financial applications are discussed in Kolm and Ritter (2019b), Meng and Khushi (2019), Charpentier, Elie, and Remlinger (2020) and Mosavi et al. (2020). This shows again that RL has recently gained traction among the quantitative finance community.33 While RL is a framework much more than a particular algorithm, its efficient application in portfolio management is not straightforward, as we will show. 16.1 Theoretical layout 16.1.1 General framework In this section, we introduce the core concepts of RL and follow relatively closely the notations (and layout) of Sutton and Barto (2018), which is widely considered as a solid reference in the field, along with Bertsekas (2017). One central tool in the field is called the Markov Decision Process (MDP, see Chapter 3 in Sutton and Barto (2018)). MDPs, like all RL frameworks, involve the interaction between an agent (e.g., a trader or portfolio manager) and an environment (e.g., a financial market). The agent performs actions that may alter the state of environment and gets a reward (possibly negative) for each action. This short sequence can be repeated an arbitrary number of times, as is shown in Figure 16.1. FIGURE 16.1: Scheme of Markov Decision Process. R, S and A stand for reward, state and action, respectively. Given initialized values for the state of the environment (\\(S_0\\)) and reward (usually \\(R_0=0\\)), the agent performs an action (e.g., invests in some assets). This generates a reward \\(R_1\\) (e.g., returns, profits, Sharpe ratio) and also a future state of the environment (\\(S_1\\)). Based on that, the agent performs a new action and the sequence continues. When the sets of states, actions and rewards are finite, the MDP is logically called finite. In a financial framework, this is somewhat unrealistic and we discuss this issue later on. It nevertheless is not hard to think of simplified and discretized financial problems. For instance, the reward can be binary: win money versus lose money. In the case of only one asset, the action can also be dual: investing versus not investing. When the number of assets is sufficiently small, it is possible to set fixed proportions that lead to a reasonable number of combinations of portfolio choices, etc. We pursue our exposé with finite MDPs; they are the most common in the literature and their formal treatment is simpler. The relative simplicity of MDPs helps grasp the concepts that are common to other RL techniques. As is often the case with Markovian objects, the key notion is that of transition probability: \\[\\begin{equation} \\tag{16.1} p(s',r|s,a)=\\mathbb{P}\\left[S_t=s',R_t=r | S_{t-1}=s,A_{t-1}=a \\right], \\end{equation}\\] which is the probability of reaching state \\(s'\\) and reward \\(r\\) at time \\(t\\), conditionally on being in state \\(s\\) and performing action \\(a\\) at time \\(t-1\\). The finite sets of states and actions will be denoted with \\(\\mathcal{S}\\) and \\(\\mathcal{A}\\) henceforth. Sometimes, this probability is averaged over the set of rewards which gives the following decomposition: The goal of the agent is to maximize some function of the stream of rewards. This gain is usually defined as \\[\\begin{align} G_t&=\\sum_{k=0}^T\\gamma^kR_{t+k+1} \\nonumber \\\\ \\tag{16.3} &=R_{t+1} +\\gamma G_{t+1}, \\end{align}\\] i.e., it is a discounted version of the reward, where the discount factor is \\(\\gamma \\in (0,1]\\). The horizon \\(T\\) may be infinite, which is why \\(\\gamma\\) was originally introduced. Assuming the rewards are bounded, the infinite sum may diverge for \\(\\gamma=1\\). That is the case if rewards don’t decrease with time and there is no reason why they should. When \\(\\gamma <1\\) and rewards are bounded, convergence is assured. When \\(T\\) is finite, the task is called episodic and, otherwise, it is said to be continuous. In RL, the focal unknown to be optimized or learned is the policy \\(\\pi\\), which drives the actions of the agent. More precisely, \\(\\pi(a,s)=\\mathbb{P}[A_t=a|S_t=s]\\), that is, \\(\\pi\\) equals the probability of taking action \\(a\\) if the state of the environment is \\(s\\). This means that actions are subject to randomness, just like for mixed strategies in game theory. While this may seem disappointing because an investor would want to be sure to take the best action, it is also a good reminder that the best way to face random outcomes may well be to randomize actions as well. Finally, in order to try to determine the best policy, one key indicator is the so-called value function: \\[\\begin{equation} \\tag{16.4} v_\\pi(s)=\\mathbb{E}_\\pi\\left[ G_t | S_t=s \\right], \\end{equation}\\] where the time index \\(t\\) is not very relevant and omitted in the notation of the function. The index \\(\\pi\\) under the expectation operator \\(\\mathbb{E}[\\cdot]\\) simply indicates that the average is taken when the policy \\(\\pi\\) is enforced. The value function is simply equal to the average gain conditionally on the state being equal to \\(s\\). In financial terms, this is equivalent to the average profit if the agent takes actions driven by \\(\\pi\\) when the market environment is \\(s\\). More generally, it is also possible to condition not only on the state, but also on the action taken. We thus introduce the \\(q_\\pi\\) action-value function: \\[\\begin{equation} \\tag{16.5} q_\\pi(s,a)=\\mathbb{E}_\\pi\\left[ G_t | S_t=s, \\ A_t=a \\right]. \\end{equation}\\] The \\(q_\\pi\\) function is highly important because it gives the average gain when the state and action are fixed. Hence, if the current state is known, then one obvious choice is to select the action for which \\(q_\\pi(s,\\cdot)\\) is the highest. Of course, this is the best solution if the optimal value of \\(q_\\pi\\) is known, which is not always the case in practice. The value function can easily be accessed via \\(q_\\pi\\): \\(v_\\pi(s)=\\sum_a \\pi(a,s)q_\\pi(s,a)\\). The optimal \\(v_\\pi\\) and \\(q_\\pi\\) are straightforwardly defined as \\[v_*(s)=\\underset{\\pi}{\\max} \\, v_\\pi(s), \\ \\forall s\\in \\mathcal{S}, \\quad \\text{ and } \\quad q_*(s,a) =\\underset{\\pi}{\\max} \\, q_\\pi(s,a), \\ \\forall (s,a)\\in \\mathcal{S}\\times \\mathcal{A}.\\] If only \\(v_*(s)\\) is known, then the agent must span the set of actions and find those that yield the maximum value for any given state \\(s\\). Finding these optimal values is a very complicated task and many articles are dedicated to solving this challenge. One reason why finding the best \\(q_\\pi(s,a)\\) is difficult is because it depends on two elements (\\(s\\) and \\(a\\)) on one side and \\(\\pi\\) on the other. Usually, for a fixed policy \\(\\pi\\), it can be time consuming to evaluate \\(q_\\pi(s,a)\\) for a given stream of actions, states and rewards. Once \\(q_\\pi(s,a)\\) is estimated, then a new policy \\(\\pi'\\) must be tested and evaluated to determine if it is better than the original one. Thus, this iterative search for a good policy can take long. For more details on policy improvement and value function updating, we recommend chapter 4 of Sutton and Barto (2018) which is dedicated to dynamic programming. 16.1.2 Q-learning An interesting shortcut to the problem of finding \\(v_*(s)\\) and \\(q_*(s,a)\\) is to remove the dependence on the policy. Consequently, there is then of course no need to iteratively improve it. The central relationship that is required to do this is the so-called Bellman equation that is satisfied by \\(q_\\pi(s,a)\\). We detail its derivation below. First of all, we recall that \\[\\begin{align*} q_\\pi(s,a) &= \\mathbb{E}_\\pi[G_t|S_t=s,A_t=a] \\\\ &= \\mathbb{E}_\\pi[R_{t+1}+ \\gamma G_{t+1}|S_t=s,A_t=a], \\end{align*}\\] where the second equality stems from (16.3). The expression \\(\\mathbb{E}_\\pi[R_{t+1}|S_t=s,A_t=a]\\) can be further decomposed. Since the expectation runs over \\(\\pi\\), we need to sum over all possible actions \\(a'\\) and states \\(s'\\) and resort to \\(\\pi(a',s')\\). In addition, the sum on the \\(s'\\) and \\(r\\) arguments of the probability \\(p(s',r|s,a)=\\mathbb{P}\\left[S_{t+1}=s',R_{t+1}=r | S_t=s,A_t=a \\right]\\) gives access to the distribution of the random couple \\((S_{t+1},R_{t+1})\\) so that in the end \\(\\mathbb{E}_\\pi[R_{t+1}|S_t=s,A_t=a]=\\sum_{a', r,s'}\\pi(a',s')p(s',r|s,a) r\\). A similar reasoning applies to the second portion of \\(q_\\pi\\) and: \\[\\begin{align} q_\\pi(s,a) &=\\sum_{a',r, s'}\\pi(a',s')p(s',r|s,a) \\left[ r+\\gamma \\mathbb{E}_\\pi[ G_{t+1}|S_t=s',A_t=a']\\right] \\nonumber \\\\ \\tag{16.6} &=\\sum_{a',r,s'}\\pi(a',s')p(s',r|s,a) \\left[ r+\\gamma q_\\pi(s',a')\\right]. \\end{align}\\] This equation links \\(q_\\pi(s,a)\\) to the future \\(q_\\pi(s',a')\\) from the states and actions \\((s',a')\\) that are accessible from \\((s,a)\\). Notably, Equation (16.6) is also true for the optimal action-value function \\(q_*=\\underset{\\pi}{\\max} \\, q_\\pi(s,a)\\): \\[\\begin{align} q_*(s,a) &= \\underset{a'}{\\max} \\sum_{r,s'}p(s',r|s,a) \\left[ r+\\gamma q_*(s',a')\\right], \\\\ &= \\mathbb{E}_{\\pi^*}[r|s,a]+ \\gamma \\, \\sum_{r,s'}p(s',r|s,a) \\left( \\underset{a'}{\\max} q_*(s',a') \\right) \\tag{16.7} \\end{align}\\] because one optimal policy is one that maximizes \\(q_\\pi(s,a)\\), for a given state \\(s\\) and over all possible actions \\(a\\). This expression is central to a cornerstone algorithm in reinforcement learning called \\(Q\\)-learning (the formal proof of convergence is outlined in Watkins and Dayan (1992)). In \\(Q\\)-learning, the state-action function no longer depends on policy and is written with capital \\(Q\\). The process is the following: Initialize values \\(Q(s,a)\\) for all states \\(s\\) and actions \\(a\\). For each episode: \\[ (\\textbf{QL}) \\quad \\left\\{ \\begin{array}{l} \\text{0. Initialize state } S_0 \\text{ and for each iteration } i \\text{ until the end of the episode;} \\\\ \\text{1. observe state } s_i; \\\\ \\text{2. perform action } a_i \\text{(depending on } Q); \\\\ \\text{3. receive reward }r_{i+1} \\text{ and observe state } s_{i+1}; \\\\ \\text{4. Update } Q \\text{ as follows: } \\end{array} \\right.\\] \\[\\begin{equation} \\tag{16.8} Q_{i+1}(s_i,a_i) \\longleftarrow Q_i(s_i,a_i) + \\eta \\left(\\underbrace{r_{i+1}+\\gamma \\, \\underset{a}{\\max} \\, Q_i(s_{i+1},a)}_{\\text{echo of } (\\ref{eq:bellmanq})}-Q_i(s_i,a_i) \\right) \\end{equation}\\] The underlying reason this update rule works can be linked to fixed point theorems of contraction mappings. If a function \\(f\\) satisfies \\(|f(x)-f(y)|< \\delta |x-y|\\) (Lipshitz continuity), then a fixed point \\(z\\) satisfying \\(f(z)=z\\) can be iteratively obtained via \\(z \\leftarrow f(z)\\). This updating rule converges to the fixed point. Equation (16.7) can be solved using a similar principle, except that a learning rate \\(\\eta\\) slows the learning process but also technically ensures convergence under technical assumptions. More generally, (16.8) has a form that is widespread in reinforcement learning that is summarized in Equation (2.4) of Sutton and Barto (2018): \\[\\begin{equation} \\tag{16.9} \\text{New estimate} \\leftarrow \\text{Old estimate + Step size (}i.e., \\text{ learning rate)} \\times (\\text{Target - Old estimate}), \\end{equation}\\] where the last part can be viewed as an error term. Starting from the old estimate, the new estimate therefore goes in the ‘right’ (or sought) direction, modulo a discount term that makes sure that the magnitude of this direction is not too large. The update rule in (16.8) is often referred to as ‘temporal difference’ learning because it is driven by the improvement yielded by estimates that are known at time \\(t+1\\) (target) versus those known at time \\(t\\). One important step of the Q-learning sequence (QL) is the second one where the action \\(a_i\\) is picked. In RL, the best algorithms combine two features: exploitation and exploration. Exploitation is when the machine uses the current information at its disposal to choose the next action. In this case, for a given state \\(s_i\\), it chooses the action \\(a_i\\) that maximizes the expected reward \\(Q_i(s_i,a_i)\\). While obvious, this choice is not optimal if the current function \\(Q_i\\) is relatively far from the true \\(Q\\). Repeating the locally optimal strategy is likely to favor a limited number of actions, which will narrowly improve the accuracy of the \\(Q\\) function. In order to gather new information stemming from actions that have not been tested much (but that can potentially generate higher rewards), exploration is needed. This is when an action \\(a_i\\) is chosen randomly. The most common way to combine these two concepts is called \\(\\epsilon\\)-greedy exploration. The action \\(a_i\\) is assigned according to: \\[\\begin{equation} \\tag{16.10} a_i=\\left\\{ \\begin{array}{c l} \\underset{a}{\\text{argmax}} \\ Q_i(s_i,a) & \\text{ with probability } 1-\\epsilon \\\\ \\text{randomly (uniformly) over } \\mathcal{A} & \\text{ with probability } \\epsilon \\end{array}\\right. . \\end{equation}\\] Thus, with probability \\(\\epsilon\\), the algorithm explores and with probability \\(1-\\epsilon\\), it exploits the current knowledge of the expected reward and picks the best action. Because all actions have a non-zero probability of being chosen, the policy is called “soft”. Indeed, then best action has a probability of selection equal to \\(1-\\epsilon(1-\\text{card}(\\mathcal{A})^{-1})\\), while all other actions are picked with probability \\(\\epsilon/\\text{card}(\\mathcal{A})\\). 16.1.3 SARSA In \\(Q\\)-learning, the algorithm seeks to find the action-value function of the optimal policy. Thus, the policy that is followed to pick actions is different from the one that is learned (via \\(Q\\)). Such algorithms are called off-policy. On-policy algorithms seek to improve the estimation of the action-value function \\(q_\\pi\\) by continuously acting according to the policy \\(\\pi\\). One canonical example of on-policy learning is the SARSA method which requires two consecutive states and actions SARSA. The way the quintuple \\((S_t,A_t,R_{t+1}, S_{t+1}, A_{t+1})\\) is processed is presented below. The main difference between \\(Q\\) learning and SARSA is the update rule. In SARSA, it is given by \\[\\begin{equation} \\tag{16.11} Q_{i+1}(s_i,a_i) \\longleftarrow Q_i(s_i,a_i) + \\eta \\left(r_{i+1}+\\gamma \\, Q_i(s_{i+1},a_{i+1})-Q_i(s_i,a_i) \\right) \\end{equation}\\] The improvement comes only from the local point \\(Q_i(s_{i+1},a_{i+1})\\) that is based on the new states and actions (\\(s_{i+1},a_{i+1}\\)), whereas in \\(Q\\)-learning, it comes from all possible actions of which only the best is retained \\(\\underset{a}{\\max} \\, Q_i(s_{i+1},a)\\). A more robust but also more computationally demanding version of SARSA is expected SARSA in which the target \\(Q\\) function is averaged over all actions: \\[\\begin{equation} \\tag{16.12} Q_{i+1}(s_i,a_i) \\longleftarrow Q_i(s_i,a_i) + \\eta \\left(r_{i+1}+\\gamma \\, \\sum_a \\pi(a,s_{i+1}) Q_i(s_{i+1},a) -Q_i(s_i,a_i) \\right) \\end{equation}\\] Expected SARSA is less volatile than SARSA because the latter is strongly impacted by the random choice of \\(a_{i+1}\\). In expected SARSA, the average smoothes the learning process. 16.2 The curse of dimensionality Let us first recall that reinforcement learning is a framework that is not linked to a particular algorithm. In fact, different tools can very well co-exist in a RL task (AlphaGo combined both tree methods and neural networks, see Silver et al. (2016)). Nonetheless, any RL attempt will always rely on the three key concepts: the states, actions and rewards. In factor investing, they are fairly easy to identify, though there is always room for interpretation. Actions are evidently defined by portfolio compositions. The states can be viewed as the current values that describe the economy: as a first-order approximation, it can be assumed that the feature levels fulfill this role (possibly conditioned or complemented with macro-economic data). The rewards are even more straightforward. Returns or any relevant performance metric34 can account for rewards. A major problem lies in the dimensionality of both states and actions. Assuming an absence of leverage (no negative weights), the actions take values on the simplex \\[\\begin{equation} \\tag{16.13} \\mathbb{S}_N=\\left\\{ \\mathbf{x} \\in \\mathbb{R}^N\\left|\\sum_{n=1}^Nx_n=1, \\ x_n\\ge 0, \\ \\forall n=1,\\dots,N \\right.\\right\\} \\end{equation}\\] and assuming that all features have been uniformized, their space is \\([0,1]^{NK}\\). Needless to say, the dimensions of both spaces are numerically impractical. A simple solution to this problem is discretization: each space is divided into a small number of categories. Some authors do take this route. In Yang, Yu, and Almahdi (2018), the state space is discretized into three values depending on volatility, and actions are also split into three categories. Bertoluzzo and Corazza (2012), Xiong et al. (2018) and Taghian, Asadi, and Safabakhsh (2020) also choose three possible actions (buy, hold, sell). In Almahdi and Yang (2019), the learner is expected to yield binary signals for buying or shorting. Garcı́a-Galicia, Carsteanu, and Clempner (2019) consider a larger state space (8 elements) but restrict the action set to 3 options.35 In terms of the state space, all articles assume that the state of the economy is determined by prices (or returns). One strong limitation of these approaches is the marked simplification they imply. Realistic discretizations are numerically intractable when investing in multiple assets. Indeed, splitting the unit interval in \\(h\\) points yields \\(h^{NK}\\) possibilities for feature values. The number of options for weight combinations is exponentially increasing \\(N\\). As an example: just 10 possible values for 10 features of 10 stocks yield \\(10^{100}\\) permutations. The problems mentioned above are of course not restricted to portfolio construction. Many solutions have been proposed to solve Markov Decision Processes in continuous spaces. We refer for instance to Section 4 in Powell and Ma (2011) for a review of early methods (outside finance). This curse of dimensionality is accompanied by the fundamental question of training data. Two options are conceivable: market data versus simulations. Under a given controlled generator of samples, it is hard to imagine that the algorithm will beat the solution that maximizes a given utility function. If anything, it should converge towards the static optimal solution under a stationary data generating process (see, e.g., Chaouki et al. (2020) for trading tasks), which is by the way a very strong modelling assumption. This leaves market data as a preferred solution but even with large datasets, there is little chance to cover all the (actions, states) combinations mentioned above. Characteristics-based datasets have depths that run through a few decades of monthly data, which means several hundreds of time-stamps at most. This is by far too limited to allow for a reliable learning process. It is always possible to generate synthetic data (as in Yu et al. (2019)), but it is unclear that this will solidly improve the performance of the algorithm. 16.3 Policy gradient 16.3.1 Principle Beyond the discretization of action and state spaces, a powerful trick is parametrization. When \\(a\\) and \\(s\\) can take discrete values, action-value functions must be computed for all pairs \\((a,s)\\), which can be prohibitively cumbersome. An elegant way to circumvent this problem is to assume that the policy is driven by a relatively modest number of parameters. The learning process is then focused on optimizing this set of parameters \\(\\boldsymbol{\\theta}\\). We then write \\(\\pi_{\\boldsymbol{\\theta}}(a,s)\\) for the probability of choosing action \\(a\\) in state \\(s\\). One intuitive way to define \\(\\pi_{\\boldsymbol{\\theta}}(a,s)\\) is to resort to a soft-max form: \\[\\begin{equation} \\tag{16.14} \\pi_{\\boldsymbol{\\theta}}(a,s) = \\frac{e^{\\boldsymbol{\\theta}'\\textbf{h}(a,s)}}{\\sum_{b}e^{\\boldsymbol{\\theta}'\\textbf{h}(b,s)}}, \\end{equation}\\] where the output of function \\(\\textbf{h}(a,s)\\), which has the same dimension as \\(\\boldsymbol{\\theta}\\) is called a feature vector representing the pair \\((a,s)\\). Typically, \\(\\textbf{h}\\) can very well be a simple neural network with two input units and an output dimension equal to the length of \\(\\boldsymbol{\\theta}\\). One desired property for \\(\\pi_{\\boldsymbol{\\theta}}\\) is that it be differentiable with respect to \\(\\boldsymbol{\\theta}\\) so that \\(\\boldsymbol{\\theta}\\) can be improved via some gradient method. The most simple and intuitive results about policy gradients are known in the case of episodic tasks (finite horizon) for which it is sought to maximize the average gain \\(\\mathbb{E}_{\\boldsymbol{\\theta}}[G_t]\\) where the gain is defined in Equation (16.3). The expectation is computed according to a particular policy that depends on \\(\\boldsymbol{\\theta}\\), this is why we use a simple subscript. One central result is the so-called policy gradient theorem which states that \\[\\begin{equation} \\tag{16.15} \\nabla \\mathbb{E}_{\\boldsymbol{\\theta}}[G_t]=\\mathbb{E}_{\\boldsymbol{\\theta}} \\left[G_t\\frac{\\nabla \\pi_{\\boldsymbol{\\theta}}}{\\pi_{\\boldsymbol{\\theta}}} \\right]. \\end{equation}\\] This result can then be used for gradient ascent: when seeking to maximize a quantity, the parameter change must go in the upward direction: \\[\\begin{equation} \\tag{16.16} \\boldsymbol{\\theta} \\leftarrow \\boldsymbol{\\theta} + \\eta \\nabla \\mathbb{E}_{\\boldsymbol{\\theta}}[G_t]. \\end{equation}\\] This simple update rule is known as the REINFORCE algorithm. One improvement of this simple idea is to add a baseline, and we refer to section 13.4 of Sutton and Barto (2018) for a detailed account on this topic. 16.3.2 Extensions A popular extension of REINFORCE is the so-called actor-critic (AC) method which combines policy gradient with \\(Q\\)- or \\(v\\)-learning. The AC algorithm can be viewed as some kind of mix between policy gradient and SARSA. A central requirement is that the state-value function \\(v(\\cdot)\\) be a differentiable function of some parameter vector \\(\\textbf{w}\\) (it is often taken to be a neural network). The update rule is then \\[\\begin{equation} \\tag{16.17} \\boldsymbol{\\theta} \\leftarrow \\boldsymbol{\\theta} + \\eta \\left(R_{t+1}+\\gamma v(S_{t+1},\\textbf{w})-v(S_t,\\textbf{w}) \\right)\\frac{\\nabla \\pi_{\\boldsymbol{\\theta}}}{\\pi_{\\boldsymbol{\\theta}}}, \\end{equation}\\] but the trick is that the vector \\(\\textbf{w}\\) must also be updated. The actor is the policy side which is what drives decision making. The critic side is the value function that evaluates the actor’s performance. As learning progresses (each time both sets of parameters are updated), both sides improve. The exact algorithmic formulation is a bit long and we refer to Section 13.5 in Sutton and Barto (2018) for the precise sequence of steps of AC. Another interesting application of parametric policies is outlined in Aboussalah and Lee (2020). In their article, the authors define a trading policy that is based on a recurrent neural network. Thus, the parameter \\(\\boldsymbol{\\theta}\\) in this case encompasses all weights and biases in the network. Another favorable feature of parametric policies is that they are compatible with continuous sets of actions. Beyond the form (16.14), there are other ways to shape \\(\\pi_{\\boldsymbol{\\theta}}\\). If \\(\\mathcal{A}\\) is a subset of \\(\\mathbb{R}\\), and \\(f_{\\boldsymbol{\\Omega}}\\) is a density function with parameters \\(\\boldsymbol{\\Omega}\\), then a candidate form for \\(\\pi_{\\boldsymbol{\\theta}}\\) is \\[\\begin{equation} \\tag{16.18} \\pi_{\\boldsymbol{\\theta}} = f_{\\boldsymbol{\\Omega}(s,\\boldsymbol{\\theta})}(a), \\end{equation}\\] in which the parameters \\(\\boldsymbol{\\Omega}\\) are in turn functions of the states and of the underlying (second order) parameters \\(\\boldsymbol{\\theta}\\). While the Gaussian distribution (see section 13.7 in Sutton and Barto (2018)) is often a preferred choice, they would require some processing to lie inside the unit interval. One easy way to obtain such values is to apply the normal cumulative distribution function to the output. In Wang and Zhou (2019), the multivariate Gaussian policy is theoretically explored, but it assumes no constraint on weights. Some natural parametric distributions emerge as alternatives. If only one asset is traded, then the Bernoulli distribution can be used to determine whether or not to buy the asset. If a riskless asset is available, the beta distribution offers more flexibility because the values for the proportion invested in the risky asset span the whole interval; the remainder can be invested into the safe asset. When many assets are traded, things become more complicated because of the budget constraint. One ideal candidate is the Dirichlet distribution because it is defined on a simplex (see Equation (16.13)): \\[f_{\\boldsymbol{\\alpha}}(w_1,\\dots,w_n)=\\frac{1}{B(\\boldsymbol{\\alpha})}\\prod_{n=1}^Nw_n^{\\alpha_n-1},\\] where \\(B(\\boldsymbol{\\alpha})\\) is the multinomial beta function: \\[B(\\boldsymbol{\\alpha})=\\frac{\\prod_{n=1}^N\\Gamma(\\alpha_n)}{\\Gamma\\left(\\sum_{n=1}^N\\alpha_n \\right)}.\\] If we set \\(\\pi=\\pi_{\\boldsymbol{\\alpha}}=f_{\\boldsymbol{\\alpha}}\\), the link with factors or characteristics can be coded through \\({\\boldsymbol{\\alpha}}\\) via a linear form: \\[\\begin{equation} (\\textbf{F1}) \\quad \\alpha_{n,t}=\\theta_{0,t} + \\sum_{k=1}^K \\theta_{t}^{(k)}x_{t,n}^{(k)}, \\end{equation}\\] which is highly tractable, but may violate the condition that \\(\\alpha_{n,t}>0\\) for some values of \\(\\theta_{k,t}\\). Indeed, during the learning process, an update in \\(\\boldsymbol{\\theta}\\) might yield values that are out of the feasible set of \\(\\boldsymbol{\\alpha}_t\\). In this case, it is possible to resort to a trick that is widely used in online learning (see, e.g., section 2.3.1 in ). The idea is simply to find the acceptable solution that is closest to the suggestion from the algorithm. If we call \\(\\boldsymbol{\\theta}^*\\) the result of an update rule from a given algorithm, then the closest feasible vector is \\[\\begin{equation} \\boldsymbol{\\theta}= \\underset{\\textbf{z} \\in \\Theta(\\textbf{x}_t)}{\\min} ||\\boldsymbol{\\theta}^*-\\textbf{z}||^2, \\end{equation}\\] where \\(||\\cdot||\\) is the Euclidean norm and \\(\\Theta(\\textbf{x}_t)\\) is the feasible set, that is, the set of vectors \\(\\boldsymbol{\\theta}\\) such that the \\(\\alpha_{n,t}=\\theta_{0,t} + \\sum_{k=1}^K \\theta_{t}^{(k)}x_{t,n}^{(k)}\\) are all non-negative. A second option for the form of the policy, \\(\\pi^2_{\\boldsymbol{\\theta}_t}\\), is slightly more complex but remains always valid (i.e., has positive \\(\\alpha_{n,t}\\) values): \\[\\begin{equation} (\\textbf{F2}) \\quad \\alpha_{n,t}=\\exp \\left(\\theta_{0,t} + \\sum_{k=1}^K \\theta_{t}^{(k)}x_{t,n}^{(k)}\\right), \\end{equation}\\] which is simply the exponential of the first version. With some algebra, it is possible to derive the policy gradients. The policies \\(\\pi^j_{\\boldsymbol{\\theta}_t}\\) are defined by the Equations \\((\\textbf{Fj})\\) above. Let \\(\\digamma\\) denote the digamma function. Let \\(\\textbf{1}\\) denote the \\(\\mathbb{R}^N\\) vector of all ones. We have \\[\\begin{align*} \\frac{\\nabla_{\\boldsymbol{\\theta}_t} \\pi^1_{\\boldsymbol{\\theta}_t}}{\\pi^1_{\\boldsymbol{\\theta}_t}}&= \\sum_{n=1}^N \\left( \\digamma \\left( \\textbf{1}'\\textbf{X}_t\\boldsymbol{\\theta}_t \\right) - \\digamma(\\textbf{x}_{t,n}\\boldsymbol{\\theta}_t) + \\ln w_n \\right) \\textbf{x}_{t,n}' \\\\ \\frac{\\nabla_{\\boldsymbol{\\theta}_t} \\pi^2_{\\boldsymbol{\\theta}_t}}{\\pi^2_{\\boldsymbol{\\theta}_t}}&= \\sum_{n=1}^N \\left( \\digamma \\left( \\textbf{1}'e^{\\textbf{X}_{t}\\boldsymbol{\\theta}_t} \\right) - \\digamma(e^{\\textbf{x}_{t,n}\\boldsymbol{\\theta}_t}) + \\ln w_n \\right) e^{\\textbf{x}_{t,n}\\boldsymbol{\\theta}_t} \\textbf{x}_{t,n}' \\end{align*}\\] where \\(e^{\\textbf{X}}\\) is the element-wise exponential of a matrix \\(\\textbf{X}\\). The allocation can then either be made by direct sampling, or using the mean of the distribution \\((\\textbf{1}'\\boldsymbol{\\alpha})^{-1}\\boldsymbol{\\alpha}\\). Lastly, a technical note: Dirichlet distributions can only be used for small portfolios because the scaling constant in the density becomes numerically intractable for large values of \\(N\\) (e.g., above 50). 16.4 Simple examples 16.4.1 Q-learning with simulations To illustrate the gist of the problems mentioned above, we propose two implementations of \\(Q\\)-learning. For simplicity, the first one is based on simulations. This helps understand the learning process in a simplified framework. We consider two assets: one risky and one riskless, with return equal to zero. The returns for the risky process follow an autoregressive model of order one (AR(1)): \\(r_{t+1}=a+\\rho r_t+\\epsilon_{t+1}\\) with \\(|\\rho|<1\\) and \\(\\epsilon\\) following a standard white noise with variance \\(\\sigma^2\\). In practice, individual (monthly) returns are seldom autocorrelated, but adjusting the autocorrelation helps understand if the algorithm learns correctly (see exercise below). The environment consists only in observing the past return \\(r_t\\). Since we seek to estimate the \\(Q\\) function, we need to discretize this state variable. The simplest choice is to resort to a binary variable: equal to -1 (negative) if \\(r_t<0\\) and to +1 (positive) if \\(r_t\\ge 0\\). The actions are summarized by the quantity invested in the risky asset. It can take 5 values: 0 (risk-free portfolio), 0.25, 0.5, 0.75 and 1 (fully invested in the risky asset). This is for instance the same choice as in Pendharkar and Cusatis (2018). The landscape of R libraries for RL is surprisingly sparse. We resort to the package ReinforcementLearning which has an intuitive implementation of \\(Q\\)-learning (another option would be the reinforcelearn package). It requires a dataset with the usual inputs: state, action, reward and subsequent state. We start by simulating the returns: they drive the states and the rewards (portfolio returns). The actions are sampled randomly. Technically, the main function of the package requires that states and actions be of character type. The data is built in the chunk below. library(ReinforcementLearning) # Package for RL set.seed(42) # Fixing the random seed n_sample <- 10^5 # Number of samples to be generated rho <- 0.8 # Autoregressive parameter sd <- 0.4 # Std. dev. of noise a <- 0.06 * rho # Scaled mean of returns data_RL <- tibble(returns = a/rho + arima.sim(n = n_sample, # Returns via AR(1) simulation list(ar = rho), sd = sd), action = round(runif(n_sample)*4)/4) %>% # Random action (portfolio) mutate(new_state = if_else(returns < 0, "neg", "pos"), # Coding of state reward = returns * action, # Reward = portfolio return state = lag(new_state), # Next state action = as.character(action)) %>% na.omit() # Remove one missing state data_RL %>% head() # Show first lines ## # A tibble: 6 x 5 ## returns action new_state reward state ## <dbl> <chr> <chr> <dbl> <chr> ## 1 -0.474 0.5 neg -0.237 neg ## 2 -0.185 0.25 neg -0.0463 neg ## 3 0.146 0.25 pos 0.0364 neg ## 4 0.543 0.75 pos 0.407 pos ## 5 0.202 0.75 pos 0.152 pos ## 6 0.376 0.25 pos 0.0940 pos There are 3 parameters in the implementation of the Q-learning algorithm: \\(\\eta\\), which is the learning rate in the updating Equation (16.8). In ReinforcementLearning, this is coded as alpha; \\(\\gamma\\), the discounting rate for the rewards (also shown in Equation (16.8)); and \\(\\epsilon\\), which controls the rate of exploration versus exploitation (see Equation (16.10)). control <- list(alpha = 0.1, # Learning rate gamma = 0.7, # Discount factor for rewards epsilon = 0.1) # Exploration rate fit_RL <- ReinforcementLearning(data_RL, # Main RL function s = "state", a = "action", r = "reward", s_new = "new_state", control = control) print(fit_RL) # Show the output ## State-Action function Q ## 0.25 0 1 0.75 0.5 ## neg 0.2473169 0.4216894 0.1509653 0.1734538 0.229004 ## pos 1.0721669 0.7561417 1.4739050 1.1214795 1.045047 ## ## Policy ## neg pos ## "0" "1" ## ## Reward (last iteration) ## [1] 2588.659 The output shows the Q function, which depends naturally both on states and actions. When the state is negative, large risky positions (action equal to 0.75 or 1.00) are associated with the smallest average rewards, whereas small positions yield the highest average rewards. When the state is positive, the average rewards are the highest for the largest allocations. The rewards in both cases are almost a monotonic function of the proportion invested in the risky asset. Thus, the recommendation of the algorithm (i.e., the policy) is to be fully invested in a positive state and to refrain from investing in a negative state. Given the positive autocorrelation of the underlying process, this does make sense. Basically, the algorithm has simply learned that positive (resp. negative) returns are more likely to follow positive (resp. negative) returns. While this is somewhat reassuring, it is by no means impressive, and much simpler tools would yield similar conclusions and guidance. 16.4.2 Q-learning with market data The second application is based on the financial dataset. To reduce the dimensionality of the problem, we will assume that: - only one feature (price-to-book ratio) captures the state of the environment. This feature is processed so that is has only a limited number of possible values; - actions take values over a discrete set consisting of three positions: +1 (buy the market), -1 (sell the market) and 0 (hold no risky positions); - only two assets are traded: those with stock_id equal to 3 and 4 - they both have 245 days of trading data. The construction of the dataset is unelegantly coded below. return_3 <- data_ml %>% filter(stock_id == 3) %>% pull(R1M_Usd) # Return of asset 3 return_4 <- data_ml %>% filter(stock_id == 4) %>% pull(R1M_Usd) # Return of asset 4 pb_3 <- data_ml %>% filter(stock_id == 3) %>% pull(Pb) # P/B ratio of asset 3 pb_4 <- data_ml %>% filter(stock_id == 4) %>% pull(Pb) # P/B ratio of asset 4 action_3 <- floor(runif(length(pb_3))*3) - 1 # Action for asset 3 (random) action_4 <- floor(runif(length(pb_4))*3) - 1 # Action for asset 4 (random) RL_data <- tibble(return_3, return_4, # Building the dataset pb_3, pb_4, action_3, action_4) %>% mutate(action = paste(action_3, action_4), # Uniting actions pb_3 = round(5 * pb_3), # Simplifying states (P/B) pb_4 = round(5 * pb_4), # Simplifying states (P/B) state = paste(pb_3, pb_4), # Uniting states reward = action_3*return_3 + action_4*return_4, # Computing rewards new_state = lead(state)) %>% # Infer new state dplyr::select(-pb_3, -pb_4, -action_3, # Remove superfluous vars. -action_4, -return_3, -return_4) head(RL_data) # Showing the result ## # A tibble: 6 x 4 ## action state reward new_state ## <chr> <chr> <dbl> <chr> ## 1 -1 -1 1 1 -0.061 1 1 ## 2 0 1 1 1 0 1 1 ## 3 -1 0 1 1 -0.018 1 1 ## 4 0 -1 1 1 0.011 1 1 ## 5 -1 1 1 1 -0.036 1 1 ## 6 -1 -1 1 1 -0.056 1 1 Actions and states have to be merged to yield all possible combinations. To simplify the states, we round 5 times the price-to-book ratios. We keep the same hyperparameters as in the previous example. Columns below stand for actions: the first (\\(resp.\\) second) number notes the position in the first (\\(resp.\\) second) asset. The rows correspond to states. The scaled P/B ratios are separated by a point (e.g., “X2.3” means that the first (\\(resp.\\) second) asset has a scaled P/B of 2 (\\(resp.\\) 3). fit_RL2 <- ReinforcementLearning(RL_data, # Main RL function s = "state", a = "action", r = "reward", s_new = "new_state", control = control) fit_RL2$Q <- round(fit_RL2$Q, 3) # Round the Q-matrix print(fit_RL2) # Show the output ## State-Action function Q ## 0 0 0 1 0 -1 -1 -1 -1 0 -1 1 1 -1 1 0 1 1 ## 0 2 0.000 0.000 0.000 -0.017 0.000 0.000 0.000 0.002 0.000 ## 0 3 0.000 0.000 0.003 0.000 0.000 0.000 0.030 0.000 0.000 ## 3 1 0.002 0.000 0.005 0.000 -0.002 0.000 0.000 0.000 0.000 ## 2 1 0.005 0.018 0.009 -0.028 0.010 -0.003 0.021 0.008 -0.004 ## 2 2 0.000 0.010 0.000 0.014 0.000 0.000 -0.013 0.006 0.000 ## 2 3 0.000 0.000 0.000 0.000 0.000 0.020 0.000 -0.034 0.000 ## 1 1 0.002 -0.005 -0.022 -0.011 -0.002 -0.009 -0.020 -0.014 -0.023 ## 1 2 0.006 0.016 0.006 0.028 -0.001 0.001 0.020 0.020 -0.001 ## 1 3 0.001 0.004 0.004 -0.011 0.000 0.003 0.005 0.003 0.010 ## ## Policy ## 0 2 0 3 3 1 2 1 2 2 2 3 1 1 1 2 1 3 ## "1 0" "1 -1" "0 -1" "1 -1" "-1 -1" "-1 1" "0 0" "-1 -1" "1 1" ## ## Reward (last iteration) ## [1] -1.296 The output shows that there are many combinations of states and actions that are not spanned by the data: basically, the \\(Q\\) function has a zero and it is likely that the combination has not been explored. Some states seem to be more often represented (“X1.1”, “X1.2” and “X2.1”), others, less (“X3.1” and “X3.2”). It is hard to make any sense of the recommendations. Some states close “X0.1” and “X1.1” but the outcomes related to them are very different (buy and short versus hold and buy). Moreover, there is no coherence and no monotonicity in actions with respect to individual state values: low values of states can be associated to very different actions. One reason why these conclusions do not appear trustworthy pertains to the data size. With only 200+ time points and 99 state-action pairs (11 times 9), this yields on average only two data points to compute the \\(Q\\) function. This could be improved by testing more random actions, but the limits of the sample size would eventually (rapidly) be reached anyway. This is left as an exercise (see below). 16.5 Concluding remarks Reinforcement learning has been applied to financial problems for a long time. Early contributions in the late 1990s include Neuneier (1996), Moody and Wu (1997), Moody et al. (1998) and Neuneier (1998). Since then, many researchers in the computer science field have sought to apply RL techniques to portfolio problems. The advent of massive datasets and the increase in dimensionality make it hard for RL tools to adapt well to very rich environments that are encountered in factor investing. Recently, some approaches seek to adapt RL to continuous action spaces (Wang and Zhou (2019), Aboussalah and Lee (2020)) but not to high-dimensional state spaces. These spaces are those required in factor investing because all firms yield hundreds of data points characterizing their economic situation. In addition, applications of RL in financial frameworks have a particularity compared to many typical RL tasks: in financial markets, actions of agents have no impact on the environment (unless the agent is able to perform massive trades, which is rare and ill-advised because it pushes prices in the wrong direction). This lack of impact of actions may possibly mitigate the efficiency of traditional RL approaches. Those are challenges that will need to be solved in order for RL to become competitive with alternative (supervised) methods. Nevertheless, the progressive (online-like) way RL works seems suitable for non-stationary environments: the algorithm slowly shifts paradigms as new data arrives. In stationary environments, it has been shown that RL manages to converge to optimal solutions (Kong et al. (2019), Chaouki et al. (2020)). Therefore, in non-stationary markets, RL could be a recourse to build dynamic predictions that adapt to changing macroeconomic conditions. More research needs to be carried out in this field on large dimensional datasets. We end this chapter by underlining that reinforcement learning has also been used to estimate complex theoretical models (Halperin and Feldshteyn (2018), Garcı́a-Galicia, Carsteanu, and Clempner (2019)). The research in the field is incredibly diversified and is orientated towards many directions. It is likely that captivating work will be published in the near future. 16.6 Exercises Test what happens if the process for generating returns has a negative autocorrelation. What is the impact on the \\(Q\\) function and the policy? Keeping the same 2 assets as in Section 16.4.2, increases the size of RL_data by testing all possible action combinations for each original data point. Re-run the \\(Q\\)-learning function and see what happens. References "],["data-description.html", "Chapter 17 Data description", " Chapter 17 Data description TABLE 17.1: List of all variables (features and labels) in the dataset Column Name Short Description stock_id security id date date of the data Advt_12M_Usd average daily volume in amount in USD over 12 months Advt_3M_Usd average daily volume in amount in USD over 3 months Advt_6M_Usd average daily volume in amount in USD over 6 months Asset_Turnover total sales on average assets Bb_Yld buyback yield Bv book value Capex_Ps_Cf capital expenditure on price to sale cash flow Capex_Sales capital expenditure on sales Cash_Div_Cf cash dividends cash flow Cash_Per_Share cash per share Cf_Sales cash flow per share Debtequity debt to equity Div_Yld dividend yield Dps dividend per share Ebit_Bv EBIT on book value Ebit_Noa EBIT on non operating asset Ebit_Oa EBIT on operating asset Ebit_Ta EBIT on total asset Ebitda_Margin EBITDA margin Eps earnings per share Eps_Basic earnings per share basic Eps_Basic_Gr earnings per share growth Eps_Contin_Oper earnings per share continuing operations Eps_Dil earnings per share diluted Ev enterprise value Ev_Ebitda enterprise value on EBITDA Fa_Ci fixed assets on common equity Fcf free cash flow Fcf_Bv free cash flow on book value Fcf_Ce free cash flow on capital employed Fcf_Margin free cash flow margin Fcf_Noa free cash flow on net operating assets Fcf_Oa free cash flow on operating assets Fcf_Ta free cash flow on total assets Fcf_Tbv free cash flow on tangible book value Fcf_Toa free cash flow on total operating assets Fcf_Yld free cash flow yield Free_Ps_Cf free cash flow on price sales Int_Rev intangibles on revenues Interest_Expense interest expense coverage Mkt_Cap_12M_Usd average market capitalization over 12 months in USD Mkt_Cap_3M_Usd average market capitalization over 3 months in USD Mkt_Cap_6M_Usd average market capitalization over 6 months in USD Mom_11M_Usd price momentum 12 - 1 months in USD Mom_5M_Usd price momentum 6 - 1 months in USD Mom_Sharp_11M_Usd price momentum 12 - 1 months in USD divided by volatility Mom_Sharp_5M_Usd price momentum 6 - 1 months in USD divided by volatility Nd_Ebitda net debt on EBITDA Net_Debt net debt Net_Debt_Cf net debt on cash flow Net_Margin net margin Netdebtyield net debt yield Ni net income Ni_Avail_Margin net income available margin Ni_Oa net income on operating asset Ni_Toa net income on total operating asset Noa net operating asset Oa operating asset Ocf operating cash flow Ocf_Bv operating cash flow on book value Ocf_Ce operating cash flow on capital employed Ocf_Margin operating cash flow margin Ocf_Noa operating cash flow on net operating assets Ocf_Oa operating cash flow on operating assets Ocf_Ta operating cash flow on total assets Ocf_Tbv operating cash flow on tangible book value Ocf_Toa operating cash flow on total operating assets Op_Margin operating margin Op_Prt_Margin net margin 1Y growth Oper_Ps_Net_Cf cash flow from operations per share net Pb price to book Pe price earnings Ptx_Mgn margin pretax Recurring_Earning_Total_Assets reccuring earnings on total assets Return_On_Capital return on capital Rev revenue Roa return on assets Roc return on capital Roce return on capital employed Roe return on equity Sales_Ps price to sales Share_Turn_12M average share turnover 12 months Share_Turn_3M average share turnover 3 months Share_Turn_6M average share turnover 6 months Ta total assets Tev_Less_Mktcap total enterprise value less market capitalization Tot_Debt_Rev total debt on revenue Total_Capital total capital Total_Debt total debt Total_Debt_Capital total debt on capital Total_Liabilities_Total_Assets total liabilities on total assets Vol1Y_Usd volatility of returns over one year Vol3Y_Usd volatility of returns over 3 years R1M_Usd return forward 1 month (LABEL) R3M_Usd return forward 3 months (LABEL) R6M_Usd return forward 6 months (LABEL) R12M_Usd return forward 12 months (LABEL) "],["solutions-to-exercises.html", "Chapter 18 Solutions to exercises 18.1 Chapter 3 18.2 Chapter 4 18.3 Chapter 5 18.4 Chapter 6 18.5 Chapter 7: the autoencoder model 18.6 Chapter 8 18.7 Chapter 11: ensemble neural network 18.8 Chapter 12 18.9 Chapter 15 18.10 Chapter 16", " Chapter 18 Solutions to exercises 18.1 Chapter 3 For annual values, see 18.1: data_ml %>% group_by(date) %>% mutate(growth = Pb > median(Pb)) %>% # Creates the sort ungroup() %>% # Ungroup mutate(year = lubridate::year(date)) %>% # Creates a year variable group_by(year, growth) %>% # Analyze by year & sort summarize(ret = mean(R1M_Usd)) %>% # Compute average return ggplot(aes(x = year, y = ret, fill = growth)) + geom_col(position = "dodge") + # Plot! theme(legend.position = c(0.7, 0.8)) FIGURE 18.1: The value factor: annual returns. For monthly values, see 18.2: returns_m <- data_ml %>% group_by(date) %>% mutate(growth = Pb > median(Pb)) %>% # Creates the sort group_by(date, growth) %>% # Analyze by date & sort summarize(ret = mean(R1M_Usd)) %>% # Compute average return spread(key = growth, value = ret) %>% # Pivot to wide matrix format ungroup() colnames(returns_m)[2:3] <- c("value", "growth") # Changing column names returns_m %>% mutate(value = cumprod(1 + value), # From returns to portf. values growth = cumprod(1 + growth)) %>% gather(key = portfolio, value = value, -date) %>% # Back in tidy format ggplot(aes(x = date, y = value, color = portfolio)) + geom_line() + # Plot! theme(legend.position = c(0.7, 0.8)) FIGURE 18.2: The value factor: portfolio values. Portfolios based on quartiles, using the tidyverse only. We rely heavily on the fact that features are uniformized, i.e., that their distribution is uniform for each given date. Overall, small firms outperform heavily (see Figure 18.3). data_ml %>% mutate(small = Mkt_Cap_6M_Usd <= 0.25, # Small firms... medium = Mkt_Cap_6M_Usd > 0.25 & Mkt_Cap_6M_Usd <= 0.5, large = Mkt_Cap_6M_Usd > 0.5 & Mkt_Cap_6M_Usd <= 0.75, xl = Mkt_Cap_6M_Usd > 0.75, # ...Xlarge firms year = year(date)) %>% group_by(year) %>% summarize(small = mean(small * R1M_Usd), # Compute avg returns medium = mean(medium * R1M_Usd), large = mean(large * R1M_Usd), xl = mean(xl * R1M_Usd)) %>% gather(key = size, value = return, -year) %>% ggplot(aes(x = year, y = return, fill = size)) + geom_col(position = "dodge") FIGURE 18.3: The value factor: portfolio values. 18.2 Chapter 4 Below, we import a credit spread supplied by Bank of America. Its symbol/ticker is “BAMLC0A0CM”. We apply the data expansion on the small number of predictors to save memory space. One important trick that should not be overlooked is the uniformization step after the product (4.3) is computed. Indeed, we want the new features to have the same properties as the old ones. If we skip this step, distributions will be altered, as we show in one example below. We start with the data extraction and joining. It’s important to join early so as to keep the highest data frequency (daily) in order to replace missing points with close values. Joining with monthly data before replacing creates unnecessary lags. getSymbols.FRED("BAMLC0A0CM", # Extract data env = ".GlobalEnv", return.class = "xts") ## [1] "BAMLC0A0CM" cred_spread <- fortify(BAMLC0A0CM) # Transform to dataframe colnames(cred_spread) <- c("date", "spread") # Change column name cred_spread <- cred_spread %>% # Take extraction and... full_join(data_ml %>% dplyr::select(date), by = "date") %>% # Join! mutate(spread = na.locf(spread)) # Replace NA by previous cred_spread <- cred_spread[!duplicated(cred_spread),] # Remove duplicates The creation of the augmented dataset requires some manipulation. Features are no longer uniform as is shown in Figure 18.4. data_cond <- data_ml %>% # Create new dataset dplyr::select(c("stock_id", "date", features_short)) names_cred_spread <- paste0(features_short, "_cred_spread") # New column names feat_cred_spread <- data_cond %>% # Old values dplyr::select(features_short) cred_spread <- data_ml %>% # Create vector of spreads dplyr::select(date) %>% left_join(cred_spread, by = "date") feat_cred_spread <- feat_cred_spread * # This product creates... matrix(cred_spread$spread, # the new values... length(cred_spread$spread), # using duplicated... length(features_short)) # columns colnames(feat_cred_spread) <- names_cred_spread # New column names data_cond <- bind_cols(data_cond, feat_cred_spread) # Aggregate old & new data_cond %>% ggplot(aes(x = Eps_cred_spread)) + geom_histogram() # Plot example FIGURE 18.4: Distribution of Eps after conditioning. To prevent this issue, uniformization is required and is verified in Figure 18.5. data_cond <- data_cond %>% # From new dataset group_by(date) %>% # Group by date and... mutate_at(names_cred_spread, norm_unif) # Uniformize the new features data_cond %>% ggplot(aes(x = Eps_cred_spread)) + geom_histogram(bins = 100) # Verification FIGURE 18.5: Distribution of uniformized conditioned feature values. The second question naturally requires the downloading of VIX series first and the joining with the original data. getSymbols.FRED("VIXCLS", # Extract data env = ".GlobalEnv", return.class = "xts") ## [1] "VIXCLS" vix <- fortify(VIXCLS) # Transform to dataframe colnames(vix) <- c("date", "vix") # Change column name vix <- vix %>% # Take extraction and... full_join(data_ml %>% dplyr::select(date), by = "date") %>% # Join! mutate(vix = na.locf(vix)) # Replace NA by previous vix <- vix[!duplicated(vix),] # Remove duplicates vix <- data_ml %>% # Keep original data format dplyr::select(date) %>% # ... left_join(vix, by = "date") # Via left_join() We can then proceed with the categorization. We create the vector label in a new (smaller) dataset but not attached to the large data_ml variable. Also, we check the balance of labels and its evolution through time (see Figure 18.6). delta <- 0.5 # Magnitude of vix correction vix_bar <- median(vix$vix) # Median of vix data_vix <- data_ml %>% # Smaller dataset dplyr::select(stock_id, date, R1M_Usd) %>% mutate(r_minus = (-0.02) * exp(-delta*(vix$vix-vix_bar)), # r_- r_plus = 0.02 * exp(delta*(vix$vix-vix_bar))) # r_+ data_vix <- data_vix %>% mutate(R1M_Usd_Cvix = if_else(R1M_Usd < r_minus, -1, # New label! if_else(R1M_Usd > r_plus, 1,0)), R1M_Usd_Cvix = as.factor(R1M_Usd_Cvix)) data_vix %>% mutate(year = year(date)) %>% group_by(year, R1M_Usd_Cvix) %>% summarize(nb = n()) %>% ggplot(aes(x = year, y = nb, fill = R1M_Usd_Cvix)) + geom_col() FIGURE 18.6: Evolution of categories through time. Finally, we switch to the outliers (Figure 18.7). data_ml %>% ggplot(aes(x = R12M_Usd)) + geom_histogram() FIGURE 18.7: Outliers in the dependent variable. Returns above 50 should indeed be rare. data_ml %>% filter(R12M_Usd > 50) %>% dplyr::select(stock_id, date, R12M_Usd) ## # A tibble: 8 x 3 ## stock_id date R12M_Usd ## <int> <date> <dbl> ## 1 212 2000-12-31 53.0 ## 2 221 2008-12-31 53.5 ## 3 221 2009-01-31 55.2 ## 4 221 2009-02-28 54.8 ## 5 296 2002-06-30 72.2 ## 6 683 2009-02-28 96.0 ## 7 683 2009-03-31 64.8 ## 8 862 2009-02-28 58.0 The largest return comes from stock #683. Let’s have a look at the stream of monthly returns in 2009. data_ml %>% filter(stock_id == 683, year(date) == 2009) %>% dplyr::select(date, R1M_Usd) ## # A tibble: 12 x 2 ## date R1M_Usd ## <date> <dbl> ## 1 2009-01-31 -0.625 ## 2 2009-02-28 0.472 ## 3 2009-03-31 1.44 ## 4 2009-04-30 0.139 ## 5 2009-05-31 0.086 ## 6 2009-06-30 0.185 ## 7 2009-07-31 0.363 ## 8 2009-08-31 0.103 ## 9 2009-09-30 9.91 ## 10 2009-10-31 0.101 ## 11 2009-11-30 0.202 ## 12 2009-12-31 -0.251 The returns are all very high. The annual value is plausible. In addition, a quick glance at the Vol1Y values shows that the stock is the most volatile of the dataset. 18.3 Chapter 5 We recycle the training and testing data variables created in the chapter (coding section notably). In addition, we create a dedicated function and resort to the map2() function from the purrr package. alpha_seq <- (0:10)/10 # Sequence of alpha values lambda_seq <- 0.1^(0:5) # Sequence of lambda values pars <- expand.grid(alpha_seq, lambda_seq) # Exploring all combinations! alpha_seq <- pars[,1] lambda_seq <- pars[,2] lasso_sens <- function(alpha, lambda, x_train, y_train, x_test, y_test){ # Function fit_temp <- glmnet(x_train, y_train, # Model alpha = alpha, lambda = lambda) return(sqrt(mean((predict(fit_temp, x_test) - y_test)^2))) # Output } rmse_elas <- map2(alpha_seq, lambda_seq, lasso_sens, # Automation x_train = x_penalized_train, y_train = y_penalized_train, x_test = x_penalized_test, y_test = testing_sample$R1M_Usd) bind_cols(alpha = alpha_seq, lambda = as.factor(lambda_seq), rmse = unlist(rmse_elas)) %>% ggplot(aes(x = alpha, y = rmse, fill = lambda)) + geom_col() + facet_grid(lambda ~.) + coord_cartesian(ylim = c(0.19,0.193)) FIGURE 18.8: Performance of elasticnet across parameter values. As is outlined in Figure 18.8, the parameters have a very marginal impact. Maybe the model is not a good fit for the task. 18.4 Chapter 6 fit1 <- rpart(formula, data = training_sample, # Data source: full sample cp = 0.001) # Precision: smaller = more leaves mean((predict(fit1, testing_sample) - testing_sample$R1M_Usd)^2) ## [1] 0.04018973 fit2 <- rpart(formula, data = training_sample, # Data source: full sample cp = 0.01) # Precision: smaller = more leaves mean((predict(fit2, testing_sample) - testing_sample$R1M_Usd)^2) # Test! ## [1] 0.03699696 rpart.plot(fit1) # Plot the first tree FIGURE 18.9: Sample (complex) tree. The first model (Figure 18.9) is too precise: going into the details of the training sample does not translate to good performance out-of-sample. The second, simpler model, yields better results. n_trees <- c(10, 20, 40, 80, 160) mse_RF <- 0 for(j in 1:length(n_trees)){ # No need for functional programming here... fit_temp <- randomForest( as.formula(paste("R1M_Usd ~", paste(features_short, collapse = " + "))), # New formula! data = training_sample, # Data source: training sample sampsize = 30000, # Size of (random) sample for each tree replace = TRUE, # Is the sampling done with replacement? ntree = n_trees[j], # Nb of random trees mtry = 5) # Nb of predictors for each tree mse_RF[j] <- mean((predict(fit_temp, testing_sample) - testing_sample$R1M_Usd)^2) } mse_RF ## [1] 0.03967754 0.03885924 0.03766900 0.03696370 0.03699772 Trees are by definition random so results can vary from test to test. Overall, large numbers of trees are preferable and the reason is that each new tree tells a new story and diversifies the risk of the whole forest. Some more technical details of why that may be the case are outlined in the original paper by Breiman (2001). For the last exercises, we recycle the formula used in Chapter 6. tree_2008 <- rpart(formula, data = data_ml %>% filter(year(date) == 2008), # Data source: 2008 cp = 0.001, maxdepth = 2) rpart.plot(tree_2008) FIGURE 18.10: Tree for 2008. The first splitting criterion in Figure 18.10 is enterprise value (EV). EV is an indicator that adjusts market capitalization by substracting debt and adding cash. It is a more faithful account of the true value of a company. In 2008, the companies that fared the least poorly were those with the highest EV (i.e., large, robust firms). tree_2009 <- rpart(formula, data = data_ml %>% filter(year(date) == 2009), # Data source: 2009 cp = 0.001, maxdepth = 2) rpart.plot(tree_2009) FIGURE 18.11: Tree for 2009. In 2009 (Figure 18.11), the firms that recovered the fastest were those that experienced high volatility in the past (likely, downwards volatility). Momentum is also very important: the firms with the lowest past returns are those that rebound the fastest. This is a typical example of the momentum crash phenomenon studied in Barroso and Santa-Clara (2015) and Daniel and Moskowitz (2016). The rationale is the following: after a market downturn, the stocks with the most potential for growth are those that have suffered the largest losses. Consequently, the negative (short) leg of the momentum factor performs very well, often better than the long leg. And indeed, being long in the momentum factor in 2009 would have generated negative profits. 18.5 Chapter 7: the autoencoder model First, it is imperative to format the inputs properly. To avoid any issues, we work with perfectly rectangular data and hence restrict the investment set to the stocks with no missing points. Dimensions must also be in the correct order. data_short <- data_ml %>% # Shorter dataset filter(stock_id %in% stock_ids_short) %>% dplyr::select(c("stock_id", "date",features_short, "R1M_Usd")) dates <- unique(data_short$date) # Vector of dates N <- length(stock_ids_short) # Dimension for assets Tt <- length(dates) # Dimension for dates K <- length(features_short) # Dimension for features factor_data <- data_short %>% # Factor side date dplyr::select(date, stock_id, R1M_Usd) %>% spread(key = stock_id, value = R1M_Usd) %>% dplyr::select(-date) %>% as.matrix() beta_data <- array(unlist(data_short %>% # Beta side data: beware the permutation below! dplyr::select(-stock_id, -date, -R1M_Usd)), dim = c(N, Tt, K)) beta_data <- aperm(beta_data, c(2,1,3)) # Permutation Next, we turn to the specification of the network, using a functional API form. main_input <- layer_input(shape = c(N), name = "main_input") # Main input: returns factor_network <- main_input %>% # Def of factor side network layer_dense(units = 8, activation = "relu", name = "layer_1_r") %>% layer_dense(units = 4, activation = "tanh", name = "layer_2_r") aux_input <- layer_input(shape = c(N,K), name = "aux_input") # Aux input: characteristics beta_network <- aux_input %>% # Def of beta side network layer_dense(units = 8, activation = "relu", name = "layer_1_l") %>% layer_dense(units = 4, activation = "tanh", name = "layer_2_l") %>% layer_permute(dims = c(2,1), name = "layer_3_l") # Permutation! main_output <- layer_dot(c(beta_network, factor_network), # Product of 2 networks axes = 1, name = "main_output") model_ae <- keras_model( # AE Model specs inputs = c(main_input, aux_input), outputs = c(main_output) ) Finally, we ask for the structure of the model, and train it. summary(model_ae) # See model details / architecture ## Model: "model_1" ## __________________________________________________________________________________________ ## Layer (type) Output Shape Param # Connected to ## ========================================================================================== ## aux_input (InputLayer) [(None, 793, 7)] 0 ## __________________________________________________________________________________________ ## layer_1_l (Dense) (None, 793, 8) 64 aux_input[0][0] ## __________________________________________________________________________________________ ## main_input (InputLayer) [(None, 793)] 0 ## __________________________________________________________________________________________ ## layer_2_l (Dense) (None, 793, 4) 36 layer_1_l[0][0] ## __________________________________________________________________________________________ ## layer_1_r (Dense) (None, 8) 6352 main_input[0][0] ## __________________________________________________________________________________________ ## layer_3_l (Permute) (None, 4, 793) 0 layer_2_l[0][0] ## __________________________________________________________________________________________ ## layer_2_r (Dense) (None, 4) 36 layer_1_r[0][0] ## __________________________________________________________________________________________ ## main_output (Dot) (None, 793) 0 layer_3_l[0][0] ## layer_2_r[0][0] ## ========================================================================================== ## Total params: 6,488 ## Trainable params: 6,488 ## Non-trainable params: 0 ## __________________________________________________________________________________________ model_ae %>% compile( # Learning parameters optimizer = "rmsprop", loss = "mse" ) model_ae %>% fit( # Learning function x = list(main_input = factor_data, aux_input = beta_data), y = list(main_output = factor_data), epochs = 20, # Nb rounds batch_size = 49 # Nb obs. per round ) 18.6 Chapter 8 Since we are going to reproduce a similar analysis several times, let’s simplify the task with 2 tips. First, by using default parameter values that will be passed as common arguments to the svm function. Second, by creating a custom function that computes the MSE. Third, by resorting to functional calculus via the map function from the purrr package. Below, we recycle datasets created in Chapter 6. mse <- function(fit, features, label){ # MSE function return(mean((predict(fit, features)-label)^2)) } par_list <- list(y = train_label_xgb[1:10000], # From Tree chapter x = train_features_xgb[1:10000,], type = "eps-regression", epsilon = 0.1, # Width of strip for errors gamma = 0.5, # Constant in the radial kernel cost = 0.1) svm_par <- function(kernel, par_list){ # Function for SVM fit automation require(e1071) return(do.call(svm, c(kernel = kernel, par_list))) } kernels <- c("linear", "radial", "polynomial", "sigmoid") # Kernels fit_svm_par <- map(kernels, svm_par, par_list = par_list) # SVM models map(fit_svm_par, mse, # MSEs features = test_feat_short, # From SVM chapter label = testing_sample$R1M_Usd) ## [[1]] ## [1] 0.03849786 ## ## [[2]] ## [1] 0.03924576 ## ## [[3]] ## [1] 0.03951328 ## ## [[4]] ## [1] 334.8173 The first two kernels yield the best fit, while the last one should be avoided. Note that apart from the linear kernel, all other options require parameters. We have used the default ones, which may explain the poor performance of some nonlinear kernels. Below, we train an SVM model on a training sample with all observations but that is limited to the 7 major predictors. Even with a smaller number of features, the training is time consuming. svm_full <- svm(y = train_label_xgb, # Train label x = train_features_xgb, # Training features type = "eps-regression", # SVM task type (see LIBSVM documentation) kernel = "linear", # SVM kernel epsilon = 0.1, # Width of strip for errors cost = 0.1) # Slack variable penalisation test_feat_short <- dplyr::select(testing_sample,features_short) # Test set mean(predict(svm_full, test_feat_short) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.490343 This figure is very low. Below, we test a very simple form of boosted trees, for comparison purposes. xgb_full <- xgb.train(data = train_matrix_xgb, # Data source eta = 0.3, # Learning rate objective = "reg:linear", # Objective function max_depth = 4, # Maximum depth of trees nrounds = 60 # Number of trees used (bit low here) ) ## [14:43:24] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. mean(predict(xgb_full, xgb_test) * testing_sample$R1M_Usd > 0) # Hit ratio ## [1] 0.5017377 The forecasts are slightly better, but the computation time is lower. Two reasons why the models perform poorly: there are not enough predictors; the models are static: they do not adjust dynamically to macro-conditions. 18.7 Chapter 11: ensemble neural network First, we create the three feature sets. The first one gets all multiples of 3 between 3 and 93. The second one gets the same indices, minus one, and the third one, the initial indices minus two. feat_train_1 <- training_sample %>% dplyr::select(features[3*(1:31)]) %>% # First set of feats as.matrix() feat_train_2 <- training_sample %>% dplyr::select(features[3*(1:31)-1]) %>% # Second set of feats as.matrix() feat_train_3 <- training_sample %>% dplyr::select(features[3*(1:31)-2]) %>% # Third set of feats as.matrix() feat_test_1 <- testing_sample %>% dplyr::select(features[3*(1:31)]) %>% # Test features 1 as.matrix() feat_test_2 <- testing_sample %>% dplyr::select(features[3*(1:31)-1]) %>% # Test features 2 as.matrix() feat_test_3 <- testing_sample %>% dplyr::select(features[3*(1:31)-2]) %>% # Test features 3 as.matrix() Then, we specify the network structure. First, the 3 independent networks, then the aggregation. first_input <- layer_input(shape = c(31), name = "first_input") # First input first_network <- first_input %>% # Def of 1st network layer_dense(units = 8, activation = "relu", name = "layer_1") %>% layer_dense(units = 2, activation = 'softmax') # Softmax for categ. output second_input <- layer_input(shape = c(31), name = "second_input") # Second input second_network <- second_input %>% # Def of 2nd network layer_dense(units = 8, activation = "relu", name = "layer_2") %>% layer_dense(units = 2, activation = 'softmax') # Softmax for categ. output third_input <- layer_input(shape = c(31), name = "third_input") # Third input third_network <- third_input %>% # Def of 3rd network layer_dense(units = 8, activation = "relu", name = "layer_3") %>% layer_dense(units = 2, activation = 'softmax') # Softmax for categ. output main_output <- layer_concatenate(c(first_network, second_network, third_network)) %>% # Combination layer_dense(units = 2, activation = 'softmax', name = 'main_output') model_ens <- keras_model( # Agg. Model specs inputs = c(first_input, second_input, third_input), outputs = c(main_output) ) Lastly, we can train and evaluate (see Figure 18.12). summary(model_ens) # See model details / architecture ## Model: "model_2" ## __________________________________________________________________________________________ ## Layer (type) Output Shape Param # Connected to ## ========================================================================================== ## first_input (InputLayer) [(None, 31)] 0 ## __________________________________________________________________________________________ ## second_input (InputLayer) [(None, 31)] 0 ## __________________________________________________________________________________________ ## third_input (InputLayer) [(None, 31)] 0 ## __________________________________________________________________________________________ ## layer_1 (Dense) (None, 8) 256 first_input[0][0] ## __________________________________________________________________________________________ ## layer_2 (Dense) (None, 8) 256 second_input[0][0] ## __________________________________________________________________________________________ ## layer_3 (Dense) (None, 8) 256 third_input[0][0] ## __________________________________________________________________________________________ ## dense_21 (Dense) (None, 2) 18 layer_1[0][0] ## __________________________________________________________________________________________ ## dense_22 (Dense) (None, 2) 18 layer_2[0][0] ## __________________________________________________________________________________________ ## dense_23 (Dense) (None, 2) 18 layer_3[0][0] ## __________________________________________________________________________________________ ## concatenate (Concatenate) (None, 6) 0 dense_21[0][0] ## dense_22[0][0] ## dense_23[0][0] ## __________________________________________________________________________________________ ## main_output (Dense) (None, 2) 14 concatenate[0][0] ## ========================================================================================== ## Total params: 836 ## Trainable params: 836 ## Non-trainable params: 0 ## __________________________________________________________________________________________ model_ens %>% compile( # Learning parameters optimizer = optimizer_adam(), loss = "binary_crossentropy", metrics = "categorical_accuracy" ) fit_NN_ens <- model_ens %>% fit( # Learning function x = list(first_input = feat_train_1, second_input = feat_train_2, third_input = feat_train_3), y = list(main_output = NN_train_labels_C), # Recycled from NN Chapter epochs = 12, # Nb rounds batch_size = 512, # Nb obs. per round validation_data = list(list(feat_test_1, feat_test_2, feat_test_3), NN_test_labels_C) ) plot(fit_NN_ens) FIGURE 18.12: Learning an integrated ensemble. 18.8 Chapter 12 18.8.1 EW portfolios with the tidyverse This one is incredibly easy; it’s simpler and more compact but close in spirit to the code that generates Figure 3.1. The returns are plotted in Figure 18.13. data_ml %>% group_by(date) %>% # Group by date summarize(return = mean(R1M_Usd)) %>% # Compute return ggplot(aes(x = date, y = return)) + geom_point() + geom_line() # Plot FIGURE 18.13: Time series of returns. 18.8.2 Advanced weighting function First, we code the function with all inputs. weights <- function(Sigma, mu, Lambda, lambda, k_D, k_R, w_old){ N <- nrow(Sigma) M <- solve(lambda*Sigma + 2*k_R*Lambda + 2*k_D*diag(N)) # Inverse matrix num <- 1-sum(M %*% (mu + 2*k_R*Lambda %*% w_old)) # eta numerator den <- sum(M %*% rep(1,N)) # eta denominator eta <- num / den # eta vec <- mu + eta * rep(1,N) + 2*k_R*Lambda %*% w_old # Vector in weight return(M %*% vec) } Second, we test it on some random dataset. We use the returns created at the end of Chapter 1 and used for the Lasso allocation in Section 5.2.2. For \\(\\boldsymbol{\\mu}\\), we use the sample average, which is rarely a good idea in practice. It serves as illustration only. Sigma <- returns %>% dplyr::select(-date) %>% as.matrix() %>% cov() # Covariance matrix mu <- returns %>% dplyr::select(-date) %>% apply(2,mean) # Vector of exp. returns Lambda <- diag(nrow(Sigma)) # Trans. Cost matrix lambda <- 1 # Risk aversion k_D <- 1 k_R <- 1 w_old <- rep(1, nrow(Sigma)) / nrow(Sigma) # Prev. weights: EW weights(Sigma, mu, Lambda, lambda, k_D, k_R, w_old) %>% head() # First weights ## [,1] ## 1 0.0031339308 ## 3 -0.0003243527 ## 4 0.0011944677 ## 7 0.0014194215 ## 9 0.0015086240 ## 11 -0.0005015207 Some weights can of course be negative. Finally, we use the map2() function to test some sensitivity. We examine 3 key indicators: - diversification, which we measure via the inverse of the sum of squared weights (inverse Hirschman-Herfindhal index); - leverage, which we assess via the absolute sum of negative weights; - in-sample volatility, which we compute as \\(\\textbf{w}' \\boldsymbol{\\Sigma} \\textbf{x}\\) To do so, we create a dedicated function below. sensi <- function(lambda, k_D, Sigma, mu, Lambda, k_R, w_old){ w <- weights(Sigma, mu, Lambda, lambda, k_D, k_R, w_old) out <- c() out$div <- 1/sum(w^2) # Diversification out$lev <- sum(abs(w[w<0])) # Leverage out$vol <- t(w) %*% Sigma %*% w # In-sample vol return(out) } Instead of using the baseline map2 function, we rely on a version thereof that concatenates results into a dataframe directly. lambda <- 10^(-3:2) # parameter values k_D <- 2*10^(-3:2) # parameter values pars <- expand_grid(lambda, k_D) # parameter grid lambda <- pars$lambda k_D <- pars$k_D res <- map2_dfr(lambda, k_D, sensi, Sigma = Sigma, mu = mu, Lambda = Lambda, k_R = k_R, w_old = w_old) bind_cols(lambda = as.factor(lambda), k_D = as.factor(k_D), res) %>% gather(key = indicator, value = value, -lambda, -k_D) %>% ggplot(aes(x = lambda, y = value, fill = k_D)) + geom_col(position = "dodge") + facet_grid(indicator ~. , scales = "free") FIGURE 18.14: Indicators related to portfolio weights. In Figure 18.14, each panel displays an indicator. In the first panel, we see that diversification increases with \\(k_D\\): indeed, as this number increases, the portfolio converges to uniform (EW) values. The parameter \\(\\lambda\\) has a minor impact. The second panel naturally shows the inverse effect for leverage: as diversification increases with \\(k_D\\), leverage (i.e., total negative positions - shortsales) decreases. Finally, the last panel shows that in-sample volatility is however largely driven by the risk aversion parameter. As \\(\\lambda\\) increases, volatility logically decreases. For small values of \\(\\lambda\\), \\(k_D\\) is negatively related to volatility but the pattern reverses for large values of \\(\\lambda\\). This is because the equally weighted portfolio is less risky than very leveraged mean-variance policies, but more risky than the minimum-variance portfolio. 18.8.3 Functional programming in the backtest Often, programmers prefer to avoid loops. In order to avoid a loop in the backtest, we need to code what happens for one given date. This is encapsulated in the following function. For simplicity, we code it for only one strategy. Also, the function will assume the structure of the data is known, but the columns (features & labels) could also be passed as arguments. We recycle the function weights_xgb from Chapter 12. portf_map <- function(t, data_ml, ticks, t_oos, m_offset, train_size, weight_func){ train_data <- data_ml %>% filter(date < t_oos[t] - m_offset * 30, # Roll. window w. buffer date > t_oos[t] - m_offset * 30 - 365 * train_size) test_data <- data_ml %>% filter(date == t_oos[t]) # Test set realized_returns <- test_data %>% # Computing returns via: dplyr::select(R1M_Usd) # 1M holding period! temp_weights <- weight_func(train_data, test_data, features) # Weights = > recycled! ind <- match(temp_weights$names, ticks) %>% na.omit() # Index of test assets x <- c() x$weights <- rep(0, length(ticks)) # Empty weights x$weights[ind] <- temp_weights$weights # Locate weights correctly x$returns <- sum(temp_weights$weights * realized_returns) # Compute returns return(x) } Next, we combine this function to map(). We only test the first 6 dates: this reduces the computation times. back_test <- 1:3 %>% # Test on the first 100 out-of-sample dates map(portf_map, data_ml = data_ml, ticks = ticks, t_oos = t_oos, m_offset = 1, train_size = 5, weight_func = weights_xgb) ## [14:43:55] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [14:44:04] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. ## [14:44:14] WARNING: amalgamation/../src/objective/regression_obj.cu:174: reg:linear is now deprecated in favor of reg:squarederror. head(back_test[[1]]$weights) # Sample weights ## [1] 0.001675042 0.000000000 0.000000000 0.001675042 0.000000000 0.001675042 back_test[[1]]$returns # Return of first period ## [1] 0.0189129 Each element of backtest is a list with two components: the portfolio weights and the returns. To access the data easily, functions like melt from the package reshape2 are useful. 18.9 Chapter 15 We recycle the AE model trained in Chapter 15. Strangely, building smaller models (encoder) from larger ones (AE) requires to save and then reload the weights. This creates an external file, which we call “ae_weights”. We can check that the output does have 4 columns (compressed) instead of 7 (original data). save_model_weights_hdf5(object = ae_model,filepath ="ae_weights.hdf5", overwrite = TRUE) encoder_model <- keras_model(inputs = input_layer, outputs = encoder) encoder_model %>% load_model_weights_hdf5(filepath = "ae_weights.hdf5",skip_mismatch = TRUE,by_name = TRUE) encoder_model %>% compile( loss = 'mean_squared_error', optimizer = 'adam', metrics = c('mean_absolute_error') ) encoder_model %>% keras::predict_on_batch(x = training_sample %>% dplyr::select(features_short) %>% as.matrix()) %>% head(5) ## [,1] [,2] [,3] [,4] ## [1,] -0.9409834 0.04998749 -0.9673178 -0.3220703 ## [2,] -0.9425045 0.06579173 -0.9573155 -0.2911604 ## [3,] -0.9664346 0.02962989 -0.9694159 -0.3387784 ## [4,] -0.9694507 0.02313471 -0.9734578 -0.3428233 ## [5,] -0.9723647 0.01510313 -0.9802681 -0.3430010 18.10 Chapter 16 All we need to do is change the rho coefficient in the code of Chapter 16. set.seed(42) # Fixing the random seed n_sample <- 10^5 # Number of samples generated rho <- (-0.8) # Autoregressive parameter sd <- 0.4 # Std. dev. of noise a <- 0.06 * rho # Scaled mean of returns data_RL3 <- tibble(returns = a/rho + arima.sim(n = n_sample, # Returns via AR(1) simulation list(ar = rho), sd = sd), action = round(runif(n_sample)*4)/4) %>% # Random action (portfolio) mutate(new_state = if_else(returns < 0, "neg", "pos"), # Coding of state reward = returns * action, # Reward = portfolio return state = lag(new_state), # Next state action = as.character(action)) %>% na.omit() # Remove one missing state The learning can then proceed. control <- list(alpha = 0.1, # Learning rate gamma = 0.7, # Discount factor for rewards epsilon = 0.1) # Exploration rate fit_RL3 <- ReinforcementLearning(data_RL3, # Main RL function s = "state", a = "action", r = "reward", s_new = "new_state", control = control) print(fit_RL3) # Show the output ## State-Action function Q ## 0.25 0 1 0.75 0.5 ## neg 0.7107268 0.5971710 1.4662416 0.9535698 0.8069591 ## pos 0.7730842 0.7869229 0.4734467 0.4258593 0.6257039 ## ## Policy ## neg pos ## "1" "0" ## ## Reward (last iteration) ## [1] 3013.162 In this case, the constantly switching feature of the return process changes the outcome. The negative state is associated with large profits when the portfolio is fully invested, while the positive state has the best average reward when the agent refrains from investing. For the second exercise, the trick is to define all possible actions, that is all combinations (+1,0-1) for the two assets on all dates. We recycle the data from Chapter 16. pos_3 <- c(-1,0,1) # Possible alloc. to asset 1 pos_4 <- c(-1,0,1) # Possible alloc. to asset 3 pos <- expand_grid(pos_3, pos_4) # All combinations pos <- bind_cols(pos, id = 1:nrow(pos)) # Adding combination id ret_pb_RL <- bind_cols(r3 = return_3, r4 = return_4, # Returns & P/B dataframe pb3 = pb_3, pb4 = pb_4) data_RL4 <- sapply(ret_pb_RL, # Combining return & positions rep.int, times = nrow(pos)) %>% data.frame() %>% bind_cols(id = rep(1:nrow(pos), 1, each = length(return_3))) %>% left_join(pos) %>% dplyr::select(-id) %>% mutate(action = paste(pos_3, pos_4), # Uniting actions pb3 = round(5 * pb3), # Simplifying states pb4 = round(5 * pb4), # Simplifying states state = paste(pb3, pb4), # Uniting states reward = pos_3*r3 + pos_4*r4, # Computing rewards new_state = lead(state)) %>% # Infer new state dplyr::select(-pb3, -pb4, -pos_3, # Remove superfluous vars. -pos_4, -r3, -r4) We can the plug this data into the RL function. fit_RL4 <- ReinforcementLearning(data_RL4, # Main RL function s = "state", a = "action", r = "reward", s_new = "new_state", control = control) fit_RL4$Q <- round(fit_RL4$Q, 3) # Round the Q-matrix print(fit_RL4) # Show the output ## State-Action function Q ## 0 0 0 1 0 -1 -1 -1 -1 0 -1 1 1 -1 1 0 1 1 ## 0 2 0.000 0.000 0.002 -0.017 -0.018 -0.020 0.023 0.025 0.024 ## 0 3 0.001 -0.005 0.007 -0.013 -0.019 -0.026 0.031 0.027 0.021 ## 3 1 0.003 0.003 0.003 0.002 0.002 0.003 0.002 0.002 0.003 ## 2 1 0.027 0.038 0.020 0.004 0.015 0.039 0.013 0.021 0.041 ## 2 2 0.021 0.014 0.027 0.038 0.047 0.045 -0.004 -0.011 -0.016 ## 2 3 0.007 0.006 0.008 0.054 0.057 0.056 -0.041 -0.041 -0.041 ## 1 1 0.027 0.054 0.005 -0.031 -0.005 0.041 0.025 0.046 0.072 ## 1 2 0.019 0.020 0.020 0.015 0.023 0.029 0.012 0.014 0.023 ## 1 3 0.008 0.019 0.000 -0.036 -0.027 -0.016 0.042 0.053 0.060 ## ## Policy ## 0 2 0 3 3 1 2 1 2 2 2 3 1 1 1 2 1 3 ## "1 0" "1 -1" "0 -1" "1 1" "-1 0" "-1 0" "1 1" "-1 1" "1 1" ## ## Reward (last iteration) ## [1] 0 The matrix is less sparse compared to the one of Chapter 16; we have covered much more ground! Some policy recommendations have not changed compared to the smaller sample, but some have! The change occurs for the states for which only a few points were available in the first trial. With more data, the decision is altered. "]]