#install.packages("tidyverse")
#install.packages("psych")
#install.packages("lightgbm")
#install.packages("iml")
Stock Market Prediction
Project :
Regression task, where we want to predict stock returns from panel-type data (last columnn).
Import library
Data wrangling
First, let’s load the data & the packages.
library(tidyverse)
library(dplyr) # for data manipulation
library(ggplot2) # For the plots
library(gridExtra) # to arrange 2 graphs in one row
library(psych) # for comprehensive summary
library(rlang) # for in-built functions
library(forecast) # for forecasting future values
library(lightgbm) #for the model lightgbm
Let’s load the dataset that comes in RData format.
load('stocks_clean.RData') # loading the RData file
<- stocks_clean # and assigning it to a variable Return
return rm(stocks_clean) # now that return is assigned, remove remove returns to save memory
dim(return) # Dimension of the dataframe
[1] 289271 13
head(return) # Look at the first and last observations of the dataframe
ticker | date | price | market_cap | price_to_book | debt_to_equity | profitability | volatility | revenue | ghg_s1 | ghg_s2 | ghg_s3 | return |
---|---|---|---|---|---|---|---|---|---|---|---|---|
AAON US Equity | 1995-12-31 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 0.8628 | 71.728 | 14.720 | NA | NA | NA | -0.0980883 |
AAON US Equity | 1996-01-31 | 0.4719 | 32.8520 | 2.4256 | 85.9073 | 3.0722 | 63.087 | 67.346 | NA | NA | NA | -0.0651743 |
AAON US Equity | 1996-02-29 | 0.5048 | 35.1440 | 2.5948 | 85.9073 | 3.0722 | 97.639 | 67.346 | NA | NA | NA | 0.0697182 |
AAON US Equity | 1996-03-31 | 0.4170 | 29.0367 | 2.0805 | 65.1878 | 3.1180 | 100.450 | 13.438 | NA | NA | NA | -0.1739303 |
AAON US Equity | 1996-04-30 | 0.3841 | 26.7444 | 1.9162 | 65.1878 | 3.1180 | 76.133 | 13.438 | NA | NA | NA | -0.0788969 |
AAON US Equity | 1996-05-31 | 0.3951 | 27.5445 | 1.9710 | 65.1878 | 3.1180 | 88.304 | 13.438 | NA | NA | NA | 0.0286384 |
tail(return)
ticker | date | price | market_cap | price_to_book | debt_to_equity | profitability | volatility | revenue | ghg_s1 | ghg_s2 | ghg_s3 | return |
---|---|---|---|---|---|---|---|---|---|---|---|---|
ZIXI US Equity | 2022-07-29 | 8.485 | 481.8671 | 14.1059 | 146.8623 | -3.7101 | 4.118 | 64.85 | NA | NA | NA | 0 |
ZIXI US Equity | 2022-08-31 | 8.485 | 481.8671 | 14.1059 | 146.8623 | -3.7101 | 4.118 | 64.85 | NA | NA | NA | 0 |
ZIXI US Equity | 2022-09-30 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
ZIXI US Equity | 2022-10-31 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
ZIXI US Equity | 2022-11-30 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
ZIXI US Equity | 2022-12-30 | 8.485 | 481.8671 | 14.1059 | NA | NA | 4.118 | NA | NA | NA | NA | 0 |
We can see that there are 289271 observations (rows) and 13 variables (features) so the dataset is quite large. - The last variable called “return” is the target variable amd well positioned at the last column.
Structure and statistics of the dataset
Structure
str(return[,1:13]) # Provides the structure , datatype
tibble [289,271 × 13] (S3: tbl_df/tbl/data.frame)
$ ticker : chr [1:289271] "AAON US Equity" "AAON US Equity" "AAON US Equity" "AAON US Equity" ...
$ date : Date[1:289271], format: "1995-12-31" "1996-01-31" ...
$ price : num [1:289271] 0.505 0.472 0.505 0.417 0.384 ...
$ market_cap : num [1:289271] 35.1 32.9 35.1 29 26.7 ...
$ price_to_book : num [1:289271] 2.59 2.43 2.59 2.08 1.92 ...
$ debt_to_equity: num [1:289271] 85.9 85.9 85.9 65.2 65.2 ...
$ profitability : num [1:289271] 0.863 3.072 3.072 3.118 3.118 ...
$ volatility : num [1:289271] 71.7 63.1 97.6 100.5 76.1 ...
$ revenue : num [1:289271] 14.7 67.3 67.3 13.4 13.4 ...
$ ghg_s1 : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
$ ghg_s2 : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
$ ghg_s3 : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
$ return : num [1:289271] -0.0981 -0.0652 0.0697 -0.1739 -0.0789 ...
There seems to be character, Date and numerical type of data. Let’s have a precise view using the module class.
sapply(return, class) |> table() |> head()
character Date numeric
1 1 11
So the dataset is mainly composed by numeric variables and some are logical and others with characters. We will therefore need to use hot encoding to numericalize them if we want to use them as inputs in our ML algo.
Statistics
<- summary(return) #provides a summary of statistics for numerical values
statistics <- as.data.frame(statistics) #creating a dataframe for better view assigning to a vector statistics_df
statistics_df
# Then use the describe function
<- describe(statistics_df)
describe_stats describe_stats
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Var1* | 1 | 91 | 1.00000 | 0.000000 | 1.0 | 1.00000 | 0.0000 | 1 | 1 | 0 | NaN | NaN | 0.0000000 |
Var2* | 2 | 91 | 7.00000 | 3.762387 | 7.0 | 7.00000 | 4.4478 | 1 | 13 | 12 | 0.0000000 | -1.253316 | 0.3944053 |
Freq* | 3 | 84 | 41.82143 | 23.494763 | 42.5 | 42.01471 | 30.3933 | 1 | 81 | 80 | -0.0697899 | -1.261956 | 2.5634888 |
summary(return$return) # descriptive summary of the predictive variable
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.999900 -0.046077 0.007165 0.011707 0.061672 4.140351
::describe(return) # more details psych
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ticker* | 1 | 289271 | 442.8751862 | 2.554853e+02 | 443.0000000 | 442.8453398 | 327.6546000 | 1.0000e+00 | 8.850000e+02 | 8.840000e+02 | -0.0003507 | -1.198849 | 0.4750218 |
date | 2 | 289271 | NaN | NA | NA | NaN | NA | Inf | -Inf | -Inf | NA | NA | NA |
price | 3 | 289271 | 47.5231056 | 1.358301e+02 | 24.6000000 | 29.2883421 | 21.9721320 | 1.0000e-06 | 5.908870e+03 | 5.908870e+03 | 18.4437518 | 485.375142 | 0.2525477 |
market_cap | 4 | 285271 | 14358.8671009 | 5.284050e+04 | 1783.9113000 | 4689.8198910 | 2431.9996634 | 1.2000e-02 | 2.913284e+06 | 2.913284e+06 | 20.4232946 | 762.572413 | 98.9322942 |
price_to_book | 5 | 269118 | 6.9235647 | 3.650293e+02 | 2.0421000 | 2.3796616 | 1.3192175 | 4.0000e-04 | 1.239516e+05 | 1.239516e+05 | 292.0519380 | 88943.404460 | 0.7036495 |
debt_to_equity | 6 | 279183 | 202.9767215 | 4.641365e+03 | 52.8601000 | 63.9115311 | 60.8140281 | 0.0000e+00 | 1.188137e+06 | 1.188137e+06 | 187.9389783 | 46359.668753 | 8.7841811 |
profitability | 7 | 283169 | -238.4942549 | 2.270789e+04 | 7.3622000 | 8.8191991 | 8.0604514 | -3.3007e+06 | 3.703000e+05 | 3.671000e+06 | -133.5345377 | 18751.861019 | 42.6730831 |
volatility | 8 | 289084 | 39.1337053 | 3.991481e+01 | 30.7070000 | 33.5309748 | 15.8830938 | 0.0000e+00 | 5.137323e+03 | 5.137323e+03 | 29.3463638 | 2549.313940 | 0.0742373 |
revenue | 9 | 284166 | 19533.5954801 | 4.259334e+05 | 443.5000000 | 1153.5807626 | 611.3115624 | -2.7966e+04 | 3.137951e+07 | 3.140747e+07 | 40.9380333 | 2104.805636 | 799.0163901 |
ghg_s1 | 10 | 37575 | 5209.6830526 | 1.565725e+04 | 212.1960000 | 1376.6898118 | 305.4215304 | 0.0000e+00 | 1.455000e+05 | 1.455000e+05 | 5.0702658 | 30.350285 | 80.7729651 |
ghg_s2 | 11 | 35717 | 1138.8577400 | 2.507859e+03 | 279.6060000 | 534.8833402 | 373.5054876 | 0.0000e+00 | 2.900000e+04 | 2.900000e+04 | 4.6730604 | 28.919859 | 13.2698360 |
ghg_s3 | 12 | 24404 | 30353.2335338 | 1.028987e+05 | 549.5130000 | 5531.2176450 | 805.1718906 | 0.0000e+00 | 1.169970e+06 | 1.169970e+06 | 5.0776069 | 29.746759 | 658.6875273 |
return | 13 | 289271 | 0.0117065 | 1.241597e-01 | 0.0071651 | 0.0079437 | 0.0798772 | -9.9990e-01 | 4.140351e+00 | 5.140251e+00 | 3.2588582 | 61.246547 | 0.0002308 |
# Calculate missing values
sapply(return, function(x) sum(is.na(x)))
ticker date price market_cap price_to_book
0 0 0 4000 20153
debt_to_equity profitability volatility revenue ghg_s1
10088 6102 187 5105 251696
ghg_s2 ghg_s3 return
253554 264867 0
# Calculate the threshold for missing values
<- nrow(return) * 0.85
threshold
# Remove columns where the number of missing values is greater than the threshold
<- return |>
return select_if(~sum(is.na(.)) < threshold)
# Replace NA values with the mean of each column with missing values
<- return |>
return mutate(market_cap = ifelse(is.na(market_cap), mean(market_cap, na.rm = TRUE), market_cap),
price_to_book = ifelse(is.na(price_to_book), mean(price_to_book, na.rm = TRUE), price_to_book),
debt_to_equity = ifelse(is.na(debt_to_equity), mean(debt_to_equity, na.rm = TRUE), debt_to_equity),
profitability = ifelse(is.na(profitability), mean(profitability, na.rm = TRUE), profitability),
volatility = ifelse(is.na(volatility), mean(volatility, na.rm = TRUE), volatility),
revenue = ifelse(is.na(revenue), mean(revenue, na.rm = TRUE), revenue))
sapply(return, function(x) sum(is.na(x)))
ticker date price market_cap price_to_book
0 0 0 0 0
debt_to_equity profitability volatility revenue return
0 0 0 0 0
Couple of points to note :
The
market_cap
variable has a very high standard deviation relative to the mean, indicating that the dataset includes companies of vastly different sizes, from small caps to large caps.The
price_to_book
ratio shows significant variability as well, with some values as high as 12,351.61, suggesting there might be some highly valued companies compared to their book value.Negative
debt_to_equity
for some companies, which might indicate more complex financial structures or situations where shareholder’s equity is negative.The
profitability
variable has an exceptionally wide range, with the minimum being a large negative number and the maximum being very high, indicating some companies are highly profitable while others are incurring substantial losses.revenue
has a high skewness value, suggesting the inclusion of companies with massive differences in sales figures, from losses/revenue deductions to substantial earnings.The greenhouse gas variables (
ghg_s1
,ghg_s2
andghg_s3
) have very different counts of non-missing values, which could suggest data collection challenges or varying reporting standards across companies. We decided to remove these columns because this level of sparsity (~85%) provides little to no added value for our predictive accuracy and could potentially distort the outcome, leading to unreliable predictions.The
return
variable is quite stable with a mean close to zero and a small standard deviation, suggesting that the dataset might represent a balanced view of stock performance over time.The remaining missing values were replaced by the mean of each column.
We need to remove the columns ticker, date, “ghg_s1”, “ghg_s2”, “ghg_s3”, because they are irrelevant.
Uni-variate Analysis
Categorical Values
The only categorical value is “ticker” and we will analyse the 10 top ones.
# Aggregated market_cap by ticker
<- return |>
aggregated_data group_by(ticker) |>
summarise(total_market_cap = sum(market_cap, na.rm = TRUE))
# Filter for the top 20 tickers by market cap
<- 20
top_n <- aggregated_data |>
aggregated_data_top_n top_n(n = top_n, wt = total_market_cap)
# Plot only the top N tickers
ggplot(aggregated_data_top_n, aes(x = reorder(ticker, -total_market_cap), y = total_market_cap)) +
geom_bar(stat = "identity", fill = "skyblue") +
theme_minimal() +
labs(title = "Total Market Cap by Ticker", x = "Ticker", y = "Total Market Cap") +
coord_flip() # Flip coordinates for horizontal bars
For positive variables, when tails are very heavy, use the +scale_x_log10() layer for histograms :
# Plot only the top N tickers with log-transformed y-axis
ggplot(aggregated_data_top_n, aes(x = reorder(ticker, -total_market_cap), y = total_market_cap)) +
geom_bar(stat = "identity", fill = "skyblue") +
scale_y_log10() + # Log-transform the y-axis
theme_minimal() +
labs(title = "Total Market Cap by Ticker (Log Scale)", x = "Ticker", y = "Log of Total Market Cap") +
coord_flip() # Flip coordinates for horizontal bars
# Using the first 10 tickers to analyse
<- head(aggregated_data, 10)
top_tickers_data
# Creating a pie chart
ggplot(top_tickers_data, aes(x = "", y = total_market_cap, fill = ticker)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y") + # Convert the bar chart to a pie chart
theme_void() + # Remove background and axes
labs(title = "Market Cap Distribution Among Top Tickers") +
theme(legend.title = element_blank()) # Hide the legend title
So we observe that : * In the bar charts, coord_flip() is used to flip the chart for better readability, especially when dealing with many tickers. * Pie charts are not ideal for datasets with many categories because they can become cluttered and difficult to interpret, so we use only the top 10 tickers for simplicity.
Numerical Values
<- function(data, var_name) {
plot_histogram_boxplot # Convert the variable name to a symbol for ggplot
<- sym(var_name)
var
# Create a histogram plot
<- ggplot(data, aes(x = !!var)) +
p1 geom_histogram(bins = 30, fill = "skyblue", color = "black") +
theme_minimal() +
labs(title = paste("Histogram of", var_name), x = var_name, y = "Count")
# Create a boxplot
<- ggplot(data, aes(x = "", y = !!var)) +
p2 geom_boxplot(fill = "tomato", color = "black") +
theme_minimal() +
labs(title = paste("Boxplot of", var_name), x = "", y = var_name)
# Combine plots using grid.arrange
grid.arrange(p1, p2, ncol = 2)
}
# List of quantitative variable names to plot
<- c("price", "market_cap", "price_to_book", "debt_to_equity",
quantitative_vars "profitability", "volatility", "revenue", "return")
# Loop through each variable and plot
for(var_name in quantitative_vars) {
print(paste("Plotting for:", var_name)) # Print the variable name being plotted
plot_histogram_boxplot(return, var_name)
}
[1] "Plotting for: price"
[1] "Plotting for: market_cap"
[1] "Plotting for: price_to_book"
[1] "Plotting for: debt_to_equity"
[1] "Plotting for: profitability"
[1] "Plotting for: volatility"
[1] "Plotting for: revenue"
[1] "Plotting for: return"
Modeling using ARIMA
ARIMA models are well-suited for time series data that show patterns over time and can be made stationary through differentiation. It is useful for forecasting future points in a series based on its own past values (autoregression) and a moving average of past errors. * Use case for VARIMA: If the return
variable shows autocorrelation over time (which means past returns are predictive of future returns)
Due to the large amount stocks, we will explore some tickets in order to simplify it.
Example 1 - “AAPL US Equity” ticker
Data Preprocessing
<- "AAPL US Equity" # Replace with the ticker you want to predict
ticker_name
# Filter the data for the selected ticker
<- return |> filter(ticker == ticker_name)
ticker_data
# Order the data by date
<- ticker_data |> arrange(date) ticker_data
Split the Data
# Calculate the splitting index
<- floor(0.8 * nrow(ticker_data))
split_index
# Create training and testing datasets
<- ticker_data[1:split_index, ]
train_data <- ticker_data[(split_index + 1):nrow(ticker_data), ] test_data
Fit the VARIMA Model
# Fit an ARIMA model
<- auto.arima(train_data$return)
return_arima summary(return_arima)
Series: train_data$return
ARIMA(1,0,1) with non-zero mean
Coefficients:
ar1 ma1 mean
0.6134 -0.5783 0.0265
s.e. 1.1523 1.1973 0.0085
sigma^2 = 0.01625: log likelihood = 169.39
AIC=-330.77 AICc=-330.62 BIC=-316.5
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.0001031105 0.1267608 0.09507222 -Inf Inf 0.7166279 0.0004694282
Predicting Future Returns
# Make predictions
<- forecast(return_arima, h = nrow(test_data)) # 'h' is the number of periods to predict
predictions
# Plot the predictions against the actual test data
plot(predictions)
lines(test_data$date, test_data$return, col = "red")
The blue line in the graph likely represents the forecasted values generated by the ARIMA model for the specified future periods (as defined by h = nrow(test_data)
). It is the model’s best estimate of the time series’ central tendency moving forward.
Representation :
- **Black Line* : This is the historical time series data upon which the model was trained.
- Blue Line : This line is the point forecast from the ARIMA model, indicating the expected value of the series at each future point.
- Grey Shaded Area : This usually depicts the prediction intervals (often 80% and 95% confidence intervals) around the point forecasts, representing the uncertainty in the forecasts. The lighter the shade of grey, the lower the confidence (i.e., the wider the interval).
In the graph, the blue line shows where the model predicts the return will be, on average, for the next h periods. The shaded area around it indicates the level of confidence the model has in its predictions; the actual future values are expected to fall within this range most of the time, given the model assumptions hold true.
The range of 0 to 300 on the x-axis of your ARIMA forecast plot represents the index of the observations in the time series data. This is a common default in time series plots when the time series object doesn’t have an associated time/date attribute or when the plotting function isn’t explicitly told to use a date variable for the x-axis. While the y-axis, labeled return
, represents the values of the variable being forecast by the ARIMA(1,0,1) model
Performance Evaluation
# Calculate accuracy metrics
accuracy(predictions, test_data$return)
ME RMSE MAE MPE MAPE MASE
Training set 0.0001031105 0.12676080 0.09507222 -Inf Inf 0.7166279
Test set -0.0001558202 0.09022325 0.07685722 114.7769 124.9835 0.5793283
ACF1
Training set 0.0004694282
Test set NA
In the evaluation, -Inf and Inf values for MPE and MAPE suggest that there are cases where actual values are zero or very close to zero. In such cases, these percentage errors can become infinite or undefined, which is why they are not good measures for rendering these metrics unsuitable for this dataset.
The RMSE value of 0.0902 for the test set indicates that the forecast errors are moderate, and ideally, we would look for a lower RMSE for better forecast accuracy. The MAE is measured at 0.0769 for the test set, which similarly points to moderate errors; as with RMSE, a lower MAE would mean the predictions are closer to the actual values. The MASE stands at 0.579 for the test set, which is less than 1, suggesting that the forecasting model is performing better than a naive benchmark model.
Example 2 - “ZIXI US Equity” ticker
Data Preprocessing
<- "ZIXI US Equity" # Replace with the ticker you want to predict
ticker_name
# Filter the data for the selected ticker
<- return |> filter(ticker == ticker_name)
ticker_data
# Order the data by date
<- ticker_data |> arrange(date) ticker_data
Split the Data
# Calculate the splitting index
<- floor(0.8 * nrow(ticker_data))
split_index
# Create training and testing datasets
<- ticker_data[1:split_index, ]
train_data <- ticker_data[(split_index + 1):nrow(ticker_data), ] test_data
Fit the VARIMA Model
# Fit an ARIMA model
<- auto.arima(train_data$return)
return_arima summary(return_arima)
Series: train_data$return
ARIMA(0,0,0) with non-zero mean
Coefficients:
mean
0.0278
s.e. 0.0162
sigma^2 = 0.06814: log likelihood = -19.22
AIC=42.44 AICc=42.49 BIC=49.56
Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 3.956344e-14 0.260537 0.1697435 -Inf Inf 0.6678411 -0.0007031951
Predicting Future Returns
# Make predictions
<- forecast(return_arima, h = nrow(test_data)) # 'h' is the number of periods to predict
predictions
# Plot the predictions against the actual test data
plot(predictions)
lines(test_data$date, test_data$return, col = "red")
Performance Evaluation
# Calculate accuracy metrics
accuracy(predictions, test_data$return)
ME RMSE MAE MPE MAPE MASE
Training set 3.956344e-14 0.2605370 0.16974349 -Inf Inf 0.6678411
Test set -1.277648e-02 0.1217997 0.08573132 -Inf Inf 0.3373024
ACF1
Training set -0.0007031951
Test set NA
The RMSE on the test set is 0.122, which gives you an idea about the typical size of the forecast errors. The MASE being less than one (0.337) indicates that the model is performing better than a naive model for the test data set.
Modeling using lightGBM
Data Splitting
In this section, we fit a boosted tree from the lightGBM package. We split the data in two, for simplicity: we won’t use validation.
# Setting seed for reproducibility
set.seed(123)
# Data preparation
<- sample(1:nrow(return), 0.8 * nrow(return)) # 80% for training
train_indices <- return[train_indices, ]
train_data <- setdiff(1:nrow(return), train_indices) # Indices not in training set
remaining_indices
# Test set
<- 0.2 # 20% for testing
test_size <- sample(remaining_indices, test_size * length(remaining_indices))
test_indices <- return[test_indices, ]
test_data
# Future validation set
<- setdiff(remaining_indices, test_indices) # Remove test indices
remaining_indices <- sample(remaining_indices, 0.2 * length(remaining_indices)) # 20% of remaining data for validation
validation_indices <- return[validation_indices, -which(names(return) == "return")] # Remove 'return' column validation_data
# Convert data to LightGBM format
<- lgb.Dataset(data = as.matrix(train_data[, -which(names(train_data) == "return")]), label = train_data$return) train_data_lgb
Warning in storage.mode(data) <- "double": NAs introduced by coercion
<- lgb.Dataset(data = as.matrix(test_data[, -1]), label = test_data$return, reference = train_data_lgb) test_data_lgb
Warning in storage.mode(data) <- "double": NAs introduced by coercion
# Define LightGBM parameters
<- list(
train_params objective = "regression", # Regression task
metric = "rmse", # Root Mean Square Error as the evaluation metric
num_leaves = 31, # Maximum number of leaves in one tree
learning_rate = 0.1, # Learning rate
feature_fraction = 0.8, # Percentage of features used per iteration
bagging_fraction = 0.8, # Percentage of data used per iteration
bagging_freq = 5, # Frequency for bagging
verbose = -1 # No print updates
)
Training
<- lgb.train(params = train_params,
lgb_model data = train_data_lgb,
nrounds = 100, # Number of boosting iterations (trees)
valids = list(validation = test_data_lgb),
early_stopping_rounds = 10) # Early stopping
Prediction & evaluation
<- predict(lgb_model, newdata = as.matrix(test_data[, -1])) predictions
Warning in storage.mode(data) <- "double": NAs introduced by coercion
<- sqrt(mean((predictions - test_data$return)^2))
rmse cat("RMSE:", rmse, "\n")
RMSE: 0.1255344
<- mean(abs(predictions - test_data$return))
mae cat("MAE:", mae, "\n")
MAE: 0.08153906
We obtained a 8% of error for returns, which is not negligible, with a MAE of 0.082, & RMSE of 0.126.
Interpretability
In the realm of finance, where money management is pivotal, there is a significant preference for transparent algorithms that support decision-making. Stakeholders demand to understand the underlying reasoning of predictive models, which leads to the crucial distinction between global and local interpretability.
Global interpretability allows us to comprehend the model’s mechanisms over the aggregate data, offering insights on how the model makes decisions across the broad spectrum of data. An exemplar of this is the lightGBM model we trained, whose feature importance chart, displaying the aggregated gains across the ensemble of trees, exemplifies global interpretability. Such clarity is not just academic but practical, allowing for refined models that prioritize transparency in their predictive logic.
Feature Importance
Feature importance can help in understanding the model’s decision-making process and in the reduction of the feature space for more efficient models.
<- lgb.importance(model = lgb_model, percentage = TRUE)
lgb_importance print(lgb_importance)
Feature Gain Cover Frequency
<char> <num> <num> <num>
1: volatility 0.32614683 0.4703797605 0.2000000
2: price_to_book 0.25451039 0.4439211563 0.2000000
3: market_cap 0.18233986 0.0500271586 0.2666667
4: price 0.10241046 0.0008298006 0.1000000
5: revenue 0.08431652 0.0343258741 0.1333333
6: debt_to_equity 0.05027594 0.0005162500 0.1000000
lgb.plot.importance(lgb_importance)
volatility seems to be the most important feature, contributing the most to the model’s predictions, followed by price_to_book and market_cap. The features price, revenue, and debt_to_equity have lower importance scores in this model.