Stock Market Prediction

Author

Jia Xin Tang Zhi

Project :

Regression task, where we want to predict stock returns from panel-type data (last columnn).

Import library

#install.packages("tidyverse") 
#install.packages("psych")
#install.packages("lightgbm") 
#install.packages("iml")

Data wrangling

First, let’s load the data & the packages.

library(tidyverse)
library(dplyr)              # for data manipulation
library(ggplot2)            # For the plots
library(gridExtra)          # to arrange 2 graphs in one row
library(psych)              # for comprehensive summary
library(rlang)              # for in-built functions
library(forecast)           # for forecasting future values
library(lightgbm)           #for the model lightgbm

Let’s load the dataset that comes in RData format.

load('stocks_clean.RData')  # loading the RData file
return <- stocks_clean      # and assigning it to a variable  Return
rm(stocks_clean)            # now that return is assigned, remove remove returns to save memory

dim(return)   # Dimension of the dataframe

[1] 289271     13

head(return)  # Look at the first and last observations of the dataframe

ticker	date	price	market_cap	price_to_book	debt_to_equity	profitability	volatility	revenue	ghg_s1	ghg_s2	ghg_s3	return
AAON US Equity	1995-12-31	0.5048	35.1440	2.5948	85.9073	0.8628	71.728	14.720	NA	NA	NA	-0.0980883
AAON US Equity	1996-01-31	0.4719	32.8520	2.4256	85.9073	3.0722	63.087	67.346	NA	NA	NA	-0.0651743
AAON US Equity	1996-02-29	0.5048	35.1440	2.5948	85.9073	3.0722	97.639	67.346	NA	NA	NA	0.0697182
AAON US Equity	1996-03-31	0.4170	29.0367	2.0805	65.1878	3.1180	100.450	13.438	NA	NA	NA	-0.1739303
AAON US Equity	1996-04-30	0.3841	26.7444	1.9162	65.1878	3.1180	76.133	13.438	NA	NA	NA	-0.0788969
AAON US Equity	1996-05-31	0.3951	27.5445	1.9710	65.1878	3.1180	88.304	13.438	NA	NA	NA	0.0286384

tail(return)

ticker	date	price	market_cap	price_to_book	debt_to_equity	profitability	volatility	revenue	ghg_s1	ghg_s2	ghg_s3
ZIXI US Equity	2022-07-29	8.485	481.8671	14.1059	146.8623	-3.7101	4.118	64.85	NA	NA	NA
ZIXI US Equity	2022-08-31	8.485	481.8671	14.1059	146.8623	-3.7101	4.118	64.85	NA	NA	NA
ZIXI US Equity	2022-09-30	8.485	481.8671	14.1059	NA	NA	4.118	NA	NA	NA	NA
ZIXI US Equity	2022-10-31	8.485	481.8671	14.1059	NA	NA	4.118	NA	NA	NA	NA
ZIXI US Equity	2022-11-30	8.485	481.8671	14.1059	NA	NA	4.118	NA	NA	NA	NA
ZIXI US Equity	2022-12-30	8.485	481.8671	14.1059	NA	NA	4.118	NA	NA	NA	NA

We can see that there are 289271 observations (rows) and 13 variables (features) so the dataset is quite large. - The last variable called “return” is the target variable amd well positioned at the last column.

Structure and statistics of the dataset

Structure

str(return[,1:13])    # Provides the structure , datatype

tibble [289,271 × 13] (S3: tbl_df/tbl/data.frame)
 $ ticker        : chr [1:289271] "AAON US Equity" "AAON US Equity" "AAON US Equity" "AAON US Equity" ...
 $ date          : Date[1:289271], format: "1995-12-31" "1996-01-31" ...
 $ price         : num [1:289271] 0.505 0.472 0.505 0.417 0.384 ...
 $ market_cap    : num [1:289271] 35.1 32.9 35.1 29 26.7 ...
 $ price_to_book : num [1:289271] 2.59 2.43 2.59 2.08 1.92 ...
 $ debt_to_equity: num [1:289271] 85.9 85.9 85.9 65.2 65.2 ...
 $ profitability : num [1:289271] 0.863 3.072 3.072 3.118 3.118 ...
 $ volatility    : num [1:289271] 71.7 63.1 97.6 100.5 76.1 ...
 $ revenue       : num [1:289271] 14.7 67.3 67.3 13.4 13.4 ...
 $ ghg_s1        : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
 $ ghg_s2        : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
 $ ghg_s3        : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
 $ return        : num [1:289271] -0.0981 -0.0652 0.0697 -0.1739 -0.0789 ...

There seems to be character, Date and numerical type of data. Let’s have a precise view using the module class.

sapply(return, class) |> table() |> head()


character      Date   numeric 
        1         1        11

So the dataset is mainly composed by numeric variables and some are logical and others with characters. We will therefore need to use hot encoding to numericalize them if we want to use them as inputs in our ML algo.

Statistics

statistics <- summary(return) #provides a summary of statistics for numerical values
statistics_df <- as.data.frame(statistics) #creating a dataframe for better view assigning to a vector statistics_df

# Then use the describe function
describe_stats <- describe(statistics_df)
describe_stats

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
Var1*	1	91	1.00000	0.000000	1.0	1.00000	0.0000	1	1	0	NaN	NaN	0.0000000
Var2*	2	91	7.00000	3.762387	7.0	7.00000	4.4478	1	13	12	0.0000000	-1.253316	0.3944053
Freq*	3	84	41.82143	23.494763	42.5	42.01471	30.3933	1	81	80	-0.0697899	-1.261956	2.5634888

summary(return$return)  # descriptive summary of the predictive variable

     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.999900 -0.046077  0.007165  0.011707  0.061672  4.140351

psych::describe(return)  # more details

Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
ticker*	1	289271	442.8751862	2.554853e+02	443.0000000	442.8453398	327.6546000	1.0000e+00	8.850000e+02	8.840000e+02	-0.0003507	-1.198849	0.4750218
date	2	289271	NaN	NA	NA	NaN	NA	Inf	-Inf	-Inf	NA	NA	NA
price	3	289271	47.5231056	1.358301e+02	24.6000000	29.2883421	21.9721320	1.0000e-06	5.908870e+03	5.908870e+03	18.4437518	485.375142	0.2525477
market_cap	4	285271	14358.8671009	5.284050e+04	1783.9113000	4689.8198910	2431.9996634	1.2000e-02	2.913284e+06	2.913284e+06	20.4232946	762.572413	98.9322942
price_to_book	5	269118	6.9235647	3.650293e+02	2.0421000	2.3796616	1.3192175	4.0000e-04	1.239516e+05	1.239516e+05	292.0519380	88943.404460	0.7036495
debt_to_equity	6	279183	202.9767215	4.641365e+03	52.8601000	63.9115311	60.8140281	0.0000e+00	1.188137e+06	1.188137e+06	187.9389783	46359.668753	8.7841811
profitability	7	283169	-238.4942549	2.270789e+04	7.3622000	8.8191991	8.0604514	-3.3007e+06	3.703000e+05	3.671000e+06	-133.5345377	18751.861019	42.6730831
volatility	8	289084	39.1337053	3.991481e+01	30.7070000	33.5309748	15.8830938	0.0000e+00	5.137323e+03	5.137323e+03	29.3463638	2549.313940	0.0742373
revenue	9	284166	19533.5954801	4.259334e+05	443.5000000	1153.5807626	611.3115624	-2.7966e+04	3.137951e+07	3.140747e+07	40.9380333	2104.805636	799.0163901
ghg_s1	10	37575	5209.6830526	1.565725e+04	212.1960000	1376.6898118	305.4215304	0.0000e+00	1.455000e+05	1.455000e+05	5.0702658	30.350285	80.7729651
ghg_s2	11	35717	1138.8577400	2.507859e+03	279.6060000	534.8833402	373.5054876	0.0000e+00	2.900000e+04	2.900000e+04	4.6730604	28.919859	13.2698360
ghg_s3	12	24404	30353.2335338	1.028987e+05	549.5130000	5531.2176450	805.1718906	0.0000e+00	1.169970e+06	1.169970e+06	5.0776069	29.746759	658.6875273
return	13	289271	0.0117065	1.241597e-01	0.0071651	0.0079437	0.0798772	-9.9990e-01	4.140351e+00	5.140251e+00	3.2588582	61.246547	0.0002308

# Calculate missing values
sapply(return, function(x) sum(is.na(x)))

        ticker           date          price     market_cap  price_to_book 
             0              0              0           4000          20153 
debt_to_equity  profitability     volatility        revenue         ghg_s1 
         10088           6102            187           5105         251696 
        ghg_s2         ghg_s3         return 
        253554         264867              0

# Calculate the threshold for missing values
threshold <- nrow(return) * 0.85

# Remove columns where the number of missing values is greater than the threshold
return <- return |>
  select_if(~sum(is.na(.)) < threshold)

# Replace NA values with the mean of each column with missing values
return <- return |>
  mutate(market_cap = ifelse(is.na(market_cap), mean(market_cap, na.rm = TRUE), market_cap),
         price_to_book = ifelse(is.na(price_to_book), mean(price_to_book, na.rm = TRUE), price_to_book),
         debt_to_equity = ifelse(is.na(debt_to_equity), mean(debt_to_equity, na.rm = TRUE), debt_to_equity),
         profitability = ifelse(is.na(profitability), mean(profitability, na.rm = TRUE), profitability),
         volatility = ifelse(is.na(volatility), mean(volatility, na.rm = TRUE), volatility),
         revenue = ifelse(is.na(revenue), mean(revenue, na.rm = TRUE), revenue))

sapply(return, function(x) sum(is.na(x)))

        ticker           date          price     market_cap  price_to_book 
             0              0              0              0              0 
debt_to_equity  profitability     volatility        revenue         return 
             0              0              0              0              0

Couple of points to note :

The market_cap variable has a very high standard deviation relative to the mean, indicating that the dataset includes companies of vastly different sizes, from small caps to large caps.
The price_to_book ratio shows significant variability as well, with some values as high as 12,351.61, suggesting there might be some highly valued companies compared to their book value.
Negative debt_to_equity for some companies, which might indicate more complex financial structures or situations where shareholder’s equity is negative.
The profitability variable has an exceptionally wide range, with the minimum being a large negative number and the maximum being very high, indicating some companies are highly profitable while others are incurring substantial losses.
revenue has a high skewness value, suggesting the inclusion of companies with massive differences in sales figures, from losses/revenue deductions to substantial earnings.
The greenhouse gas variables (ghg_s1, ghg_s2 and ghg_s3) have very different counts of non-missing values, which could suggest data collection challenges or varying reporting standards across companies. We decided to remove these columns because this level of sparsity (~85%) provides little to no added value for our predictive accuracy and could potentially distort the outcome, leading to unreliable predictions.
The return variable is quite stable with a mean close to zero and a small standard deviation, suggesting that the dataset might represent a balanced view of stock performance over time.
The remaining missing values were replaced by the mean of each column.
We need to remove the columns ticker, date, “ghg_s1”, “ghg_s2”, “ghg_s3”, because they are irrelevant.

Uni-variate Analysis

Categorical Values

The only categorical value is “ticker” and we will analyse the 10 top ones.

# Aggregated market_cap by ticker
aggregated_data <- return |>
  group_by(ticker) |>
  summarise(total_market_cap = sum(market_cap, na.rm = TRUE))

# Filter for the top 20 tickers by market cap
top_n <- 20 
aggregated_data_top_n <- aggregated_data |>
  top_n(n = top_n, wt = total_market_cap)

# Plot only the top N tickers
ggplot(aggregated_data_top_n, aes(x = reorder(ticker, -total_market_cap), y = total_market_cap)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Total Market Cap by Ticker", x = "Ticker", y = "Total Market Cap") +
  coord_flip() # Flip coordinates for horizontal bars

For positive variables, when tails are very heavy, use the +scale_x_log10() layer for histograms :

# Plot only the top N tickers with log-transformed y-axis
ggplot(aggregated_data_top_n, aes(x = reorder(ticker, -total_market_cap), y = total_market_cap)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  scale_y_log10() + # Log-transform the y-axis
  theme_minimal() +
  labs(title = "Total Market Cap by Ticker (Log Scale)", x = "Ticker", y = "Log of Total Market Cap") +
  coord_flip() # Flip coordinates for horizontal bars

# Using the first 10 tickers to analyse
top_tickers_data <- head(aggregated_data, 10)

# Creating a pie chart
ggplot(top_tickers_data, aes(x = "", y = total_market_cap, fill = ticker)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y") + # Convert the bar chart to a pie chart
  theme_void() + # Remove background and axes
  labs(title = "Market Cap Distribution Among Top Tickers") +
  theme(legend.title = element_blank()) # Hide the legend title

So we observe that : * In the bar charts, coord_flip() is used to flip the chart for better readability, especially when dealing with many tickers. * Pie charts are not ideal for datasets with many categories because they can become cluttered and difficult to interpret, so we use only the top 10 tickers for simplicity.

Numerical Values

plot_histogram_boxplot <- function(data, var_name) {
  # Convert the variable name to a symbol for ggplot
  var <- sym(var_name)
  
  # Create a histogram plot
  p1 <- ggplot(data, aes(x = !!var)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    theme_minimal() +
    labs(title = paste("Histogram of", var_name), x = var_name, y = "Count")
  
  # Create a boxplot
  p2 <- ggplot(data, aes(x = "", y = !!var)) +
    geom_boxplot(fill = "tomato", color = "black") +
    theme_minimal() +
    labs(title = paste("Boxplot of", var_name), x = "", y = var_name)
  
  # Combine plots using grid.arrange
  grid.arrange(p1, p2, ncol = 2)
}

# List of quantitative variable names to plot
quantitative_vars <- c("price", "market_cap", "price_to_book", "debt_to_equity", 
                       "profitability", "volatility", "revenue", "return")

# Loop through each variable and plot
for(var_name in quantitative_vars) {
  print(paste("Plotting for:", var_name)) # Print the variable name being plotted
  plot_histogram_boxplot(return, var_name)
}

[1] "Plotting for: price"

[1] "Plotting for: market_cap"

[1] "Plotting for: price_to_book"

[1] "Plotting for: debt_to_equity"

[1] "Plotting for: profitability"

[1] "Plotting for: volatility"

[1] "Plotting for: revenue"

[1] "Plotting for: return"

Modeling using ARIMA

ARIMA models are well-suited for time series data that show patterns over time and can be made stationary through differentiation. It is useful for forecasting future points in a series based on its own past values (autoregression) and a moving average of past errors. * Use case for VARIMA: If the return variable shows autocorrelation over time (which means past returns are predictive of future returns)

Due to the large amount stocks, we will explore some tickets in order to simplify it.

Example 1 - “AAPL US Equity” ticker

Data Preprocessing

ticker_name <- "AAPL US Equity"  # Replace with the ticker you want to predict

# Filter the data for the selected ticker
ticker_data <- return |> filter(ticker == ticker_name)

# Order the data by date
ticker_data <- ticker_data |> arrange(date)

Split the Data

# Calculate the splitting index
split_index <- floor(0.8 * nrow(ticker_data))

# Create training and testing datasets
train_data <- ticker_data[1:split_index, ]
test_data <- ticker_data[(split_index + 1):nrow(ticker_data), ]

Fit the VARIMA Model

# Fit an ARIMA model
return_arima <- auto.arima(train_data$return)
summary(return_arima)

Series: train_data$return 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1      ma1    mean
      0.6134  -0.5783  0.0265
s.e.  1.1523   1.1973  0.0085

sigma^2 = 0.01625:  log likelihood = 169.39
AIC=-330.77   AICc=-330.62   BIC=-316.5

Training set error measures:
                       ME      RMSE        MAE  MPE MAPE      MASE         ACF1
Training set 0.0001031105 0.1267608 0.09507222 -Inf  Inf 0.7166279 0.0004694282

Predicting Future Returns

# Make predictions
predictions <- forecast(return_arima, h = nrow(test_data))  # 'h' is the number of periods to predict

# Plot the predictions against the actual test data
plot(predictions)
lines(test_data$date, test_data$return, col = "red")

The blue line in the graph likely represents the forecasted values generated by the ARIMA model for the specified future periods (as defined by h = nrow(test_data)). It is the model’s best estimate of the time series’ central tendency moving forward.

Representation :

**Black Line* : This is the historical time series data upon which the model was trained.
Blue Line : This line is the point forecast from the ARIMA model, indicating the expected value of the series at each future point.
Grey Shaded Area : This usually depicts the prediction intervals (often 80% and 95% confidence intervals) around the point forecasts, representing the uncertainty in the forecasts. The lighter the shade of grey, the lower the confidence (i.e., the wider the interval).

In the graph, the blue line shows where the model predicts the return will be, on average, for the next h periods. The shaded area around it indicates the level of confidence the model has in its predictions; the actual future values are expected to fall within this range most of the time, given the model assumptions hold true.

The range of 0 to 300 on the x-axis of your ARIMA forecast plot represents the index of the observations in the time series data. This is a common default in time series plots when the time series object doesn’t have an associated time/date attribute or when the plotting function isn’t explicitly told to use a date variable for the x-axis. While the y-axis, labeled return, represents the values of the variable being forecast by the ARIMA(1,0,1) model

Performance Evaluation

# Calculate accuracy metrics
accuracy(predictions, test_data$return)

                        ME       RMSE        MAE      MPE     MAPE      MASE
Training set  0.0001031105 0.12676080 0.09507222     -Inf      Inf 0.7166279
Test set     -0.0001558202 0.09022325 0.07685722 114.7769 124.9835 0.5793283
                     ACF1
Training set 0.0004694282
Test set               NA

In the evaluation, -Inf and Inf values for MPE and MAPE suggest that there are cases where actual values are zero or very close to zero. In such cases, these percentage errors can become infinite or undefined, which is why they are not good measures for rendering these metrics unsuitable for this dataset.

The RMSE value of 0.0902 for the test set indicates that the forecast errors are moderate, and ideally, we would look for a lower RMSE for better forecast accuracy. The MAE is measured at 0.0769 for the test set, which similarly points to moderate errors; as with RMSE, a lower MAE would mean the predictions are closer to the actual values. The MASE stands at 0.579 for the test set, which is less than 1, suggesting that the forecasting model is performing better than a naive benchmark model.

Example 2 - “ZIXI US Equity” ticker

Data Preprocessing

ticker_name <- "ZIXI US Equity"  # Replace with the ticker you want to predict

# Filter the data for the selected ticker
ticker_data <- return |> filter(ticker == ticker_name)

# Order the data by date
ticker_data <- ticker_data |> arrange(date)

Split the Data

# Calculate the splitting index
split_index <- floor(0.8 * nrow(ticker_data))

# Create training and testing datasets
train_data <- ticker_data[1:split_index, ]
test_data <- ticker_data[(split_index + 1):nrow(ticker_data), ]

Fit the VARIMA Model

# Fit an ARIMA model
return_arima <- auto.arima(train_data$return)
summary(return_arima)

Series: train_data$return 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
        mean
      0.0278
s.e.  0.0162

sigma^2 = 0.06814:  log likelihood = -19.22
AIC=42.44   AICc=42.49   BIC=49.56

Training set error measures:
                       ME     RMSE       MAE  MPE MAPE      MASE          ACF1
Training set 3.956344e-14 0.260537 0.1697435 -Inf  Inf 0.6678411 -0.0007031951

Predicting Future Returns

# Make predictions
predictions <- forecast(return_arima, h = nrow(test_data))  # 'h' is the number of periods to predict

# Plot the predictions against the actual test data
plot(predictions)
lines(test_data$date, test_data$return, col = "red")

Performance Evaluation

# Calculate accuracy metrics
accuracy(predictions, test_data$return)

                        ME      RMSE        MAE  MPE MAPE      MASE
Training set  3.956344e-14 0.2605370 0.16974349 -Inf  Inf 0.6678411
Test set     -1.277648e-02 0.1217997 0.08573132 -Inf  Inf 0.3373024
                      ACF1
Training set -0.0007031951
Test set                NA

The RMSE on the test set is 0.122, which gives you an idea about the typical size of the forecast errors. The MASE being less than one (0.337) indicates that the model is performing better than a naive model for the test data set.

Modeling using lightGBM

Data Splitting

In this section, we fit a boosted tree from the lightGBM package. We split the data in two, for simplicity: we won’t use validation.

# Setting seed for reproducibility
set.seed(123)

# Data preparation
train_indices <- sample(1:nrow(return), 0.8 * nrow(return))  # 80% for training
train_data <- return[train_indices, ]
remaining_indices <- setdiff(1:nrow(return), train_indices)  # Indices not in training set

# Test set
test_size <- 0.2  # 20% for testing
test_indices <- sample(remaining_indices, test_size * length(remaining_indices))
test_data <- return[test_indices, ]

# Future validation set
remaining_indices <- setdiff(remaining_indices, test_indices)  # Remove test indices
validation_indices <- sample(remaining_indices, 0.2 * length(remaining_indices))  # 20% of remaining data for validation
validation_data <- return[validation_indices, -which(names(return) == "return")]  # Remove 'return' column

# Convert data to LightGBM format
train_data_lgb <- lgb.Dataset(data = as.matrix(train_data[, -which(names(train_data) == "return")]), label = train_data$return)

Warning in storage.mode(data) <- "double": NAs introduced by coercion

test_data_lgb <- lgb.Dataset(data = as.matrix(test_data[, -1]), label = test_data$return, reference = train_data_lgb)

Warning in storage.mode(data) <- "double": NAs introduced by coercion

# Define LightGBM parameters
train_params <- list(
  objective = "regression",  # Regression task
  metric = "rmse",  # Root Mean Square Error as the evaluation metric
  num_leaves = 31,  # Maximum number of leaves in one tree
  learning_rate = 0.1,  # Learning rate
  feature_fraction = 0.8,  # Percentage of features used per iteration
  bagging_fraction = 0.8,  # Percentage of data used per iteration
  bagging_freq = 5,  # Frequency for bagging
  verbose = -1  # No print updates
)

Training

lgb_model <- lgb.train(params = train_params,
                       data = train_data_lgb,
                       nrounds = 100,  # Number of boosting iterations (trees)
                       valids = list(validation = test_data_lgb),
                       early_stopping_rounds = 10)  # Early stopping

Prediction & evaluation

predictions <- predict(lgb_model, newdata = as.matrix(test_data[, -1]))

Warning in storage.mode(data) <- "double": NAs introduced by coercion

rmse <- sqrt(mean((predictions - test_data$return)^2))
cat("RMSE:", rmse, "\n")

RMSE: 0.1255344

mae <- mean(abs(predictions - test_data$return))
cat("MAE:", mae, "\n")

MAE: 0.08153906

We obtained a 8% of error for returns, which is not negligible, with a MAE of 0.082, & RMSE of 0.126.

Interpretability

In the realm of finance, where money management is pivotal, there is a significant preference for transparent algorithms that support decision-making. Stakeholders demand to understand the underlying reasoning of predictive models, which leads to the crucial distinction between global and local interpretability.

Global interpretability allows us to comprehend the model’s mechanisms over the aggregate data, offering insights on how the model makes decisions across the broad spectrum of data. An exemplar of this is the lightGBM model we trained, whose feature importance chart, displaying the aggregated gains across the ensemble of trees, exemplifies global interpretability. Such clarity is not just academic but practical, allowing for refined models that prioritize transparency in their predictive logic.

Feature Importance

Feature importance can help in understanding the model’s decision-making process and in the reduction of the feature space for more efficient models.

lgb_importance <- lgb.importance(model = lgb_model, percentage = TRUE)
print(lgb_importance)

          Feature       Gain        Cover Frequency
           <char>      <num>        <num>     <num>
1:     volatility 0.32614683 0.4703797605 0.2000000
2:  price_to_book 0.25451039 0.4439211563 0.2000000
3:     market_cap 0.18233986 0.0500271586 0.2666667
4:          price 0.10241046 0.0008298006 0.1000000
5:        revenue 0.08431652 0.0343258741 0.1333333
6: debt_to_equity 0.05027594 0.0005162500 0.1000000

lgb.plot.importance(lgb_importance)

volatility seems to be the most important feature, contributing the most to the model’s predictions, followed by price_to_book and market_cap. The features price, revenue, and debt_to_equity have lower importance scores in this model.