Stock Market Prediction

Author

Jia Xin Tang Zhi

Project :

Regression task, where we want to predict stock returns from panel-type data (last columnn).

Import library

#install.packages("tidyverse") 
#install.packages("psych")
#install.packages("lightgbm") 
#install.packages("iml") 

Data wrangling

First, let’s load the data & the packages.

library(tidyverse)
library(dplyr)              # for data manipulation
library(ggplot2)            # For the plots
library(gridExtra)          # to arrange 2 graphs in one row
library(psych)              # for comprehensive summary
library(rlang)              # for in-built functions
library(forecast)           # for forecasting future values
library(lightgbm)           #for the model lightgbm

Let’s load the dataset that comes in RData format.

load('stocks_clean.RData')  # loading the RData file
return <- stocks_clean      # and assigning it to a variable  Return
rm(stocks_clean)            # now that return is assigned, remove remove returns to save memory
dim(return)   # Dimension of the dataframe
[1] 289271     13
head(return)  # Look at the first and last observations of the dataframe
ticker date price market_cap price_to_book debt_to_equity profitability volatility revenue ghg_s1 ghg_s2 ghg_s3 return
AAON US Equity 1995-12-31 0.5048 35.1440 2.5948 85.9073 0.8628 71.728 14.720 NA NA NA -0.0980883
AAON US Equity 1996-01-31 0.4719 32.8520 2.4256 85.9073 3.0722 63.087 67.346 NA NA NA -0.0651743
AAON US Equity 1996-02-29 0.5048 35.1440 2.5948 85.9073 3.0722 97.639 67.346 NA NA NA 0.0697182
AAON US Equity 1996-03-31 0.4170 29.0367 2.0805 65.1878 3.1180 100.450 13.438 NA NA NA -0.1739303
AAON US Equity 1996-04-30 0.3841 26.7444 1.9162 65.1878 3.1180 76.133 13.438 NA NA NA -0.0788969
AAON US Equity 1996-05-31 0.3951 27.5445 1.9710 65.1878 3.1180 88.304 13.438 NA NA NA 0.0286384
tail(return)
ticker date price market_cap price_to_book debt_to_equity profitability volatility revenue ghg_s1 ghg_s2 ghg_s3 return
ZIXI US Equity 2022-07-29 8.485 481.8671 14.1059 146.8623 -3.7101 4.118 64.85 NA NA NA 0
ZIXI US Equity 2022-08-31 8.485 481.8671 14.1059 146.8623 -3.7101 4.118 64.85 NA NA NA 0
ZIXI US Equity 2022-09-30 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
ZIXI US Equity 2022-10-31 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
ZIXI US Equity 2022-11-30 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0
ZIXI US Equity 2022-12-30 8.485 481.8671 14.1059 NA NA 4.118 NA NA NA NA 0

We can see that there are 289271 observations (rows) and 13 variables (features) so the dataset is quite large. - The last variable called “return” is the target variable amd well positioned at the last column.

Structure and statistics of the dataset

Structure

str(return[,1:13])    # Provides the structure , datatype
tibble [289,271 × 13] (S3: tbl_df/tbl/data.frame)
 $ ticker        : chr [1:289271] "AAON US Equity" "AAON US Equity" "AAON US Equity" "AAON US Equity" ...
 $ date          : Date[1:289271], format: "1995-12-31" "1996-01-31" ...
 $ price         : num [1:289271] 0.505 0.472 0.505 0.417 0.384 ...
 $ market_cap    : num [1:289271] 35.1 32.9 35.1 29 26.7 ...
 $ price_to_book : num [1:289271] 2.59 2.43 2.59 2.08 1.92 ...
 $ debt_to_equity: num [1:289271] 85.9 85.9 85.9 65.2 65.2 ...
 $ profitability : num [1:289271] 0.863 3.072 3.072 3.118 3.118 ...
 $ volatility    : num [1:289271] 71.7 63.1 97.6 100.5 76.1 ...
 $ revenue       : num [1:289271] 14.7 67.3 67.3 13.4 13.4 ...
 $ ghg_s1        : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
 $ ghg_s2        : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
 $ ghg_s3        : num [1:289271] NA NA NA NA NA NA NA NA NA NA ...
 $ return        : num [1:289271] -0.0981 -0.0652 0.0697 -0.1739 -0.0789 ...

There seems to be character, Date and numerical type of data. Let’s have a precise view using the module class.

sapply(return, class) |> table() |> head()

character      Date   numeric 
        1         1        11 

So the dataset is mainly composed by numeric variables and some are logical and others with characters. We will therefore need to use hot encoding to numericalize them if we want to use them as inputs in our ML algo.

Statistics

statistics <- summary(return) #provides a summary of statistics for numerical values
statistics_df <- as.data.frame(statistics) #creating a dataframe for better view assigning to a vector statistics_df

# Then use the describe function
describe_stats <- describe(statistics_df)
describe_stats 
vars n mean sd median trimmed mad min max range skew kurtosis se
Var1* 1 91 1.00000 0.000000 1.0 1.00000 0.0000 1 1 0 NaN NaN 0.0000000
Var2* 2 91 7.00000 3.762387 7.0 7.00000 4.4478 1 13 12 0.0000000 -1.253316 0.3944053
Freq* 3 84 41.82143 23.494763 42.5 42.01471 30.3933 1 81 80 -0.0697899 -1.261956 2.5634888
summary(return$return)  # descriptive summary of the predictive variable
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-0.999900 -0.046077  0.007165  0.011707  0.061672  4.140351 
psych::describe(return)  # more details 
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
vars n mean sd median trimmed mad min max range skew kurtosis se
ticker* 1 289271 442.8751862 2.554853e+02 443.0000000 442.8453398 327.6546000 1.0000e+00 8.850000e+02 8.840000e+02 -0.0003507 -1.198849 0.4750218
date 2 289271 NaN NA NA NaN NA Inf -Inf -Inf NA NA NA
price 3 289271 47.5231056 1.358301e+02 24.6000000 29.2883421 21.9721320 1.0000e-06 5.908870e+03 5.908870e+03 18.4437518 485.375142 0.2525477
market_cap 4 285271 14358.8671009 5.284050e+04 1783.9113000 4689.8198910 2431.9996634 1.2000e-02 2.913284e+06 2.913284e+06 20.4232946 762.572413 98.9322942
price_to_book 5 269118 6.9235647 3.650293e+02 2.0421000 2.3796616 1.3192175 4.0000e-04 1.239516e+05 1.239516e+05 292.0519380 88943.404460 0.7036495
debt_to_equity 6 279183 202.9767215 4.641365e+03 52.8601000 63.9115311 60.8140281 0.0000e+00 1.188137e+06 1.188137e+06 187.9389783 46359.668753 8.7841811
profitability 7 283169 -238.4942549 2.270789e+04 7.3622000 8.8191991 8.0604514 -3.3007e+06 3.703000e+05 3.671000e+06 -133.5345377 18751.861019 42.6730831
volatility 8 289084 39.1337053 3.991481e+01 30.7070000 33.5309748 15.8830938 0.0000e+00 5.137323e+03 5.137323e+03 29.3463638 2549.313940 0.0742373
revenue 9 284166 19533.5954801 4.259334e+05 443.5000000 1153.5807626 611.3115624 -2.7966e+04 3.137951e+07 3.140747e+07 40.9380333 2104.805636 799.0163901
ghg_s1 10 37575 5209.6830526 1.565725e+04 212.1960000 1376.6898118 305.4215304 0.0000e+00 1.455000e+05 1.455000e+05 5.0702658 30.350285 80.7729651
ghg_s2 11 35717 1138.8577400 2.507859e+03 279.6060000 534.8833402 373.5054876 0.0000e+00 2.900000e+04 2.900000e+04 4.6730604 28.919859 13.2698360
ghg_s3 12 24404 30353.2335338 1.028987e+05 549.5130000 5531.2176450 805.1718906 0.0000e+00 1.169970e+06 1.169970e+06 5.0776069 29.746759 658.6875273
return 13 289271 0.0117065 1.241597e-01 0.0071651 0.0079437 0.0798772 -9.9990e-01 4.140351e+00 5.140251e+00 3.2588582 61.246547 0.0002308
# Calculate missing values
sapply(return, function(x) sum(is.na(x)))
        ticker           date          price     market_cap  price_to_book 
             0              0              0           4000          20153 
debt_to_equity  profitability     volatility        revenue         ghg_s1 
         10088           6102            187           5105         251696 
        ghg_s2         ghg_s3         return 
        253554         264867              0 
# Calculate the threshold for missing values
threshold <- nrow(return) * 0.85

# Remove columns where the number of missing values is greater than the threshold
return <- return |>
  select_if(~sum(is.na(.)) < threshold)

# Replace NA values with the mean of each column with missing values
return <- return |>
  mutate(market_cap = ifelse(is.na(market_cap), mean(market_cap, na.rm = TRUE), market_cap),
         price_to_book = ifelse(is.na(price_to_book), mean(price_to_book, na.rm = TRUE), price_to_book),
         debt_to_equity = ifelse(is.na(debt_to_equity), mean(debt_to_equity, na.rm = TRUE), debt_to_equity),
         profitability = ifelse(is.na(profitability), mean(profitability, na.rm = TRUE), profitability),
         volatility = ifelse(is.na(volatility), mean(volatility, na.rm = TRUE), volatility),
         revenue = ifelse(is.na(revenue), mean(revenue, na.rm = TRUE), revenue))

sapply(return, function(x) sum(is.na(x)))
        ticker           date          price     market_cap  price_to_book 
             0              0              0              0              0 
debt_to_equity  profitability     volatility        revenue         return 
             0              0              0              0              0 

Couple of points to note :

  1. The market_cap variable has a very high standard deviation relative to the mean, indicating that the dataset includes companies of vastly different sizes, from small caps to large caps.

  2. The price_to_book ratio shows significant variability as well, with some values as high as 12,351.61, suggesting there might be some highly valued companies compared to their book value.

  3. Negative debt_to_equity for some companies, which might indicate more complex financial structures or situations where shareholder’s equity is negative.

  4. The profitability variable has an exceptionally wide range, with the minimum being a large negative number and the maximum being very high, indicating some companies are highly profitable while others are incurring substantial losses.

  5. revenue has a high skewness value, suggesting the inclusion of companies with massive differences in sales figures, from losses/revenue deductions to substantial earnings.

  6. The greenhouse gas variables (ghg_s1, ghg_s2 and ghg_s3) have very different counts of non-missing values, which could suggest data collection challenges or varying reporting standards across companies. We decided to remove these columns because this level of sparsity (~85%) provides little to no added value for our predictive accuracy and could potentially distort the outcome, leading to unreliable predictions.

  7. The return variable is quite stable with a mean close to zero and a small standard deviation, suggesting that the dataset might represent a balanced view of stock performance over time.

  8. The remaining missing values were replaced by the mean of each column.

  9. We need to remove the columns ticker, date, “ghg_s1”, “ghg_s2”, “ghg_s3”, because they are irrelevant.

Uni-variate Analysis

Categorical Values

The only categorical value is “ticker” and we will analyse the 10 top ones.

# Aggregated market_cap by ticker
aggregated_data <- return |>
  group_by(ticker) |>
  summarise(total_market_cap = sum(market_cap, na.rm = TRUE))
# Filter for the top 20 tickers by market cap
top_n <- 20 
aggregated_data_top_n <- aggregated_data |>
  top_n(n = top_n, wt = total_market_cap)
# Plot only the top N tickers
ggplot(aggregated_data_top_n, aes(x = reorder(ticker, -total_market_cap), y = total_market_cap)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  theme_minimal() +
  labs(title = "Total Market Cap by Ticker", x = "Ticker", y = "Total Market Cap") +
  coord_flip() # Flip coordinates for horizontal bars

For positive variables, when tails are very heavy, use the +scale_x_log10() layer for histograms :

# Plot only the top N tickers with log-transformed y-axis
ggplot(aggregated_data_top_n, aes(x = reorder(ticker, -total_market_cap), y = total_market_cap)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  scale_y_log10() + # Log-transform the y-axis
  theme_minimal() +
  labs(title = "Total Market Cap by Ticker (Log Scale)", x = "Ticker", y = "Log of Total Market Cap") +
  coord_flip() # Flip coordinates for horizontal bars

# Using the first 10 tickers to analyse
top_tickers_data <- head(aggregated_data, 10)

# Creating a pie chart
ggplot(top_tickers_data, aes(x = "", y = total_market_cap, fill = ticker)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y") + # Convert the bar chart to a pie chart
  theme_void() + # Remove background and axes
  labs(title = "Market Cap Distribution Among Top Tickers") +
  theme(legend.title = element_blank()) # Hide the legend title

So we observe that : * In the bar charts, coord_flip() is used to flip the chart for better readability, especially when dealing with many tickers. * Pie charts are not ideal for datasets with many categories because they can become cluttered and difficult to interpret, so we use only the top 10 tickers for simplicity.

Numerical Values

plot_histogram_boxplot <- function(data, var_name) {
  # Convert the variable name to a symbol for ggplot
  var <- sym(var_name)
  
  # Create a histogram plot
  p1 <- ggplot(data, aes(x = !!var)) +
    geom_histogram(bins = 30, fill = "skyblue", color = "black") +
    theme_minimal() +
    labs(title = paste("Histogram of", var_name), x = var_name, y = "Count")
  
  # Create a boxplot
  p2 <- ggplot(data, aes(x = "", y = !!var)) +
    geom_boxplot(fill = "tomato", color = "black") +
    theme_minimal() +
    labs(title = paste("Boxplot of", var_name), x = "", y = var_name)
  
  # Combine plots using grid.arrange
  grid.arrange(p1, p2, ncol = 2)
}
# List of quantitative variable names to plot
quantitative_vars <- c("price", "market_cap", "price_to_book", "debt_to_equity", 
                       "profitability", "volatility", "revenue", "return")

# Loop through each variable and plot
for(var_name in quantitative_vars) {
  print(paste("Plotting for:", var_name)) # Print the variable name being plotted
  plot_histogram_boxplot(return, var_name)
}
[1] "Plotting for: price"

[1] "Plotting for: market_cap"

[1] "Plotting for: price_to_book"

[1] "Plotting for: debt_to_equity"

[1] "Plotting for: profitability"

[1] "Plotting for: volatility"

[1] "Plotting for: revenue"

[1] "Plotting for: return"

Modeling using ARIMA

ARIMA models are well-suited for time series data that show patterns over time and can be made stationary through differentiation. It is useful for forecasting future points in a series based on its own past values (autoregression) and a moving average of past errors. * Use case for VARIMA: If the return variable shows autocorrelation over time (which means past returns are predictive of future returns)

Due to the large amount stocks, we will explore some tickets in order to simplify it.

Example 1 - “AAPL US Equity” ticker

Data Preprocessing

ticker_name <- "AAPL US Equity"  # Replace with the ticker you want to predict

# Filter the data for the selected ticker
ticker_data <- return |> filter(ticker == ticker_name)

# Order the data by date
ticker_data <- ticker_data |> arrange(date)

Split the Data

# Calculate the splitting index
split_index <- floor(0.8 * nrow(ticker_data))

# Create training and testing datasets
train_data <- ticker_data[1:split_index, ]
test_data <- ticker_data[(split_index + 1):nrow(ticker_data), ]

Fit the VARIMA Model

# Fit an ARIMA model
return_arima <- auto.arima(train_data$return)
summary(return_arima)
Series: train_data$return 
ARIMA(1,0,1) with non-zero mean 

Coefficients:
         ar1      ma1    mean
      0.6134  -0.5783  0.0265
s.e.  1.1523   1.1973  0.0085

sigma^2 = 0.01625:  log likelihood = 169.39
AIC=-330.77   AICc=-330.62   BIC=-316.5

Training set error measures:
                       ME      RMSE        MAE  MPE MAPE      MASE         ACF1
Training set 0.0001031105 0.1267608 0.09507222 -Inf  Inf 0.7166279 0.0004694282

Predicting Future Returns

# Make predictions
predictions <- forecast(return_arima, h = nrow(test_data))  # 'h' is the number of periods to predict

# Plot the predictions against the actual test data
plot(predictions)
lines(test_data$date, test_data$return, col = "red")

The blue line in the graph likely represents the forecasted values generated by the ARIMA model for the specified future periods (as defined by h = nrow(test_data)). It is the model’s best estimate of the time series’ central tendency moving forward.

Representation :

  • **Black Line* : This is the historical time series data upon which the model was trained.
  • Blue Line : This line is the point forecast from the ARIMA model, indicating the expected value of the series at each future point.
  • Grey Shaded Area : This usually depicts the prediction intervals (often 80% and 95% confidence intervals) around the point forecasts, representing the uncertainty in the forecasts. The lighter the shade of grey, the lower the confidence (i.e., the wider the interval).

In the graph, the blue line shows where the model predicts the return will be, on average, for the next h periods. The shaded area around it indicates the level of confidence the model has in its predictions; the actual future values are expected to fall within this range most of the time, given the model assumptions hold true.

The range of 0 to 300 on the x-axis of your ARIMA forecast plot represents the index of the observations in the time series data. This is a common default in time series plots when the time series object doesn’t have an associated time/date attribute or when the plotting function isn’t explicitly told to use a date variable for the x-axis. While the y-axis, labeled return, represents the values of the variable being forecast by the ARIMA(1,0,1) model

Performance Evaluation

# Calculate accuracy metrics
accuracy(predictions, test_data$return)
                        ME       RMSE        MAE      MPE     MAPE      MASE
Training set  0.0001031105 0.12676080 0.09507222     -Inf      Inf 0.7166279
Test set     -0.0001558202 0.09022325 0.07685722 114.7769 124.9835 0.5793283
                     ACF1
Training set 0.0004694282
Test set               NA

In the evaluation, -Inf and Inf values for MPE and MAPE suggest that there are cases where actual values are zero or very close to zero. In such cases, these percentage errors can become infinite or undefined, which is why they are not good measures for rendering these metrics unsuitable for this dataset.

The RMSE value of 0.0902 for the test set indicates that the forecast errors are moderate, and ideally, we would look for a lower RMSE for better forecast accuracy. The MAE is measured at 0.0769 for the test set, which similarly points to moderate errors; as with RMSE, a lower MAE would mean the predictions are closer to the actual values. The MASE stands at 0.579 for the test set, which is less than 1, suggesting that the forecasting model is performing better than a naive benchmark model.

Example 2 - “ZIXI US Equity” ticker

Data Preprocessing

ticker_name <- "ZIXI US Equity"  # Replace with the ticker you want to predict

# Filter the data for the selected ticker
ticker_data <- return |> filter(ticker == ticker_name)

# Order the data by date
ticker_data <- ticker_data |> arrange(date)

Split the Data

# Calculate the splitting index
split_index <- floor(0.8 * nrow(ticker_data))

# Create training and testing datasets
train_data <- ticker_data[1:split_index, ]
test_data <- ticker_data[(split_index + 1):nrow(ticker_data), ]

Fit the VARIMA Model

# Fit an ARIMA model
return_arima <- auto.arima(train_data$return)
summary(return_arima)
Series: train_data$return 
ARIMA(0,0,0) with non-zero mean 

Coefficients:
        mean
      0.0278
s.e.  0.0162

sigma^2 = 0.06814:  log likelihood = -19.22
AIC=42.44   AICc=42.49   BIC=49.56

Training set error measures:
                       ME     RMSE       MAE  MPE MAPE      MASE          ACF1
Training set 3.956344e-14 0.260537 0.1697435 -Inf  Inf 0.6678411 -0.0007031951

Predicting Future Returns

# Make predictions
predictions <- forecast(return_arima, h = nrow(test_data))  # 'h' is the number of periods to predict

# Plot the predictions against the actual test data
plot(predictions)
lines(test_data$date, test_data$return, col = "red")

Performance Evaluation

# Calculate accuracy metrics
accuracy(predictions, test_data$return)
                        ME      RMSE        MAE  MPE MAPE      MASE
Training set  3.956344e-14 0.2605370 0.16974349 -Inf  Inf 0.6678411
Test set     -1.277648e-02 0.1217997 0.08573132 -Inf  Inf 0.3373024
                      ACF1
Training set -0.0007031951
Test set                NA

The RMSE on the test set is 0.122, which gives you an idea about the typical size of the forecast errors. The MASE being less than one (0.337) indicates that the model is performing better than a naive model for the test data set.

Modeling using lightGBM

Data Splitting

In this section, we fit a boosted tree from the lightGBM package. We split the data in two, for simplicity: we won’t use validation.

# Setting seed for reproducibility
set.seed(123)

# Data preparation
train_indices <- sample(1:nrow(return), 0.8 * nrow(return))  # 80% for training
train_data <- return[train_indices, ]
remaining_indices <- setdiff(1:nrow(return), train_indices)  # Indices not in training set

# Test set
test_size <- 0.2  # 20% for testing
test_indices <- sample(remaining_indices, test_size * length(remaining_indices))
test_data <- return[test_indices, ]

# Future validation set
remaining_indices <- setdiff(remaining_indices, test_indices)  # Remove test indices
validation_indices <- sample(remaining_indices, 0.2 * length(remaining_indices))  # 20% of remaining data for validation
validation_data <- return[validation_indices, -which(names(return) == "return")]  # Remove 'return' column
# Convert data to LightGBM format
train_data_lgb <- lgb.Dataset(data = as.matrix(train_data[, -which(names(train_data) == "return")]), label = train_data$return)
Warning in storage.mode(data) <- "double": NAs introduced by coercion
test_data_lgb <- lgb.Dataset(data = as.matrix(test_data[, -1]), label = test_data$return, reference = train_data_lgb)
Warning in storage.mode(data) <- "double": NAs introduced by coercion
# Define LightGBM parameters
train_params <- list(
  objective = "regression",  # Regression task
  metric = "rmse",  # Root Mean Square Error as the evaluation metric
  num_leaves = 31,  # Maximum number of leaves in one tree
  learning_rate = 0.1,  # Learning rate
  feature_fraction = 0.8,  # Percentage of features used per iteration
  bagging_fraction = 0.8,  # Percentage of data used per iteration
  bagging_freq = 5,  # Frequency for bagging
  verbose = -1  # No print updates
)

Training

lgb_model <- lgb.train(params = train_params,
                       data = train_data_lgb,
                       nrounds = 100,  # Number of boosting iterations (trees)
                       valids = list(validation = test_data_lgb),
                       early_stopping_rounds = 10)  # Early stopping

Prediction & evaluation

predictions <- predict(lgb_model, newdata = as.matrix(test_data[, -1]))
Warning in storage.mode(data) <- "double": NAs introduced by coercion
rmse <- sqrt(mean((predictions - test_data$return)^2))
cat("RMSE:", rmse, "\n")
RMSE: 0.1255344 
mae <- mean(abs(predictions - test_data$return))
cat("MAE:", mae, "\n")
MAE: 0.08153906 

We obtained a 8% of error for returns, which is not negligible, with a MAE of 0.082, & RMSE of 0.126.

Interpretability

In the realm of finance, where money management is pivotal, there is a significant preference for transparent algorithms that support decision-making. Stakeholders demand to understand the underlying reasoning of predictive models, which leads to the crucial distinction between global and local interpretability.

Global interpretability allows us to comprehend the model’s mechanisms over the aggregate data, offering insights on how the model makes decisions across the broad spectrum of data. An exemplar of this is the lightGBM model we trained, whose feature importance chart, displaying the aggregated gains across the ensemble of trees, exemplifies global interpretability. Such clarity is not just academic but practical, allowing for refined models that prioritize transparency in their predictive logic.

Feature Importance

Feature importance can help in understanding the model’s decision-making process and in the reduction of the feature space for more efficient models.

lgb_importance <- lgb.importance(model = lgb_model, percentage = TRUE)
print(lgb_importance)
          Feature       Gain        Cover Frequency
           <char>      <num>        <num>     <num>
1:     volatility 0.32614683 0.4703797605 0.2000000
2:  price_to_book 0.25451039 0.4439211563 0.2000000
3:     market_cap 0.18233986 0.0500271586 0.2666667
4:          price 0.10241046 0.0008298006 0.1000000
5:        revenue 0.08431652 0.0343258741 0.1333333
6: debt_to_equity 0.05027594 0.0005162500 0.1000000
lgb.plot.importance(lgb_importance)

volatility seems to be the most important feature, contributing the most to the model’s predictions, followed by price_to_book and market_cap. The features price, revenue, and debt_to_equity have lower importance scores in this model.