According to historical records, the average annual return of S&P 500 is approximately 10%–11% since its inception in 1926. However, is it possible to build up a machine learning model that can predict a specific stock’s price will rise or fall in the next year based on last year’s financial indicators? In addition, is there a method which is able to predict stock price changes in the following year?
In this research, more than 200 financial indicators of US stocks from 2014 to 2018 were collected to predict the price variance and the stock price trend in the following year.
This dataset is from Kaggle: https://www.kaggle.com/cnic92/200-financial-indicators-of-us-stocks-20142018. In this dataset, the last column represents the class of each stock, where: if the value of a stock increases during next year, then class equals to 1; if the value of a stock decreases during next year then class equals to 0. The second-to-last column, PRICE VAR [%], lists the percent price variation of each stock for the next year. For example, percent price variation for the year 2016 (meaning from the first trading day on Jan 2016 to the last trading day on Dec 2016).The columns of PRICE VAR [%] and class make it possible to use the datasets for both classification and regression tasks.
1. Data Exploration
After datasets uploading, we use the following function to do some feature engineering and feature selection. Basically, features were calculated and selected from five dimensions: liquidity(solvency) ratios, leverage(debit) financial ratios, operation ratios(assets efficiency), profitability ratios and valuation ratios (market value ratios).
Peek at 2018 Financial Indicators Dataset.
There are several takeaways from 2018 Dataset: 4194 observations, 21 columns(18 numeric(financial indicators),1 int(class column), 2 object(company name and sector)).
Now, let’s get more information about this dataset. Datasets of other years can be explored in the same way.
Firstly, take a look at the distribution of stock price variance in general and based on sectors.
Takeaways: stock price variance median is approximately 20% in general, but it varies in different industries.
Next, let us visualize the relationship between next year stock price variances and financial indicators (in groups) by industries(get rid of outliers).
Leverage Financial Ratios
In addition, let us do financial analysis on one company based on 2014 — 2018 datasets. Analysis on other companies can be done in the same manner.
Combine different datasets of Apple into one dataset.
Visualize financial indicators of Apple from 2014 to 2018.
2. Prediction Modeling
Concatenate all five datasets, and drop stock price variance columns. This combined dataset will be used in the following classification problem.
In the regression problem, df2018 will be used as an example. Column ‘2019StockPriceVar’’ is the target.
Identify the baseline
Linear regression modelling
Looks normal, right? The R Squared score is about 0.30, which means 30% of the stock price variance can be explained by this linear regression model for the train set. Wait a second, why is that ‘Class’ column still there? What will happen if I drop that column? Let’s see!
After removing the ‘Class’ column, we only get a R Squared score of 0.01, which means that only 1.0% of the stock price variance can be explained by the linear regression model. Tremendous drop of R Squared score means there is a data leakage. Therefore, ‘Class’ feature should not be included in real model. Meanwhile, it is easy for us to figure out that this linear regression model does not make too much sense to this dataset.
Following let us use some visualization tool to observe if features are related to the target.
It is a heat map showing correlations of features and our target ‘2019StockPriceVar. Light blue means that there is no correlation or very weak correlation. Dark blue indicates that there is fairly strong correlation. Therefore, it is easy to discover that these features and the target have very weak relationship based on above correlation matrix.
For classification problem, let us use the combined data frame ‘df’. Column ‘Class’ as the target. ‘Class0’ means that the stock price will decrease in the following year. ‘Class=1’ means that the stock price will increase in the following year.
Identify the baseline
In this case, XGBoost classifier was chosen, which is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework.
OK, different metrics will be used in the following part to evaluate our model.
ROC score and ROC curve.
ShAP Value Analysis
SHAP Values (an acronym from SHapley Additive exPlanations) break down a prediction to show the impact of each feature.
From above analysis and modeling, here are some points that can be seen below:
- For regression problem, financial indicators from five dimensions were selected as features. These features are very important ones for financial analysis, however, it seems that they don’t give much insight on companies’ stock price variations in the following year.
- The baseline of classification problem is 55%, the prediction accuracy of our model for test set is 59%, only less than 3% higher than the baseline. Relying on last year’s financial indicators to predict next year’s stock trend is not trustworthy. Past performance is not an entirely reliable metric for the future.
- Machine learning models chosen in the above analysis may be not appropriate for this scenario.Above prediction is cross-sectorial. If we do analysis based on a specific industry, the prediction result may vary.
Source code can be found: https://github.com/yuanjinren/DS/blob/master/YuanjinRen_Unit2_Project.ipynb