Within finance we can use neural networks for so many things like feature extraction, data generation, classification, regression, ranking, survival analysis, time series forecasting, optimization…
Representation learning (Neural networks)
Unsupervised Learning (Autoencoding, Word2Vec)
Synthetic Data Generation (VAEs, GANs)
Supervised Learning (MLP, CNN)
Time Series Forecasting (RNN, CNN, Attention, N-Beats)
The idea is simple: the model takes inputs x, multiplies them by weights w, and adds them all together. The sum of the product obtained ∑i=1mxiwi is then passed through a non-linear activation function g. The output of the function give the prediction y^.
Backpropagation implies that if the network produces an unsatisfactory outcome, we go back and adjust the weights of the neurons and their connections.
This is how the network learns from its mistakes:
A training cycle consists of a forward and backward pass.
To repeat: all neural networks can be divided into two parts: a forward propagation phase, where the information "flows" forward to compute predictions and the error;
And the backward propagation phase, where the backpropagation algorithm computes the error derivatives and update the network weights.
Quick recap matrix notation:
First let’s reflect back to our previous multilayered perceptron, we will now label the linear function as λ(), the sigmoid function as σ(), and the probability threshold function as τ().
We will implement a Multilayer Perceptron with one hidden layer by translating all our equations into code.
For one it is important to know we don’t perform the loops that the summation notation might imply.
Loops are known to be highly inefficient computationally, so we will want to avoid them.
Fortunately, we can use matrix operations (i.e. vectorization) to achieve the exact same result much faster.
In the example below the we don’t perform 12 different operations, but simply 1 operation.
Neural Network Steps
Remember that we need to computer the following operations in order:
Linear function aggregation z
Sigmoid function activation a
C. NNs from Scratch
There are two ways to think of dimensionality reduction, (1) the first is as a method to represent features in a lower dimensional space within the same tensor rank, (2) the second is a method to move from a high to a low tensor rank.
Rank to Rank
The standard form of dimensionality reduction technique we might use is PCA to shrink the features from a 2D array (matrix) of 100 feature to 5 features by forming an efficient representation on a new 2D array (represented on the bottom right).
The techniques for 3D and 4D arrays are not widely used, one is called Matrix Product State, these would allow you to go from 3D→3D and 4D→4D.
Since the development of neural networks a models known as an Autoencoder has become popular to learn an efficient lower-dimensional representation of the data within the same rank.
Autoencoders are very flexible non-linear decomposition methods and can work with tensors of any rank, including rank two (2D), three (3D), and four (4D) tensors.
High to Low Rank
You can move from a 4D tensor (tesseract) to a 3D tensor using Matrix Product State.
You can move from 3D (tensor) to 2D (matrix) using the Tucker and CANDECOMP decompositions.
We can also use PCA to go from a 2D array to a 1D array, by only choosing the first principal component.
D. Dimensionality Reduction (Autoencoder)
Convolutional Neural Network (CNNs)
CNNs was originally used for pattern and object recognition on image data. The image data is converted into a 3D array of RGB (Red Green Blue) color channels. In the example on the bottom right the 3 channels for 4 pixels are captured.
Stock Trading Via Images
It wasn’t long until developers from finance realized that you could use the same format above to solve problems in finance.
Tucker Balch (JP Morgan AI Research), created a large sample of financial time-series images encoded as candlestick (Box and Whisker) charts.
They labeled the samples following three algebraically-defined binary trade strategies (Murphy, 1999).
The authors realized that there are various ways in which one can “visualize” stock market data.
We can also add a lot of technical feature and let the CNN find the appropriate direction that the technical indicator identified.
Moreover, we can vary the image resolution (pixels) as a robustness test to see the variation in performance.
Here the purpose was not to predict the direction of the stock, but instead see if they can capture the labels of technical indicator methods.
E. Prediction (CNNs)
Sequence (order of data) matters in many domains, like reading and speech, where each word is seen in a context based on our understanding of the previous words.
The non-linear equivalent of the ARIMA model is called a Recurrent Neural Network (RNN)
Below is how you can convert a Feed-Forward (Multilayer Perceptron) Neural Network into a Recurrent Neural Network.
FFN versus RNN
RNN has two desirable properties it can (1) output an infinite amount of samples, (2) and it incorporates previous prediction information.
The difference between feed froward models (left) and recurrent neural network models (right) are minimal.
When an RNN model has to calculate the new hidden state it uses feature inputs like FFN, but also the previous hidden state.
The FNN (MLP) model is the standard NN that purely relies on the input without taking previous information into account for subsequent predictions.
F. Prediction (RNNs)
Recall that internal nodes (the circles ∘ in the middle) simply represent scalar values z, and the values represent the linear weighted input values (inputs x weights, e.g, x×w).
Tip: there are different ways in which to present the activated nodes (input z with an activation function g(z) applied over it).
For autoencoders these values z (all ten internal z circles) was obtained by the choice of weights W that minimizes the final reconstruction error.
However, we can do something more interesting, we can force several z values to become parameters on a distribution N(μ,σ2) instead of simple unconstrained scalar values.
The mean μ and variance σ2 together forms its own normal probability distribution that we can use to sample a new latent space from.
Sampling from the distribution creates sample that can be decoded and turned into new outputs.
A single node would have a latent sample zi=μi+σi2×ϵi, the mean and variance in the equation defines the shape of the distributions; the Gaussian errors ϵ1,ϵ2∼N(0,1) defines the position of the tiny red dot on the distribution.
G. Data Generation (VAEs)
Many (supervised) machine learning systems look at some kind of complicated input (say, an image) and produce a simple output (a label like, "cat", “dog”, “female”, “male”).
By contrast, the goal of a generative model is quite the opposite: it takes a small piece of input—perhaps a few random numbers—and produce a complex output, like an image of a realistic-looking face.
GANs are only one of my types of Generative models; it is know to produce state-of-the-art data generation models, including for images and video.
Discriminative models (nets) like CNNs or MLPs attempt to learn the probability of Y, whereas Generative models (nets) like VAEs and GANs attempt to learn a simulator of the data X.
Probabilistically we can say generative models attempt to model the full joint distribution P(X,Y), whereas discriminative models want the posterior probability P(Y,X).
In a simple classification task (Discriminative model), you might be provided with the foot size X=243, and the plot above shows how you could estimate P(Y,X) → female.
Whereas a data generating task (Generative model), you would be provided with the sex Y=female.
With this you would sample fake data from the likelihood distribution P(X,Y), and these samples would be indistinguishable from the true values.
Generative models can therefore do cool stuff like this Y→X→Y→X
Given training data, you can generate new samples from the same distribution:
H. Data Generation (GANs)
Feature Generation and Deep Learning
Assignment on Colab
Submission: May 9, 2023 11:59 PM
You can open the notebook and make your own copy.
Once are done, you can download the ipynb file and upload it to Brightspace.
There would be no extensions for this assignment, please submit on time.
The fourth project is the development of a notebook (code + explanation) that successfully engineers 12 unique types of features → 3 for each type of feature engineering: transforming, interacting, mapping, and extracting.
The second part of the assignment is the development of a deep learning classification model to predict the direction of the S&P500 for the dates 2018-01-01—2018-07-12 (test set).
The feature engineering section is unrelated to the model section, you can develop any feature, not just features that would work for deep learning models (later on you can decide which features to use in your model).
You also have to uncomment (remove #) for all the example features and make them run successfully → every feature example has some error/s that you have to fix. Please also describe the error you fixed!
Note that we won't be attempting to measure the quality of every feature (i.e., how much it improves the model), that is slightly too advanced for this course.
A. New features:
I. Features & Deep Learning