Home
ArXiv
SSRN
Seminars
GitHub
Login
Login
Home
ArXiv
SSRN
Seminars
GitHub
Share
Home
ArXiv
SSRN
Seminars
GitHub
I. Deep Learning
Lectures
10
11
Difficulty
Hard
Homework
No
Small Description
Deep neural networks.
journals
Deep Learning
Deep Learning
Within finance we can use neural networks for so many things like
feature extraction
,
data generation
,
classification
,
regression
,
ranking
,
survival analysis
,
time series forecasting
,
optimizationβ¦
Financial Use-case
Representation learning (Neural networks)
Unsupervised Learning (Autoencoding, Word2Vec)
Synthetic Data Generation (VAEs, GANs)
Supervised Learning (MLP, CNN)
Time Series Forecasting (RNN, CNN, Attention, N-Beats)
The Perceptron:
The idea is simple: the model takes inputs
$x$
, multiplies them by weights
$w$
, and adds them all together. The
sum of the product
obtained
$\sum_{i=1}^{m} x_{i} w_{i}$
is then passed through a
non-linear activation
function
$g$
. The output of the function give
the prediction
$\hat{y}$
.
$\hat{y}=g\left(\sum_{i=1}^{m} x_{i} w_{i}\right)$
A. Introduction
Computing Gradients
Backpropagation implies that if the network produces an unsatisfactory outcome, we go back and adjust the weights of the neurons and their connections.
This is how the network learns from its mistakes:
A training cycle consists of a
forward
and
backward
pass.
Forwardpropagation
β’
To repeat: all neural networks can be divided into two parts: a
forward propagation phase
, where the information "flows" forward to
compute predictions and the error
;
β’
And the
backward propagation phase
, where the backpropagation algorithm
computes the error derivatives and update the network weights
.
Quick recap matrix notation:
First letβs reflect
back to our previous
multilayered perceptron
, we will now label the linear function as
$\lambda()$
, the sigmoid function as
$\sigma()$
, and the probability threshold function as
$\tau()$
.
B. Backpropagation
Practical Example
We will implement a Multilayer Perceptron with one hidden layer by translating all our equations into code.
Vectorization
β’
For one it is important to know we donβt perform the loops that the summation notation might imply.
β’
Loops are known to be highly inefficient computationally, so we will want to avoid them.
β’
Fortunately, we can use matrix operations (i.e. vectorization) to achieve the exact same result much faster.
β’
In the example below the we donβt perform 12 different operations, but simply 1 operation.
Instead of looping over each row in our training dataset we compute the outcome for each row all at once using linear algebra operations.
NumPy
is the most popular library for matrix operations and linear algebra in Python.
Neural Network Steps
Remember that we need to computer the following operations in order:
1.
Linear function aggregation
$z$
2.
Sigmoid function activation
$a$
C. NNs from Scratch
Dimensionality Reduction
There are two ways to think of dimensionality reduction, (1) the first is as a method to represent features in a lower dimensional space within the same tensor rank, (2) the second is a method to move from a high to a low tensor rank.
Expressed differently one type of dimensionality expresses the quantity of features, and the other the quantity of array axes.
Rank to Rank
β’
The standard form of dimensionality reduction technique we might use is PCA to shrink the features from a 2D array (matrix) of 100 feature to 5 features by forming an efficient representation on a new 2D array (represented on the bottom right).
β’
The techniques for 3D and 4D arrays are not widely used, one is called Matrix Product State, these would allow you to go from 3Dβ3D and 4Dβ4D.
β’
Since the development of neural networks a models known as an Autoencoder has become popular to learn an efficient lower-dimensional representation of the data within the same rank.
β’
Autoencoders
are
very flexible
non-linear decomposition methods and can work with tensors of any rank, including rank two (2D), three (3D), and four (4D) tensors.
High to Low Rank
β’
You can move from a 4D tensor (tesseract) to a 3D tensor using Matrix Product State.
β’
You can move from 3D (tensor) to 2D (matrix) using the Tucker and CANDECOMP decompositions.
β’
We can also use PCA to go from a 2D array to a 1D array, by only choosing the first principal component.
D. Dimensionality Reduction (Autoencoder)
Convolutional Neural Network (CNNs)
CNNs was originally used for pattern and object recognition on image data. The image data is converted into a 3D array of RGB (Red Green Blue) color channels. In the example on the bottom right the 3 channels for 4 pixels are captured.
Stock Trading Via Images
1907.10046.pdf
2071.8KB
β’
It wasnβt long until developers from finance realized that you could use the same format above to solve problems in finance.
β’
Tucker Balch (JP Morgan AI Research), created a large sample of financial time-series images encoded as candlestick (Box and Whisker) charts.
β’
They labeled the samples following three algebraically-defined binary trade strategies (Murphy, 1999).
We suggest that the transformation of continuous numeric time-series classification problem to a vision problem is useful for recovering signals typical of technical analysis.
β’
The authors realized that there are various ways in which one can βvisualizeβ stock market data.
β’
We can also add a lot of technical feature and let the CNN find the appropriate direction that the technical indicator identified.
β’
Moreover, we can vary the image resolution (pixels) as a robustness test to see the variation in performance.
β’
Here the purpose was not to predict the direction of the stock, but instead see if they can capture the labels of technical indicator methods.
E. Prediction (CNNs)
Forecasting
Sequence
(order of data) matters in many domains, like reading and speech, where each word is seen in a context based on our understanding of the previous words.
The
three big disadvantages
with
ARIMA models
is that you can (1) only have
one time series
as an input, (2) and you can only have
one scalar value as an output
per prediction, (3) and it canβt model non-linear patterns because there are
no interaction effects between lags
.
β’
The non-linear equivalent of the ARIMA model is called a Recurrent Neural Network (RNN)
β’
Below is how you can convert a Feed-Forward (Multilayer Perceptron) Neural Network into a Recurrent Neural Network.
FFN versus RNN
RNN has two desirable properties it can (1) output an infinite amount of samples, (2) and it incorporates previous prediction information.
The difference between
feed froward models
(left) and
recurrent neural network
models (right) are minimal.
β’
When an RNN model has to calculate the new
hidden state
it uses
feature inputs
like FFN, but also the
previous hidden state
.
β’
The FNN (MLP) model is the standard NN that purely relies on the input without taking previous information into account for subsequent predictions.
Remember that the Multilayer Perceptron and Feed-Forward network models are synonyms for neural networks that have fully connected (dense) layers.
You can use FFN models for TS prediction, but you canβt use RNN for CS prediction.
F. Prediction (RNNs)
Variational Autoencoder
Recall that internal nodes (the circles
$\circ$
in the middle) simply represent scalar values
$z$
, and the values represent the linear weighted input values (inputs x weights, e.g,
$x \times w$
).
Tip:
there are
different ways
in which to present the activated nodes (input
$z$
with an activation function
$g(z)$
applied over it).
For autoencoders
these values
$z$
(
all ten internal
$z$
circles
) was obtained by the choice of weights
$W$
that minimizes the final reconstruction error.
β’
However, we can do something more interesting, we can force several
$z$
values to become
parameters on a distribution
$\mathcal{N}\left(\mu, \sigma^{2}\right)$
instead of simple unconstrained scalar values.
β’
The mean
$\mu$
and variance
$\sigma^{2}$
together forms its own normal probability distribution that we can use to sample a new latent space from.
β’
Sampling from the distribution creates sample that can be
decoded
and turned into
new outputs
.
The purpose of the
VAE model
is not
dimensionality reduction
, but instead
synthetic data generation
. This means that at the end, we can throw the encoder away and generate new outputs by putting random (normally distributed) noise through the decoder.
Instead of passing the output of the
encoder
directly to the
decoder
, we use it to form the parameters
mean
$\mu$
and
variance
$\sigma^2$
of a normal distribution
$\mathcal{N}\left(\mu, \sigma^{2}\right)$
. The
latent activation space
are then generated by sampling from this distribution.
A single node would have a latent sample
$z_i=\mu_i+\sigma^{2}_i\times\epsilon_{i}$
, the
mean
and
variance
in the equation defines the shape of the distributions; the Gaussian errors
$\epsilon_{1}, \epsilon_{2} \sim N(0,1)$
defines the position of the tiny red dot
on the distribution.
G. Data Generation (VAEs)
Generative Models
Many (supervised) machine learning systems look at some kind of complicated input (say, an image) and produce a simple output (a label like, "cat", βdogβ, βfemaleβ, βmaleβ).
By contrast, the goal of a generative model is quite the opposite: it takes a small piece of inputβ
perhaps a few random numbers
βand produce a complex output, like an image of a realistic-looking face.
GANs
are only one of my types of Generative models; it is know to produce state-of-the-art data generation models, including for images and video.
β’
Discriminative
models
(nets) like CNNs or MLPs attempt to learn the probability of
$Y$
, whereas
Generative models
(nets) like VAEs and GANs attempt to learn a simulator of the data
$X$
.
β’
Probabilistically we can say generative models attempt to model the full joint distribution
$P(X,Y)$
, whereas discriminative models want the posterior probability
$P(Y, X)$
.
So what does generative mean?
Simply that you can
sample from the model
and that the
distribution
of samples
approximates
the distribution of
true data
points.
In a simple classification task (Discriminative model), you might be provided with the foot size
$X=243$
, and the plot above shows how you could estimate
$P(Y, X)$
β
female.
Whereas a data generating task (Generative model), you would be provided with the sex
$Y=\text{female}$
.
With this you would sample fake data from the likelihood distribution
$P(X,Y)$
, and these samples would be indistinguishable from the true values.
Generative models can therefore do cool stuff like this
$Y β X β Y βX$
Training Objective
Given training data, you can generate new samples from the same distribution:
H. Data Generation (GANs)
Feature Generation and Deep Learning
Assignment on Colab
Submission:
May 9, 2023 11:59 PM
1.
Please read all the instructions here and on the
Colab notebook
.
2.
You can open the notebook and make your own copy.
3.
Once are done, you can download the ipynb file and upload it to Brightspace.
4.
There would be no extensions for this assignment, please submit on time.
β’
The fourth project is the development of a notebook (code + explanation) that successfully engineers
12 unique types of features β 3
for each type of feature engineering:
transforming
,
interacting
,
mapping
, and
extracting
.
β’
The second part of the assignment is the development of a
deep learning classification
model to predict the direction of the S&P500 for the dates
2018-01-01β2018-07-12
(test set).
β’
The feature engineering section is unrelated to the model section,
you can
develop any feature
, not just features that would work for deep learning models (later on you can decide which features to use in your model).
β’
You also have to uncomment (remove #) for all the example features and make them run successfully β
every
feature example has some error/s that you have to fix. Please also describe the error you fixed!
β’
Note that we
won't
be attempting to measure the quality of every feature (i.e., how much it improves the model), that is slightly too advanced for this course.
Grading
A. New
features:
I. Features & Deep Learning
TOP