Reinforcement learning has a very dense vocabulary and we will probably explore 30 new terms in this lecture!
🇷🇱 Learns (3) Sequential (2) Optimal Decisions (1) under Uncertainty, let’s start backwards and move our way forward.
(1) means that the problems under consideration involve random variables that evolve over time; the technical term for this is Stochastic Processes.
Random variables that evolve over time can be found in nature (e.g., weather) and in business (e.g., customer demand or stock prices).
(2) refers to using optimization methods to maximize a well-defined goal.
(3) refers to the fact that as we move forward in time, as the random variables evolve the optimal decisions have to be adjusted to the "changing circumstances."
Putting this all together, RL seeks to fight against temporal uncertainty by persistently steering an agent towards the goal.
This brings us to the term Control (in references to persistent steering), which gave us the term Stochastic Control an older branch of sequential decision making.
Markov Decision Process
All problems in Reinforcement Learning are theoretically modeled as maximizing the return in a Markov Decision Process, or simply, an MDP. An MDP is characterized by 4 things:
S : The set of states that the agent experiences when interacting with the environment. The states are assumed to have the Markov property.
A: Is a set of legitimate actions that the agent can execute in the environment.
P : Is a transition probability function, specifying the probability that the environment will transition to state s′ if the agent takes action a in state s.
R : Is a function that returns the reward received for taking action a in state s.
A Markov decision process is often denoted as M=⟨S,A,P,R⟩. Let us now look into them in a bit more detail (Sometimes T is used for the transition function, instead of P).
Baby intuition: it's worth pointing out that the MDP framework was inspired by how babies (Agents) learn to perform tasks (i.e., take Actions) in response to the random activities and events (States and Rewards) they observe as being served up from the world (Environment) around them.
S describes the legitimate values the state can take at time t is denoted as St∈S where S is referred to as the state space.
The set of states, S, can be either discrete or continuous in value.
An example of a finite discrete state space is the state we used for the Atari game, i.e. the past 4 frames.
B. Markov Decision Process
Difficulty in solving MDPs
There are three reasons why solving a MDP is fundamentally a hard problem.
First the challenge is simply that the State Space or Action Space is very large (involving many variables) and hence computationally intractable.
Second, actions have delayed rewards, and there are no direct feedback on what the "correct" Action is for a given State.
Third, in reality the agent doesn't know the model of the environment which means the probability of state transitions i.e., the function P we defined before.
Recall that our ultimate goal is to find a policy π that maximizes the expected future (discounted) reward.
If we know all those elements of an MDP (S,A,P,R) we can just plan out the solution without ever actually taking an action in the environment.
In AI we use to word Planning to describe a solution to a decision-making problem that is calculated without taking an actual decision (no additional learning takes place).
The most common planning algorithm is Dynamic programming that takes advantage of the full knowledge of the transition probabilities p.
It leverages the Bellman equation to efficiently reason about future probabilistic outcomes to perform the requisite optimization calculation to infer the optimal policy π.
C. Reinforcement Learning
Reinforcement Learning for Trading
SL and RL algorithms indirectly pick up on well-known trading strategies without having to predefine and identify them.
For example, the gradient step parameter update that leads to the machine agent to buy more of what did the best yesterday are indirectly creating a momentum trading strategy.
Many professional players of Go now trains with an RL AI to identify new moves that humans have never played before, so we can also use AI to uncover new trading strategies.
“But it also contained several moves made by both man and machine that were outlandish, brilliant, creative, foolish, and even beautiful.”
(A.1) Traditional Trading Process
(A.2) Rule-based (heuristic programming/expert systems)
Steven Cohen at Point72 is also testing models that mimics the trades of its portfolio managers. And Paul Tudor has also assigned a team of coders to create ‘’Paul in a Box’’.
D. Trading Strategies
Trading Showcase RL
Reinforcement Learning for Trading
It is also simple to see how trading can easily slide into a reinforcement learning frameworks, the two fields even use the same vocabulary.
Just like in RL, in trading return maximization is the goal and we can define the reward function as the change of portfolio value over time.
In trading we are also constantly weighing up whether we should ‘tweak’ our existing strategies, or find new alpha, this is similar to the exploration-exploitation dilemma in RL.
Taking many complex financial factors into account, DRL trading agents build a multi-factor model and provide algorithmic trading strategies, which are difficult for human traders to execute and understand.
The Training Process
The right hand side of the graphic shows the process of developing and reinforcement learning strategy.
Reinforcement Learning versus Supervised Learning
E. Trading Showcase
By now you should know that there are five elements to any sequential decision problem. Before you start developing a mathematical model, you need to identify three of these (no analytics needed):
Metrics - What determines the performance of your system?
Decisions - To improve the performance of any system, you have to make better decisions, but what types of decisions are being made (and who makes them).
Uncertainties - What aspects of the system are you uncertain about, and what types of information will arrive after you make a decision.
WIdely overlooked in the stochastic optimization community is the power of the methods that are most widely used in practice:
PFAs - some sort of parameterized model (e.g. order-up-to, buy low, sell high) CFAs - Some form of parameterized optimization problem Deterministic DLAs - such as google maps. Note that deterministic lookaheads can be useful in some settings (such as google maps), but you can almost always make deterministic lookaheads even better by introducing tunable parameters.
In effect, all of these involve some form of tunable parameter. Tuning can be hard. Normally we prefer to do this in a simulator, but simulators can be hard to build. Without a simulator, most of the time we are using the intuition of someone who understands the problem.
I first highlighted the importance of modeling uncertainty as the foundation for developing flexible supply chains that can respond to uncertainty. Then, after listing the five dimensions of every sequential decision problem, I focused on the first four:
What decisions are being made (and who makes them).
Sources of uncertainty.
The information needed to make decisions (the state variable).