ML 1: Reinforcement Learning, So Hot Right Now
- Published on
- ∘ 12 min read ∘ ––– views
Previous Article
Next Article
Introduction to Reinforcement Learning
This page is a summary of the cool symbols and my layman's understanding of what they mean in the context of Reinforcement Learning. Praise be to Sergey and BHoag. A more complete, coherent glossary is given in the glossary of this point; this post remains on the more shit-posty end of the spectrum.
Glossary
symbol | term |
---|---|
state | |
action | |
time step | |
reward that scales with the state e.g. get provided to the agent in order to line this process of incrementing to approach a better and better policy is repeated until the optimal policy is achieved by maximizing | |
the set of states and the distribution of starting states | |
the set of actions that can be taken in a given state | |
Transition dynamics function which maps a state-action pair at time to a distribution of states at // the probability "" of given can't make any assumptions about future states, just that a function that determines the probability of any state | |
immediate reward function | |
discount factor where places emphasis on the immediacy of the reward | |
is a mapping from the set of states to a probability distribution over the set of actions | |
an episode. A Markov Decision Process is said to be "episodic" if, after length , the state resets if , then the MDP is not episodic if , then the MDP is episodic | |
brings the policy us closer to the goal of Reinforcement Learning: | |
the optimal policy/control strategy based on the given possibility and rewards |
Problem Solving in Reinforcement Learning: Evaluative Functions
function | purpose |
---|---|
Expected return, given the state and probability | |
| the optimal policy for all states in the set of states. sidebar: saying that something holds true for all states in the set of all possible states is really stupid and redundant for a generalized solution like RL. Why not say whenever you mention ? |
勿 | the set of all possible sets, policies, models, gradients, loss functions, backup operations, transition dynamics, action, Turing Machines, and symbols. To be used: |
quality function which is like , except the intial action is provided | |
the expected output of the next state determined by the anticipated reward plus the discount times the quality of the next state, provided the next state and action | |
The value of Q is assigned according to the learning rate times the Temporal Difference Error | |
the learning rate | |
the Temporal Difference Errror | |
the target of standard regression maximized with respect to the action . | |
the optimal Quality Function is approximated by the target of standard regression | |
Monte Carlo Methods a policy is rolled out and pursued for an arbitrary amount of episodes after which the mean value of the policy is returned. //looking into the possible future based on the current state and a given policy and choosing the best policy to be taken based on the mean | |
a combination of Temporal Difference bootstrapping and monte carlo methods where is an interpolation factor between TD and MCM | |
the Advantage Function which produces relative state-action values instead of absolute state-action values like . this is useful as it's easier to make comparative evaluations between actual return calculations e.g. we can tell that without knowing of or are actually any good. the Advantage Function asks/answers the question "How much better is this action compared to the average action taken in this state." | |
Generalized Policy Improvement |
Problem Solving in Reinforcement Learning: Policy Search
function | purpose |
---|---|
means of policy search which is typically chosen that is parametrized to maximize using either gradient-based or gradient free optimization | |
gradient-based optimization | applicable to Deep Reinforcement Learning scenarios denoted by high-dimensional action spaces |
gradient-free optimization | good for low-dimensional problems |
REINFORCE rule: gradient-estimation (aka "score function", "likelihood-ratio estimator") computes the gradient of the expectation over a function of a random variable with respect to parameters . | |
in RL | 勿 hint: it does |
A Brief Note on Gradients
Closing Big Brain Ideas
term | definition |
---|---|
Markov's Property | Only the current state effects the next state |
Actor Critic Methods | uses as a baseline for policy gradients |
Reinforcement Learning | focuses on learning without knowing he underlying model of the environment, but they can learn from interacting with it |
Convolutional Neural Network | commonly used as components of RL agents allowing them to learn directly from raw, high-dimensional visual input values |
The Goal of RL | to train Deep Neural Networks to approximate |
Experience Relay Memory | stores in a cyclic buffer, enabling RL agents to sample from and train on previously observed data (batches) offline |
Target Network | correctly-weighted, but frozen neural nets. The "active" policy network pulls TD error values from the cached and comparatively stable target network, rather than having to calculate the TD error based on its own rapidly fluctuating estimates of values |
Hard Attention | using RL to make discrete stochastic decisions over inputs via back propagation and re-parameterization |
Re-parameterization | allows neural networks to be treated as stochastic computational graphs -- a key concept in algorithms involving Stochastic Value Gradients |
Dynamic Programming | using the current to improve the next via "bootstrapping" |
Monte Carlo Methods | estimating the expected return from a state by averaging return from multiple rollouts of a policy. used to find optimal trajectories. limited to episodic Markov Decision Processes |
Policy Search | doesn't maintain a value-function model, instead this approach directly searches for an optimal policy |
Actor-Critic Model | The Actor (policy) receives a state from the environment and chooses an action to perform. At the same time, some critic () receives the state and reward resulting from the previous interaction. The critic uses the RD error calculated from this information to update itself and the actor. It's trivial, really. |