Kristin's Blog

← Back to blog

Published on Fri Jan 17 2025 00:00:00 GMT+0000 (Coordinated Universal Time) by Kristin Wei

This blog summarizes the basic concepts of reinforcement learning based on Sutton’s book.

Reinforcement Learning

Reinforcement learning (RL) is an interdisciplinary area of machine learning and optimal control concerned with how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. Reinforcement learning is one of the three basic machine learning paradigms, alongside supervised learning and unsupervised learning. [From wikipedia]

Simply, an agent (robot) interacts with an environment (world) and learns to make decisions to maximize its final goal. Here,

There are some famous algorithms in RL, such as Q-learning.

About DRL we will mention in another post.

Fundamentals

Basic Concepts

Reinforcement learning bases on V(s),Q(s,a),π(as),R,GV(s),Q(s,a),\pi(a|s),R,G:

An Example: Hero in a game

A hero in game collects coins(reward) along a path in a 2d grid map to gain experience.

a hero in a 2D map 1

The FAMOUS Bellman Equation

The Bellman equation is basiccally connecting the vπ(s)v_{\pi}(s) and vπ(s)v_{\pi}(s'), or qπ(s,a)q_{\pi}(s,a) and qπ(s,a)q_{\pi}(s',a'),

Backup Diagramm2

vπ(s)=Eπ[GtSt=s]=Eπ[Rt+1+γGt+1St=s]=>Eπ[Rt+1+γGt+1St+1=s]=aAπ(as)sSp(s,rs,a)Eπ[Rt+1+γGt+1St+1=s]=aAπ(as)sSp(s,rs,a)[r+γEπ[Gt+1St+1=s]=aAπ(as)sSp(s,rs,a)[r+γvπ(s)]\begin{aligned} v_{\pi}(s)&=E_{\pi}[G_t|S_t=s]\\ &=E_{\pi}[R_{t+1}+\gamma G_{t+1}|S_t=s]\\ &=>E_{\pi}[R_{t+1}+\gamma G_{t+1}|S_{t+1}=s']\\ &=\sum_{a\in A}\pi(a|s)\sum_{s'\in S}p(s',r|s,a) E_{\pi}[R_{t+1}+\gamma G_{t+1}|S_{t+1}=s']\\ &=\sum_{a\in A}\pi(a|s)\sum_{s'\in S}p(s',r|s,a) [r+\gamma E_{\pi}[G_{t+1}|S_{t+1}=s']\\ &=\sum_{a\in A}\pi(a|s)\sum_{s'\in S}p(s',r|s,a) [r+\gamma v_{\pi}(s')]\\ \end{aligned}

Similarily explained above, E(G)E(G) will become Eπ(GtSt=s,At=a)E_{\pi}(G_t|S_t=s,A_t=a) for qπ(s,a)q_{\pi}(s,a), then the Bellman equation changes to:

qπ(s,a)=Eπ[GtSt=s,At=a]=>Eπ[Rt+1+γGt+1St+1=s,At+1=a]=sSp(s,rs,a)Eπ[Rt+1+γGt+1St+1=s]=sSp(s,rs,a)aAπ(as)Eπ[Rt+1+γGt+1St+1=s,At+1=a] =sSp(s,rs,a)aAπ(as)[r+γEπ[Gt+1St+1=s,At+1=a]]=sSp(s,rs,a)aAπ(as)[r+γqπ(s,a)]\begin{aligned} q_{\pi}(s,a)&=E_{\pi}[G_t|S_t=s,A_t=a]\\ &=>E_{\pi}[R_{t+1}+\gamma G_{t+1}|S_{t+1}=s',A_{t+1}=a']\\ &=\sum_{s'\in S}p(s',r|s,a)E_{\pi}[R_{t+1}+\gamma G_{t+1}|S_{t+1}=s']\\ &=\sum_{s'\in S}p(s',r|s,a)\sum_{a'\in A}\pi(a'|s')E_{\pi}[R_{t+1}+\gamma G_{t+1}|S_{t+1}=s',A_{t+1}=a']\\\ &=\sum_{s'\in S}p(s',r|s,a)\sum_{a'\in A}\pi(a'|s')[r+\gamma E_{\pi}[G_{t+1}|S_{t+1}=s',A_{t+1}=a']]\\ &=\sum_{s'\in S}p(s',r|s,a)\sum_{a'\in A}\pi(a'|s')[r+\gamma q_{\pi}(s',a')]\\ \end{aligned}

Choose Path based on Bellman Equation

When the hero stand at ss state seeing all vπ(s)v_{\pi}(s') , but only one step will be chosen in reality, which means π(as)=1\pi(a|s)=1 for this action aa. This decision will let the vπ(s)v_{\pi}(s) biggest, and the policy will be updated and v(s)v_*(s) is defined as:

v(s)=maxaπ(as)sSp(s,rs,a)[r+γvπ(s)]=sSp(s,rs,amax)[r+γvπ(s)]=qπ(s,amax)\begin{aligned} v_*(s)&=\max_a \pi(a|s)\sum_{s'\in S}p(s',r|s,a)[r+\gamma v_{\pi}(s')]\\ &=\sum_{s'\in S}p(s',r|s,a_{\max})[r+\gamma v_{\pi}(s')]\\ &=q_{\pi}(s,a_{\max}) \end{aligned}

p(s,rs,a)p(s',r|s,a) is of course not controlled by the hero, thus, policy has the only option in next step — at ss' choose amaxa'_{\max} , where q(s,a)q(s',a') is max for all aAa' \in A. Use the same logic,

q(s,a)=sSp(s,rs,a)[r+γq(s,amax)]\begin{aligned} q_*(s,a)&=\sum_{s'\in S}p(s',r|s,a)[r+\gamma q(s',a'_{\max})]\\ \end{aligned}

v(s) vs q(s,a)

qπ(s,a)q_{\pi}(s,a) seems have chosen the aa without policy. But thinking deeply, policy π(as)\pi(a|s) controls the choice when finally the hero acts in the environment. The π\pi for v(s)v(s) and q(s,a)q(s,a) just dedicates the policy is updated according v(s)v(s) and q(s,a)q(s,a).

No matter which is used in policy update, what really matters is the next state ss', the v(s)v(s') or the sSp(s,rs,a)[r+γv(s)]\sum\limits_{s'\in S} p(s',r|s,a)[r+\gamma v(s')] , since again the p(s,rs,a)p(s',r|s,a) is not controllable.

Once the next step is determinated, aa at this state ss is also confirmed. q(s,a)q(s,a) just more connects to the next state.

v(s)v(s) choses the path by comparsion between multiple v(s)v(s'), but q(s,a)q(s,a) indicates the path by comparsion between its company q(s,a1),q(s,a2),q(s,a3)...q(s,a_1), q(s,a_2), q(s,a_3)....

RL Methods

Monte Carlo, Temporal-Difference and Q-Learning are all model-free methods, which means the probability departing from states to states is unknown. The above optimal policy is used in Dynamic Programming, since the p(s,rs,a)p(s',r|s,a) is known. That’s also the reason why use DP in model-based environment. For model-free environment, the value is estimated by exploring and update. MC, TD or Q-learning just differ at these 2 processes.

Monte Carlo

The basic idea of Monte Carlo is to estimate value by :

V(s)=GNV(s)=\frac{G}{N}

in the step update form:

V(s)V(s)+RV(s)NV(s)\leftarrow V(s)+\frac{R-V(s)}{N}

with starting from N=1N=1, in Monte Carlo V(s)=GV(s)=G.

With this setting, Monte Carlo performs the best with full exploring, also means ϵ=1\epsilon=1 for on policy MC control with ϵ\epsilon-soft algorithm, and must run enough steps, which is absolutely slow!!!

Using this idea, of course in most environments, exploring start with argmax at policy update will fail.

Nevertheless, the GG is driven from trajecotry [s0,a0,r1,s1,a1,r2,...,sT1,aT1,rT]\left[s_0,a_0,r_1,s_1,a_1,r_2,...,s_{T-1},a_{T-1},r_T\right] updated by Gst=Rt+γGst1G_{s_t}=R_t+\gamma G_{s_{t-1}}, where the Terminal GsT=0G_{s_T}=0. No terminal then no value update and policy update. However, a random walk can’t garante reaching the terminal.

α-constant Monte Carlo

The α\alpha - constant Monte Carlo updates it by:

V(s)=V(s)+α[GV(s)]=V(s)+GV(s)1α=V(s)(1α)+αG\begin{aligned} V(s)&=V(s)+\alpha \left[ G-V(s) \right]\\ &=V(s)+\frac{G-V(s)}{\frac{1}{\alpha}}\\ &=V(s)(1-\alpha)+\alpha G\\ \end{aligned}

In α\alpha - constant will always consider part of the original V(s)V(s): 3

Vep+1(s)=[Vep1(s)(1α)+αGep1](1α)+αGep=Vep1(1α)2+α(1α)Gep1+αGep=V1(1α)ep+1epα(1α)iGi\begin{aligned} V_{ep+1}(s)&=\left[V_{ep-1}(s)(1-\alpha)+\alpha G_{ep-1}\right](1-\alpha)+\alpha G_{ep}\\ &=V_{ep-1}(1-\alpha)^2+\alpha(1-\alpha)G_{ep-1}+\alpha G_{ep}\\ &=V_1(1-\alpha)^{ep}+\sum_1^{ep}\alpha(1-\alpha)^iG_i\\ \end{aligned}

for α<1\alpha <1, when tt\rightarrow \infty, VV_{\infty} has more value depending on GG, and specially recent GG.

What’s more, when updating the value, the value V(s)V(s) is moving towards to the actual value, no matter is updated by Monte Carlo average method or TD or Q-learning, so partly we can trust the new V(s)V(s).

Temporal Difference

TD is a bootstrapping method, which is quiet determined by the old value.

Vep+1(s)=Vep(s)+α[R+γVep(s)Vep(s)]V_{ep+1}(s)=V_{ep}(s)+\alpha[R+\gamma V_{ep}(s')-V_{ep}(s)]

Comparing with the α\alpha - constant Monte Carlo Vep+1(s)=Vep(s)+α[Rep+γGep1Vep(s)]V_{ep+1}(s)=V_{ep}(s)+\alpha [R_{ep}+\gamma G_{ep-1}-V_{ep}(s)], α\alpha is the stepsize and also determines the update quantity of the V(s)V(s). Once V(s)V(s') is estimated close to the real value, V(s)V(s) is updated by one step closer to the real V(s)V(s). Digging to the end, the terminal V(sT)=0V(s_T)=0, and the V(sT1)V(s_{T-1}) s are all updated exactlly by one step close to the real value, unlike the Monte Carlo, always needing a trajectory to end to update the value.

For TD, update is not deserved with end to terminal. The first run to terminal is only updated valuable on the V(sT1)V(s_{T-1}), and next run is V(sT2)V(s_{T-2}), and so on…

On one side, the V(s)V(s) is updated truely along the way to terminal, with this chosen path, the value is updated more fast, since the agent prefers to go this path under ϵ\epsilon - greedy policy; On the other side, with randomly exploring, the agent searchs for a better way to terminal. Once found the new path will be compared with the old one, the V(s)V(s) will determine the optimal path.

If we use Q(s,a)Q(s,a) in TD, then the algorithm is called the famous sarsa.

Qep+1(s,a)=Qep(s,a)+α[R+γQep(s,a)Qep(s,a)]Q_{ep+1}(s,a)=Q_{ep}(s,a)+\alpha\left[R+\gamma Q_{ep}(s',a')-Q_{ep}(s,a)\right]

Similarly, the Q(s,a)Q(s,a) is updated from the Q(sT,aT)Q(s_T,a_T) once reaches the terminal.

Q-learning

While the agent is still randomly walking in the environment without arriving at the terminal, then the updated value is equavalent to random initialized Q(s,a)Q(s,a). The meaningful value is like TD starting from Q(sT,aT)Q(s_T,a_T), the difference locates at that, because of the continous exploring, we can safely choose the best way with fast speed. This indicates we can determine the best step from state ss by looking into the Q(s,a)Q(s',a')s and gives the s1s-1 a symbol (Q(s,a)Q(s,a)) that ss is the best within his company:

Qep+1(s,a)=Qep(s,a)+α[R+γQep(s,amax)Qep(s,a)]Q_{ep+1}(s,a)=Q_{ep}(s,a)+\alpha \left[R+\gamma Q_{ep}(s',a'_{\max})-Q_{ep}(s,a)\right]

Gradually the from the starting state, the agent find the fast by seeing the biggest Q(s,a)Q(s,a) at each state.

Thoughts

Arbitrary Initial Q or V

Even give Q(s,a)Q(s,a) or V(s)V(s) a positive value at start, by updating, a negative value Q(s,a)Q(s',a') or V(s)V(s') will contribute part of it. At least the RR will definately affect negatively to it. After this, a positive Q(s,a)Q(s,a) or V(s)V(s) can’t be compared with a Q(s,a)Q(s,a), which is driven from the positive value given by terminal.

Where goes p(s’,r|s,a) ?

When we have the model, then p(s,rs,a)p(s',r|s,a) can help us compare the V(s)V(s) by avioding the low value and passing more though the high value, or directly getting more rewards. In model free, there is no p(s,rs,a)p(s',r|s,a) in offer. But no matter p(s,rs,a)p(s',r|s,a) or Q(s,a)Q(s,a) or V(s)V(s) just to find the best way. With many exploring, the value is showing the best probability of getting best reward, then there is no need to setting p(s,rs,a)p(s',r|s,a) in model free environment.

p(s,rs,a)p(s',r|s,a) is of course not controlled by the hero.

Epsilon Greedy vs Epsilon Soft

In reinforcement learning, we can’t run infinite times to update the whole QQ - value table or VV - value table, efficient update choices must be made.

Generally thinking, which ss or (s,a)(s,a) has more opportunity to get RR (high value), should be updated more to converge to the optimal. But stochastic exploring is also required to jump out of sub-optimal. The simple idea to implement is ϵ\epsilon -greedy.

Tips:

To generate a list of probability of action set, which sum to 1, we can use Dirichlet distribution, e.g.

p=numpy.random.dirichlet(np.ones(n_actions))

Dirichlet

epsilon -Greedy

max:p=1ϵ+ϵAothers:p=ϵA\begin{aligned}\max &: p=1-\epsilon+\frac{\epsilon}{|A|}\\\\ \mathrm{others}&: p=\frac{\epsilon}{|A|}\end{aligned}

i.e, random chose with ϵ\epsilon and chose the best with 1ϵ1-\epsilon

epsilon-greedy

with code

p[max_value_index]=1-epsilon
p+=epsilon/n

epsilon - Soft

the only requirement of ϵ\epsilon - Soft is each probability greater than ϵ/A\epsilon /|A|, Dirichlet is useful here.

epsilon-greedy

p=np.random.dirichlet(np.ones(n))*(1-epsilon)
p+=epsilon/n

Footnotes

  1. Online image from here

  2. Sutton’s Reinforcement Book

  3. The ep represents the episode number, there we use first visit Monte Calro method.

Written by Kristin Wei

← Back to blog