Explore / Study / Computer Science / Machine Learning 1.4k words | 8 minutes

Search and Optimization

  1. Numerical Optimization
    1. Second-order Taylor Expansion
    2. Newton Direction
  2. Stochastic Search
    1. Simulated Annealing
    2. Cross Entropy Methods
    3. Search Gradient
  3. Reinforcement Learning
    1. Values
    2. Bellman Expectation Equation
    3. Bellman Optimality Equation
    4. Value Iteration
    5. Monte Carlo Policy Evaluation
    6. Temporal-Difference (TD) Prediction
    7. Tabular Q-Learning
    8. Deep Q-Learning
    9. Policy Gradient
  4. Bandits and MCTS
    1. Regret
    2. Concentration Bounds
    3. Explore-Then-Commit
    4. Epsilon-Greedy
    5. Upper Confidence Bound

Numerical Optimization

Second-order Taylor Expansion

f(x+p)=f(x)+f(x)Tp+12pT2f(x+tp)pt(0,1)f(x+p)=f(x)+\nabla f(x)^Tp+\frac{1}{2}p^T\nabla^2f(x+tp)p \quad t\in(0,1)

Newton Direction

p=(2f(x))1f(x)p=-\left(\nabla^2f(x)\right)^{-1}\nabla f(x)

Stochastic Search

Simulated Annealing

Repeat: Sample a step pPp \sim P. If f(x+p)f(x)f(x+p) \ge f(x), with exp(f(x)f(x+p)T)\exp\left(\frac{f(x)-f(x+p)}{T}\right) probability xx+px \leftarrow x + p, else xx+px \leftarrow x+p.

Cross Entropy Methods

Repeat: Collect a set AA of samples p(x)\sim p(x). Select top kk elite samples EAE\subseteq A. Update p(x)p(x) to best fit EE.

Search Gradient

Repeat: Draw λ\lambda samples zkπ(θ)\mathbf{z}_k\sim\pi(\cdot|\theta). Evaluate the finesses f(zk)f(\mathbf{z}_k). Calculate log-derivatives θlogπ(zkθ)\nabla_\theta\log\pi(\mathbf{z}_k|\theta). θJ1λk=1λθlogπ(zkθ)f(zk)\nabla_\theta J\leftarrow\frac{1}{\lambda}\sum_{k=1}^\lambda\nabla_\theta\log\pi(\mathbf{z}_k|\theta)\cdot f(\mathbf{z}_k). θθ+ηθJ\theta\leftarrow\theta+\eta\cdot\nabla_\theta J.

Reinforcement Learning

Values

Vπ(s0)=Eπ[i=0γiR(si)]V^\pi(s_0)=\mathbb{E}_\pi\left[\sum_{i=0}^\infty\gamma^iR(s_i)\right]

Bellman Expectation Equation

Vπ(s0)=R(s0)+γEπ[Vπ(s1)]=R(s0)+γiP(s1is0,π(s0))Vπ(s1i)V^\pi(s_0)=R(s_0)+\gamma \mathbb{E}_\pi\left[V^\pi(s_1)\right]=R(s_0)+\gamma\sum_iP\left(s_1^i|s_0,\pi (s_0)\right)\cdot V^\pi\left(s_1^i\right)

Bellman Optimality Equation

V(s0)=maxπVπ(s0)=R(s0)+γmaxaA(s0)iP(s1is0,a)V(s1i)V^*(s_0)=\max_{\pi}V^\pi(s_0)=R(s_0)+\gamma\max_{a\in A(s_0)}\sum_iP\left(s_1^i|s_0,a\right)\cdot V^*\left(s_1^i\right)

Value Iteration

Repeat until δ<ϵ(1γ)/γ\delta < \epsilon (1-\gamma)/\gamma: UU,δ0U \leftarrow U', \delta \leftarrow 0. For each state sSs \in S, U[s]R(s)+γmaxaA(s)sP(ss,a)U[s]U^{\prime}[s]\gets R(s)+\gamma\max_{a\in A(s)}\sum_{s^{\prime}}P(s^{\prime}|s,a)U[s^{\prime}]. If U[s]U[s]>δ\left|U^{\prime}[s]-U[s]\right|>\delta, then δU[s]U[s]\delta\leftarrow|U^{\prime}[s]-U[s]|.

Monte Carlo Policy Evaluation

Vπ(s0)=Eπ[G(s0)]=Eπ[i=0TγiR(si)]V^\pi(s_0) = \mathbb{E}_\pi[G(s_0)]=\mathbb{E}_\pi\left[\sum_{i=0}^T\gamma^{i}R(s_i)\right]

Temporal-Difference (TD) Prediction

Vπ(s0)Vπ(s0)+α(R(s0)+γVπ(s1)Vπ(s0))V^\pi(s_0)\leftarrow V^\pi(s_0)+\alpha(R(s_0)+\gamma V^\pi(s_1)-V^\pi(s_0))

Tabular Q-Learning

Q(s,a)Q(s,a)+α(R(s)+γmaxaQ(s,a)Q(s,a))Q(s,a)\leftarrow Q(s,a)+\alpha\left(R(s)+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q(s,a)\right)

Deep Q-Learning

θθαθ[R(s)+γmaxaQθ(s,a)Qθ(s,a)]2\theta\leftarrow\theta-\alpha\nabla_\theta\left[R(s)+\gamma\max_{a^{\prime}}Q_\theta(s^{\prime},a^{\prime})-Q_\theta(s,a)\right]^2

Policy Gradient

θθ+αR(τ)i=0τ1log(πθ(aisi))\theta\leftarrow\theta+\alpha R(\tau)\sum_{i=0}^{|\tau|-1}\nabla\log(\pi_\theta(a_i|s_i))

Bandits and MCTS

Regret

Rσ(t)=Eciσ[i=1tXci=1tXci]=μtEciσ[i=1tμ(ci)]R_\sigma(t)=\mathbb{E}_{c_i\sim\sigma}\left[\sum_{i=1}^tX_{c^*}-\sum_{i=1}^tX_{c_i}\right]=\mu^*t-\mathbb{E}_{c_i\sim\sigma}\left[\sum_{i=1}^t\mu(c_i)\right]

Concentration Bounds

P(Xˉμε)2e2Nε2with i.i.d. XiPμ[0,1]P(|\bar{X}-\mu|\geq\varepsilon)\leq2e^{-2N\varepsilon^2} \quad \text{with i.i.d.}\ X_i\sim P_\mu \in [0,1]

P(Xˉμε)2T2cwhere ε=clogTNP(|\bar{X}-\mu|\geq\varepsilon)\leq\frac{2}{T^{2c}} \quad \text{where}\ \varepsilon=\sqrt{\frac{c\log T}{N}}

Explore-Then-Commit

First, play each coin N=cT23logT13N=c'\cdot T^{\frac{2}{3}}\log T^{\frac{1}{3}} times. Compare the average reward μˉ1\bar{\mu}_1 and μˉ2\bar{\mu}_2. Afterwards play the coin with higher μˉi\bar{\mu}_i.

R(T)N+clogTNTR(T)O(T23(logT)13)R(T)\leq N+c\sqrt{\frac{\log T}{N}}\cdot T \quad R(T)\sim O\left(T^{\frac{2}{3}}(\log T)^{\frac{1}{3}}\right)

Epsilon-Greedy

Compare the average reward μˉ1\bar{\mu}_1 and μˉ2\bar{\mu}_2. Choose ε=ct13(logt)13\varepsilon=c'\cdot t^{-\frac{1}{3}}(\log t)^{\frac{1}{3}}. With probability 1ε1-\varepsilon, play the one with better empirical mean. With ε\varepsilon, play the other one.

R(t)εt+clogtεttR(t)O(t23(logt)13)R(t)\leq\varepsilon t+c\sqrt{\frac{\log t}{\varepsilon t}}\cdot t \quad R(t)\sim O\left(t^{\frac{2}{3}}(\log t)^{\frac{1}{3}}\right)

Upper Confidence Bound

Compute the UCB value of each arm i[K]i\in [K]. Play the arm with maxiB(i,t,ni(t))\max_iB(i,t,n_i(t)).

B(i,t,ni(t))=Xˉi+2logtni(t)B(i,t,n_i(t))=\bar{X}_i+\sqrt{\frac{2\log t}{n_i(t)}}

RUCB(t)=i=1K(μiμi)E[ni(t)]i=1K8logtμiμi+O(1)R_{UCB}(t)=\sum_{i=1}^K(\mu_{i^*}-\mu_i)\cdot\mathbb{E}[n_i(t)]\leq\sum_{i=1}^K\frac{8\log t}{\mu_{i^*}-\mu_i}+O(1)

— Mar 13, 2025

Creative Commons License
Search and Optimization by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.