🌑

☀️

Hi Folks.

Home Explore Trending Me About

🇨🇳 🔎

Explore / Study / Computer Science / Machine Learning 1.4k words | 8 minutes

Search and Optimization

Numerical Optimization

Second-order Taylor Expansion

f(x+p)=f(x)+\nabla f(x)^Tp+\frac{1}{2}p^T\nabla^2f(x+tp)p \quad t\in(0,1)

Newton Direction

p=-\left(\nabla^2f(x)\right)^{-1}\nabla f(x)

Stochastic Search

Simulated Annealing

Repeat: Sample a step $p \sim P$ . If $f(x+p) \ge f(x)$ , with $\exp\left(\frac{f(x)-f(x+p)}{T}\right)$ probability $x \leftarrow x + p$ , else $x \leftarrow x+p$ .

Cross Entropy Methods

Repeat: Collect a set $A$ of samples $\sim p(x)$ . Select top $k$ elite samples $E\subseteq A$ . Update $p(x)$ to best fit $E$ .

Search Gradient

Repeat: Draw $\lambda$ samples $\mathbf{z}_k\sim\pi(\cdot|\theta)$ . Evaluate the finesses $f(\mathbf{z}_k)$ . Calculate log-derivatives $\nabla_\theta\log\pi(\mathbf{z}_k|\theta)$ . $\nabla_\theta J\leftarrow\frac{1}{\lambda}\sum_{k=1}^\lambda\nabla_\theta\log\pi(\mathbf{z}_k|\theta)\cdot f(\mathbf{z}_k)$ . $\theta\leftarrow\theta+\eta\cdot\nabla_\theta J$ .

Reinforcement Learning

Values

V^\pi(s_0)=\mathbb{E}_\pi\left[\sum_{i=0}^\infty\gamma^iR(s_i)\right]

Bellman Expectation Equation

V^\pi(s_0)=R(s_0)+\gamma \mathbb{E}_\pi\left[V^\pi(s_1)\right]=R(s_0)+\gamma\sum_iP\left(s_1^i|s_0,\pi (s_0)\right)\cdot V^\pi\left(s_1^i\right)

Bellman Optimality Equation

V^*(s_0)=\max_{\pi}V^\pi(s_0)=R(s_0)+\gamma\max_{a\in A(s_0)}\sum_iP\left(s_1^i|s_0,a\right)\cdot V^*\left(s_1^i\right)

Value Iteration

Repeat until $\delta < \epsilon (1-\gamma)/\gamma$ : $U \leftarrow U', \delta \leftarrow 0$ . For each state $s \in S$ , $U^{\prime}[s]\gets R(s)+\gamma\max_{a\in A(s)}\sum_{s^{\prime}}P(s^{\prime}|s,a)U[s^{\prime}]$ . If $\left|U^{\prime}[s]-U[s]\right|>\delta$ , then $\delta\leftarrow|U^{\prime}[s]-U[s]|$ .

Monte Carlo Policy Evaluation

V^\pi(s_0) = \mathbb{E}_\pi[G(s_0)]=\mathbb{E}_\pi\left[\sum_{i=0}^T\gamma^{i}R(s_i)\right]

Temporal-Difference (TD) Prediction

V^\pi(s_0)\leftarrow V^\pi(s_0)+\alpha(R(s_0)+\gamma V^\pi(s_1)-V^\pi(s_0))

Tabular Q-Learning

Q(s,a)\leftarrow Q(s,a)+\alpha\left(R(s)+\gamma\max_{a^{\prime}}Q(s^{\prime},a^{\prime})-Q(s,a)\right)

Deep Q-Learning

\theta\leftarrow\theta-\alpha\nabla_\theta\left[R(s)+\gamma\max_{a^{\prime}}Q_\theta(s^{\prime},a^{\prime})-Q_\theta(s,a)\right]^2

Policy Gradient

\theta\leftarrow\theta+\alpha R(\tau)\sum_{i=0}^{|\tau|-1}\nabla\log(\pi_\theta(a_i|s_i))

Bandits and MCTS

Regret

R_\sigma(t)=\mathbb{E}_{c_i\sim\sigma}\left[\sum_{i=1}^tX_{c^*}-\sum_{i=1}^tX_{c_i}\right]=\mu^*t-\mathbb{E}_{c_i\sim\sigma}\left[\sum_{i=1}^t\mu(c_i)\right]

Concentration Bounds

P(|\bar{X}-\mu|\geq\varepsilon)\leq2e^{-2N\varepsilon^2} \quad \text{with i.i.d.}\ X_i\sim P_\mu \in [0,1]

P(|\bar{X}-\mu|\geq\varepsilon)\leq\frac{2}{T^{2c}} \quad \text{where}\ \varepsilon=\sqrt{\frac{c\log T}{N}}

Explore-Then-Commit

First, play each coin $N=c'\cdot T^{\frac{2}{3}}\log T^{\frac{1}{3}}$ times. Compare the average reward $\bar{\mu}_1$ and $\bar{\mu}_2$ . Afterwards play the coin with higher $\bar{\mu}_i$ .

R(T)\leq N+c\sqrt{\frac{\log T}{N}}\cdot T \quad R(T)\sim O\left(T^{\frac{2}{3}}(\log T)^{\frac{1}{3}}\right)

Epsilon-Greedy

Compare the average reward $\bar{\mu}_1$ and $\bar{\mu}_2$ . Choose $\varepsilon=c'\cdot t^{-\frac{1}{3}}(\log t)^{\frac{1}{3}}$ . With probability $1-\varepsilon$ , play the one with better empirical mean. With $\varepsilon$ , play the other one.

R(t)\leq\varepsilon t+c\sqrt{\frac{\log t}{\varepsilon t}}\cdot t \quad R(t)\sim O\left(t^{\frac{2}{3}}(\log t)^{\frac{1}{3}}\right)

Upper Confidence Bound

Compute the UCB value of each arm $i\in [K]$ . Play the arm with $\max_iB(i,t,n_i(t))$ .

B(i,t,n_i(t))=\bar{X}_i+\sqrt{\frac{2\log t}{n_i(t)}}

R_{UCB}(t)=\sum_{i=1}^K(\mu_{i^*}-\mu_i)\cdot\mathbb{E}[n_i(t)]\leq\sum_{i=1}^K\frac{8\log t}{\mu_{i^*}-\mu_i}+O(1)

— Mar 13, 2025

Related: #Optimization, #ReinforcementLearning

▲Top

View / Make Comments

Search and Optimization by Lu Meng is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. Permissions beyond the scope of this license may be available at About.

◀ §1 Basic Macroeconomic Concepts

Probabilistic Reasoning and Learning ▶

Made with ❤ at Earth.

🌑 ☀️