Изменения

Перейти к: навигация, поиск

Обучение с подкреплением

2343 байта добавлено, 22:09, 13 января 2019
Нет описания правки
=== Aлгоритм Q-learning ===
[[File:Q-Learning.png|thumb|313px|Процесс Q-обучения]]
 
The weight for a step from a state <math>\Delta t</math> steps into the future is calculated as <math>\gamma^{\Delta t}</math>. <math>\gamma</math> (the ''discount factor'') is a number between 0 and 1 (<math>0 \le \gamma \le 1</math>) and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). <math> \gamma </math> may also be interpreted as the probability to succeed (or survive) at every step <math>\Delta t</math>.
 
The algorithm, therefore, has a function that calculates the quality of a state-action combination:
 
:<math>Q: S \times A \to \mathbb{R}</math> .
 
Before learning begins, {{tmath|Q}} is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time <math>t</math> the agent selects an action <math>a_t</math>, observes a reward <math>r_t</math>, enters a new state <math>s_{t+1}</math> (that may depend on both the previous state <math>s_t</math> and the selected action), and <math>Q</math> is updated. The core of the algorithm is a simple [[Markov decision process#Value iteration|value iteration update]], using the weighted average of the old value and the new information:
 
:<math>Q^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot \overbrace{\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}} </math>
 
where ''<math>r_{t}</math>'' is the reward received when moving from the state <math>s_{t}</math> to the state <math>s_{t+1}</math>, and <math>\alpha</math> is the learning rate (<math>0 < \alpha \le 1</math>).
 
An episode of the algorithm ends when state <math>s_{t+1}</math> is a final or ''terminal state''. However, ''Q''-learning can also learn in non-episodic tasks.{{citation needed|date=December 2017}} If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.
 
For all final states <math>s_f</math>, <math>Q(s_f, a)</math> is never updated, but is set to the reward value <math>r</math> observed for state <math>s_f</math>. In most cases, <math>Q(s_f,a)</math> can be taken to equal zero.
 
# '''Initialization''' (Инициализация):
## for each s and a do Q[s, a] = RND // инициализируем функцию полезности Q от действия а в ситуации s как случайную для любых входных данных
77
правок

Навигация