Изменения

Обучение с подкреплением

1280 байт убрано, 23:06, 13 января 2019

Нет описания правки

== Q-learning ==

'''Q-обучение''' (Q-learning) — метод, применяемый в [[Искусственный интеллект|искусственном интеллекте]] при [[Агентный подход|агентном подходе]]. Относится к экспериментам вида [[Обучение с подкреплением|oбучение с подкреплением]]. На основе получаемого от среды вознаграждения ~~[[Интеллектуальный~~ агент~~|агент]]~~ формирует ~~[[Функция полезности|~~функцию полезности]] <tex>Q</tex>, что впоследствии дает ему возможность уже не случайно выбирать стратегию поведения, а учитывать опыт предыдущего взаимодействия со средой. Одно из преимуществ <tex>Q</tex>-обучения — то, что оно в состоянии сравнить ожидаемую ~~[[Полезность (экономика)|~~полезность]] доступных действий, не формируя модели окружающей среды. Применяется для ситуаций, которые можно представить в виде ~~[[Марковский процесс принятия решений|марковского процесса принятия решений]]~~МППР.

~~=== Aлгоритм Q-learning ===[[File~~Таким образом, алгоритм это функция качества от состояния и действия:~~Q-Learning.png|thumb|313px|Процесс Q-обучения]]~~

~~The weight for a step from a state~~ :<~~math~~tex>Q: S \~~Delta t</math> steps into the future is calculated as <math>~~times A \to \~~gamma^~~mathbb{~~\Delta t~~R}</math>. <math>\gamma</math> (the ''discount factor'') is a number between 0 and 1 (<math>0 \le \gamma \le 1</math>) and has the effect of valuing rewards received earlier higher than those received later (reflecting the value of a "good start"). <math> \gamma </math> may also be interpreted as the probability to succeed (or survive) at every step <math>\Delta t</mathtex>.

~~The algorithm~~Перед обучением {{tmath|Q}} инициализируется случайными значениями. После этого в каждый момент времени <math>t</math> агент выбирает действие <tex>a_t</tex>, ~~therefore~~получает награду <tex>r_t</tex>, ~~has a function that calculates the quality of a state-action combination~~переходит в новое состояние <math>s_{t+1}</math> (которое может зависеть от предыдущего состояния <tex>s_t</tex> и выбранного действия), и обновляет функцию <tex>Q</tex>. Обновление функции использует взвешенное среднее между старым и новым значениями:

:<~~math~~tex>Q~~: S~~ ^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\~~times A~~ text{learning rate}} \to cdot \~~mathbb~~overbrace{R\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}}</~~math~~tex> .,

~~Before learning begins, {{tmath|Q}} is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time~~ где ''<~~math~~tex>r_{t}</~~math> the agent selects an action <math>a_t</math~~tex>'' это награда, ~~observes a reward~~ полученная при переходе из состояния <~~math~~tex>~~r_t~~s_{t}</~~math~~tex>~~, enters a new state~~ в состояние <~~math~~tex>s_{t+1}</~~math~~tex> ~~(that may depend on both the previous state~~ , и <~~math~~tex>~~s_t~~\alpha</~~math~~tex> ~~and the selected action), and~~ это скорость обучения (<~~math~~tex>Q0 < \alpha \le 1</~~math~~tex> ~~is updated~~). ~~The core of the algorithm is a simple [[Markov decision process#Value iteration|value iteration update]], using the weighted average of the old value and the new information:~~

:Алгоритм заканчивается, когда агент переходит в терминальное состояние <~~math~~tex>Q^{new}(s_{t},a_{t}) \leftarrow (1-\alpha) \cdot \underbrace{Q(s_{t},a_{t})}_{\text{old value}} + \underbrace{\alpha}_{\text{learning rate}} \cdot \overbrace{\bigg( \underbrace{r_{t}}_{\text{reward}} + \underbrace{\gamma}_{\text{discount factor}} \cdot \underbrace{\max_{a}Q(s_{t+1~~}, a)}_{\text{estimate of optimal future value}} \bigg) }^{\text{learned value}~~} </~~math~~tex>.

~~where ''<math>r_{t}</math>'' is the reward received when moving from the state <math>s_{t}</math> to the state <math>s_{t+1}</math>, and <math>\alpha</math> is the~~ === Aлгоритм Q-learning ~~rate (<math>0 < \alpha \le 1</math>).~~===

~~An episode of the algorithm ends when state <math>s_{t+1}</math> is a final or ''terminal state''. However, ''~~[[File:Q''-~~learning can also learn in non-episodic tasks~~Learning.~~{{citation needed~~png|~~date=December 2017}} If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.~~ ~~For all final states <math>s_f</math>, <math>~~thumb|313px|Процесс Q~~(s_f, a)</math> is never updated, but is set to the reward value <math>r</math> observed for state <math>s_f</math>. In most cases, <math>Q(s_f,a)</math> can be taken to equal zero.~~-обучения]]

~~Обозначения~~* <tex>S</tex> -- — множество состояний* <tex>A</tex> -- — множество действий* <tex>R = S * \times A \rightarrow \mathbb{R}</tex> -- — функция награды* <tex>T = S * \times A -> \rightarrow S</tex> -- — функция перехода* <tex>\alpha \in [0, 1]</tex> -- — learning rate (обычно 0.1) // , чем он выше, тем сильнее агент доверяет новой информации* <tex>\gamma \in [0, 1]</tex> -- — discounting factor // , чем он меньше, тем меньше агент задумывается о выгоде от будущих своих действий

'''fun''' Q-learning(<tex>S, A, R, T, \alpha, \gamma</tex>):

Dariyakovleva

77

правок

Изменения

Обучение с подкреплением

Навигация

Персональные инструменты

Пространства имён

Варианты

Просмотры

Ещё

Поиск

Навигация

Инструменты