Many real-world problems are inherently hierarchically structured. The
use of this structure in an agent's policy may well be the key to
improved scalability and higher performance on motor skill tasks.
However, such hierarchical structures cannot be exploited by current
policy search algorithms. We concentrate on a basic, but highly relevant
hierarchy - the `mixed option' policy. Here, a gating network first
decides which of the options to execute and, subsequently, the
option-policy determines the action. Using a hierarchical setup for our
learning method allows us to learn not only one solution to a problem
but many. We base our algorithm on a recently proposed information
theoretic policy search method, which addresses the
exploitation-exploration trade-off by limiting the loss of information
between policy updates.