Hierarchical Programmatic Reinforcement Learning
via Learning to Compose Programs
We re-formulate solving a reinforcement learning task as synthesizing
a task-solving program that can be executed to interact with the environment
and maximize the return. We first learn a program embedding space that
continuously parameterizes a diverse set of programs sampled from a program dataset.
Then, we train a meta-policy, whose action space is the learned program embedding space,
to produce a series of programs (i.e., predict a series of actions) to yield a
composed task-solving program.