[ot][spam][crazy] lab1 docs was: lab1 was: draft: learning RL

Mon May 9 14:41:09 PDT 2022

Ok, that page was so short!!! Back to:

Lunar Lander Documentation:
https://www.gymlibrary.ml/environments/box2d/lunar_lander

> Action Space: Discrete(4)
Action is 1 of 4 integers
> Observation Shape: (8,)
Observation space is an unbounded 8-vector
> Observation High: [inf inf inf inf inf inf inf inf]
> Observation Low: [-inf -inf -inf -inf -inf -inf -inf -inf]
> Import: gym.make("LunarLander-v2")

> Description
> This environment is a classic rocket trajectory optimization problem. According to
> Pontryagin’s maximum principle, it is optimal to fire the engine at full throttle or turn it off.
> This is the reason why this environment has discrete actions: engine on or off.
Aww shouldn't the model learn this?

> There are two environment versions: discrete or continuous. The landing pad is always at
> coordinates (0,0). The coordinates are the first two numbers in the state vector. Landing
> outside of the landing pad is possible. Fuel is infinite, so an agent can learn to fly and then
> land on its first attempt.
>
> To see a heuristic landing, run:
>
> python gym/envs/box2d/lunar_lander.py
Otherwise known as:
pip3 install gym[box2d] && python3 -m gym.envs.box2d.lunar_lander # i think

> Action Space
> There are four discrete actions available: do nothing, fire left orientation engine, fire main
> engine, fire right orientation engine.
>
> Observation Space
> There are 8 states: the coordinates of the lander in x & y, its linear velocities in x & y, its
> angle, its angular velocity, and two booleans that represent whether each leg is in contact
> with the ground or not.

> Rewards
> Reward for moving from the top of the screen to the landing pad and coming to rest is
> about 100-140 points. If the lander moves away from the landing pad, it loses reward. If the
> lander crashes, it receives an additional -100 points. If it comes to rest, it receives an
> additional +100 points. Each leg with ground contact is +10 points. Firing the main engine
> is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200
> points.
This is very very similar to the text from huggingface's lab.

> Starting State
> The lander starts at the top center of the viewport with a random initial force applied to its
> center of mass.

> Episode Termination
> The episode finishes if:
> 1. the lander crashes (the lander body gets in contact with the moon);
> 2. the lander gets outside of the viewport (x coordinate is greater than 1);
> 3. the lander is not awake. From the Box2D docs, a body which is not awake is a
> body which doesn’t move and doesn’t collide with any other body:

> > When Box2D determines that a body (or group of bodies) has come to rest, the body
> > enters a sleep state which has very little CPU overhead. If a body is awake and collides
> > with a sleeping body, then the sleeping body wakes up. Bodies will also wake up if a joint
> > or contact attached to them is destroyed.

> Arguments
> To use to the continuous environment, you need to specify the continuous=True argument
> like below:
>
> > import gym
> > env = gym.make("LunarLander-v2", continuous=True)

They don't say what the continuous environment is. It seems like
source code is still a better resource than documentation.

When installed with pip in linux, the environment source is at
~/.local/lib/python3.*/site-packages/gym/envs/box2d/lunar_lander.py
for me. On the web, that's
https://github.com/openai/gym/blob/master/gym/envs/box2d/lunar_lander.py
.

It looks like the documentation on the web is not up to date or is
truncated for some reason. The documentation in the source code does
indeed continue:

>    If `continuous=True` is passed, continuous actions (corresponding to the throttle of the engines) will be used and the
>    action space will be `Box(-1, +1, (2,), dtype=np.float32)`.
>    The first coordinate of an action determines the throttle of the main engine, while the second
>    coordinate specifies the throttle of the lateral boosters.
>    Given an action `np.array([main, lateral])`, the main engine will be turned off completely if
>    `main < 0` and the throttle scales affinely from 50% to 100% for `0 <= main <= 1` (in particular, the
>    main engine doesn't work  with less than 50% power).
>    Similarly, if `-0.5 < lateral < 0.5`, the lateral boosters will not fire at all. If `lateral < -0.5`, the left
>    booster will fire, and if `lateral > 0.5`, the right booster will fire. Again, the throttle scales affinely
>    from 50% to 100% between -1 and -0.5 (and 0.5 and 1, respectively).
>    `gravity` dictates the gravitational constant, this is bounded to be within 0 and -12.
>    If `enable_wind=True` is passed, there will be wind effects applied to the lander.
>    The wind is generated using the function `tanh(sin(2 k (t+C)) + sin(pi k (t+C)))`.
>    `k` is set to 0.01.
>    `C` is sampled randomly between -9999 and 9999.
>    `wind_power` dictates the maximum magnitude of wind.

So, you can indeed provide a harder challenge to the agent, by using
continuous=True and/or enable_wind=True . Like usual, they thought of
my concern.

This appears to roughly be the full documentation of the LunarLander
environment (v2).