[ot][spam][crazy] lab1 docs was: lab1 was: draft: learning RL

Undiscussed Horrific Abuse, One Victim of Many gmkarl at gmail.com
Mon May 9 14:41:09 PDT 2022

Ok, that page was so short!!! Back to:

Lunar Lander Documentation:

> Action Space: Discrete(4)
Action is 1 of 4 integers
> Observation Shape: (8,)
Observation space is an unbounded 8-vector
> Observation High: [inf inf inf inf inf inf inf inf]
> Observation Low: [-inf -inf -inf -inf -inf -inf -inf -inf]
> Import: gym.make("LunarLander-v2")

> Description
> This environment is a classic rocket trajectory optimization problem. According to
> Pontryagin’s maximum principle, it is optimal to fire the engine at full throttle or turn it off.
> This is the reason why this environment has discrete actions: engine on or off.
Aww shouldn't the model learn this?

> There are two environment versions: discrete or continuous. The landing pad is always at
> coordinates (0,0). The coordinates are the first two numbers in the state vector. Landing
> outside of the landing pad is possible. Fuel is infinite, so an agent can learn to fly and then
> land on its first attempt.
> To see a heuristic landing, run:
> python gym/envs/box2d/lunar_lander.py
Otherwise known as:
pip3 install gym[box2d] && python3 -m gym.envs.box2d.lunar_lander # i think

> Action Space
> There are four discrete actions available: do nothing, fire left orientation engine, fire main
> engine, fire right orientation engine.
> Observation Space
> There are 8 states: the coordinates of the lander in x & y, its linear velocities in x & y, its
> angle, its angular velocity, and two booleans that represent whether each leg is in contact
> with the ground or not.

> Rewards
> Reward for moving from the top of the screen to the landing pad and coming to rest is
> about 100-140 points. If the lander moves away from the landing pad, it loses reward. If the
> lander crashes, it receives an additional -100 points. If it comes to rest, it receives an
> additional +100 points. Each leg with ground contact is +10 points. Firing the main engine
> is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200
> points.
This is very very similar to the text from huggingface's lab.

> Starting State
> The lander starts at the top center of the viewport with a random initial force applied to its
> center of mass.

> Episode Termination
> The episode finishes if:
> 1. the lander crashes (the lander body gets in contact with the moon);
> 2. the lander gets outside of the viewport (x coordinate is greater than 1);
> 3. the lander is not awake. From the Box2D docs, a body which is not awake is a
> body which doesn’t move and doesn’t collide with any other body:

> > When Box2D determines that a body (or group of bodies) has come to rest, the body
> > enters a sleep state which has very little CPU overhead. If a body is awake and collides
> > with a sleeping body, then the sleeping body wakes up. Bodies will also wake up if a joint
> > or contact attached to them is destroyed.

> Arguments
> To use to the continuous environment, you need to specify the continuous=True argument
> like below:
> > import gym
> > env = gym.make("LunarLander-v2", continuous=True)

They don't say what the continuous environment is. It seems like
source code is still a better resource than documentation.

When installed with pip in linux, the environment source is at
for me. On the web, that's

It looks like the documentation on the web is not up to date or is
truncated for some reason. The documentation in the source code does
indeed continue:

>    If `continuous=True` is passed, continuous actions (corresponding to the throttle of the engines) will be used and the
>    action space will be `Box(-1, +1, (2,), dtype=np.float32)`.
>    The first coordinate of an action determines the throttle of the main engine, while the second
>    coordinate specifies the throttle of the lateral boosters.
>    Given an action `np.array([main, lateral])`, the main engine will be turned off completely if
>    `main < 0` and the throttle scales affinely from 50% to 100% for `0 <= main <= 1` (in particular, the
>    main engine doesn't work  with less than 50% power).
>    Similarly, if `-0.5 < lateral < 0.5`, the lateral boosters will not fire at all. If `lateral < -0.5`, the left
>    booster will fire, and if `lateral > 0.5`, the right booster will fire. Again, the throttle scales affinely
>    from 50% to 100% between -1 and -0.5 (and 0.5 and 1, respectively).
>    `gravity` dictates the gravitational constant, this is bounded to be within 0 and -12.
>    If `enable_wind=True` is passed, there will be wind effects applied to the lander.
>    The wind is generated using the function `tanh(sin(2 k (t+C)) + sin(pi k (t+C)))`.
>    `k` is set to 0.01.
>    `C` is sampled randomly between -9999 and 9999.
>    `wind_power` dictates the maximum magnitude of wind.

So, you can indeed provide a harder challenge to the agent, by using
continuous=True and/or enable_wind=True . Like usual, they thought of
my concern.

This appears to roughly be the full documentation of the LunarLander
environment (v2).

More information about the cypherpunks mailing list