I think get is pretty good now.
in file ...\applications\ndSandbox\demos\ndBipedTest_2.cpp
at the top there is a define
#define _TEST_ONE_FUTURE_STE_ACTION
that shows how the AI controller start training.
if you run the test, I will populate the replay buffer with random action, there you can see that since the action are has equal change to go forward or back. the max angel is never reached.
but the model is not balance, and simple topple to the floor.
This part is how the training start and how the action are generated.
them after the replay buffer has enough actions, what is does is that is try to get actions form the replay buffer, more than randomly.
so when happen is that the reward function, is the sum of the action to the future that the player survive the longer. since short action kill the model very soon, the reply buffer will have more longer sequences of good action than short sequences so it will optimize those.
this part is no in yet, but to emulate it if to uncomment the define. the controller will simple select the action with the large reward, and what is does is that it make the model oscillate around the equilibrium. each time making the angle grows larger.
this is a very remarkable result.
if we where using control theory, there are equation than can be use to determine is the system in stable. but for almost 20 years I tried that, and even when getting a model to balance, it is so complex that the slide change cause lot of code rewrite, compels logic and after a while is difficult to undertenant.
what the ML approach is expected to do, is that if should figure out that the action with the best reward is not always the nest one, it is possible to take few step in the wrong direction and them after some time in the futures make a turn this is what the Bellman equation, which is at the root of all machine learning, shows.
https://en.wikipedia.org/wiki/Bellman_equation
basically the neural net s a way to encode the bellman equation so that given a present state, the machine select there state with the the maximum expected reward.
an example would be say the model is really close to balance, the com is align with the zmp.
but to do that, in the previous step has to give a push to the hip. so now the hip has a large velocity.
if we evaluate the present state, the action with the large reward will be to move in the opposite direction, but of form previous experience that neural net now know that f is does that the model will over shout. so the evaluation form the net will come op with the value do nothing as the one with the the highest expected reward.
it is very difficult to do that with control logic.
anyway if you play with and without the define you can see the difference. but the train model only search one step, it is just a prove of concept for continuing to the rest of the QDN algorithm.