Script of the talk:

Online robot learning by reward and punishment for a mobile robot

D. Suwimonteerabuth and P. Chongstitvatana

present at Int. Conf. on Intelligent Robots and Systems, Switzerland, 1-4 October 2002.

by P. Chongstitvatana

IROS 2002

<intro>
Good morning ladies and gentlemen. I came from Chulalongkorn University, situated in Bangkok, the capital city of Thailand.

<page 2>
I am going to describe my work in robot learning. In this work, we try to find a flexible way for a robot to learn a new task without using "built-in" knowledge. What I mean by "built-in" knowledge is some form of explicit representation about the task that is available to the robot. Without knowledge about the task, the robot is said to have no goal. This is very important distinction of our work. Because without the robot knowing about its goal "a priori", the robot can be taught to do many tasks. This is the flexibility we want.

So, how the robot learn to do a task? We use a human to teach the robot the desired behaviour. The robot situated in the real world, a human trainer observes its behaviour, he then gives reward or punishment to direct the robot towards the desired behaviour. That is our scheme.

<page 3>
We have done a number of experiment to test the idea. The learning mechanism is as follows: A robot is controlled by Finite State Machines. Genetic Algorithm is used to evolve FSM. Evolution is suitable for this task because it is a continuous process and it can integrate a human trainer into the learning process quite
naturally. The human trainer gives reward and punishment that affects the evolution process directly.

<page 4>
The method of experiment is as follows: This is the mobile robot used in the experiment. It has infrared and color sensors in its belly to distinguish the color of the floor. The arena is a rectangular floor of size about 2 x 2 meters surrounded by walls. The robot can move around and stays within this area. The floor is divided equally into two areas differentiated by colors: one half is black and other is white. The task we want the robot to learn is to stay in one designated color specified by the trainer.

<page 5>
Each controller is a FSM with eight states. The state transition and output is not specified and will be evolved by GA. The fitness function is controlled by the trainer sending signals of reward and punishment via a radio transmitter to the robot. There are two-bit output to control four motions: moving forward, backward, turning
left and right.

<page 6>
We studied the factors that are relevant to the quality of learning.

One, is there the effect of giving reward and punishment?
Two, the different initial conditions: the starting position, whether it is initially in the white or black area.
The number of reward and punishment given.
The size of gene pool.
Finally, will the different trainer gives the different result?

We let the robot run continuously in the arena. The gene pool size is 20. Initially, all FSMs are randomly generated. Each FSM runs for 30 sec. The fitness function is controlled by the trainer sending signals of reward and punishment to the robot. Each experiment is repeated a couple of times, 60 minutes each.

<page 7>
The measure of performance is the time that the robot stays in the designated color. The first experiment, we establish the base-line result by letting the robot run without giving reward/punishment. There are 3 runs. You can see the random drift of the performance.

<page 8>
The second experiment, we start giving the reward/punishment. At the start, the robot is placed on the white side. At first, the robot stays mostly on the white side. Around 15 minutes it drifts to the black side. By 25
minutes, it learns to move to the white side and at 45 minutes it stays in the while side almost all the time. The red line shows the average of 3 runs.

What had happened? Initially, by randomness of FSMs, the robot moves to the black side. Some FSM exhibits "jittering" motions, that is, moving back and forth or moves in small circle, hence, the robot stays in the black side. The trainer starts giving the punishment, around 25 min. the robot moves and the reward is given. Once it is on the white side, the reward is given to the behaviour that move less, hence the robot stays in the white side.

The third experiment, the starting position is the black area. The robot is able to learn the task of moving to and stays in the white area.

<page 9>
The fourth experiment, the number of reward and punishment is limited to 3 and 6 in 30 sec. The rate of learning is reduced compared to the unlimited version.

The fifth experiment, we want to find the effect of the size of the gene pool. The size is varied from 20, 10, 5. The size of gene pool affects the convergence rate in GA as the larger pool will have more diverse gene but in the real-time context it will also take longer to evolve a solution. For the size of 5, the learning rate is low. This can be interpreted that the gene pool is too small for GA to find good solutions. The size 10 exhibits the best learning rate. As in real-time the smaller size is faster to adapt.

<page 10>
And the last experiment, we use 3 different trainers. The result shows two of them are good trainer. The third person is not good. This shows clearly that the ability of human trainer affects the success.

<page 11>
In conclusion, we have made an empirical study of the learning by reward and punishment. The result shows that the robot can learn quite successfully. The robot learns flexibly. Its behavious can be shaped in real-time and continuously. And lastly, we find out that it is tedious and not trivial job to train the robot this way.

Thank you.