Building a Light-Seeking Robot with Q-Learning
Date: Apr 19, 2002
Article is provided courtesy of Prentice Hall.
Q-Learning is a well-known algorithm that allows machines to learn while unsupervised. The Lego Mindstorms kit, along with leJOS, contains everything needed to implement this fascinating algorithm. This article demonstrates how to build a robot that will learn to seek out a bright light.
One of the most powerful aspects of Lego Mindstorms is that it can be programmed to do whatever we want it to do. This can be interesting, but often these types of projects are very predictable. Rather than doing what we tell it to do, an even more fascinating robot would have the ability to learn on its own. The field of AI has produced numerous algorithms for learning. There are essentially two large subdivisions of learning (which apply to animals as well as robots): supervised learning and unsupervised learning.
Supervised learning is often accomplished using neural networks. In a supervised learning situation, a robot is given input/output pairs and, after many examples, it develops its own function that can decide what to do with a given input. For example, a computer hooked up to a camera could be shown a series of satellite photographs of a forest. Some of the pictures could contain tanks hiding in the trees, and others could be regular unoccupied forest. One photo after another, the robot is shown a picture and told whether or not tanks are present in the scene. Once the teaching process is done, a new picture is shown to the robot and it tries to identify whether or not a tank is present. This type of problem is ideal for neural networks. The robot in this situation is learning passively; that is, after each photograph is shown, it doesn't take an action or make any statements. It just sits back and learns.
Even more interesting than supervised learning is unsupervised learning. This type of robot receives feedback from each action it performs, which allows it to judge how effective the action was. The feedback is extracted from the environment, either through sensors or internal states such as counting. This feedback is then classified as a reward (or reinforcement). An algorithm decides the value of the reward, which can be either positive or negative. These built-in rewards are very similar to the instincts and feelings that guide humans and other animals. A small sampling of reinforcements that guide your typical day are hunger, pain, enjoyment of food, and sensing cold temperatures.
There are three main advantages of reinforcement learning:
Very little programming is required because the robot figures out the algorithm itself.
If the environment changes, it doesn't need to be reprogrammed. Even if the robot design is altered, it will relearn the optimal algorithm.
If the learning algorithm is properly designed, the robot is guaranteed to find the most efficient policy.
Reinforcement learning shines when given a complex problem. Any problem with many different states and actionsso many that it is complicated for humans to fathomis ideal for reinforcement learning. In robotics, if you want to program a six-legged walking robot, you need to understand which direction each of the motors turns, you need to pay attention to the sensors that indicate leg position relative to the others, and you need to pay attention to a myriad of physical conditions such as balance. This can be downright complex because a simple pair of wires that are reversed could throw everything off. With reinforcement learning, the robot can sit there experimenting with different walking gaits, measure how far a gait has caused it to move, and the best gait will reveal itself with enough reinforcement. The user could then change the length of the robot legs, change motor sizes, and reverse wires; and the robot will readapt to the new hardware. If the walking-algorithm were manually programmed everything would need to be reprogrammed.
There are two types of unsupervised reinforcement learning. The first requires a model of the world so it can make proper decisions. For example, a self-learning chess program would need to know the position of all the pieces and all the available moves to both players in order to make an informed decision. This can be complex because it needs to keep many statistics. The second type uses an action-value model, which creates a function to deal with different states. This is known as Q-Learning.
The rest of this article will reveal more about Q-Learning, including the algorithm and the parts that make up that algorithm. This includes building and programming a real Lego Mindstorms robot with Java. The result will be a light-seeking robot that uses Q-Learning to produce a light-seeking algorithm.
The Q-Learning Algorithm
A Q-Learning robot can determine the value of an action right after the action is performed, and doesn't need to know about the larger world model. It just needs to know the available actions for each step. Because it requires no model, it is much simpler to program than other learning algorithms.
Q-Learning values are built on a reward scheme. We need to design a reward algorithm that will motivate our robot to perform a goal-oriented behavior. For this project, we'll create a goal-based robot that is rewarded for finding brighter areas of light. This turns out to be very easy to do, using the following criteria:
Goal: Approach Light. The value of the current light reading minus the last light reading determines the reward (greater increase = greater reward). So if the current light reading is 56 and the previous light reading was 53, it receives a reward of +3.
Goal: Avoid Obstacles. If one of the bumpers is pressed, the reward is -2.
Goal: Avoid Staying Still. If the light reading hasn't changed in the last five steps, it receives a negative reward of -2. Presumably, if the robot is receiving identical light readings for five or more steps in a row, it is hung up or not moving.
So how are the actual Q-Values calculated? Basically we just need an equation that increases the Q-Value when a reward is positive, decreases the value when it is negative, and holds the value at equilibrium when the Q-Values are optimal. The equation is as follows:
where the following is true:
NOTE
This calculation must occur after an action has taken place, so the robot can determine how successful the action was (hence, why previous action and previous state are used).
In order to implement this algorithm, all movement by the robot must be segregated into steps. Each step consists of reading the percepts, choosing an action, and evaluating how well the action performed. All Q-values will presumably be equal to zero for the first step, but during the next step (when the algorithm is invoked), it will set a Q-Value for Q(a,i) that will be a product of the reward it received for the last action. So as the robot moves along, the Q-values are calculated repeatedly, gradually becoming more refined (and more accurate).
In order to better understand the overall flow of the program, it would be useful to examine this in abstract. The abstract algorithm would look something like this:
repeat(forever): input = getInputs() action = chooseAction(inputs, qvalues) apply(action) qvalues=updateQvalues(qvalues, getFeedback() )
Light Seeking
Let's try to design a Q-Learning project for the Lego Robotics Invention System (RIS). In order to take useful measurements, the robot will need to be able to sample two light value readings, one to the left and one to the right, and then contrast them. The robot will also have two bumpers, both front and rear, in order to detect when it is hung up on something. The total input this robot will have to work with is summarized in the following table.
|
Sensor |
Function |
|
1 |
Front touch sensor |
|
2 |
Light sensor |
|
3 |
Rear touch sensor |
Normally, light-seeking robots contain two light sensors, each pointing in slightly different directions so it can compare the two values and head toward the strongest value. The RIS kit only comes with a single light sensor, so we will have to simulate two sensors. We can "fake" two sensors by making the entire robot rotate left, take a reading, and then rotate right and take another.
NOTE
Lego has other options for comparing two light readings.
Now that we know the function of the robot, we can proceed with the design. We'll make a simple tank called the Q-Tank ("Cute-Tank" because it's pocket-sized). The complete directions can be found here.
Figure 1 The Q-Tank.
States and Actions
The sensors on the robot are used to describe the state of the robot. The state is just a current summary of all the perceptions (known as percepts) that a robot is monitoring. There can be internal state information as well, such as the robot keeping track of the number of collisions it has had. Q-Tank will keep just one internal valuethe number of times in a row that the light sensor value has remained unchanged. The purpose of this value is to determine if the robot is just sitting still. Obviously, we want to motivate the robot to keep moving. If the value is not changing, then it is likely not doing enough to home in on the target; or it might be stuck in a corner and the bumper failed to activate.
Q-Learning is a little bit like statistics-gathering. In a nutshell, the robot keeps a table of statistics that describe how successful an action is when presented with a state. All actions and all states are identified by a single number. For example, when state 4 occurs (which might mean the left value is greater than the right, and the forward bumper is pressed) the robot might try action 6 (which happens to be rotating left). The Q-values are all stored in a table, accessible by Q(action, state). So, if we looked up our example in the table, we might see that Q(4,6) has a Q-Value of 2.397.
Q-Tank is an active learner, so every time it is given a state it tries out an action; then, after the action is performed, it evaluates how successful it was by using the reward function. Let's take a simplified example: a robot that can only discern 2 states: bumper pressed or not pressed. It only has one motor that can do two things: forward or backward. The entire Q-table for this robot would be as follows (the numbers I used to fill in the table were taken at random):
|
|
Action 1 |
Action 2 |
|
State 1 |
-0.5 |
1.5 |
|
State 2 |
2.99 |
-2 |
For every step the robot takes, it can look up the state in the table. In order for it to pick the best action available, it should pick the action that will give it the highest Q-value (that is, the one that in the past has produced the highest reward). In the above table, given state 2 it makes the most sense to perform action 1.
So, what are the actions and states for this project? Q-Tank has two motors that are hooked up to output port A and C. These motors in turn spin two wheels. Each motor is capable of three functions: forward, backward, and stop. These functions create nine unique actions that Q-Tank can perform, as shown in the following table.
|
Action ID |
Motors A, C |
Action Description |
|
1 |
stop, stop |
Stop |
|
2 |
stop, fwd |
Lazy left turn fwd |
|
3 |
stop, rvs |
Lazy right turn rvs |
|
4 |
fwd, stop |
Lazy right turn fwd |
|
5 |
fwd, fwd |
Forward |
|
6 |
fwd, rvs |
Sharp right turn |
|
7 |
rvs, stop |
Lazy left turn rvs |
|
8 |
rvs, fwd |
Sharp left turn |
|
9 |
rvs, rvs |
Back up |
Now what about states? There are four separate criteria for the states, as mentioned above: the contrasted light value, front bumper, rear bumper, and whether or not the readings have repeated five times or more. More than five identical readings in a row likely indicate that the robot is stuck, so a negative reinforcement will cause it to attempt to become unstuck.
|
Percept |
1 |
2 |
3 |
|
Difference of Light |
L > R |
R > L |
L = = R |
|
Front Bumper |
pressed |
unpressed |
|
|
Rear Bumper |
pressed |
unpressed |
|
|
Repeating > 5 times |
no |
yes |
|
So there are 3 x 2 x 2 x 2 combinations of percepts, or 24 unique states Q-Tank can identify. For programming purposes, we will encode each percept with a value: 0 or 1 for bumpers; 0,1,2 for light comparisons; and 0 or 1 for repeating. I won't list the entire table, but a partial list of states is as follows:
|
State |
Percepts |
|
1 |
0,0,0,0 |
|
2 |
0,0,0,1 |
|
3 |
0,0,1,0 |
|
4 |
0,0,1,1 |
|
5 |
etc. |
With 24 states and nine actions, our Q-table will be quite large, with a grand total of 216 Q-Values. I should be clear that the actual Q-Learning algorithm does not need to know the details of the state or action; it just needs an identification number that represents a state or action. For example, if it knows it is in state 7, it might decide to try action 5. This is like ordering food by number from a menu just give the number and judge how good it was.
Notice Q(a1 , j) in the algorithm above. This will choose the maximum Q-Value out of all available Q-Values for the state. The implications of this can be hard to understand, so stay with me. Let's say the robot is in state 8. It looks up the Q-values in the table, and sees that there are two actions that have the same Q-Value (action 1 and action 2), so it is likely they produce the same immediate reward. Action 1 delivers it to state 5 (see Figure 2), whereas action 2 delivers it to state 10. Once in state 5, the best action (a3) will deliver a reward of only 0.5. In contrast, if it had gone to state 10, the best action would deliver a reward of +2.5! So as we can see, it is better to go to state 10 than state 5, even though the immediate rewards for both are equal. By including Q(a1 , j) in the Q-Learning equation above, the algorithm becomes aware of what lies ahead (given enough practice). This shows that the Q-Learning algorithm is much more powerful than it first appears. It actually has a bit of a capability to look ahead at the possible rewards. Given enough steps, this algorithm will converge on the optimal action for every state in the state space.
Figure 2 The powerful Q-Learning algorithm.
Another interesting aspect of the Q-Learning equation is the learning rate (ß in the equation above). This constant determines the rate of convergence on the optimal algorithm. It's important to strike the right balance between speedy learning and giving the Q-Value history more weight. If we use a value that is too high, each new step it takes will overshadow the previous Q-Value history and make the robot seem like it has no memory of previous situations. If the learning rate is too small, then our robot will be a slow learner.
If we accepted this algorithm as described up to this point, it would not perform optimally. The problem is, when the robot starts, all the Q values are set to zero. As soon as it performed an action that gave it any reward, it would be satisfied with that action as the maximum action for a given state. So even if the reward was a paltry 0.1, it is still larger than 0, and hence this action would be picked over all the others because the robot has not explored any others yet. So how do we get the robot to explore other options it hasn't yet tried? By choosing a random action once in a while. We assign the robot an exploration constant of between 0 and 1. If the exploration constant is 0.1, then 10% of the time it tries a random action rather than the maximum action in the table.
Another way to make the algorithm less predictable is to seed the table with random values at the start, rather than leaving all the values at 0. The values will all eventually migrate to their proper values if given enough steps, so this does not adversely affect the performance of Q-Learning.
Evaluating the Project
Q-Tank seems to work quite well, but it is not as interesting to watch as I had hoped. The robot takes about 60 steps to build up reasonable Q-Values, but after awhile it begins acting rationally and attempts to face the brightest light in the room. It's not perfect, though. The reality is that it has a hard time actually driving toward a light because the light reading doesn't tend to increase much when moving toward the light source. We could make the light sensor more light-sensitive. The Lego light sensor has the option of reading the raw value, which produces a range of 0 to 1023 (rather than 0 to 100), but this also causes problems. When it is this sensitive, someone shifting their position in the room is enough to change the ambient light value, which fools our robot into thinking it has moved (even though the robot might be hung up in a corner). I've monitored the raw light value of the sensor, and it jumps around all over the place, so using values of 0 to 100 acts to stabilize readings.
In a typical rectangular room, with many objects of different colors creating lots of local shadows and bright spots, it doesn't have as much of a chance to find the strongest light. If the room is all white with a half-sphere ceiling, and the light sensor pointed roughly at the ceiling, then it would be able to move toward the light much more easily. Also, if the brightest direction in the room produces a light value of 68, once it is facing that direction, it doesn't really have anywhere else to go. So the best thing for it to do is sit still and point at the light (which is one reason why I added a negative reinforcement if the light value remains static for more than five moves).
In terms of observing the behavior, Q-Tank would be easier to understand if there were two light sensors. The quick left-right movements that begin each step tend to obscure observations. Also, the movement left and right may throw off the robot because the start and end position may not match exactly, which could contaminate the sensor data the robot is receiving.
Ironically, the most powerful part of the Q-Learning equation, the ability to take into account the state it is delivered to, does not get much use with such a simple problem as light-seeking. The main problem is that Q-Tank is not given a very rich description of the state that it is in. Q-Tank is only given a few scant details from the light sensor. If you, as an intelligent person, were given such scant data about your surroundings, you wouldn't be able to do very interesting actions, either. The Q-Learning algorithm would really shine if the robot kept track of Cartesian coordinates (x, y, and direction). Then it would learn the best route in a particular room layout from any point in the room to a target area, such as a bright light. This would increase the number of states the Q-Table would need to keep track of, however.
Q-Learning is a wonderful technique. It allows a programmer to state the high-level goals of the robot without having to rely on assumptions about the environment. The model requires just two things on the part of the programmer: perceptions and actions. The overall program can be very generic, with no hard-coded behavior. The real challenge is to find a project worthy of the algorithm.
Download the source code.
References
Arkin, Ronald C. Behavior-Based Robotics. MIT Press, 1998.
Russell, Stuart and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1994.