# Building a Light-Seeking Robot with Q-Learning

• Print

## States and Actions

The sensors on the robot are used to describe the state of the robot. The state is just a current summary of all the perceptions (known as percepts) that a robot is monitoring. There can be internal state information as well, such as the robot keeping track of the number of collisions it has had. Q-Tank will keep just one internal value—the number of times in a row that the light sensor value has remained unchanged. The purpose of this value is to determine if the robot is just sitting still. Obviously, we want to motivate the robot to keep moving. If the value is not changing, then it is likely not doing enough to home in on the target; or it might be stuck in a corner and the bumper failed to activate.

Q-Learning is a little bit like statistics-gathering. In a nutshell, the robot keeps a table of statistics that describe how successful an action is when presented with a state. All actions and all states are identified by a single number. For example, when state 4 occurs (which might mean the left value is greater than the right, and the forward bumper is pressed) the robot might try action 6 (which happens to be rotating left). The Q-values are all stored in a table, accessible by Q(action, state). So, if we looked up our example in the table, we might see that Q(4,6) has a Q-Value of 2.397.

Q-Tank is an active learner, so every time it is given a state it tries out an action; then, after the action is performed, it evaluates how successful it was by using the reward function. Let's take a simplified example: a robot that can only discern 2 states: bumper pressed or not pressed. It only has one motor that can do two things: forward or backward. The entire Q-table for this robot would be as follows (the numbers I used to fill in the table were taken at random):

 Action 1 Action 2 State 1 -0.5 1.5 State 2 2.99 -2

For every step the robot takes, it can look up the state in the table. In order for it to pick the best action available, it should pick the action that will give it the highest Q-value (that is, the one that in the past has produced the highest reward). In the above table, given state 2 it makes the most sense to perform action 1.

So, what are the actions and states for this project? Q-Tank has two motors that are hooked up to output port A and C. These motors in turn spin two wheels. Each motor is capable of three functions: forward, backward, and stop. These functions create nine unique actions that Q-Tank can perform, as shown in the following table.

 Action ID Motors A, C Action Description 1 stop, stop Stop 2 stop, fwd Lazy left turn fwd 3 stop, rvs Lazy right turn rvs 4 fwd, stop Lazy right turn fwd 5 fwd, fwd Forward 6 fwd, rvs Sharp right turn 7 rvs, stop Lazy left turn rvs 8 rvs, fwd Sharp left turn 9 rvs, rvs Back up

Now what about states? There are four separate criteria for the states, as mentioned above: the contrasted light value, front bumper, rear bumper, and whether or not the readings have repeated five times or more. More than five identical readings in a row likely indicate that the robot is stuck, so a negative reinforcement will cause it to attempt to become unstuck.

 Percept 1 2 3 Difference of Light L > R R > L L = = R Front Bumper pressed unpressed Rear Bumper pressed unpressed Repeating > 5 times no yes

So there are 3 x 2 x 2 x 2 combinations of percepts, or 24 unique states Q-Tank can identify. For programming purposes, we will encode each percept with a value: 0 or 1 for bumpers; 0,1,2 for light comparisons; and 0 or 1 for repeating. I won't list the entire table, but a partial list of states is as follows:

 State Percepts 1 0,0,0,0 2 0,0,0,1 3 0,0,1,0 4 0,0,1,1 5 etc.

With 24 states and nine actions, our Q-table will be quite large, with a grand total of 216 Q-Values. I should be clear that the actual Q-Learning algorithm does not need to know the details of the state or action; it just needs an identification number that represents a state or action. For example, if it knows it is in state 7, it might decide to try action 5. This is like ordering food by number from a menu— just give the number and judge how good it was.

Notice Q(a1 , j) in the algorithm above. This will choose the maximum Q-Value out of all available Q-Values for the state. The implications of this can be hard to understand, so stay with me. Let's say the robot is in state 8. It looks up the Q-values in the table, and sees that there are two actions that have the same Q-Value (action 1 and action 2), so it is likely they produce the same immediate reward. Action 1 delivers it to state 5 (see Figure 2), whereas action 2 delivers it to state 10. Once in state 5, the best action (a3) will deliver a reward of only 0.5. In contrast, if it had gone to state 10, the best action would deliver a reward of +2.5! So as we can see, it is better to go to state 10 than state 5, even though the immediate rewards for both are equal. By including Q(a1 , j) in the Q-Learning equation above, the algorithm becomes aware of what lies ahead (given enough practice). This shows that the Q-Learning algorithm is much more powerful than it first appears. It actually has a bit of a capability to look ahead at the possible rewards. Given enough steps, this algorithm will converge on the optimal action for every state in the state space.

Figure 2 The powerful Q-Learning algorithm.

Another interesting aspect of the Q-Learning equation is the learning rate (ß in the equation above). This constant determines the rate of convergence on the optimal algorithm. It's important to strike the right balance between speedy learning and giving the Q-Value history more weight. If we use a value that is too high, each new step it takes will overshadow the previous Q-Value history and make the robot seem like it has no memory of previous situations. If the learning rate is too small, then our robot will be a slow learner.

If we accepted this algorithm as described up to this point, it would not perform optimally. The problem is, when the robot starts, all the Q values are set to zero. As soon as it performed an action that gave it any reward, it would be satisfied with that action as the maximum action for a given state. So even if the reward was a paltry 0.1, it is still larger than 0, and hence this action would be picked over all the others because the robot has not explored any others yet. So how do we get the robot to explore other options it hasn't yet tried? By choosing a random action once in a while. We assign the robot an exploration constant of between 0 and 1. If the exploration constant is 0.1, then 10% of the time it tries a random action rather than the maximum action in the table.

Another way to make the algorithm less predictable is to seed the table with random values at the start, rather than leaving all the values at 0. The values will all eventually migrate to their proper values if given enough steps, so this does not adversely affect the performance of Q-Learning.

• 🔖 Save To Your Account

### InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

## Overview

Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

## Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

### Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

### Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

### Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

### Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

### Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

### Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

## Other Collection and Use of Information

### Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

### Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

### Do Not Track

This site currently does not respond to Do Not Track signals.

## Security

Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

## Children

This site is not directed to children under the age of 13.

## Marketing

Pearson may send or direct marketing communications to users, provided that

• Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
• Such marketing is consistent with applicable law and Pearson's legal obligations.
• Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
• Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

## Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

## Choice/Opt-out

Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

## Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

## Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

## Sharing and Disclosure

Pearson may disclose personal information, as follows:

• As required by law.
• With the consent of the individual (or their parent, if the individual is a minor)
• In response to a subpoena, court order or legal process, to the extent permitted or required by law
• To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
• In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
• To investigate or address actual or suspected fraud or other illegal activities
• To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
• To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
• To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.