Home > Articles > Programming

  • Print
  • + Share This
Like this article? We recommend

Like this article? We recommend

UltraSPARC III Cu Memory Hierarchy

TABLE 1 presents the memory hierarchy of the UltraSPARC™ III Cu microprocessor. For more details, refer to [ArchManual].

TABLE 1 Cache Characteristics of the UltraSPARC III Cu Microprocessor

Cache

Function

Size

7Line size

8Subblock

Organization

I-cache Instructions 32 Kbytes 32 bytes none 4-way set associative
D-cache Data 64 Kbytes 32 bytes none 4-way set associative
I-TLB Address 16 entries9

 

 

Fully associative

 

Address 128 entries10

 

 

2-way set associative
D-TLB Address

16 entries9

 

 

Fully associative

 

Address 512 entries9, 11

 

 

2-way set associative

 

Address

512 entries9, 11

 

 

2-way set associative
P-cache Prefetch 2 Kbytes 64 bytes 32 bytes 4-way set associative
W-cache Stores 2 Kbytes 64 bytes 32 bytes 4-way set associative
E-cache Unified 8 Mbytes 64-512 bytes12 64 bytes 2-way set associative

As TABLE 1 shows, the UltraSPARC III Cu processor has two caches that have not yet been mentioned.

The P-cache is a prefetch cache, used to store data that has been brought in as a result of a prefetch operation (instruction or hardware initiated, if supported). Only floating point loads can get data from the P-cache.

The W-cache acts as a holding station for stored data. This cache reduces bandwidth to the E-cache by coalescing and bursting stores to the E-cache.

FIGURE 7 is a block diagram with the main components of the memory hierarchy on the UltraSPARC III Cu microprocessor. The arrows indicate the various prefetch possibilities. For more details, refer to [ArchManual].

Figure 7FIGURE 7 Main Components of the Memory Hierarchy

As commented earlier, UltraSPARC has two separate TLBs for instruction and data (I-TLB and D-TLB, respectively). In FIGURE 7, these TLBs are not shown as separate blocks, but labelled TLBs instead.

Performance Aspects

This section uses a simple example to demonstrate the way the memory hierarchy should be used and how performance can degrade if data is not accessed in the intended way.

Consider the following loop that sums up all elements in a two-dimensional array.

float x[m][n];

for (i=0; i<m; i++)
  for (j=0; j<n; j++)  .......... (1)
   sum += x[i][j];

To simplify the discussion, assume the following:

  • Matrix x, or a subset, is not present yet in any of the caches in the system

  • The first element x[0][0] is at a data cache line boundary and the first element in a virtual memory page

  • The dimensions of x are a multiple of four—such that four rows span one virtual memory page

  • The system has only one data cache.

  • A data cache line contains four elements

None of these assumptions are crucial in the remainder of the discussion. They simply make the explanation a little easier.

FIGURE 8 uses shading to indicate in what way the matrix is built up, as seen from a memory access and storage perspective. In this figure, the matrix fits in two pages. Each row consists of four data cache lines.

Figure 8FIGURE 8 Matrix x

The nested loop in (1) is ordered such that element x[0][0] is the first one to be read. Because (by assumption) none of the data is cached anywhere, this memory access results in both a data cache miss and a TLB miss. The latter is because the page that this element is part of has not been mapped in the TLB cache yet. Therefore, this is a very expensive reference.

The next reference is to x[0][1]. This element is included in the cache line that was brought in as a result of the reference to x[0][0]. It is also part of the page that was just mapped into the TLB. As a result, this is a very fast memory access. Elements x[0][2] and x[0][3] will have similar fast memory access.

However, the reference to x[0][4] will be slower, because a new cache line must be brought in. Reading x[0][5] will be fast again, because it is part of the second cache line that was just loaded into the data cache.

This pattern repeats itself until all the elements in the first page are exhausted. In the example, the last element is x[3][n-1].

The reference to x[4][0] results in both a data cache and a TLB miss again. The TLB miss is because this element is part of the second page and no data in this page has been accessed yet.

It is easy to see that this kind of memory access pattern produces (m* n)/4 data cache misses and m/4 TLB misses.

Other than potentially reusing previously cached elements, little can be done to avoid these transfers (referred to as the natural cache traffic).

The preceding example demonstrates what happens if memory access follows the storage order. The following paragraphs discuss what happens if such a favorable memory access pattern does not exist. The same nested loop is discussed, but with the order of the loops interchanged:

float x[m][n];

for (j=0; j<n; j++)
  for (i=0; i<m; i++)  .......... (2)
   sum += x[i][j];

Similar to the previous example, the first reference (to x[0][0]) results in a data cache and TLB miss.

The next reference is to x[1][0]. This element is still part of the same page (and hence no TLB miss occurs), but not part of the cache line just brought in. Therefore, this reference results a data cache miss. Likewise, the references to x[2][0] and x[3][0] each results in a data cache miss.

On the next reference, to x[4][0], a TLB miss occurs because a page boundary is being crossed. The references to x[5][0], x[6][0] and x[7][0] cause a data cache line miss again.

While progressing through the first column of the matrix, the TLB and data cache are filled up. What happens next depends on the value of m.

If m is sufficiently small, the cache lines and TLB entries are still present when proceeding to the next column (j=1 in the loop above). If this is the case, all is well. TLB entries are no longer needed and the values of x[0..m-1][1] are all in the data cache already. If a cache line spans four elements, references to x[0..m-1][4] again result in cache misses, but these misses are part of the natural traffic between main memory and the cache subsystem.

Performance problems arise if m and n are large. To make this more specific, an imaginary system with the following memory characteristics is introduced13:

A data cache with a capacity of 32 kilobytes, LRU replacement, and a line size of 16 bytes

  • n A fully associative TLB with 256 entries

  • n One virtual memory page has a size of 4 kilobytes

One cache line contains 16/4 = 4 elements14. The system can store 32 Kbytes/16 = 2048 cache lines in the data cache and map 256*4 = 1 megabyte of data in the TLB. If the data requirement exceeds one or both of these thresholds, performance degrades.

Assume that m = 4096 and n = 1024.

By the time 2048 elements of the first column of matrix x are read, the data cache is full. It contains the first four columns of the matrix.

The next reference (i=2048), causes one of the lines to be removed from the cache (evicted). Because of the LRU replacement, the very first cache line loaded (x[0][0]...x[0][3]) is evicted from the cache.

For i = 2049, the second cache line will be replaced, and so forth.

After the last value of the inner loop iteration (i=4095) is executed, the top half of the first four columns of matrix x are replaced by the bottom part of the first four columns.

TABLE 2 shows a snapshot of the data cache after it has finished processing the iteration for the indicated values of j and i.

TABLE 2 Three Snapshots of the Data Cache

J=0 I=2047

J=0 I=2048

J=0 I=4095

x[ 0] [0] ... x[ 0] [3]
x[ 1] [0] ... x[ 1] [3]
x[ 2] [0] ... x[ 2] [3]
x[ 3] [0] ... x[ 3] [3]
. . .
x[2046] [0] ... x[2046] [3]
x[2047] [0] ... x[2047] [3]
x[ 2048] [0] ... x[ 2048] [3]
x[ 1] [0] ... x[ 1] [3]
x[ 2] [0] ... x[ 2] [3]
x[ 3] [0] ... x[ 3] [3]
. . .
x[2046] [0] ... x[2046] [3]
x[2047] [0] ... x[2047] [3]
x[2048] [0] ... x[2048] [3]
x[2049] [0] ... x[2049] [3]
x[2050] [0] ... x[2050] [3]
x[2051] [0] ... x[2051] [3]
. . .
x[4094][0] ... x[4094][3]
x[4095][0] ... x[4095][3]

When the program processes the next column of the matrix, j=1, all cache lines must be reloaded again. For example, the first element needed is x[0][1]. Initially this was brought into the cache for j = 0 and i = 0, but it was evicted when loading subsequent elements of the matrix.

Therefore, all references to the second column of the matrix result in a cache miss again. There is no spatial locality (that is, not using all of the elements in one cache line.) In a similar manner, it can be seen that all references to the subsequent columns of the matrix are cache misses.

In other words, all references to x[i][j] result in a cache miss. As a result, performance is very poor.

But, things are even worse than this. Due to the poor memory access pattern, no spatial or temporal locality (that is re-use of cache lines) is obtained in the TLB cache either. Every reference to a matrix element results in a TLB miss too.

The explanation for this is similar to that seen for the data cache.

The TLB cache has a capacity of 256 entries. One row of the matrix is 1024 * 4 = 4 kilobytes, which is exactly the size of one virtual memory page.

The program starts reading the first column of the matrix (j = 0). A TLB entry must be set up for the page that contains x[0][0]. The next reference is to x[1][0]. This element has not yet been mapped into the TLB either, giving rise to another TLB miss. This pattern is repeated for the next 254 iterations: they all cause a TLB entry to be set up.

When i = 256 an older entry must be replaced instead of re-using these cached entries because the TLB cache is full. Similar to what was seen with the data cache, subsequent loop iterations evict older entries from the TLB. For i = 511, all first 256 entries have been flushed out of the TLB.

By the time the program processes all elements in the first column of matrix, the TLB entries map to the last 256 rows of the matrix.

Unfortunately, the program cannot access these elements, but starts with the top part of the matrix again. None of the pages corresponding to these elements are in the TLB and a TLB miss occurs again on every matrix reference.

This pattern repeats itself over and over again, resulting in a TLB miss on every reference to matrix x.

It is clear that this loop performs very poorly under these circumstances.

For larger matrixes, the behavior is similar. If the values for m and/or n are reduced, some or all of these effects are less pronounced. For small matrixes, they will be gone entirely.

Performance Results

To illustrate the preceding data, this example was run on a UltraSPARC III Cu processor at 900 Mhz in a Sun Fire™ 6800 system.

In the row-oriented version, the inner loop is over the rows of the matrix:

for (i=0; i<m; i++)
  for (j=0; j<n; j++)
   sum += x[i][j];

In the column-oriented version, the inner loop is over the columns of the matrix:

for (j=0; j<n; j++)
  for (i=0; i<m; i++)
   sum += x[i][j];

Without special precautions, the C compiler from the Sun ONE™ Studio 7 Compiler Collection suite transforms the column version with the bad memory access pattern into the row version. This is, of course, the right thing to do, but it defeats the purpose of our experiment. To prevent this optimization, we crafted the example to prevent the compiler from applying the loop interchange.

FIGURE 9 shows the performances (in Megaflops per second) as a function of the memory footprint (in kilobytes).

Figure 9FIGURE 9 Performance of the Matrix Summation Operation

Only square nxn matrices are used. The footprint is then given by n * n * 4/1024 kilobytes. Note the log-log scale in the graph.

To filter out timing anomalies, each performance experiment was repeated three times. The average over these three results is the value that was plotted.

The shape of these curves is not surprising. With both versions, the performance is highest for a footprint of about 64 kilobytes, the size of the L1 D-cache on the UltraSPARC III Cu processor.

Because the entire matrix fits in this cache, it will also fit in the D-TLB cache and the memory access pattern is irrelevant when it comes to performance.

However, there is a slight difference between the two versions. The column-oriented version is slightly slower. Additional cache conflicts cause this difference. The column version has a non-unit stride access pattern, increasing the chance of premature cache line evictions.

In general, the curve for the row version is smoother than for the column version. This is because the non-unit stride access pattern in the latter gives rise to more cache line collisions.

Once the matrix no longer fits in the D-cache but still fits in the L2 E-cache, performance degrades, but the performance curves are still similar. This is because all of the matrix can be mapped into the E-cache and D-TLB.

As soon as the matrix no longer fits in the E-cache, performance degrades again, but the curves are no longer identical. The row version outperforms the column version in a rather dramatic way. It is about four to five times faster.

Software Prefetch Effect

Studying the effect of software prefetch is interesting. The previous results shown were obtained with software prefetch disabled (-xprefetch=no compiler option).

With the -xprefetch=yes option, the compiler applies heuristics to decide which prefetch instructions to insert and whether or not to insert them.

On small problems, prefetching data is not meaningful, as the matrix will be in one of the caches anyway. There is not much latency to hide and the additional cost of the prefetch instruction only slows down the program.

A prefetch instruction on data that is not mapped into the D-TLB will be dropped. The poor memory access pattern in the column version generates many D-TLB misses on large matrices. Therefore, there is no practical reason to insert prefetch instructions in the column version.15

This is different for the row version. Thanks to the favorable memory access pattern, this algorithm only generates a modest amount of D-TLB misses compared to the total number of cache references. When accessing a large matrix, prefetch should, therefore, help hide the memory latency

FIGURE 10 is a plot of the performance for the row version, with and without software prefetch. The speed-up (performance ratio between the two versions) is given in FIGURE 11.

Comparing the performances shown in FIGURE 9 and FIGURE 10, it is seen that prefetch not only improves performance for large matrices that do not fit in the level 2 E-cache, but also helps for data sets that are E-cache resident. For a memory footprint between 64 kilobytes and 8 megabytes, the row version with prefetch does not degrade in performance (FIGURE 10), whereas the original version without prefetch does (FIGURE 9).

Figure 10FIGURE 10 Effect of Software Prefetch on the Row Oriented Version.


Figure 11FIGURE 11 Performance Ratio for the Row Version With and Without Prefetch.

For matrices that fit in the D-cache, prefetch slows down the operation, but the degradation is limited to approximately 20 percent.

For larger matrices, prefetch has a significant effect on performance. The speed-up for E-cache resident matrices is close to 4. For larger matrices, the speed-up is almost a factor of 5.

  • + Share This
  • 🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Overview


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information


To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.

Surveys

Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.

Newsletters

If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information


Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.

Security


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.

Children


This site is not directed to children under the age of 13.

Marketing


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information


If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.

Choice/Opt-out


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information


Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents


California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure


Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.

Links


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact


Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice


We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020