- Developing Hash Codes
- Hash Tables
- More Contracts
- The setUp Method
- Completing the Contract
- Why Bother?
Completing the Contract
Listing 5 shows a test that links equality to hash codes.
Listing 5 Linking equality to hash codes.
public void testHashCode() { assertHashCodeConsistency(); assertHashCodeEquality(); } private void assertHashCodeEquality() { assertEquals(card1.hashCode(), card1copy.hashCode()); }
The final part of the hashCode contract isn’t something we can directly test. It suggests that hashCode should return different values for two unequal objects. But there’s no way we can guarantee that—the only way would be to exhaustively compare the hashCode values for all possible objects.
A constant hashCode, returned the same by all Card objects, will cause performance problems, but probably not in a 52-card deck. Were we to concern ourselves with larger sets of cards, we could be in trouble. Let’s experiment again.
The test code shown in Listing 6 follows a simple idiom for an ad hoc performance test. It’s something that you can quickly throw together. It’s also something that can grow into a simple performance-testing framework as you add similar tests and refactor out the duplication between them.
Listing 6 Experimenting with large numbers of cards.
public void testHashCode() { assertHashCodeConsistency(); assertHashCodeEquality(); assertHashCodePerformance(); } private void assertHashCodePerformance() { final int capacity = 100000; final int iterations = capacity * 8 / 10 / 52; Set<Card> set = new HashSet<Card>(capacity); long start = System.currentTimeMillis(); for (int i = 0; i < iterations; i++) for (Suit suit: Suit.values()) for (Rank rank: Rank.values()) set.add(new Card(rank, suit)); long stop = System.currentTimeMillis(); long elapsed = stop - start; assertTrue("elapsed: " + elapsed, elapsed < 50); }
For the performance comparison itself, I used an arbitrary threshold of 50ms. Generally, if operations start taking hundreds of milliseconds or more, it can be in the realm of perceived poor performance. The thresholds and iteration numbers in the test are arbitrarily chosen absolutes. What’s important, though, is the relative difference between performance of a bad hash code algorithm and a better algorithm.
Granted, the test is a bit artificial. I can’t think of any application where I’d need to work with 100,000 cards at once. Texas Hold ’Em doesn’t directly fall into that category. But when we get to the point of writing simulations going against millions of combinations, we may need to revisit performance needs.
The test does fail (on my machine, at least), and it’s something we can get to pass by improving on the hashCode algorithm:
@Override public int hashCode() { final int hashMultiplier = 41; int result = 7; result = result * hashMultiplier + getRank().hashCode(); result = result * hashMultiplier + getSuit().hashCode(); return result; }
This is a fairly standard hashCode algorithm. The algorithm uses the hashCode values as returned by what we consider the "key" elements of each card: its rank and its suit. We choose a prime number for the multiplier to help guarantee uniqueness in the result.
So how does the improved algorithm help performance? Hash codes should be dispersed fairly well from each other. The more that they "collide," or result in the same number, the worse your hash table performance will be: Objects that collide in a HashSet are placed into a list that must be traversed serially, defeating the point of a rapid hash-based lookup. (Note that the modulus operation can create additional collisions.)
We used a performance test as one way to write tests against hashCode. It has the most bearing on your application needs, particularly if the hashCode algorithm is deficient and can result in poor performance. But I don’t really like it for this example, because I don’t have any specific performance requirements.
Another technique that we could choose to use is mathematical. We could derive the hashCode of all possible elements and then use that to calculate the variance. It too would be an arbitrary measure—we’d need to come up with some reasonable threshold for the variance.