A. Intro
Download the code for Lab 12 and create a new Eclipse project out of it.
B. Get Acquainted with Maps
Tables
Before this class you probably worked with tables that stored collections of associations between symbols and other objects. For example, a function to translates a Roman numeral to decimal might use a table that stores the following associations:
Roman digit | Corresponding Decimal Value |
---|---|
I |
1 |
V |
5 |
X |
10 |
L |
50 |
C |
100 |
D |
500 |
M |
1000 |
In CS 61A, you first encountered tables in the context of dictionaries. Other applications of tables in that course included memoization of recursive processes (we'll revisit that application shortly).
A table has two important features.
- It's a collection of pairs of information. Each pair consists of a key and an associated value.
- Access to table elements is generally by key: we ask for the value that's associated with a given key. There can't be any more than one value associated with a given key. For example, if the key "mike" was associated with the value 27, it would not also be associated with the value 45.
The Map Interface
Java's name for an association table is a Map
. Map
is an interface in java.util
, not an actual class; an actual class that stores associations would implement Map
and provide definitions for the methods it prescribes. Among these methods, the most interesting are
Object put(Object key, Object value)
which adds the association between key
and value
to the table (and returns the previously existing one, if there was one), and
Object get(Object key)
which retrieves the value associated with key
The Object
returned by put
is the value previously associated with the given key, or null
if the key had no previously associated value. (Recall that tables contain at most one associated value for each key.) The example below, in which table is assumed to be an instance of a class implementing Map
, demonstrates how put
works. (The table stores associations between people and blood types.)
Statement | Effect |
---|---|
Object value; value = table.put ("mike", "a-"); |
value contains null ; the set of associations represented by table is {("mike", "a-")}. |
value = table.put ("sara", "a+"); |
value contains null ; table represents a set of two associations, {("mike", "a-"), ("sara", "a+")}. |
value = table.put ("mike", "ab-"); |
value originally contains "a-" since that was the value that was originally associated with "mike"; the set of associations represented by the resulting table is {("mike", "ab-"), ("sara", "a+")}. |
value = table.put ("david", "o-"); |
value contains null ; table represents a set of three associations, {("mike", "ab-"), ("sara", "a+"), ("david", "o-")}. |
Discussion: A Class that Implements Map
Link to the discussion
Try to imagine what a class that implements
Map
would look like. For example, what would its private instance variable(s) be? Explain how the
get
and
put
operations would work with those instance variables. (There are several reasonable answers.)
Finally, what are the runtimes of your
get
and
put
methods with respect to the total number of keys in the map?
java.util Classes that Implement Map
Several classes in java.util
implement the Map
interface. One involves a binary search tree, which we'll discuss in a later lab. Two others involve hash tables, the subject of the rest of this lab.
C. The Hash Map
A Faster Map
In the previous discussion, did you come up with run times for put
and get
that were had constant runtime with respect to the number of keys in the map? Likely not! If your method called something like ArrayList
's indexOf(Object o)
method, then it isn't constant time. Finding an object in an ArrayList
requires iterating through the whole thing, which is linear time.
But one observation that you may have come up with is that if the keys in our map are integers, then we can use just a normal array as a map with constant time access. All we do is store the values in an array and access them immediately using array indexing. For example, if the table were an array of 100 elements that stores the entries (5, "Clancy")
, (17, "Wei")
, (42, "Waliany")
, and (83, "Hale")
, the value associated with 5 is just table[5]
, and we can access that value in constant time.
Cool, right? Too bad it only works for integer keys. But hashing is a technique for extending this constant-time access to non-integer keys.
Hashing
The idea underlying hashing is to add a step to the getting process: namely the transformation of the key into an int that can be used to index an array. If this transformation is fast and different keys get transformed into different index values, then we can approximate the direct access that an array provides.
The transforming function is called a hash function, and the int
it returns is the hash value. (The reason for the term "hash" will appear shortly.)
Here's an example with first and last names. Let's say there are the following key-value pairs: ("Alice", "Sheng")
, ("Amit", "Akula")
, ("Amruta", "Yelamanchili")
, ("Armani", "Ferrante")
, ("Daniel", "Nguyen")
, ("Gilbert", "Ghang")
, ("Ross", "Teixeria")
, ("Rudy", "Laprade")
, and ("Yeseon", "Lee")
. The keys start with a variety of different first letters. Thus, a reasonable hash function for this set of keys could be based on the index of the first letter of the key in the alphabet. Applying this hash function to the four keys gives us the following results:
key | hash value |
---|---|
"Alice" | 0 |
"Amit" | 0 |
"Amruta" | 0 |
"Armani" | 0 |
"Daniel" | 3 |
"Gilbert" | 6 |
"Ross" | 17 |
"Rudy" | 17 |
"Yeseon" | 24 |
With a 26-element array named table
, we could go directly to Daniel's associated value by accessing table[3]
.
An array that's not very full uses a lot of extra space, but that is okay. We are trading memory efficiency for search efficiency.
Collisions
In the previous example, the keys had mostly different values, but there were four collisions (keys with the same hash value). "Alice", "Amit", "Amruta", and "Armani" all had a hash value of 0, and "Ross" and "Rudy" had hash values of 17. Other hash functions, like one depending on the length of the first name, would produce even more collisions.
There are two common methods of dealing with collisions:
- Linear Probing: Store the colliding keys elsewhere in the array, potentially in the next open array space. We will encounter it later in this lab.
- Chaining: An easier, more common solution is to store all the keys with a given hash value together in a collection of their own, such as a linked list. This collection is often called a bucket.
Here are the example entries from the previous step, hashed into a 26-element table of linked lists using the function hash(key) = ((int) key.charAt(0)) - 65
.
Assume for a moment that all of the keys in the hash table have different hash codes and end up in different places in the array. What is the runtime of getting a key with respect to the total number of keys in the map?
O(1). We can index into the array in constant time. Then we search for the key in the bucket. But there is only one item in the bucket, so this search also runs in O(1) time.
Now assume instead that all the keys in the hash table happen to have the same hash code (despite being different keys), and they end up in the same bucket. What is the runtime of getting a key with respect to the total number of keys in the map?
O(n) in the worst case, where there are n keys in the map. To find a key, we first index into the array at the proper location in constant time. But then we have to search through all of the items in the bucket to find the appropriate key. If the bucket is a linked list, we just have to iterate through the linked list until we find the key. In the worst case, all the keys end up in the same bucket, so this could take up to n steps.
The Meaning of 'Hashing'
Clearly we are better off if different keys have different hash codes, and collisions are low, since this produces the best average running time.
To determine the average case in any problem, you would need to determine or approximate the probability distribution of possible inputs; in the case of hashing, the possible inputs are the keys. Given this distribution, you want to design your hash function to scatter the keys throughout the table as much as possible to minimize clumps produced by keys that are similar in some way to one another. One definition of "hash" is "to muddle or confuse", and a good hash function will "muddle" whatever ordering exists in a set of keys. This is where the hash function got its name.
Note also that the performance of hashing depends very much on average case behavior. No matter what the hash function, we can always produce a set of keys that make it perform as badly as possible.
Load Factor
No matter what, if the underlying array of our hash table is small, and we add a lot of keys to it, then we will start getting more and more collisions. Because of this, a hash tables should expand its underlying array once it starts to get filled (much like how an ArrayList
expands once it fills up).
A hash table should have an associated load factor, which is a fraction between 0 and 1. Once the number of keys / total size of the underlying array is greater than the load factor, then the hash table should resize to be bigger. The load factor is commonly around 0.7.
An Animation of Hashing
Phew! That's a lot of text describing hashing. Visualizing the process will probably be more helpful for really understanding what's going on.
This demo illustrates the hashing process, using the hashCode
method of the String
class. Collisions are stored in a linked list as just described.
You can try inserting the strings "h", "12", "z", "m", "a", "14", "1", and "za" to yield interesting behaviour. In addition to insert, don't forget to experiment with delete and find!
java.util.HashMap
The class java.util.HashMap
is implemented as a hash table, that is, an array of collections as just diagrammed. The put
method adds an entry to the HashMap
:
Object put (Object key, Object value);
put
returns either the value previously associated with key
, or null
if there was no previous association with key
.
The put
method requires that keys support two operations:
public boolean equals(Object obj)
and
public int hashCode()
Here is how it uses them:
- First, put calls
key.hashCode()
automatically to find out which collection to search forkey
. The programmer does not callhashCode
themselves. The relevant array index in the table maintained by theHashMap
class iskey.hashCode() % the length of the bucket array
. - Then, it searches the bucket, the collection of entries whose keys all have the same remainder when dividing the hash value by the table size. It does this first by comparing hash values, then using
equals
for keys with the same hash code. If it finds an entry with the given key, it replaces the associated value. Otherwise it adds the key/value entry to the collection.
The get
method, which returns the value associated with a given key, works similarly. It (not the programmer) calls the key's hashCode
method to find out which collection to search, then compares the key with the keys in the collection using equals
.
A full description of this class can be found on the Java API here.
Note: A possible ambiguity arises when get
returns null
: either the key is not in the table, or it is in the table but is associated with the value null
. We resolve the ambiguity with the containsKey
method, which returns true
if its key argument is associated with something and returns false
otherwise.
Constructor
As we have seen in other Java collection classes, we are allowed to restrict the keys and values in the table to certain types. To declare a java.util.HashMap
whose keys are all Strings
and whose associated values are all Integers
, we provide the relevant types in brackets:
HashMap<String, Integer> table = new HashMap <String, Integer>();
Some Additional (not super important) Points about the Constructor
The HashMap
constructor has two optional arguments. One is the initial size of the bucket array. Normally, the array is expanded (doubling its size, similar to ArrayList
) when the table gets too full. If we know approximately how many entries the table will contain, however, we can avoid the intermediate expansions. The other argument is the maximum load factor , which is a percentage specifying how full the HashMap
can be before it expands its internal array. The load factor is defined as the number of entries in the map divided by the number of buckets (i.e. the size of the internal array).
Self-Test: Relation of .equals and .hashCode
True or false: If for two objects a
and b
, the expression a.equals(b)
is true
, then it must be the case that a.hashCode() == b.hashCode()
is also true.
True
|
Correct!
|
|
False
|
Incorrect.
|
True or false: If for two objects a
and b
, the expression a.hashCode() == b.hashCode()
is true
, then it must be the case that a.equals(b)
is also true.
True
|
Incorrect.
|
|
False
|
Correct!
|
Exercise: PhoneBook with HashMap
Add code to complete the PhoneBook
class and possibly to the Person
and PhoneNumber
class. You must use HashMap
from java.util
! The point of this exercise is to give you practice with it.
Run the test cases in PhoneBookTest.java
to check that the code works. Then answer the following questions to check your understanding of the desired behavior.
What type was the Key
in the HashMap
book used in the PhoneBook
?
PhoneNumber Object
|
Incorrect.
|
|
String representing the name
|
Incorrect.
|
|
String representing the phone number
|
Incorrect.
|
|
PhoneBook Object
|
Incorrect.
|
|
Person Object
|
Correct!
|
What type was the Value
in the HashMap
book used in the PhoneBook
?
PhoneNumber Object
|
Correct!
|
|
String representing the name
|
Incorrect.
|
|
String representing the phone number
|
Incorrect.
|
|
PhoneBook Object
|
Incorrect.
|
|
Person Object
|
Incorrect.
|
What happens when you try to add the same person to the phone book twice?
An exception is thrown
|
Incorrect.
|
|
The second entry is ignored because an entry already exists
|
Incorrect.
|
|
Duplicate entries are added to the HashMap
|
Incorrect.
|
|
The second entry's value replaces the first's
|
Correct!
|
What happens if you modify a PhoneNumber
that is already stored in the Book?
You are now no longer able to look up the PhoneNumber for that Person object.
|
Incorrect.
|
|
The PhoneBook remembers the old entry and is not changed because they reference different objects
|
Incorrect.
|
|
An exception is thrown
|
Incorrect.
|
|
The entry is reflected in the Phone book because they reference the same object
|
Correct!
|
What happens if you modify a Person
that is already stored in the Book?
The entry is reflected in the Phone book because they reference the same object and you can still look up the PhoneNumber
|
Incorrect.
|
|
The entry is reflected in the Phone book because they reference the same object, but you can no longer look up the PhoneNumber
|
Correct!
|
|
The PhoneBook creates a second entry for the second person and still references the old Object as well
|
Incorrect.
|
|
An exception is thrown
|
Incorrect.
|
Note: For each question ensure that you understand why the answer is so. Consult with your neighbors and TA, especially for the forth and fifth questions.
Keeping Track of Hash Table Contents
Consider the following program. We think drawing a rough sketch of the Hash Table contents will be helpful. Note: a HashSet
works just like a HashMap
, except instead of storing key-value pairs, it just stores items with no values associated with them.
import java.awt.*;
import java.util.*;
public class TestPoint {
public static void main (String [ ] args) {
HashSet<Point> points = new HashSet<Point> ( );
Point p = new Point (0, 5);
for (int k = 0; k < 3; k++) {
p.x = k;
points.add (p);
}
for (int k = 0; k < 3; k++) {
System.out.println ("Does the hash table contain (" + k + ",5)? "
+ points.contains (new Point(k,5)));
}
}
}
When run, it produces the following output:
Does the hash table contain (0,5)? false
Does the hash table contain (1,5)? false
Does the hash table contain (2,5)? true
Think about why for the next question.
Discussion: What Happened?
Link to the discussion
Why did the code return false when asked whether it contained the points (0,5) and (1,5)? Earlier in the code we put added those points to the
HashSet
.
Linear Probing
Hashing into "buckets" is not the only method for dealing with collisions. In the "linear probing" strategy of handling collisions, keys are stored in the array rather than in a separate collection. A key's hash value is found, and then the array is searched linearly (wrapping around to the beginning of the array when it reaches the end) for the key. The search stops either when the table is full, when an empty space is found (meaning that the key is not yet in the table), or when the key is found.
This Demo lets you play with the linear probing algorithm with insertion, deleting, and find.
Discussion: Linear Probing
Link to the discussionThink about the kinds of instance variables that a linear probing hash map would have in comparison with those of a bucketing hash map. What's one major pro and one major con of using a linear probing hash map instead of a bucketing hash map?
Let's Make our own Hash Map!
If you haven't yet switched which partner is coding in this lab, please do so now.
By this point in the lab, you should have a strong understanding of the following concepts:
- the structure of a hash map and how items are stored
- how items are accessed in a hash map
- how the
hashCode
is used in a hash map - basic operations of all maps
This should be enough to make our own hash map! We have created a template class for you, MyHashMap.java
. Implement the core functionality of your hash map. You can decide how to implement the hash map (how to handle collisions, etc.) so add what instance variables you need. This means you can either use a bucketing or a linear probing approach.
Discussion: Inventing a Hash Function
Link to the discussionHere are eleven keys.
10 100 32 45 58 126 1 29 200 400 15
Find a hash function that, when used with an empty chained table of size 11, produces three or fewer collisions when the eleven keys are added to the table. Make a post describing your function. Then check your hash function with your neighbors and trade tactics for handling these sorts of exercises.
hashCode
Methods of Builtin Objects
Print the hash values for several instances of builtin classes such as java.awt.Point
, java.lang.String
, and java.lang.Integer
, just to get a feel for what kind of values a hashCode
method typically produces.
Also, produce evidence for whether or not the ArrayList.hashCode
method uses the hashCode
methods of its elements.
Discussion: ArrayList Hash Code
Link to the discussion
Does the
hashCode
of an
ArrayList
depend upon
hashCode
s of the objects in the
ArrayList
? How does your evidence confirm of disconfirm this?
D. Some real-world hash functions
Information presented here is taken from the article "Selecting a Hashing Algorithm", B.J. McKenzie et al., Software Practice & Experience, vol. 20, no. 2, February 1990. The authors experimented with eight hash functions implemented in a variety of widely distributed software.
Hashing Algorithms
All of the algorithms compute a hash value H for a string of length n
whose characters are c1, c2, ..., cn
. The hash value is determined from successive partial results h0, h1, h2, ..., hn
, with each hk
computed from hk?1
as given in the formulas below. The hash table size is the value used in the mod operation at the end of each algorithm.
Amsterdam Compiler Kit (ACK)
There is a "mask" for characters, built as follows:
`m1 = 171; mk = rightmost 8 bits of 77mk?1+153`
The hash value
H
is then the last 8 bits ofhn
, whereh0 = 0 and hk = hk?1 + XOR(ck, mk)
.Eidgenossische Technische Hochschule Modula-2 Cross Compiler (ETH)
h0 = 1; hk = ck*((hk?1 mod 257)+1); H = hn mod 1699
GNU C preprocessor (GNU-cpp)
h0 = 0; hk = 4hk?1+ck; H = last 31 bits of hn, mod 1403
GNU compiler front end (GNU-cc1)
h0 = n; hk = 613hk?1+ck; H = last 30 bits of hn, mod 1008
Portable C Compiler front end (PCC)
h0 = 0; hk = 2hk?1+ck; H = last 15 bits of hn, mod 1013
Unix 4.3 BSD C preprocessor (CPP)
h0 = 0; hk = 2hk?1+ck; H = hn mod 2000
AT&T C++ compiler (C++)
h0 = 0; hk = 2hk?1+ck; H = hn mod 257
Icon translator (Icon)
h0 = 0; hk = hk?1+ck; H = hn mod 128
Performance
Algorithms were tested on 36,376 identifiers from a large bunch of C programs, and 24,473 words from a UNIX dictionary.
ACK is a loser (U-shaped distribution). Icon, C++, GNU-cc1, and GNU-cpp seem to distribute the words well. Theoretical results suggest that an algorithm of the form hk = A*hk?1+ck
; H = hn mod N
will be good, with A
a power of 2 for speed and N
chosen appropriately. The authors note:
"[A] and N need to be selected with care. Although it may seem unlikely that anyone would choose one of the really bad combinations, the facts ... indicate that far-from-optimal choices are made and persisted with. The experiments have shown that very small variations in N can produce large variations in the efficiency of the hash table lookup, and that the popular view, that choice of a prime number will automatically ensure a good result, is not well founded."
E. Conclusion
Please turn in:
- PhoneBook.java, Person.java, and PhoneNumber.java
- MyHashMap.java
Submit these files as lab12.