Setup

Please do not skip this section. Partnerships are required and you must complete your work in the shared repository. See the Collaboration Guide for more information.

Today, you should be working with the partner you intend to collaborate with for project 3. Starting today, you two will be required to work together through the due date of the project on both the labs and the project. If you are looking for a partner, please let your TA know; they’d love to help!

The two of you will be working from a shared repository. Please go to the Partnerships page on Beacon, and invite your Project 3 partner (Only one partner needs to do this!). Generate the magic link and have your partner click on it to confirm.

Once confirmed, your partnership page on beacon will contain a link to view the repository on GitHub, which should have a name of the form su19-proj3-g***.

Both of you should clone this repository. Make sure you do not clone this repository inside of another repository–you will experience git problems if you do so.

$ git clone https://github.com/Berkeley-CS61B-Student/su19-proj3-g***.git

Change directory into it:

$ cd su19-proj3-g***

And add the skeleton remote:

$ git remote add skeleton https://github.com/cs61bl/skeleton-su19.git

You’ll then be able to pull as usual:

$ git pull skeleton master

If you get an error similar to “fatal: refusing to merge unrelated histories”, you can fix this by, for this time only, running:

$ git pull --rebase --allow-unrelated-histories skeleton master

You should be using this shared repository for this week’s labs and the project. Both you and your partner will work together on the same code base. After adding and commiting your own work, you can receive your partner’s work:

$ git pull origin master

Introduction

As usual, pull the files from the skeleton and make a new IntelliJ project.

This lab will be due two hours before your Tuesday (8/6) lab and be worth a total of 6 points (1 point for the worksheet and 5 points for the coding exercises). There will be no new lab assignment on Monday (8/5). This is to give you more time to digest and implement all the parts of this lab!

The worksheet is due by end of Monday’s lab (though you can turn it in to your TA whenever you are finished with it).

The first part of this lab will be focused on implementing the \(k\)-d tree, which will be useful for Project 3 Part II (Routing). The second part of this lab will be focused on implementing the trie data structure.

\(k\)-d Trees

As mentioned above, this data structure will be useful when you implement Part II of BearMaps. Specifically, when a user clicks on a point on the map, it will be your \(k\)-d tree’s job to find the nearest intersection to that click.

Before we implement our KDTree, we’ll go over the provided files and implement a more naive data structure called the NaivePointSet, which will help us test and make sure our KDTree is behaving properly.

Point and PointSet

For this lab, we have provided Point.java, which represents a location with an \(x\) and \(y\) coordinate, and PointSet.java, which is an interface representing a set of Points. Make sure you do not modify any of these provided classes.

Point.java has the following methods:

  • public double getX()
  • public double getY()
  • public double distance(Point o): Returns the great-circle distance.

PointSet.java has only one method, listed below:

  • public Point nearest(double x, double y): Returns the Point in this set nearest to x, y.

Exercise: NaivePointSet

Before we implement our KDTree class, we will first create a NaivePointSet. Once it’s completed, the NaivePointSet will find the nearest point to a given coordinate using a naive linear-time solution. The goal of creating this class is that we will have an alternative, though slower, solution that we can use to verify if the result of our \(k\)-d tree’s nearest method is correct.

The NaivePointSet will implement the PointSet interface, showing that NaivePointSet is a type of PointSet. NaivePointSet will have the following class declaration:

public class NaivePointSet implements PointSet {
    ...
}

Your job will be to implement the following two parts of the NaivePointSet:

/* Constructs a NaivePointSet using POINTS. You can assume POINTS contains at
   least one Point object. */
public NaivePointSet(List<Point> points);

/* Returns the closest Point to the inputted X and Y coordinates. This method
   should run in Theta(N) time, where N is the number of POINTS. */
public Point nearest(double x, double y);

If you ever need to calculate the distance to a Point, use the distance method of the Point class (ex. Point.distance(p1, p2)). You may receive wrong results if you calculate distance in any other way.

Additionally, the NaivePointSet class should be immutable, meaning that we cannot add or or remove Points from it once it’s been created. You will not need to do anything special to guarantee this. You can find an example usage of this class in the main method of NaivePointSet.

As mentioned, we want to implement nearest such that it runs in linear time with respect to the number of Points we have.

Tip: Try not to overthink this! This part shouldn’t take too long. Think of a simple linear time solution that will allow us to find the nearest point to some \(x\) and \(y\) coordinate.

Worksheet 1: K-D Trees

This linear-time solution to finding the closest point is not optimal, so we’ll now look at another structure called the \(k\)-d tree which will allow us to find the closest point in \(O(\log N)\) time.

Before we dive into coding this, complete this problem on the worksheet. Make sure you understand everything on this part of the worksheet before moving on to the next section. If there’s anything you’re unsure about, be sure to talk to your TA.

If you’re not familiar with how \(k\)-d trees work, please see Lecture 6 before proceeding (make sure you understand the demos in the lecture completely). It is absolutely imperative that you do this before starting to code KDTrees.

Exercise: KDTree

Now, we’ll implement the KDTree class! Implement the following constructor and method in KDTree:

/* Constructs a KDTree using POINTS. You can assume POINTS contains at least one
   Point object. */
public KDTree(List<Point> points);

/* Returns the closest Point to the inputted X and Y coordinates. This method
   should run in O(log N) time on average, where N is the number of POINTS. */
public Point nearest(double x, double y)

If you ever need to calculate the distance to a Point, use the distance method of the Point class (ex. Point.distance(p1, p2)). You may receive wrong results if you calculate distance in any other way.

As with NaivePointSet, the KDTree class should be immutable. Also note that while \(k\)-d trees can theoretically work for any number of dimensions, your implementation only has to work for the 2-dimensional case, i.e. when our points have only x and y coordinates.

This part of the lab is conceptually tough, but if you’d like some guidance on how to proceed, take a look at the slides here!

Defining Comparators Using Lambda Functions (Optional)

You may need to use comparators in this lab to compare objects. You can construct a concrete class that implements Comparator or you can define a lambda function in Java. As a reminder, the syntax for declaring a lambda for a comparator is as follows:

Comparator<Type> function = (Type arg1, Type arg2) -> (...);
Comparator<Type> functionEquivalent = (arg1, arg2) -> (...);

where (...) is your desired return value that can use arguments arg1 and arg2.

Examples:

Comparator<Integer> intComparator = (i, j) -> i - j;
Comparator<Integer> intComparatorEquivalent = (i, j) -> {
  int diff = i - j;
  return diff;
};

Another handy tip: If you want to compare two doubles, you can use the Double.compare(double d1, double d2) method.

Testing NaivePointSet and KDTree

There are a number of different ways that you can construct tests for this part of the project, but we will go over our recommended approach.

Randomized Testing

Creating test cases that span all of the different edge cases of \(k\)-d trees will be difficult due to the complexity of the problem we are solving. To avoid thinking about all possible strange edge cases, we can turn towards techniques other than creating specific, deterministic tests to cover all of the possible errors.

One suggestion is to use randomized testing which will allow you to test your code on a large sample of points which should encompass most, if not all, of the edge cases you might run into. By using randomized tests you can generate a large number of random points to be in your tree, as well as a large number of points to be query points to the nearest function. To verify the correctness of the results you should be able to compare the results of your KDTree’s nearest function to the results to the NaivePointSet’s nearest function. If we test a large number of queries without error we can build confidence in the correctness of our data structure and algorithm.

An issue is that randomized tests are not deterministic. This mean if we run the tests again and again, different points will be generated each time which will make debugging nearly impossible because we do not have any ability to replicate the failure cases. However, randomness in computers is almost never true randomness and is instead generated by pseudorandom number generators (PRNGs). Tests can be made deterministic by seeding the PRNG, where we are essentially initializing the PRNG to start in a specific state such that the random numbers generated will be the same each time. We suggest you use the class Random which will allow you to generate random coordinate values as well as provide the seed for the PRNG. More can be found about the specifications of this class in the online documentation.

Normally, we’d like you to write these randomized tests yourself, but an example of this test has been provided to you in the skeleton files. Please put any tests you write in KDTreeTest.java. Feel free to discuss testing strategies with other students or your TA.

External Resources

Wikipedia is a pretty good resource that goes through nearest neighbor search in depth. You are free to reference pseudocode from course slides and other websites (use the @source tag to annotate your sources), but all code you write must be your own. We strongly recommend that you use the approach described in 61BL, as it is simpler than described elsewhere.

Grading

When you submit to Gradescope, you’ll see that there is a suite of tests associated with the \(k\)-d tree. We have tried to make this suite of tests exhaustive, so you will know how accurate your \(k\)-d tree is, though we have relaxed the correctness requirement to get full credit for this portion of the lab. As long as you correctly return the nearest point at least 57% of the time, you will receive full credit for this part. However, we have included a test that checks for 96.6% correctness on nearest (worth 0 points) that you can reference to see if your solution is working completely.

Tries

For the second part of this lab, you’ll create TrieSet, a Trie-based implementation of the TrieSet61BL interface, which represents a basic Trie.

Up to this point in the course, we have learned about a number of Abstract Data Types, such as Lists, Sets, Maps, and others. Today, we will be exploring another way to implement the Set and Map ADTs, specific to storing Strings.

Suppose you were tasked with creating a Set which is only going to be used to store Strings. In the beginning of the class, we might have tried to just put each String into a link of a LinkedList. If this were the representation we used, how would we implement add(), remove(), and contains()? Even if we took the time to order our LinkedList, we would still have a linear runtime for each of these operations.

Now we should be wondering how we can speed up these runtimes. Recently, we have learned about structures which might help us in our quest for efficiency. Perhaps we could store our Strings in a BST, where each node contains a single String and uses the String compareTo() method when determining which direction to go. This will offer an improvement, since a well-balanced BST will make add() and contains() considerably faster, but repeatedly using add() and remove() might end up causing the BST to lose its balance. Next, we might consider the use of a HashSet to store our Strings. This method should work reasonably well, under the assumption that String’s hashcode() method causes them to be distributed in a relatively uniform manner (which should occur, given how it is defined). However, there is still a chance that the Strings we are storing will not distribute themselves well throughout our HashSet and our runtime might suffer.

Let us now consider another way to implement this Set of Strings, the Trie. This structure, whose name is short for Retrieval Tree, is designed to give a runtime that is always constant with respect to the number of items contained (even better than our HashSet, since we will not need to worry about a good hashcode or resizing!). This may seem difficult to believe at first, but as we explore the structure of this data structure, we will see that we are simply shifting the runtime dependency to a different aspect of our data, in order to prevent the number of items contained from afecting our runtime.

To create a Trie, we must first recognize a few things about the data type we are working with. Firstly, we know that Strings are comprised of individual characters, which can be extracted. Secondly, we know that there are only so many characters available to choose from (after all, our alphabet does not have infinitely many letters). Given these two facts, we can begin our construction of this data structure.

Suppose we begin with a tree structure where we store a single character at each node. Think about how a single String might be represented; for every letter in given String, we have a node whose child is a node which represents the next letter. This is depicted below, where the word “sam” has been inserted:

Now, how do we insert another word? If this new word is similar to our initial word, this should be easy, since we can just share nodes that we have in common! Once we reach a letter that is not shared between the two words, we can just add another child to the last node that was shared and continue creating nodes for our new word. Below, we insert “sad” into our tree:

However, if we want to insert a String that does not start with the same letter as our current root, it seems that we are out of luck and need to create an entire additional tree. However, if we instead add a dummy node as our root, we can pretend that each String inserted “begins” with this empty node and go from there. This is what our Trie looks like if we use our dummy root idea and insert the String “a”:

From here, we can continue inserting additional Strings, building up our Set. Below, we have inserted the Strings “same” and “sap”:

But wait! How do we know which words are actually contained in our set? After all, if we only think of words as being in the Trie if they end with a node with no children, then we have lost our String “sam”. If we solve this issue by considering each node in the Trie to designate a String, we suddenly have unexpected Strings in our Set (“s” and “sa”). To fix this issue, we add another piece of information to each node: a flag which lets us know whether we have reached the end of a word. Now, our Trie will look something like this:

At this point, if we want to search for Strings which are contained within the Trie, we can simply look for the appropriate node for each letter as we trace through the Trie. If we ever cannot find the next node, we know that the word is not contained. We also know that if we have gone through all of our letters and arrived at a node that exists, we should still check this “flag” to be sure that the word was actually stored in the Trie.

Now that we have covered the basic idea of a Trie used as a Set, think about how we could turn this Set (of Strings) into a Map (which maps a String to another Object) with a simple change.

One question we have not yet answered is how to properly store the children for each of our nodes. If we know the size of our alphabet, it seems reasonable that we can just store an array that is as long as our alphabet in each node, allowing us to enumerate each letter in our alphabet and store a link to the appropriate child. However, you might see why this would be inefficient; most of our array will likely be empty, as most Strings obey some set of “spelling rules” and certain combinations of letters are much more likely than others (think about how many words start with “an” vs how many words start with “qx”). To account for this, we might want to be more clever with our child-tracking in our nodes. Consider using a BST or HashTable as potential ways of keeping track of a node’s children, while considering the pros and cons for each (compare to each other, as well as to our initial array idea).

Worksheet 2: Tries

Now that you have an idea of how Tries are structured, take the time to complete this worksheet exercise. After you’re done, it’s your turn to implement your own Trie from scratch!

Exercise: MyTrieSet

Create a class MyTrieSet that implements the TrieSet61BL interface using a Trie as its core data structure. You must do this in a file named MyTrieSet.java. Your implementation is required to implement all of the methods given in TrieSet61BL except for longestPrefixOf and add. For longestPrefixOf you should throw an UnsupportedOperationException. For add please copy and paste the following code if you choose not to implement it yourself:

@Override public void add(String key) { if (key == null || key.length() < 1) { return; } Node curr = root; for (int i = 0, n = key.length(); i < n; i++) { char c = key.charAt(i); if (!curr.map.containsKey(c)) { curr.map.put(c, new Node(c, false)); } curr = curr.map.get(c); } curr.isKey = true; }

The following resources might prove useful:

You can test your implementation using the TestMyTrieSet class. If you fail tests, we recommend creating a very short main method and using the visualizer, e.g.

public static void main(String[] args) {
        MyTrieSet t = new MyTrieSet();
        t.add("hello");
        t.add("hi");
        t.add("help");
        t.add("zebra");
}

Deliverables

  • Complete the NaivePointSet, KDTree, and TrieSet.
  • Turn in the worksheet to your TA by the end of Monday’s lab.
  • Fill out the week 6 reflection to get the secret word and put it in magic_word.txt.