Program Assignment #4

Math 176 - Advanced Data Structures
Programming Assignment #4
"Make Like a Search Engine"

(Please watch for updates to the programming assignment.)

New Due Date: Thursday, November 30, 4:00PM. (16 hours extension due to problems with telnetting into ieng9.)

This assignment is covered by the usual Academic Integrity guidelines for programming assignments.

For this assignment, you will write the core part of a program which performs searches on text documents. This will include much of the core functionality of search engines like Melvyl or web search engines.

Your program that will read in several thousand text files, then create an inverted list of all the (non-common) words that appear in the documents. Your program will then read, from the terminal input, a search queries which consist of two words (implicitly combined by "AND"), then your program will locate documents that contain one or both words, rank documents by how well they match the search query, and/or print extracts from the text documents where the two words appear close together. You will be supplied with code that reads the files, parses the words, discards common words, and provides an easy way to print extracts from the files. Your program must fully read the documents and create the inverted lists before reading search queries from the terminal. This is because search engines typically preprocess the documents once (perhaps in a time-consuming, expensive process) and then are able to quickly reply to a wide variety of queries.

The programming assignment is split into four stages or parts. It is strongly suggested that you implement the program stage-by-stage, completing one stage before attempting the next. It is recommended, although not required that you do the stages in the order listed below:

Given two input words: list out the document numbers of document which contain both input words.
Given two input words, compute a "rank" or "score" for each document corresponding to how well the document meets the search criteria. The scoring will give extra importance to documents that contain the two words in close proximity. Report the scores of the ten highest ranked document.
Given two input words, do the work of stage 2, plus, print excerpts from the top ranked five documents. These excerpts will be places where both input words appear in close proximity.
Do all of the above, and allow prefix searches. Thus, one or both of the search words could be suffixed with a "*", indicating that the word is a prefix string that matches any word with that prefix.

For informational purposes and for comparison with your own program, you are provided with the .class files for a sample implementation of the programming assignment. This is the program MainHw4Demo which can be found in the directory ../public/ProgHomework4. You are provided with source code for some of the helper classes and for a skeletal program which shows how to read words from files with the "WordGrabber" class. The skeletal program is MainHw4.java and it also contains code for reading in search queries and search commands. In addition, there is javadoc documentation provided for the WordGrabber class and the FileIterator class which are used by MainHw4.

The Demo Program and the Functionality Required in Your Programming Solution
You should start by trying out the MainHw4Demo program to understand the programming assignment, and to see how your program should act. In general, you program should mimic the behavior of MainHw4Demo very closely since, for grading, your program's output will be compared with the output of MainHw4Demo (compared by hand, not by computer).

Step 1: Copy from ../public/ProgHomework4, all the .class and .java files. Do not try to copy the directory TextFiles as it is much too large (about 120 Mb of disk space, but less than half that of actual data).

Step 2: Run the java program MainHw4Data. When it asks you for a directory, enter MidiData. This is a collection of 82 files, Aesop's fables in fact. The program will then ask you to enter two search words: for your first trial, try entering "lion hunters". This means we wish to search for documents which contain both the words "lion" and "hunters". (Please be sure to use "hunters", not "hunter"!)
Then enter the command option "d". This means that we are looking for the documents that contain both of the search words. The program will tell you that documents numbers 10 and 48 contain both words.
Without quiting the program, continue to step #3.

Step 3: Use the same two search words again, but now use command option "r" ("r" is for "rank"). The program lists the 10 highest ranked documents with the words "lion" or "hunters" in them. You will see that document 10 is the highest ranked: it has ten occurrences of the word "lion", two occurrences of the word "hunters", and four pairs of the words "lion" and "hunter" which are close together.

The definition of an occurence of a pair of words being close together, is that the starting position of the two occurrences of the words are less than 144 symbols apart. (The number 144 is rather arbitrary, but you must use the same convention in your program.) We do not allow pairs that occur across document boundaries.

The formula for the ranking of a document is as follows: let a be the number of occurences of the first word in a given document, let b be the number of occurences of the second search word, and finally let p be the number of occurences of the pairs of two words which are less than 144 symbols apart from each other. Then the rank or score of the document is defined to equal

(1+a+10p)*(1+b+10p).

(This formula is not fully optimal and was obtained with a little trial and error.) For example, in the search you just performed, document 10 is the highest ranked: it has 10 occurences of "lion", 2 of "hunters" with 4 occurences that are near each other. It has a ranking of 51*43 = 2193. Note that the two occurences of "hunters" form four pairs of close occurences to "lion". Thus there is duplication or overlap in the counting of close occurences.
Continue to step 4, without quitting the program.

Step 4: Use the same two search words, but now chose command option "x" to print out extracts. Extracts are excerpts from files where the two search words appear close to each other in the documents. You should see two extracts from each of documents 10 and 48. Extracts are printed from up to five files, at most two extracts per document.
Continue to the next step without quitting the program.

Step 5: A limited wild card search capability is provided, namely, by suffixing word with "*" you can seach for words that start with a given prefix. For example, try entering "lion hunt*" as the two search words. This will search for places where "lion" and where any word beginning with "hunt" occurs. In this case, if you choose the "x" option, you will see that the word "hunting" also occurs a number of times near the word "lion".

One thing to notice on printing extracts is that sometimes the extracts overlap badly. For instance if the text "Lion ... Lion ... hunting" occurs as in document #48, the two excerpts will very much duplicate each other. In a commercial search engine, this kind of thing would need to be avoided, but our purposes, it is not a problem that needs to be fixed.

The MidiData files (and other data files) are accessible for you to read, and you may want to look at them to see the kind of text files you are searching.

The Skeletal MainHw4 Program A fair amount of material and software is available in the directory ../public/ProgHomework4.
You should look at the program MainHw4.java closely, as much of the code in the program will be used in your program or adapted for use in your program. Start by looking at the routine readWordsFromFiles. This routine uses a WordGrabber object to open files one at a time and to extract alphabetic words one at a time from the files. (The source code for WordGrabber.java is available if you wish to examine it.) It first creates a new WordGrabber wg with a root directory and a file of common words. The root directory is recursively searched by the WordGrabber for files which have filename ending with .txt. When readWordsFromFiles calls wg.startNextFile() the next file is opened to be ready for reading. (startNextFile returns either the file number or -1 if no more files left. The file numbers will be sequential except in the case of read errors.) readWordsFromFiles then calls wg.posNextWord() and wg.nextWord() to get the next word from the file and the starting position of the word in the file. wg.posNextWord returns -1 when there is no next word to read. readWordsFromFiles then prints out the file number, the position of the word and the word. Common words, such as "the" and "there", and any word of three or fewer letters are suppressed by the word grabber so as to reduce the amount of work your program has to do. The file of common words is the second parameter to the WordGrabber constructor.

Next examine the main part of MainHw4. This consists of a loop that reads in two search words and a command 'd', 'r' or 'x'. This loop calls two methods to parse the input lines. You can probably use this code as is, or with minor modifications.

prettyPrint useful to breaking a long line up into pieces of at most 80 characters for printing. For example, I use this in my demo for the 'd' option, by making a long String of document numbers separated by spaces and then calling prettyPrint to print the string out. I also use this in the demo program to print the file extracts for the 'x' command. This is done with another helper routine printExtractWithTwoWords, in MainHw2.java, it takes as parameters a file number, and the position of two words in the file, and prints out a two or three line extract from the file containing the two words. There is a method of WordGrabber, called getFileInfo, which returns information about the file, often including its title and author.

printFrequentWords is some old code that I used to create the list of common words. This is left in as an illustration of how to use the Java HashMap class. (See below for more on useful Java classes.) What the printFrequentWords routine does, is form a HashMap, where the keys are distinct words, and the values are the frequency counts of the words. Each time a new word is read, it is looked up in the HashMap and the corresponding count is incremented. If it is not found, then the word is added as a key of the HashMap, with a value of 1.

What to do for the programming assignment:
You will want to start by writing code that reads in the files one word at a time and creates inverted lists containing information about where each word appears in the files. You should probably start with the following kind of structure: create a HashMap where the keys are the words as read from the files, and the values are lists or arrays or word occurence information.. (You can make them ArrayLists, at least to start.) Each list will consist of a sequence of integers

<f₁, p₁, f₂, p₂, f₃, p₃, ..., f_n, p_n>

this indicates that the word occurs n times, the first occurence was at character position p₁ in file number f₁, the second occurence at position p₂ in file number f₂, etc. As you read in words from the files, look up the words in the HashMap. If the word is not in the HashMap, make a new list <f₁, p₁>; alternately, if the word is in the HashMap, append p_n+1,f_n+1 to the list. Then add the new key and list pair into the HashMap.

Once you are able to create these lists, you can then start coding the functionality of the 'd', the 'r' and the 'x' commands. (It is strongly suggested you implement them in this order, one at a time.) The basic technique is that the two search words correspond to inverted lists. Then you walk through the inverted lists, incrementing your position in the lists one a time, always incrementing the list in which the next position is the earliest. For the 'd' option, you need only keep track of which document numbers appear in both lists. For debugging purposes, you may find it handy to write a routine that prints out the lists of occurences of the two search words.

For the 'r', you must keep track of how many of each word occurs in each document, and how often the two pairs of words occur less than144 symbols apart from each other. When calculating the distance between occurences, you should use the starting position of each word (otherwise, you will differ from the way the demo program calculates numbers of pairs of close occurences.

For the 'x' command, you collect the same information as for the 'r' command, but you also remember the two pairs of close occurences which are the closest together in each document.

For wild card (prefix) matches, you will need to find the lists for each word that matches the prefix test, and walk through them appropriately.

List incrementing algorithms. As said above, you walk through the inverted lists, incrementing your position in the lists one a time, always incrementing the list in which the next position is the earliest. There is more detailed description available.

(I earlier suggested a more complicated way of walking through the two lists, in the first version of this posted assignment. I now suggest you use the way above instead, and I will change the MainHw4Demo program to use the method suggested above. Functionally, the two methods are almost always equivalent, but they do differ slightly the way close pairs are detected.)

Debugging: You should start with a small number of files for testing purposes. This is why I supplied the MicroData and MiniData and MidiData test sets. Change the value of textRoot in MainHw2to control which files are read. You can also limit the maximum number of files read (to as low as 1 if you wish) using the variable maxNumFiles as shown by example in the readWordsFromFiles sample method.

For debugging the creation of your inverted lists, you may want to write a simple routine that prints the contents of one or more lists to the terminal.

Algorithms/Code Design
There are many possible ways to write your code. But here I will outline how I wrote the code. You may follow my outline, and create your own algorithms if you wish. Quite possibly someone will improve on what I suggest.

Reading in the words and creating the inverted lists. In my first attempt, I read in the files and created the HashMap as outlined above, with the keys of the HashMap being the words, and the values being ArrayLists. There ArrayLists had Integer entries that came in pairs. The first Integer was a file number and the second was a position in the file. To read in the 4218 files (5,000,000 words, 87,000 distinct words, 3,500,000 non-common word occurences) took about 13 minutes on my PC, and used about 128 MB of memory). If you try this on the ieng9 machine, you will not be to do more than about 1000 files. (This would be acceptable however, and net you nearly full credit.)
In my second attempt, I used the HashMap to give assign a unique integer to each distinct word. I stored the frequency counts for each word in an array. I then re-opened the WordGrabber, allocated int arrays of the right length for each word, and stored the positions and file numbers of each word occurence as integers. In fact, I packed both integer values into a single int, using the low-order 13 bits for the file number and the rest for a position. (A better strategy might have been to use an array of short's.) Coding up this kind of thing is optional, but if you do it, you should catch overflow conditions and generate an exception if necessary. This second attempt reduced the run time by about 40% and reduced the heap memory usage to a bit less than 32 Mb.
For the 'd' command, I just formed an array of ints and walked through each inverted list once. For each document number in the first list, I set the low order bit of the corresponding array int to 1. For each document in the second list, I set the second bit of the corresponding array int to 1. The array int's which ended up equal to 3, indicated documents that contained both words. (This is clearly not the very best implementation, but it works just fine.)
Part of the implementation of the 'd' command, was a "ListTraverser" class. This list traverser class includes methods for incrementing positions, getting the current position, getting the next document position, comparing which of two lists has earlier current entry, and for deciding which of two lists should be incremented next.
These methods were also very handy for the 'r' and 'x' commands.
For the 'r' command, I made three arrays with counters for (a) the number of occurences of the first search word, (b) the number of occurences of the second word, and (c) the number of occurences of close pairs of words. I walked through the inverted lists, incrementing these array values as appropriate. Then, I allocated a new array of document ranks, and looped through the documents calculating the ranks by the formula given above:

(1+a+10p)*(1+b+10p).

Then I called Arrays.sort() to sort the documents by rank. (There are several things I could have done to be more efficient: namely, I could have dispensed with the first three arrays, and just directly calculated the three counters and the ranks for each document as I encountered it. In addition, I could have kept the information for only the best 10 documents rather than making so many large arrays. However, the efficiency seemed quite adequate as I first coded it, especially since the number of files, 4218, is fairly small, so I left it as described. )

For the 'x' command, I added to functionality of the 'r' command, by keeping in addition, for each document, information about the two closest occurences of pairs of search words. For each occurence of close pairs, I needed to remember the positions of each of the words.
For wild card/prefix searches: In the code that read in the files, I added in a stage that read the HashMap keys (use the the HashMap.keySet() method to get a HashSet view of the keys) and created an array of the less than 88,000 distinct words. I then sorted the words with Arrays.sort(). For prefix searches, I used Arrays.binarySearch() to find the beginning of the words that begin with the prefix, then searched linearly through the sorted list of words starting at that point, to get all the appropriate inverted lists. (It would have probably been fine to have just searched the words in non-sorted order too, as there are only 87,000 words.)
I also found it convenient to write a "MultiListTraverser" class that encapsulated multiple inverted lists.

Java Resources. You will want to use any of Sun's built-in Java data structures you can. The most useful can be found in the java.util.* library. These include the HashSet, HashMap, TreeMap, and TreeSet classes. They also include the ArrayList class: an implementation of the resizable array (very similar to the Vector class, which you may wish to use instead). The class Arrays includes helpful routines for sorting and for binary searching.
Documentation for these classes can be found on line at http://javasoft.com/j2se/1.3/docs/api/index.html. Source code is available for these classes on ieng9 as previously announced.

Turn in: Turn in in two items: README and MainHw4.java. Any helper classes for your code must be made inner classes, so that only one source file is needed.

MainHw4.java must be setup to read from the MegaData files. If there is a limit on how many files can be read, please include in your code a stopping condition and stop reading files when the limit is reached. Also, document the limit in your README file.
The README file should explain how much of the above you successfully implemented. It should also explain how many files it runs on. it should explain the general idea of how you implemented the code, and particularly any significant differences between the implementation described above and your actual implementation. You should also include some sample search words that your program works well with. If your program fails our tests, we can still check your suggested sample words for functionality.

The bundleP4 program will be made available to turn in your program and README file.

Grading standards: The inclusion of the wild card feature was a last minute decision, and should be considered at least partly as an extra credit item. If you do everything except the wild card features, then this qualifies as at least an "A minus" grade. (Quite possibly I will be more generous than this.) Approximate partial standards are: 'd' command worth about 35 points, 'r' command worth about 35 points, 'x' command worth about 25 points, style worth about 10 points, wild card functionality worth about 15 points. Total points: 120 points.