Parsing Call Numbers Using BYU’s New Call Number Library

Let’s talk about call numbers, officially known as Library Classifications. The idea is good, allowing librarians to group material together by subject. But in practice they are nightmarish to work with programmatically.

Consider the Library of Congress (LC) Classification, used by many academic libraries around the United States. Like any other classification, it is necessarily complex. Not only must it properly handle the billions of creative works already in existence, but it must also allow for the future insertion of an infinitely large number of not-yet-created works. How do you insert an infinite number of works between “A” and “B”? By adding additional sequences of characters: “A1″, A2”, “A345 .B678”, etc. And over the last 120 years since the LC Classification was invented, librarians have gotten very creative in this endeavor.

Consider also that there is no central canonical database of call numbers. Historically, each library defined their own. Obviously there were guidelines, and in recent decades there has been more standardization as libraries strive to share more of their metadata with each other. And LC itself of course plays a very central role in the classification process; many libraries simply copy LC’s own call number for any given work. But the reality is that call number creation is really still a process centered more in tradition and general guidelines than on hard and fast rules.

For example, Brigham Young University (BYU) holds the world’s most comprehensive academic collection on Mormonism (represented by the LC subject classes BX 8601-8695), and found themselves in need of more granular/discrete subject guidelines than those represented by the official LC subject schedules. So they defined their own schedule within the subject numbers laid out by LC. Multiply that customization by thousands of libraries over multiple generations of librarians, and the result is a very complicated and only semi-standard way of implementing even the most thought-out and defined library classifications.

And this sort of flexibility is not only allowed but encouraged in the library industry. LC’s own 424 page training manual on the LC classification is filled with a lot of statements similar to “usually but not always”, “cataloger may adjust this at their discretion”, and “but in this situation it’s different.”

Finally, one must remember that library call numbers predate computers and the accompanying character sets that have developed into international standards. There are a number of situations, both by standard and by tradition, that call numbers must be ordered differently than standard UTF-8 or ASCII-based sorting algorithms would normally order them.

It’s a mess. It may be a necessary mess, but it’s still a mess.

BYU’s Call Number Library

The IT group at BYU’s Harold B. Lee Library (HBLL) is trying to tackle that mess. This week, we are open-sourcing a call number library for Java that tries to solve some of the problems we have encountered working with library classifications: https://bitbucket.org/byuhbll/lib-java-callnumber.

As the principal author of this new library, I wanted to explain what it does and why we think it will be useful for the library community. There are two main problems this library tries to solve:

  1. How can we sort call numbers correctly?
  2. How can we programmatically parse call numbers and pull out discrete elements from them?

(Oh, and I should point out that while I’ve mostly focused on LC call numbers so far, we wanted the solution to these problems to work for other library classifications as well.)

We started by creating a CallNumber interface. For now, we’ve kept this interface simple; it requires only a single method, sortKey which returns a non-pretty representation of the call number meant solely for ordering/sorting (fixing problem #1). It is worth noting that this interface does extend from the Serializable and Comparable interfaces, though we have defined a default implementation of compareTo that is suitable for most implementations. Additionally, there are some contractual requirements outlined in the attached javadocs that provide some behavioral expectations:

  • They should be immutable value objects, similar to String, Integer, and URI. They should override the hashCode and equals methods accordingly.
  • They should implement a constructor that accepts a single non-null, non-empty String argument (more on this later).
  • They should override the toString method to return a human-readable form of the call number.

Once we established the common interface, we created two implementations based on the most common library classifications, LCCallNumber (based on the Library of Congress classification described above) and DeweyCallNumber (based on the popular Dewey library classification used by many public and educational libraries). Both of these classifications go beyond the basic requirements of the CallNumber interface and actually try to parse the provided call number string into semantic elements of a call number. This is done in both cases using regular expressions (I should point out here that the regex used to parse LCCallNumber was originally authored by Bill Dueber of the University of Michigan and released under the licensing terms of Perl. I would encourage any interested readers to check out the work that Bill and a few others have done in a related project focused on parsing and normalizing LC call numbers. We actually looked at contributing our code to that repository but ultimately decided that there was some significant differences in our scope and goals that made it more appropriate to setup a separate project instead).

In addition to the interface-level methods to retrieve both human readable and sortable representations of LC and Dewey call numbers, we provided implementation-specific methods to pull out discrete elements of each call number. Internally, these discrete pieces are used to determine how we need to massage the provided string for it to sort appropriately.

We were very impressed by how well this architecture seemed to solve the problems listed above, and issued an early (internal) release to start working with. Almost immediately, however, we discovered a problem. The HBLL does not use a single classification, and most of our use cases involved working with an arbitrary set of data that could include a mix of LC and Dewey call numbers. We quickly realized that we needed a “best-guess” parser that could iterate through large numbers of call numbers and construct the appropriate CallNumber implementations on the fly. We went back to the proverbial drawing board and wrote the CallNumberParser class. Here’s how it works:

When constructing a CallNumberParser, users list the CallNumber classes that should be considered as valid “targets” for subsequent parsing operations. Order matters, as strings that match the parsing criteria for multiple implementations will be parsed using the class listed earlier in the list. CallNumberParser is immutable and thread-safe, so once created, it can be reused throughout an entire application freely. It’s kind of awesome.

Of course, a string may fail to match any of the provided implementations. We’ve provided a couple different ways to handle this situation based on the needs of the user. We defined a default implementation of CallNumber, UnclassifiedCallNumber, that will always parse any string – even empty and null strings. Users including UnclassifiedCallNumber at the end of their implementation list may rest assured that all values will get parsed. The price of this flexibility is that massaging unclassified data is of course impossible. The toString method will simply return the input string, and sortKey will return a lower-cased form of the same input string. But at an interface level, UnclassifiedCallNumbers behave just like any other CallNumber implementation. They can be checked for equality, compared, converted to human-readable or sortable strings, and so on. Leaving this implementation out of the list will cause unparseable call number candidates to throw an IllegalArgumentException instead.

Usage Examples

Basic Usage

Using the call number parser is pretty straightforward. If I had a list of mixed (LC and Dewey) call number strings that I wanted to parse, here’s how to do it:

// The following call numbers are actually used by BYU to represent
// the first two Harry Potter books in different collections.
String deweyHP1 = "823 R797h";
String lcHP1 = "PZ 4 .R798 H28 1998";
String deweyHP2 = "823 R797hp 2004";
String lcHP2 = "PZ 4 .R798 H23 1999";
//Initialize a CallNumberParser to handle LC and Dewey call numbers.
CallNumberParser parser = new CallNumberParser(LCCallNumber.class, DeweyCallNumber.class);
//Iterate through each raw call number string and parse them into CallNumber value objects.
for(String raw : Arrays.asList(deweyHP1, lcHP1, deweyHP2, lcHP2)) {
CallNumber callNumber = parser.parse(raw);
//Output to show that parsing worked.
System.out.println(callNumber.getClass().getSimpleName());
System.out.println("\tNormalized: " + callNumber.toString());
System.out.println("\tOrderable: " + callNumber.sortKey());
}
view raw example.java hosted with ❤ by GitHub

Which will output the following:

DeweyCallNumber
Normalized: 823 R797h
Orderable: 000823 r797/h
LCCallNumber
Normalized: PZ 4 .R798 H28 1998
Orderable: pz000004 r798/ h28/ 001998
DeweyCallNumber
Normalized: 823 R797hp 2004
Orderable: 000823 r797/hp 002004
LCCallNumber
Normalized: PZ 4 .R798 H23 1999
Orderable: pz000004 r798/ h23/ 001999
view raw example.out hosted with ❤ by GitHub

Even Easier Basic Usage

I’m lazy, which is a somewhat desirable trait for a software developer. I realized that in actual usage, all of my CallNumberParser objects kept getting set up to use the same parsing targets. So I added some prebuilt parser as a static final variables within CallNumberParser.

//Use a prebuilt CallNumberParser that will correctly handle LC call numbers,
//Dewey call numbers, and the default call numbers created in SirsiDynix Symphony
//for new items. It will throw an IllegalArgumentException for anything else.
CallNumber a = CallNumberParser.SYMPHONY_STRICT.parse("PZ 4 .R798 H28 1998");
//This prebuilt CallNumberParser will handle all the targets described above,
//but will parse any other values - including null and empty strings - as
//UnclassifiedCallNumber entities, so it is guaranteed to handle EVERYTHING.
CallNumber b = CallNumberParser.SYMPHONY_NONSTRICT.parse("PZ 4 .R798 H28 1998");
//Note that since CallNumbers are value objects, the following is true, even
//though the two CallNumbers were created separately using different CallNumberParser
//instances.
boolean isEqual = a.equals(b); //TRUE
view raw example.java hosted with ❤ by GitHub

Ordering Call Numbers

Normally, trying to present an ordered list of call numbers is a huge pain. You have to worry about padding numbers, stripping out non-filing characters, etc. But the call number library removes all the pain from the process. For the following example, I will deliberately use 3 call numbers that do not sort correctly as simple String objects. According to our catalogers, the “A88x” should sort before “A888”, but that is a violation of UTF-8 and ASCII-based ordering. The following table shows the difference:

Call Number Order String Order
BF 637 .C6 A88 vol.24 BF 637 .C6 A88 vol.24
BF 637 .C6 A88x vol.24 BF 637 .C6 A888 vol.24
BF 637 .C6 A888 vol.24 BF 637 .C6 A88x vol.24

Let’s make it go using the call number parser:

String first = "BF 637 .C6 A88 vol.24";
String second = "BF 637 .C6 A88x vol.24";
String third = "BF 637 .C6 A888 vol.24";
//Initialize a CallNumberParser. Since we know we're only parsing LCCallNumbers, we'll keep it simple.
CallNumberParser parser = new CallNumberParser(LCCallNumber.class);
//Parse the 3 call numbers and add them to a list (deliberately out of order).
List<CallNumber> list = new ArrayList<>();
list.add(parser.parse(third));
list.add(parser.parse(first));
list.add(parser.parse(second));
//Just to prove that call number sorting works correctly, shuffle the list before sorting it according to its
//natural order (which we can do easily, since the CallNumber interface extends Comparable.
Collections.shuffle(list);
Collections.sort(list);
for(CallNumber callNumber : list) {
//Output to show that sorting worked.
System.out.println(callNumber.getClass().getSimpleName());
System.out.println("\tNormalized: " + callNumber.toString());
System.out.println("\tOrderable: " + callNumber.sortKey());
}
view raw example.java hosted with ❤ by GitHub

Which will output the following:

String first = "BF 637 .C6 A88 vol.24";
String second = "BF 637 .C6 A88x vol.24";
String third = "BF 637 .C6 A888 vol.24";
//Initialize a CallNumberParser. Since we know we're only parsing LCCallNumbers, we'll keep it simple.
CallNumberParser parser = new CallNumberParser(LCCallNumber.class);
//Parse the 3 call numbers and add them to a list (deliberately out of order).
List<CallNumber> list = new ArrayList<>();
list.add(parser.parse(third));
list.add(parser.parse(first));
list.add(parser.parse(second));
//Just to prove that call number sorting works correctly, shuffle the list before sorting it according to its
//natural order (which we can do easily, since the CallNumber interface extends Comparable.
Collections.shuffle(list);
Collections.sort(list);
for(CallNumber callNumber : list) {
//Output to show that sorting worked.
System.out.println(callNumber.getClass().getSimpleName());
System.out.println("\tNormalized: " + callNumber.toString());
System.out.println("\tOrderable: " + callNumber.sortKey());
}
view raw example.java hosted with ❤ by GitHub

Conclusion

We have had great success using this call number library internally. I am thrilled to be able to share it with the larger library community. Included in the repository are over 70 unit tests to verify the correctness of our parsing and sorting algorithms. The first round of unit tests were based on this great tutorial on LC call number sorting by Kent State University’s library. We have since added many more tests based on specific situations we’ve run into here at the HBLL.

Please check out the repository and let me know what you think in the comments below. Also, there are many more library classifications that we have not yet implemented as CallNumber entities. Please feel free to fork the repository and create pull requests expanding the functionality of this library!

 

Author’s note: The statements and views expressed in this article are the author’s own and do not represent the view of Brigham Young University or its sponsors.

Header image: “Carlyle Books on Library Shelf” by ParentingPatch, CC-BY-SA 3.0

Leave a comment