June | 2016 | bertag.net

I love YAML. I use it in almost all my applications to manage configuration. It is easy to read and write and allows for complex objects as well as arrays/lists of data. In fact, YAML 1.2 is technically a complete superset of JSON; so anything you can do with JSON you can do with YAML. YAML also comes with broad cross-language support. For example, there are some powerful libraries out there for working with YAML in Java, such as SnakeYAML and YamlBeans.

The Old

Several years back, my colleague Charles Draper wrote a library to simplify the process of parsing and working with YAML configurations. Initially this library was part of an internal utility package, but about a year ago we went through and split it out into its own repository. During that process, we debated and nailed down some of the expected behavior of this config library and established a few informal guidelines that governed development:

It had to be stupidly convenient to pull discrete pieces of data out of the Java representation of the data.
Since we have a number of student developers who come and go every few semesters, the learning curve needed to be minimal if we wanted to have any hope of widespread usage.
It had to be able to read from multiple files and merge them in a fashion similar to how CSS works.

I want to talk briefly about that 3rd requirement, since that really stands apart from the general case of “I just want to parse a YAML file.” In several situations, we have found it beneficial to have a cascading configuration loader that is able to parse up a baseline configuration, then read subsequent files that tweak the behavior to work for a specific server or environment.

For example, imagine that an application relies on an external database. In production, we of course want to read/write to the production database, put perhaps on our staging and development servers, we want to work with a test database instead. Nothing too crazy.

So we might have a baseline application.yml file that contains the bulk of our configuration along with some placeholders for database connection values. And on our production and staging servers respectively we create a production.yml and stage.yml file containing the specific connection values for that environment, as follows:

	---
	database :
	host : localhost
	username : myuser
	password : mypassword

	# other configuration as appropriate...

view raw application.yml hosted with ❤ by GitHub

	---
	database :
	host : proddb.example.com
	password : prodpassword

view raw production.yml hosted with ❤ by GitHub

	---
	database :
	host : stgdb.example.com
	username : stguser
	port : 3306

view raw stage.yml hosted with ❤ by GitHub

At application startup, then, what we really want to do is to load the baseline application.yml and then substitute the appropriate fields from the system-specific YAML files. The way we have been handling this is by having a master Config object which we loaded one or more YAML files into as data. The Config object provided a long list of access methods to allow data to be pulled out using XPath-type references.

It worked, and we were quite proud of what we built. But we have found something even better!

The New

I recently discovered that the venerable Jackson library has a YAML plugin. This plugin uses SnakeYAML under the hood to parse the YAML data using an ObjectMapper, which means that the output of the parsing operation can be a standard JsonNode or even a targeted POJO bean.

Armed with this plugin, we gutted our Config library this week and rebuilt it to use Jackson as the data parser. We are thrilled with how easy it was to set up and how simple it is to use. Our master Config object is gone altogether; applications now interact directly with the loaded data using the JsonNode or POJO outputs mentioned above. Check out how this easy this makes loading and interacting with configuration data:

	---
	a : alpha
	b : bravo
	c :
	d : delta

view raw example.yml hosted with ❤ by GitHub

	YamlLoader loader = new YamlLoader();
	Path sourcePath = Paths.get("example.yml");

	//The config library can load data into a generic JsonNode
	JsonNode node = loader.load(sourcePath);

	System.out.println(node.path("a").asText()); //Outputs "alpha"
	System.out.println(node.path("b").asText()); //Outputs "bravo"
	System.out.println(node.path("c").path("d").asText()); //Outputs "delta"

view raw example1.java hosted with ❤ by GitHub

	YamlLoader loader = new YamlLoader();
	Path sourcePath = Paths.get("example.yml");

	//The config library can also load data into a Jackson-annotated POJO
	ExamplePOJO pojo = loader.load(ExamplePOJO.class, sourcePath);

	System.out.println(pojo.a); //Outputs "alpha"
	System.out.println(pojo.b); //Outputs "bravo"
	System.out.println(pojo.c.d); //Outputs "delta"

view raw example2.java hosted with ❤ by GitHub

	public static class ExamplePOJO {

	@JsonProperty String a;
	@JsonProperty String b;
	@JsonProperty Charlie c;

	public static class Charlie {

	@JsonProperty String d;

	}

	}

view raw ExamplePOJO.java hosted with ❤ by GitHub

Stupidly convenient? Check.

Minimal learning curve? Check.

But what about cascaded loading?

Since at the end of the day, we’re dealing with native Jackson objects, we did a little hunting to see if Jackson supported deep merge operations out of the box. As far as we can tell, it does not at the time of this writing. However, there is an open feature request for exactly that, and a number of people have rolled their own merge methods.

We took one of those methods and broke it down to understand exactly how it was working. We then added our own version of this merge logic into the YamlLoader class. In a nutshell, it merges two JsonNode objects A and B using the following logic:

If A is a missing node, then simply add B.
If either A or B is a simple field, then replace A with B
If either A or B is an array, then replace A with B
If both A and B are complex objects, then recursively call merge on each child element.

We have now theoretically satisfied our 3 basic requirements, so let’s see how this would work in our original example. We will instruct the YamlLoader class to load our baseline application.yml and then overwrite some data from the production.yml and stage.yml files:

	YamlLoader loader = new YamlLoader();

	// Cascade load application.yml and production.yml. Values in production.yml should
	// trump values in application.yml.
	ExamplePOJO pojo = loader.load(ExamplePOJO.class,
	Paths.get("application.yml"),
	Paths.get("production.yml"));

	System.out.println(pojo.database.host); //Outputs "proddb.example.com" (production.yml)
	System.out.println(pojo.database.username); //Outputs "myuser" (application.yml)
	System.out.println(pojo.database.password); //Outputs "prodpassword" (production.yml)
	System.out.println(pojo.database.port); //Outputs "null" (referenced by neither file)

	// This final example doesn't really jive with our hypothetical situation, but it's
	// still interesting from a demo perspective. In this case, stage.yml will trump
	// values from the original application.yml, but will itself be trumped by values
	// in production.yml.
	pojo = loader.load(ExamplePOJO.class,
	Paths.get("application.yml"),
	Paths.get("stage.yml")
	Paths.get("production.yml"));

	System.out.println(pojo.database.host); //Outputs "proddb.example.com" (production.yml)
	System.out.println(pojo.database.username); //Outputs "stguser" (stage.yml)
	System.out.println(pojo.database.password); //Outputs "prodpassword" (production.yml)
	System.out.println(pojo.database.port); //Outputs "3306" (stage.yml)

view raw example.java hosted with ❤ by GitHub

	public static class ExamplePOJO {

	@JsonProperty DatabasePOJO database;
	//Other fields as defined in application.yml (outside the scope of this demo)

	public static class DatabasePOJO {

	@JsonProperty String host;
	@JsonProperty String username;
	@JsonProprety String password;
	@JsonProprety Integer port;

	}

	}

view raw ExamplePOJO.java hosted with ❤ by GitHub

Conclusion

We think this is a pretty slick way of managing configurations and will start rolling it out across our applications. We have open sourced it, so if you are interested in using it, please check out the repository and let us know what you think of it in the comments below!

Source: https://bitbucket.org/byuhbll/lib-java-config
Maven: Coming Soon

Author’s note: The statements and views expressed in this article are the author’s own and do not represent the view of Brigham Young University or its sponsors.

Let’s talk about call numbers, officially known as Library Classifications. The idea is good, allowing librarians to group material together by subject. But in practice they are nightmarish to work with programmatically.

Consider the Library of Congress (LC) Classification, used by many academic libraries around the United States. Like any other classification, it is necessarily complex. Not only must it properly handle the billions of creative works already in existence, but it must also allow for the future insertion of an infinitely large number of not-yet-created works. How do you insert an infinite number of works between “A” and “B”? By adding additional sequences of characters: “A1″, A2”, “A345 .B678”, etc. And over the last 120 years since the LC Classification was invented, librarians have gotten very creative in this endeavor.

Consider also that there is no central canonical database of call numbers. Historically, each library defined their own. Obviously there were guidelines, and in recent decades there has been more standardization as libraries strive to share more of their metadata with each other. And LC itself of course plays a very central role in the classification process; many libraries simply copy LC’s own call number for any given work. But the reality is that call number creation is really still a process centered more in tradition and general guidelines than on hard and fast rules.

For example, Brigham Young University (BYU) holds the world’s most comprehensive academic collection on Mormonism (represented by the LC subject classes BX 8601-8695), and found themselves in need of more granular/discrete subject guidelines than those represented by the official LC subject schedules. So they defined their own schedule within the subject numbers laid out by LC. Multiply that customization by thousands of libraries over multiple generations of librarians, and the result is a very complicated and only semi-standard way of implementing even the most thought-out and defined library classifications.

And this sort of flexibility is not only allowed but encouraged in the library industry. LC’s own 424 page training manual on the LC classification is filled with a lot of statements similar to “usually but not always”, “cataloger may adjust this at their discretion”, and “but in this situation it’s different.”

Finally, one must remember that library call numbers predate computers and the accompanying character sets that have developed into international standards. There are a number of situations, both by standard and by tradition, that call numbers must be ordered differently than standard UTF-8 or ASCII-based sorting algorithms would normally order them.

It’s a mess. It may be a necessary mess, but it’s still a mess.

BYU’s Call Number Library

The IT group at BYU’s Harold B. Lee Library (HBLL) is trying to tackle that mess. This week, we are open-sourcing a call number library for Java that tries to solve some of the problems we have encountered working with library classifications: https://bitbucket.org/byuhbll/lib-java-callnumber.

As the principal author of this new library, I wanted to explain what it does and why we think it will be useful for the library community. There are two main problems this library tries to solve:

How can we sort call numbers correctly?
How can we programmatically parse call numbers and pull out discrete elements from them?

(Oh, and I should point out that while I’ve mostly focused on LC call numbers so far, we wanted the solution to these problems to work for other library classifications as well.)

We started by creating a CallNumber interface. For now, we’ve kept this interface simple; it requires only a single method, sortKey which returns a non-pretty representation of the call number meant solely for ordering/sorting (fixing problem #1). It is worth noting that this interface does extend from the Serializable and Comparable interfaces, though we have defined a default implementation of compareTo that is suitable for most implementations. Additionally, there are some contractual requirements outlined in the attached javadocs that provide some behavioral expectations:

They should be immutable value objects, similar to String, Integer, and URI. They should override the hashCode and equals methods accordingly.
They should implement a constructor that accepts a single non-null, non-empty String argument (more on this later).
They should override the toString method to return a human-readable form of the call number.

Once we established the common interface, we created two implementations based on the most common library classifications, LCCallNumber (based on the Library of Congress classification described above) and DeweyCallNumber (based on the popular Dewey library classification used by many public and educational libraries). Both of these classifications go beyond the basic requirements of the CallNumber interface and actually try to parse the provided call number string into semantic elements of a call number. This is done in both cases using regular expressions (I should point out here that the regex used to parse LCCallNumber was originally authored by Bill Dueber of the University of Michigan and released under the licensing terms of Perl. I would encourage any interested readers to check out the work that Bill and a few others have done in a related project focused on parsing and normalizing LC call numbers. We actually looked at contributing our code to that repository but ultimately decided that there was some significant differences in our scope and goals that made it more appropriate to setup a separate project instead).

In addition to the interface-level methods to retrieve both human readable and sortable representations of LC and Dewey call numbers, we provided implementation-specific methods to pull out discrete elements of each call number. Internally, these discrete pieces are used to determine how we need to massage the provided string for it to sort appropriately.

We were very impressed by how well this architecture seemed to solve the problems listed above, and issued an early (internal) release to start working with. Almost immediately, however, we discovered a problem. The HBLL does not use a single classification, and most of our use cases involved working with an arbitrary set of data that could include a mix of LC and Dewey call numbers. We quickly realized that we needed a “best-guess” parser that could iterate through large numbers of call numbers and construct the appropriate CallNumber implementations on the fly. We went back to the proverbial drawing board and wrote the CallNumberParser class. Here’s how it works:

When constructing a CallNumberParser, users list the CallNumber classes that should be considered as valid “targets” for subsequent parsing operations. Order matters, as strings that match the parsing criteria for multiple implementations will be parsed using the class listed earlier in the list. CallNumberParser is immutable and thread-safe, so once created, it can be reused throughout an entire application freely. It’s kind of awesome.

Of course, a string may fail to match any of the provided implementations. We’ve provided a couple different ways to handle this situation based on the needs of the user. We defined a default implementation of CallNumber, UnclassifiedCallNumber, that will always parse any string – even empty and null strings. Users including UnclassifiedCallNumber at the end of their implementation list may rest assured that all values will get parsed. The price of this flexibility is that massaging unclassified data is of course impossible. The toString method will simply return the input string, and sortKey will return a lower-cased form of the same input string. But at an interface level, UnclassifiedCallNumbers behave just like any other CallNumber implementation. They can be checked for equality, compared, converted to human-readable or sortable strings, and so on. Leaving this implementation out of the list will cause unparseable call number candidates to throw an IllegalArgumentException instead.

Usage Examples

Basic Usage

Using the call number parser is pretty straightforward. If I had a list of mixed (LC and Dewey) call number strings that I wanted to parse, here’s how to do it:

	// The following call numbers are actually used by BYU to represent
	// the first two Harry Potter books in different collections.
	String deweyHP1 = "823 R797h";
	String lcHP1 = "PZ 4 .R798 H28 1998";
	String deweyHP2 = "823 R797hp 2004";
	String lcHP2 = "PZ 4 .R798 H23 1999";

	//Initialize a CallNumberParser to handle LC and Dewey call numbers.
	CallNumberParser parser = new CallNumberParser(LCCallNumber.class, DeweyCallNumber.class);

	//Iterate through each raw call number string and parse them into CallNumber value objects.
	for(String raw : Arrays.asList(deweyHP1, lcHP1, deweyHP2, lcHP2)) {
	CallNumber callNumber = parser.parse(raw);

	//Output to show that parsing worked.
	System.out.println(callNumber.getClass().getSimpleName());
	System.out.println("\tNormalized: " + callNumber.toString());
	System.out.println("\tOrderable: " + callNumber.sortKey());
	}

view raw example.java hosted with ❤ by GitHub

Which will output the following:

	DeweyCallNumber
	Normalized: 823 R797h
	Orderable: 000823 r797/h
	LCCallNumber
	Normalized: PZ 4 .R798 H28 1998
	Orderable: pz000004 r798/ h28/ 001998
	DeweyCallNumber
	Normalized: 823 R797hp 2004
	Orderable: 000823 r797/hp 002004
	LCCallNumber
	Normalized: PZ 4 .R798 H23 1999
	Orderable: pz000004 r798/ h23/ 001999

view raw example.out hosted with ❤ by GitHub

Even Easier Basic Usage

I’m lazy, which is a somewhat desirable trait for a software developer. I realized that in actual usage, all of my CallNumberParser objects kept getting set up to use the same parsing targets. So I added some prebuilt parser as a static final variables within CallNumberParser.

	//Use a prebuilt CallNumberParser that will correctly handle LC call numbers,
	//Dewey call numbers, and the default call numbers created in SirsiDynix Symphony
	//for new items. It will throw an IllegalArgumentException for anything else.
	CallNumber a = CallNumberParser.SYMPHONY_STRICT.parse("PZ 4 .R798 H28 1998");

	//This prebuilt CallNumberParser will handle all the targets described above,
	//but will parse any other values - including null and empty strings - as
	//UnclassifiedCallNumber entities, so it is guaranteed to handle EVERYTHING.
	CallNumber b = CallNumberParser.SYMPHONY_NONSTRICT.parse("PZ 4 .R798 H28 1998");

	//Note that since CallNumbers are value objects, the following is true, even
	//though the two CallNumbers were created separately using different CallNumberParser
	//instances.
	boolean isEqual = a.equals(b); //TRUE

view raw example.java hosted with ❤ by GitHub

Ordering Call Numbers

Normally, trying to present an ordered list of call numbers is a huge pain. You have to worry about padding numbers, stripping out non-filing characters, etc. But the call number library removes all the pain from the process. For the following example, I will deliberately use 3 call numbers that do not sort correctly as simple String objects. According to our catalogers, the “A88x” should sort before “A888”, but that is a violation of UTF-8 and ASCII-based ordering. The following table shows the difference:

Call Number Order	String Order
BF 637 .C6 A88 vol.24	BF 637 .C6 A88 vol.24
BF 637 .C6 A88x vol.24	BF 637 .C6 A888 vol.24
BF 637 .C6 A888 vol.24	BF 637 .C6 A88x vol.24

Let’s make it go using the call number parser:

	String first = "BF 637 .C6 A88 vol.24";
	String second = "BF 637 .C6 A88x vol.24";
	String third = "BF 637 .C6 A888 vol.24";

	//Initialize a CallNumberParser. Since we know we're only parsing LCCallNumbers, we'll keep it simple.
	CallNumberParser parser = new CallNumberParser(LCCallNumber.class);

	//Parse the 3 call numbers and add them to a list (deliberately out of order).
	List<CallNumber> list = new ArrayList<>();
	list.add(parser.parse(third));
	list.add(parser.parse(first));
	list.add(parser.parse(second));

	//Just to prove that call number sorting works correctly, shuffle the list before sorting it according to its
	//natural order (which we can do easily, since the CallNumber interface extends Comparable.
	Collections.shuffle(list);
	Collections.sort(list);

	for(CallNumber callNumber : list) {
	//Output to show that sorting worked.
	System.out.println(callNumber.getClass().getSimpleName());
	System.out.println("\tNormalized: " + callNumber.toString());
	System.out.println("\tOrderable: " + callNumber.sortKey());
	}

view raw example.java hosted with ❤ by GitHub

Which will output the following:

	String first = "BF 637 .C6 A88 vol.24";
	String second = "BF 637 .C6 A88x vol.24";
	String third = "BF 637 .C6 A888 vol.24";

	//Initialize a CallNumberParser. Since we know we're only parsing LCCallNumbers, we'll keep it simple.
	CallNumberParser parser = new CallNumberParser(LCCallNumber.class);

	//Parse the 3 call numbers and add them to a list (deliberately out of order).
	List<CallNumber> list = new ArrayList<>();
	list.add(parser.parse(third));
	list.add(parser.parse(first));
	list.add(parser.parse(second));

	//Just to prove that call number sorting works correctly, shuffle the list before sorting it according to its
	//natural order (which we can do easily, since the CallNumber interface extends Comparable.
	Collections.shuffle(list);
	Collections.sort(list);

	for(CallNumber callNumber : list) {
	//Output to show that sorting worked.
	System.out.println(callNumber.getClass().getSimpleName());
	System.out.println("\tNormalized: " + callNumber.toString());
	System.out.println("\tOrderable: " + callNumber.sortKey());
	}

view raw example.java hosted with ❤ by GitHub

Conclusion

We have had great success using this call number library internally. I am thrilled to be able to share it with the larger library community. Included in the repository are over 70 unit tests to verify the correctness of our parsing and sorting algorithms. The first round of unit tests were based on this great tutorial on LC call number sorting by Kent State University’s library. We have since added many more tests based on specific situations we’ve run into here at the HBLL.

Please check out the repository and let me know what you think in the comments below. Also, there are many more library classifications that we have not yet implemented as CallNumber entities. Please feel free to fork the repository and create pull requests expanding the functionality of this library!

Source: https://bitbucket.org/byuhbll/lib-java-callnumber
Maven: Coming Soon

Author’s note: The statements and views expressed in this article are the author’s own and do not represent the view of Brigham Young University or its sponsors.

Header image: “Carlyle Books on Library Shelf” by ParentingPatch, CC-BY-SA 3.0

bertag.net

writings about programming, libraries, space, and other things

Month: June 2016

Java+YAML (Cascading) Configuration Made Simple

The Old

The New

Conclusion

Parsing Call Numbers Using BYU’s New Call Number Library

BYU’s Call Number Library

Usage Examples

Basic Usage

Even Easier Basic Usage

Ordering Call Numbers

Conclusion

The Old

The New

Conclusion

Share this:

BYU’s Call Number Library

Usage Examples

Basic Usage

Even Easier Basic Usage

Ordering Call Numbers

Conclusion

Share this: