Latest News + Hitchhiker Blog

Tales of Migration, Part 12
In which We Wrestle with Data

This is the twelfth in a series covering the library system migration of the Bartlett Library at the Museum of International Folk Art and our partner library at the Laboratory of Anthropology, aka “Museum Hill Libraries.”

Since our libraries are moving from a non-MARC system into MARC the data mapping section of the migration is the most treacherous and challenging. We delivered our data in late February. Now (as I write this) it’s Spring and we’ve had several conversations with our data specialist.

What are we talking about? Unlike most libraries (headed from one MARC system to another) we have to come up with rules to remap every field in every record into MARC. Often the correspondence isn’t one-to-one, meaning we have to make up rules to determine how the data will be split out. If our main entry field has a comma in it, then it probably contains an author name. Otherwise it’s probably a title. Sometimes, though, that rule will be wrong.

Even if you move more simply from one MARC system to another, you still have some wonky data issues to figure out. Yours may have to do with things like serials data (notoriously hard to move), or circulation history. The art of working all this out is like this:

• Recognize where your data looks weird
• Devise a cunning yet simple if-then rule to try to fix it
• Realize that there will still be some errors you’ll have to fix later.

Not all vendors will apply your cunning rules, but we’re lucky to have one who will. Most of our most important fixes, like separating subject headings that got mashed together with no separator, can be fixed through scripting. If an entry in the subject field has a capital letter with no space preceding it, then split the data at that point and create two separate subject headings. Same thing for multiple authors, again smashed together with no separator when they were extracted from the old system.

And here is where I learn how much I love my data expert, who quickly realized she needs to adjust the script so it does NOT apply this rule for names that start “Mc” or “Mac” or “O’” where the lack of a space before a capital letter is correct. I would have missed that completely and had a mess to clean up later.

We’re also taking advantage of migration to fix some fields that have been used inconsistently over time. One set of identifying numbers, for example, appears in a bunch of different formats, making it impossible to sort reports by that field. We came up with simple rules to put the numbers into the same format, making the search possible.

Our data has turned out to be so complex that we’ve been reassigned from the very good data specialist we had in our first meeting to the most experienced data expert our vendor has. She is an artist in recognizing patterns in the data, seeing where things might go wrong, and figuring out how to fix as much as possible without manual intervention. I haven’t always been so lucky with migration data people, and this makes a huge difference. With a good data specialist the process of moving your data is creative and collaborative, and you can wind up with cleaner, more reliable information than you had coming in.

By the end of the process you should also have a good idea what can’t be cleaned by the vendor, and where you need to spend your time fixing things manually. I’ll be spending the next few months catching the diacritics that came out of our old system as gobbledygook. Not much fun, but I know what I need to do, and that’s half the battle.