One of my favorite aphorisms is “Time flies, whether you’re having fun or not.” I’m not sure where I heard it, but for sure I’m not creative enough to make it up on my own. The truth of it has been reinforced by the realization that here it is the end of January, post-Boston Midwinter, and I’ve done so little blogging for the past six months that it’s a stretch to call myself a blogger. Time to reclaim the turf. So this post is an attempt to summarize what I’ve been doing all that time, some of which has come to a sort of fruition, but some still ripening.

Last fall I participated as a speaker in a NISO Webinar “Bibliographic Control Alphabet Soup.” I decided for my topic to talk about some of the issues around building the RDA vocabularies from spreadsheets and ERDs (Entity Relationship Diagrams), which is what I had to work with on that task. (You can see the ERDs on the RDAOnline website). Part of my reason for trying to tackle those issues in the webinar is that the vocabularies had become a major focus of my working life for quite a while (does the word ‘obsessive’ sound too dramatic?) At the time, we (Jon Phipps, Karen Coyle, Gordon Dunsire and I) were also trying to write an article about what we’d done with the RDA elements and vocabularies, and that article came out last week in DLib Magazine. Starting the article last fall prompted me to create some diagrams in an attempt to try and convey the structure of these vocabularies, and to provide examples for folks to look at while they puzzle through the ideas. I used a subset of the diagrams in the webinar as well, but most of them are used to better purpose in the article. I’m not at all sure that the several hundred folks listening in to the webinar got much of what I was trying to convey—it’s pretty new stuff for most people, and I was trying to fit too much into my limit of 20 minutes (not nearly enough time). I hope that for those who might have been overwhelmed or confused by the webinar, the article will help to make the work we’ve done a bit clearer.

The writers of the article had a number of purposes in mind, not least to document the decisions and rationale for the strategies taken by the DCMI/RDA Task Group charged with the work of building the vocabularies. Given that the library community generally has little experience with RDF and RDF vocabularies, it seemed particularly important to attempt to provide explanations that we hoped would be accessible to most librarians, and I hope we’ll have sufficient feedback to determine how close we came to that goal. Given that we expected the article to come out just after Midwinter, we did most of our presentations in Boston on the implications of the vocabularies rather than the mechanics. I talked about these at the Technical Services ‘Big Heads’ meeting (slides) and Jon, Karen and I included some of these expectations in our introduction to application profiles at CC:DA on Monday (slides).

Just before Midwinter we started a dialogue with the JSC about next steps, and I hope that some of the issues that have come up will be open to public discussion, just as the vocabulary building was done in public. We’re all in a fairly intense learning space at the moment (at least it seems so to me), and keeping the process open and visible for all seems in support of that learning. We’re also continuing to update the vocabularies as errors are brought to our attention, and to complete portions where we need to add information. One such point in the property/subproperty hierarchies in the element sets—at the time the work was done there was a limitation in the Registry software that prevented us from including the proper hierarchies in both the general and FRBR-bounded portions of the vocabularies. That limitation is now removed, and the missing relationships will be added.

Interestingly, we’re getting lots of help in finding errors from our friends in Germany, who are doing nifty things with the vocabularies. They’ve been particularly helpful with things like inadvertent spaces introduced in URIs and other things difficult for a human proofreader to find. Because we don’t have good ways (yet) to visualize the relationships, their help has been invaluable. We urge anyone else who spots an error or who has a question to use our feedback button in the Registry to communicate with us about their concern.

By Diane Hillmann, January 27, 2010, 12:16 pm (UTC-5)

A few weeks ago I attended the opening of an amber exhibition at our wonderful Museum of the Earth which is only about 6 miles from my house. The exhibit had a little of everything: science, history, geography … and jewelry. I have to admit (and this will surprise no one who knows me) that the jewelry was a big draw, and I went laden (literally), with a varied selection of my own collection of amber. Hey, laugh if you will, but these days I work at home, and have very few opportunities to wear jewelry of any kind—so this opening was irresistible.

But, enough about jewelry, I want to talk about bugs and bibs! As you might expect in a science museum, there was far more emphasis on amber as a carrier (so to speak) of bits and pieces of the past, particularly the biological past. As a preservation medium, amber is hard to beat, though, of course, there are limitations in terms of the size of the biological specimen. I didn’t realize it, but apparently fake amber is everywhere, and one way to recognize the bio fakes is that they include specimens too big to be slowed down by sticky tree sap. The exhibit had some nice fakes, including a small snake in plastic colored to look like amber.

The interest of the scientist in amber is that it stops the process of decay for those creatures lucky (or unlucky) enough to be captured in its grasp. The amber captures a moment in a bug’s short life in a way that allows us to examine it closely and in detail in our own time, millions of years later. In much the same way, the Study of the North American MARC Records Marketplace by R2 Consulting captures a moment in time, very likely too late to have much of an effect on the future, but just in time to capture the state of the cataloging world before the tsunami arrives. [R2]

But the R2 report is as fascinating to a metadata maven as a bug in amber is to a biologist. It describes in detail the current world of cataloging distribution, focusing on the “dysfunctional market” that has grown like Topsy around distribution of MARC records. It gets exactly right the disconnect between the librarian sense that “records want to be free” and the business approach that production costs must be recouped and profit margins maintained for there to be any point in participation at all, and comes down predictably in support for the latter view.

It’s a fascinating read, particularly if the fact that LC commissioned the report is kept in mind—because this is hardly a context-free analysis. I was particularly interested in the description of the businesses outside of libraries supplying MARC records either as contractors or as part of a materials supply chain. As a former denizen of one of the large academics that R2 identifies as part of the “green tier” (more about that later), I was aware of the fact of that portion of the MARC marketplace, but had little contact with it.

The gist of the report is that the MARC distribution network is a dysfunctional hybrid, partly librarianishly “free” and part commercial marketplace. The authors feel that it should be possible to increase the supply of MARC records from “the community” without relying on poor beleaguered LC to supply them, and they give us a multitude of statistics to support that assertion. They believe that there’s enough time to accomplish this and save everybody money before the promised changes come to pass, and all must be re-thought.

My comments on this report, informed by my well-known biases, fall into a few convenient categories:

Dysfunctional? Probably …

Much of the first portion of the report is devoted to a description of the current “marketplace” and a discussion of the survey results that illuminate and inform the description. It’s here that R2 makes the case that LC is subsidizing the whole shebang, to the benefit of everyone else.

“Both libraries and vendors (at least the good ones) rely on “service” to their respective clienteles to distinguish themselves, but there are important distinctions in their respective definitions of the term. In the commercial world, service must exist within a context of profitability, in which all costs are covered and some additional increment is contributed to the company’s continued growth and as a return on the capital initially invested. The library service ethic is much more open‐ended and less directly constrained by costs.”

The report contains much interesting description of what the authors perceive as the bifurcated market, one which, in their view, inhibits the growth of useful marketplace incentives to increase output:

“This tension ‐‐ between community values and commercial values, between idealism and pragmatism, between social responsibility and private benefit – has deeply affected some aspects of the library market. Cataloging, regarded by many as the heart of librarianship, is one of those areas.”

It’s pretty clear where the authors come down in this conflict between “community” and “commercial” values:

“The impulse to share records for which the costs have not been fully recovered may make sense as a form of community good, but is not sustainable without some form of subsidy or exchange. From the commercial viewpoint, it’s simply bad business.”

And, perhaps more to the point:

“It should not go unnoticed that LC itself provides open access to its MARC records via multiple channels. The prevalence of open databases is a key factor in the economic confusion that plagues the MARC Record Market … “

The report goes on to a rather interesting and revealing categorization of the complex MARC marketplace into three tiers. The “Green Tier” includes the “ … oldest, most traditional segment of the market, in which nearly all MARC records originate.” This tier includes both libraries and businesses, as well as OCLC, and is, as such, a mix of the “community” and “commercial” as described earlier. The big thing is that they’re contributors to the marketplace, even if also consumers. According to R2’s statistics, this tier includes 97% of academic libraries, 63% of public libraries and a similar proportion of school libraries.

The next tier down (and it’s clearly down, in this categorization) is called the “Blue” or “opportunistic” tier, including by the author’s definition “ … non-OCLC libraries and underfunded libraries without adequate cataloging capacity.” More interestingly, this tier “ … is also home to open database providers, and the pervasive (did they mean to say “pernicious”?) Z39.50 protocols used to locate and obtain MARC records free of charge.” But R2 makes note of the shifting borders between tiers: “Both in Canada and in the US, historically ‘green libraries’ are adopting ‘blue tier’ practices and expectation, as library budgets are cut and as Z39.50 targets proliferate. Nearly all libraries, regardless of size or type are strategically patient, periodically re-searching the ‘blue tier’ for certain records to become ‘available’; but for ‘blue tier’ libraries, this is the primary approach to cataloging … Open Access and Open Archives Initiatives reside in the blue tier, strongly supported by the basic philosophical stance that access to information should be free.”

The “bottom” tier is the non-library “purple” tier, and this description clearly defines the real threat to the current MARC world, not just the fuzzy-wuzzy library community notion of sharing: “The non-library (purple) tier operates to a large extent without appreciation for or experience with MARC records, and without much regard for the library market in general. It is important to remain aware of activity in this segment, of course, because developments here pose the most significant competitive threats to the traditional values and economic structures of the ‘traditional green tier,’ and even the ‘opportunistic blue tier.’ This is the place where newer technologies and non-MARC data formats are used and developed.”

Obviously, we have met the enemy of libraries, and according to R2 it happens to be us. But wait, there are some unexpected companions in the nasty “purple” tier. In addition to the usual suspects, like Google and Amazon, we find … “OCLC pro-actively operates within the “traditional green tier” and within the “purple non-library” tier. OCLC member libraries, however, are also very active in the “opportunistic blue tier,” sharing records in ways that may conflict with OCLC’s proprietary intent.”

The battle lines seem clearly drawn here, with the “information wants to be free” crowd clearly the enemy, whether in sheep’s clothing as traditional librarians or explicitly displaying wolfish teeth as a member of that unappreciative crowd that cares little about the current MARC marketplace and would like to see the library data silo dismantled brick by brick. No matter that we seek these changes for the benefit of libraries struggling to live within their budgets and to innovate to serve their users as well–shame, shame!

The R2 Solution

The report’s authors actually manage to ask THE most relevant question that should be (and often is) on our minds, but only to dismiss it as out of scope:

“The practice of cataloging has never before faced the level of scrutiny it now enjoys … or endures. Two types of question predominate. First, are traditional cataloging and the MARC record—even after modernization by RDA and FRBR—still necessary in an era of full‐text indexing, OpenURL linking, and other discovery options? While this is a worthy question, it is fortunately not within the purview of this report.”

Leaving aside the odd assumption that RDA and FRBR represent the “modernization” of the traditional MARC record, they couch the issue only in the context of a limited number of technologies, never mentioning the gorilla in the room, the data being built by others outside our comfy and bounded silo. Then they go on to pose the questions they would rather address:

“How do we as a profession understand and explain the costs and benefits of producing and distributing cataloging records? Where and by whom are most original records produced? What incentives exist to stimulate production? What are the barriers that discourage production? How does the library market assign value to the work of cataloging? What is the return on any organization’s investment in producing original catalog records? How does shared cataloging and free or low‐cost distribution of records affect the market? To what degree is market activity subsidized by LC and by the work of individual libraries?”

The problem is, that without an answer to question #1, the other questions seem hardly relevant.

“As noted there, the market is in need of adjustment, if it is to create an incentive for producers while retaining the community ethic of free sharing of data. The ethic of the cooperative can only be sustained if the full costs of production are borne by the community.”

It seems to me that the market will be adjusted, and the recognition of the full costs of traditional cataloging and the plunging ROI as we address Question #1 will hasten that readjustment, but probably not in the direction R2 predicts or that those seeking compensation for their MARC record production might want.

The authors provide some telling glimpses into their world view in their discussion about crosswalks:

“ONIX to MARC record translations and fully operable MARC to non‐MARC metadata crosswalks could dramatically alter this three‐tiered landscape. To date, major players in the blue and purple tiers have failed to buy into the concept of shared bibliographic and authority data. While some efforts to encourage cross‐market cooperation are underway (notably the OCLC/NISO forum), fierce competition flourishes within and between each tier of the market. Even more problematic, each tier has distinctly different needs and incentives, making it difficult to establish an adequate degree of shared urgency and/or investment in new solutions.” [RIN]

Clearly, in a world where the only relevant data one can see “out there” is ONIX, crosswalks seem a no-brainer, but to call this view “limited” seems far too kind.

Ultimately, R2 thinks we still have time to tweak the marketplace and flog out more MARC records by identifying and marshaling unused capacity (e.g., hidden catalogers) and providing economic incentives. In my view, this is a flawed argument, and takes away from the need to plan for the transition to a much different future. I agree that MARC will indeed be used by libraries for some time, but as a lossy exchange format, not the lynchpin of the library data world. R2’s strategy prolongs the old world, jeopardizing the possibilities of moving forward in a timely manner.

The Sacred Cow Effect

Sadly, the whole report, interesting though it is as a biological specimen, fails utterly to examine the data activity outside libraries except to demonize it and its proponents. In making the Library of Congress into Poor Nell, they also deny the innovations in creating and reusing data that LC itself has accomplished, for instance, the American Memory Project, the LC Flickr Project, and many other digital initiatives that have proactively (and openly) pushed the metadata envelope in ways that inspire and engage us. The report fails also to understand that the changes they fear, the ones that they rightly expect to undermine the current marketplace completely, are already nibbling ravenously around the edges of MARC and its traditional marketplace in ways that will hardly take the 5-10 years to make change become real that R2 predicts.

Last summer at ALA in Chicago, a small group of us pulled together a linked data program, hearteningly well attended, where Eric Miller persuasively predicted that the return on investment for integrating “free” metadata from “the cloud” will trump traditional concerns about quality. [Miller] Mainstream entitles like the New York Times are moving aggressively into the linked data space, seeking to merge their data with the likes of DBpedia and FreeBase. [Sandhaus]

Consider this from MMA partner Jon Phipps: “The future cataloging marketplace will have to compete with ‘free and more than good enough’. Like the people who initially sneered at Google for being too simplistic and ignoring metadata when it came to searching, the professional cataloging community ignores (or tries to fend off) the enormous future output of Linked-Data-enabled systems at its peril. By opening up a clear relationship between the semantic web and library data sets, the RDA vocabularies represent a threat to the hegemony of catalogers. The RDA vocabularies are a a disruptive, game-changing technology.” [Phipps]

The reality is that it’s not just the marketplace that’s changing, it’s also the profession. As part of the analysis of why the numbers of catalogers reported in their survey doesn’t lead to the expected output levels, R2 speculates that “These data lead us to ask what catalogers are doing. Bob Wolven and others suggest that catalogers are being called upon to apply their knowledge of cataloging principles to new initiatives; and specifically to creating metadata for digital and archival collections.” [Wolven] R2 seems to imply that this is a bad thing, taking away resources from the business of actually churning out MARC records, but certainly these newer roles are critical to the survival and renewal of libraries, far more than shoring up current MARC record production.

The solutions the R2 report poses, from paying more attention to recouping cataloging costs and re-centralizing creation of cataloging records, if taken up, would actively undermine a transition to participation in a more open, linked data world. They represent a step backward, in a community that has already internalized the values of sharing and decentralized data critical to seeing value in the world of openly accessible data lying on our doorstep.

Oddly enough, the report ends with a quote from my old friend Sherman Clarke (unattributed, so most likely as a comment to the survey):

“We collectively need to have a model that allows us to do some of the building of BIBCO records mechanically or through accretion of metadata from institutional records or other record loads. OCLC already does considerable building of the master record from incoming records; what we need is something more like the metadata that is becoming usual in NewGen environments. If someone adds a tag or review or picture, that becomes available in the master cluster. Not a BIBCO record, but a BIBCO cloud of metadata for a particular manifestation of a work/expression.”

Yup, you got it, Sherman. The change we need is not really about records, or catalogers; it’s a new way to think about information and added value.

[Miller] Miller, Eric. “Linked Data and Libraries: Grassroots Program: From Legacy Data to Linked Data, Preparing Libraries for Web 3.0. Available at: zepheira.com/talks/ala-em-lod.pdf

[R2] Study (for the Library of Congress) of the North American MARC Records Marketplace, October 2009, R2 Consulting LLC, Ruth Fischer, Rick Lugg. Available at: www.loc.gov/bibliographic-future/news/MARC_Record_Marketplace_2009-10.pdf

[RIN] Research Information Network. (2009). Creating catalogues: bibliographic records in a networked world. Available at: www.rin.ac.uk/files/creating_catalogues_REPORT_June09.pdf

[Sandhaus] Sandhaus, Evan. “150 Years of Semantic Technology.” Presentation at the Cornell University Libraries Metadata Working Group Forum, Nov. 13, 2009. Slides will be available from: metadata-wg.mannlib.cornell.edu/forum/index.php?date=2009-11-13

[Wolven] Wolven, Robert. (2008). In search of a new model: Columbia University Libraries: Robert Wolven reflects on what’s next for cooperative cataloging. netConnect, 1/15/2008. Available at: www.libraryjournal.com/article/CA6514925.htm

By Diane Hillmann, November 24, 2009, 11:04 pm (UTC-5)

…in the RDA Ontologies. Do we? After all, they’re a big part of the ‘Access’ in Resource Description and Access (RDA). But they’re not particularly semantically meaningful, especially if you have the component parts available. An Access Point is just a structured string. For instance a ‘Publication Statement’ Access Point for “The Daytona daily news” might look like:

“Daytona Beach, Florida : Geo. F. Crouch, 1903-1926″

It has a formal syntactic structure, and semantics derived from adherence to that structure when the string is created:

“Place of Publication” : “Publisher’s Name”, “Date of Publication”

Note that the punctuation is part of the formal grammar that helps parse a grammatically correct statement into its semantically meaningful constituent parts.

And this is the way we’ve been doing things forever (well it seems like forever) — semantics is derived from proper use of a syntax that everybody who is creating and using shared data has agreed upon in advance. And as long as everybody uses precisely the same syntax this works great. It works really, really well with structured syntaxes like MARC21:

260 $a Daytona Beach, Florida
260 $b Geo. F. Crouch
260 $c 1903-1926

…and hierarchical syntaxes like XML:

<publicationStatement>
  <placeOfPublication>Daytona Beach, Florida</placeOfPublication>
  <publisherName>Geo. F. Crouch</publisherNam>
  <dateOfPublication>1903-1926</dateOfPublication>
</publicationStatementt>

The RDA documents go so far as to call an Access Point an Element and its constituent parts Sub-Elements, again clearly thinking of this nice syntacticly-defined semantics.

But what if your data says, semantically, that “Place of Publication” isn’t the Name of the place or the Label for the place, but a URI that identifies the Place itself; a resource rather than a string. The Access Point rules don’t let you stick a URI in the Publication Statement where a string is supposed to be.

What about “Publisher’s Name”? That’s clearly going to be a string no matter what — names tend not to be resources. But there’s probably a Publisher resource out there, somewhere, with a URI that identifies the Publisher and probably has a Name or a Label property that provides a string that you can stick in the Publication Statement.

We’ll just ignore “Publication Date” for the now, since that’s a very different can of worms: slimy, smelly worms.

At the moment, RDA doesn’t acknowledge the existence of the resources supplying the strings for an Access Point, and it lets substantial ambiguity sneak in with property names like “Place of Publication” rather than “Place of Publication Name” like they resolved with “Publisher’s Name”. But that ambiguity didn’t exist when all the data was strings — strings you used for indexing, and displayed to the user, and didn’t have to go fetch from somewhere because they were right there in that 260 field.

I listened to a radio program that referred to all of the money that everyone in the world has available to invest as “The Global Pool of Money” and I think that applies quite nicely to the Linked-Data notion of the Semantic Web — “The Global Pool of Data”.

The open world model of the Semantic Web assumes that you will never have all of the available data that describes a resource, and the RDF data model supports this. Resources often exist, outside of traditional library data, available from the Global Pool of Data, that can supply the necessary labels.

But of course we usually just have the labels. This is library data made for cards that need to be put in the correct order and read by a person. And the Global Pool of Data usually just has a bunch of resources. This is linked data, in no particular order at all, meant to be read by a machine.

So, specifying Access Points as pre-coordinated strings actually provides us with a major opportunity when defining the ontologies; several opportunities actually:

  • We can formalize each Access Point specification into what Dublin Core calls a Syntax Encoding Scheme (SES) and say that each Access Point has a datatype.
  • We can clarify the semantics of using a label rather than a resource for properties (sub-elements) like “Publisher’s Name”
  • We can clarify the semantics of using a resource for “Place of Publication” and say that the label used in an Access Point must be the Name of the Place and this is distinctly different.

So, refined to use properties that are a bit more semantically clear, we have a slightly modified Publication Statement:

“Place of Publication Name” : “Publisher’s Name”, “Date of Publication”

…we tie these properties specifically to a FRBR Manifestation that RDA says must be what they describe, and in RDF the supporting ontology looks like:

rda:placeOfPublicationManifestation a owl:ObjectProperty
rda:PlaceOfPublicationName a owl:DatatypeProperty
rda:publisherManifestation a owl:ObjectProperty
rda:PublisherName a owl:DatatypeProperty

Here’s our sample instance again (by the way, this data is from a real linked data resource):

“Daytona Beach, Florida : Geo. F. Crouch, 1903-1926″

<http://chroniclingamerica.loc.gov/lccn/sn93063916>
  rda:publisherManifestation <http://???> (blank, but we know one must exist)
    rda:PublisherName "Geo. F. Crouch"
  rda:placeOfPublicationManifestation <http://dbpedia.org/resource/Daytona_Beach%2C_Florida>
    rda:PlaceOfPublicationName "Daytona Beach, Florida"
    rdfs:label "Daytona Beach, Florida"

This tiny chunk of data was gathered by hand and mapped (by me) from the existing resources and the labels supplied by those resources.

Someday there will be services that comb through linked data looking for missing data like that Publisher resource, and will perform a search on the Global Pool of Data looking for resources with labels matching that library data, expressed in RDA/RDF, fill in the missing pieces and present the lucky cataloger, and ultimately the user, with all that rich linked data.

The eXtensible Catalog project is working on services that do just that kind of thing, so someday may not be too far off.


By Jon, November 3, 2009, 10:42 am (UTC-5)

Last week I was in the UK, primarily to attend a DCMI Registry Community Workshop organized by UKOLN, scheduled for Friday, July 24th. Early the following week we found out that as we were gathered in York discussing distributed registries, Rachel Heery passed away after a long battle with breast cancer. Rachel was one of the founders of the Community (then called a working group), and was involved in building a number of registries, including the DCMI Registry and the IEMSR Registry at UKOLN.

There have been a lot of postings about Rachel this week from colleagues and friends, and I wanted to add my voice to that chorus of tributes to an exceptional person. I didn’t know Rachel as well as I would have liked—we worked on different continents and generally crossed paths primarily at DC conferences. But we were both members of two distinct minorities within DCMI: women, and implementers. Neither of us trained as technologists, and came to the sometimes dauntingly technical discussions at DCMI from the point of view of those trying to use DC for real projects, too often frustrated with the 50,000 foot viewpoints expressed by the more technically astute.

Stu Weibel, who knew Rachel, as I did, in the context of Dublin Core, brought her back for me the most strongly by reminding me of one of Rachel’s characteristic interjections:

“We emulate those we admire, and I have often found myself over the years using a phrase that signaled, from Rachel, an objection worthy of discussion… a sort of lilting “Hang on…!” Those who have worked with her will hear echoes of the tone and inflection that made the phrase hers, and commanded respectful attention, a flag that something was not quite right. I always think of her when I say it, and will always try to use it in the service of the honest brokerage of common goals that characterized Rachel’s efforts.”

Stu also reminds us that Rachel’s two most important contributions to the DC efforts (aside from her considerable intellect and personal presence) were in the areas of registries and application profiles–both have been a particular focus for me and Jon over the past four years or so. It gives me some solace to think that she would be pleased that implementers are still working hard in the areas she pioneered, though sad beyond measure that she will not see those efforts bear their promised fruit.

Others who comment on Rachel’s influence and career:
Lorna Campbell
Andy Powell
Lorcan Dempsey
Her UKOLN colleagues

By Diane Hillmann, August 4, 2009, 12:43 pm (UTC-5)

One of the most interesting programs at ALA Annual that I was involved with was the Linked Data grassroots program. Here’s the blurb:

From Legacy Data to Linked Data: Preparing Libraries for Web 3.0. “How can library cataloging data be transformed to function within ‘Web 3.0′ and be understood by non-library web applications? Speakers from both the library and Semantic Web communities will explore the situation in a non-technical manner and describe current work underway to transform legacy library data into linked data.“

The speakers were: Eric Miller (President, Zepheira, Inc.), me, Jennifer Bowen (Co-Principal Investigator, eXtensible Catalog Project, University of Rochester), Rebecca Guenther (Senior Networking and Standards Specialist, Network Development & MARC Standards Office, Library of Congress). Corey Harper of NYU introduced the speakers and fielded questions at the end. Because this was a Grassroots program, attempting to make a place for emerging trends in what is often a program consisting primarily of the hot issues of a year or two ago, all the approved programs got small rooms. The one we ended up in seated about 75, and we filled the floors, the aisles and much of the hallway outside the room. The room was in the Hilton, not easy to find, so it was gratifying how many people made the effort.

American Libraries reported on the program, and from the comments I’ve received it was a successful session and has generated interest in further programming on the subject for next year (and we are actually talking about doing that). In my presentation, I made the case for the readiness of libraries for the challenges of linked data, citing the work done with the RDA vocabularies as foundational to that claim. I admit that although there was a part of that claim that was, if not actually wishful thinking, at least a rhetorical device, clearly we are at some kind of tipping point (or approaching it pretty quickly). Every six months when I talk to people at ALA extensively about this stuff, or when I’m out “on the road” talking to colleagues, there is more excitement and more interest on the part of librarians, who are definitely “getting it.”

Presentations are available on the ALA Wiki.

By Diane Hillmann, August 3, 2009, 5:25 pm (UTC-5)

ALA Annual in Chicago has been a blur—I did three presentations (which I hope to talk about and link to slides as time permits). But one issue has been rolling over in my mind ever since I blurted something about it at my first presentation on Friday of Annual, when I was last up on a panel about “The Future of MARC.” Rebecca Guenther of LC spoke about the efforts to keep MARC relevant and Ted Fons of OCLC covered similar topics from the viewpoint of “The Big O” (thanks to Karen Schneider for that wonderful appellation!) A feature of both talks was the idea that reorganizing MARC records into “FRBR-ized” views was really all that was needed to take advantage of FRBR. I argued at that time that this was not the case, and as I think more about it I’m more convinced it’s true: FRBR-ization is not the same as using FRBR in native RDA.

Part of my view is based on the differences between RDA and MARC semantics (not syntax, which is where the conversation usually goes). One of the most overlooked aspects of RDA in general is the rich vocabulary of relationships that it brings to the table for use in bibliographic description. Most people who’ve focused on RDA as a textual guidance or set of rules have overlooked this, because the relationship vocabulary appears in appendices, and most of us don’t consider appendices the most important part of anything. But consider this: in the RDA Vocabularies, each of these relationships has an identifier, is part of a hierarchy that allows expression of bibliographic relationships at several levels, and gives us the ability to use these relationships to navigate the bibliographic landscape without having to delve into records and interpret the text notes we’ve used for the same purpose in MARC. For instance, using RDA you can say that ‘Resource X’ is an abridgment of ‘Resource Y’ (and that ‘Resource Y’ has an abridged version in ‘Resource X’) in a way that a system can expose to the user with no muss or fuss. The relationship is specific, identified and explicitly defined if anybody needs that to apply or interpret it.

In contrast, FRBR-ization only exposes what we can assert based on a mapping from MARC to FRBR (or RDA), which is at best the relationships between the FRBR Group 1 entities: the Work, Expression, Manifestation and Item. With the RDA array of identified relationships, we have a whole lot more. I suppose one could say that these are not necessarily part of the FRBR panoply, but if you consider them the “horizontal” relationships that fill in between the “vertical” relationships that Work, Expression, Manifestation and Item provide, then it’s possible to see how these relationships are enabled by the way the FRBR model has allowed us to rethink our world.

This is one of the issues that makes my head hurt when I think about the RDA “testing” regime that we keep hearing about. Are we wedded to the notion that if it can’t be crammed into MARC we aren’t going to use it? Can’t we start to think about MARC as a fairly lossy output format and move on to something that expresses the relationships we know will help us maintain some important functionality and credibility in the broader data world? As Jennifer Bowen and the eXtensible Catalog folks have discovered as they build the services to transform MARC into RDA (see my post on Jennifer’s paper for more about that) transforming MARC to RDA represents a fundamentally different set of problems and trade-offs than going the other way. [By the way, the XC Project was everywhere at Annual–go Jennifer!]

And as more vendors step up to the RDA plate and begin to build applications that start with RDA rather than try and transform MARC into something that could be mistaken for RDA only in dim light, we’re going to have to accept the fact that like any other metadata mapping, there is no such thing as a free lunch or a round trip.

The Registry has some of these relationships registered already (see “RDA Roles” for the relationships between FRBR Group 1 and Group 2 and “RDA Relationships for Works, Expressions, Manifestations, Items” for the relationships between Group 1 entities), but be aware that these are not yet the final versions. I haven’t gotten the information yet about the final changes to allow me to make those updates, but when I do I’ll make an announcement to that effect.

By Diane Hillmann, July 18, 2009, 11:35 am (UTC-5)

Today I got a very disappointing note in my inbox, from the US National Libraries RDA Test Project. I guess I’d call it a “ding” letter, and I have to say it was more than a bit surprising. I had volunteered to help with the testing, not by creating records, mind you, but in analyzing the records other people create. Given the fact that I’ve been the co-chair of the DCMI/RDA Task Group, done the major part of the work in registering the RDA schemas and vocabularies, and have been involved in building the XML schemas that will be the basis of much of the data creation for many early RDA implementations, I figured my experience might come in handy. But apparently not …

Dear Diane: Thank you for your interest in the US National Libraries RDA Test project. The RDA Test Steering Committee regrets that you could not be selected as a formal test participant. Interest in the project was much greater than the Steering Committee originally anticipated, and it was necessary to select test partners from more than 90 applications. Every applicant had a great deal to offer to the project, and each was carefully considered. The Steering Committee based its final selections on the goal of ensuring that the RDA Test will reflect a cross-section of US cataloging agencies balanced by size, type of organization, OPAC and cataloging systems used, and areas of specialization in cataloging and collection development.

The Steering Committee will share the methodology for the test on its Website at URL . If you are interested in conducting your own test of RDA, we encourage you to produce records following this methodology and to share the results with the Steering Committee during the test period.

Thank you again for your interest in the RDA Test.

So, exactly what are they testing that makes my knowledge and experience useless? Darned if I know. But I can’t get beyond the notion that the testing regime I see described on the website is pretty limited, and it’s hard to imagine what the results can really tell us, aside from the obvious difficulties people will encounter in attempting to cram a FRBR-based structure into any one of our current flat MARC-based library systems.

Much more interesting, to me anyway, is the idea of what RDA records might look like in straight XML or RDF, without the necessity of the contortions involved in making it all “fit” into a MARC system. Without the layer of MARC contortion we might really be able to figure out whether catalogers could adjust to RDA and create FRBR-based records. It would be nice to think that some of the open source systems would find a way to play with these records and test some more forward-looking, rather than backward-looking implementation issues.

Any volunteers for an alternate testing regime?

By Diane Hillmann, May 29, 2009, 5:08 pm (UTC-5)

This week, Karen Coyle wrote a post about LCSH as linked data: beyond “dash-dash” which provoked a discussion on the id.loc.gov discussion list.

It seems to me that there are several memes at play in this conversation:

LCSH and SKOS

As Karen points out, LCSH is more than just a simple thesaurus. It’s also a set of instructions for building structured strings in a way that’s highly meaningful for ordering physical cards in a physical catalog. In addition, each string component has specific semantics related to its position in the string, so it’s possible, if everyone knows and agrees on the rules, to parse the string and derive the semantics of each individual component. The result is a pre-coordinated index string.

These stand-alone pre-coordinated strings are perhaps much less meaningful in the context of LOD, but this certainly doesn’t apply to the components. I think what Karen is pointing out is that, while it’s wonderful to have a subset of all of the components that can be used to construct LC Subject Headings published as LOD, there’s enough missing information to reduce the overall value. As I read it, she’s wishing for the missing semantics to be published as part of the LCSH linked data, and hoping that LC doesn’t rest on its well-earned laurels and call it a day.

Structured Strings

Dublin Core calls the rules that define a structured string a "Syntax Encoding Scheme" (SES) and basically, that’s what the rules defining the construction of LC Subject Headings seem to be. It’s structurally no different than saying that the string "05/10/09", if interpreted as a date using an encoding scheme/mask of "mm/dd/yy", ‘means’ day 10 in the month May in the year 2009 using the Gregorian calendar. Fascinatingly, that same ‘date’ can be expressed as a Julian date of "2454962", but I digress.

As far as I can tell, no one has figured out a universally accepted (or any) way to define the semantic structure of a SES in a way that can be used by common semantic inference engines, and I don’t think that anyone in this discussion is asking for that. What’s needed is a way to say "Here’s a pre-coordinated string expressed as a skos:prefLabel, it has an identity, and here are it’s semantic components."

Additional data

So…

"Italy--History--1492-1559--Fiction"

…is expressed in id.loc.gov/authorities/sh2008115565#concept as…

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix skos: <http://www.w3.org/2004/02/skos/core#> .
@prefix terms: <http://purl.org/dc/terms/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

<http://id.loc.gov/authorities/sh2008115565#concept>
    skos:prefLabel "Italy--History--1492-1559--Fiction"@en ;
    rdf:type ns0:Concept ;
    terms:modified "2008-03-15T08:10:27-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    terms:created "2008-03-14T00:00:00-04:00"^^<http://www.w3.org/2001/XMLSchema#dateTime> ;
    owl:sameAs <info:lc/authorities/sh2008115565> ;
    skos:inScheme
        <http://id.loc.gov/authorities#geographicNames> ,
        <http://id.loc.gov/authorities#conceptScheme> ;
    terms:source "Work cat.: The family, 2001"@en . 

…and has a 151 field expressed in the authority file as…

151 __* |a *Italy* |x *History* |y *1492-1559* |v *Fiction

…which has the additional minimal semantics of…

<http://id.loc.gov/authorities/sh2008115565#concept>
    loc_id:type "Geographic Name" ; #note that this is also expressed as a skos:inScheme property
    loc_id:topicalDivision "History" ;
    loc_id:chronologicalSubdivision "1492-1559" ;
    loc_id:formSubdivision "Fiction" ;
    loc_id:geographicName "Italy" .

…and this might also be expressed as…

<http://id.loc.gov/authorities/sh2008115565#concept>
   loc_id:type id.loc.gov/authorities/sh2002011429 ;
   loc_id:topicalDivision id.loc.gov/authorities/sh85061212 ;
   loc_id:formSubdivision id.loc.gov/authorities/sh85048050 ;
   loc_id:geographicName id.loc.gov/authorities/n79021783 ;
   dc:temporal "1492-1559" ;
   dc:spatial sws.geonames.org/3175395/ ;
   dc:spatial id.loc.gov/authorities/n79021783 .

Making sure that those strings in the first example are expressed as resource identifiers is also something that I think Karen is asking for. (BTW, The ability to lookup a label by URL at id.loc.gov is really useful)

I should point out that Ed, Antoine, Clay, and Dan’s DC2008 paper detailing the conversion of LCSH to SKOS goes into some detail (see section 2.7) about the LCSH to SKOS mapping, but doesn’t directly address the issue that Karen is raising about mapping the explicit semantics of the subfields.

By Jon, May 20, 2009, 3:45 pm (UTC-5)

Friday I was in Hyde Park, NY, at the site of the Franklin D. Roosevelt Presidential Library and Museum, attending the NYLINK Annual Meeting. I’d been asked to come and talk about change in cataloging. NYLINK, formerly the SUNY/OCLC Network, is one of the struggling regional service providers that used to be the primary brokers for OCLC services for smaller libraries, and did most of the training for OCLC in the process. These organizations are now providing other kinds of services and training for a wide variety of libraries, and are still one of the best places for figuring out what working librarians are thinking.

I was particularly impressed with how this day-long meeting was organized. After the welcome and logistics, two speakers—Liz Chabot from Ithaca College and Susan Currie from Binghamton University—got things off with a bang by proposing seven “what ifs”:

… we stopped cataloging?
… librarians individually and as a profession promoted, used and helped develop Wikipedia?
… we accepted open source software as a way of being in control of the customer experience?
… we required all library staff to have expertise using technology?
… mistakes were expected and embraced—and librarians became the mistake masters?
… we didn’t make our customers work so hard?
… we let customers determine their loan periods?

Meeting participants spent some time talking about these questions, in the meantime getting to know one another at their tables, and getting fired up about the topics. The morning speaker, Joe Lucia from Villanova University, then started off with this energized group, and did a great job putting the important issues in a more general perspective. He spoke compellingly about our plight in libraries and the parallels with newspapers, described changes in how people deal with linear text vs. Hypertext, and pointed out the work of R. David Lankes (a colleague of mine at Syracuse University), who talks about the “library as conversation.” Joe’s primary point in bringing together these threads is his contention that libraries are not in the “information” business so much as we are in the “knowledge and conversation” business. He talked very passionately about the library as a “commons” where knowledge and conversation happen with the active engagement of the library as organization, bringing students, faculty and others out into the open for exchanges of ideas, not just providing places for their solitary study.

But of course it will be no surprise that I was most engaged with the question “What if we stopped cataloging?” I’m not sure that question is the right one, or perhaps just not specific enough. I might recast it as “What if we stopped cataloging the same old stuff?” By “same old stuff” I mean the secondary products of academic pursuits: books, journals, government documents, etc., the stuff that we have always cataloged. What if we began to catalog different stuff, the stuff we’re publishing ourselves, the podcasts we create of the talks given by our own faculty and those who visit us to discuss their research, the exhibitions we create out of our collections and materials borrowed from others, the primary materials our former faculty members have donated into our care—the things that we’ve always consigned to others (archivists, perhaps?) or not cataloged at all. What we have called cataloging, when it is not done for the first time, is these days for the most part consigned to staff who may not be catalogers, bought from vendors, or automatically claimed from larger databases to populate ours. Insofar as we have used these more efficient, more automated strategies, we have indeed “stopped cataloging.” What we haven’t done is modified our missions and our budgets to take more responsibility for these other often unique things that we have neglected, and as such we have been the instruments of our own demise.

So as we hear the bemoaning of the profession of cataloging, and ourselves sometimes obsess about the “how” of the changes in our lives brought on by technology, financial distress, and various other pressures, we would do well to remember that behind all this, the mission of the institutions we call home is changing significantly, and we can’t answer those “how” questions without looking anew at those shifting missions.

My presentation and Corey Harper’s came after lunch, and the group was exceptionally ready to hear what we had to say, thanks to the fabulous preparation provided by NYLINK, Joe Lucia, Liz Chabot and Susan Currie. I had some great conversations there, and on the ride home thought quite a bit about how proud I am to be a librarian, and the wonderful opportunities I continue to have to get to know some of the best people in the business.

By Diane Hillmann, May 11, 2009, 9:06 am (UTC-5)

This week I’ve been on the road, doing presentations at the New Jersey Library Association meetings on Wednesday, and Five Colleges in Western MA Friday morning. I’ve been doing all this travel in the faithful MetadataMobile, increasingly an object of interest, amusement and (almost) veneration by those who have heard about her. It’s been a great trip, with wonderful audiences, and as always, I learn a lot about what’s on the minds of those in the trenches.

I arrived later than I’d expected at NJLA, due to missing the exit I was supposed to use on the Garden State Parkway. I was very busy looking for a coffee and a rest stop, and it probably whizzed right by me. By the time I realized I was way too far down the Parkway, I decided to get off and use my newly acquired New Jersey Official Map to get me to my destination. I know, it’s very retro as a strategy, but maybe I was feeling that a little self-punishment was in order, or maybe I just needed to get off the big roads for a while.

As a child of the fifties, my first stop was a gas station as I needed to gas up and figured I’d confirm my location on my handy-dandy map and get some advice on my route. I had forgotten that New Jersey doesn’t do self-service gas pumping, and I’d also forgotten that virtually all the NJ gas stations I’ve visited in the past decade are staffed by (and possibly owned by) persons from somewhere else. When shown a map and asked for directions, they invariably act like people who have been lately dropped into their current location from the sky with no idea how to locate themselves in the world they inhabit (pretty close to reality, probably). I knew pretty much where I was, but wanted to figure out how to get to a big secondary East/West road that would take me to where I needed to be. It was hopeless—these guys (they’re always guys) couldn’t even locate the town they were in on my map so I thanked them for their trouble and got back on the road. I ended up following my nose (and the MM’s internal compass, which tells me whether I’m going N, S, E, W or any rational combination thereof). Thankfully, in NJ if you go east you find the Atlantic eventually.

It was a gorgeous day to be taking the long way, and I passed by what remains of the agricultural areas in NJ, including a whole bunch of horse farms (one was called “Due Process Stables” and you could pretty much guess how that was funded). I got to the conference venue in Long Branch (right on the beach) in time to take a nice walk on the boardwalk and decompress from those long hours behind the wheel.

The program was about FRBR and RDA for real people, and Rhonda Marker from Rutgers presented first. Rhonda and I have presented together before, and generally have fun with it, since we sometimes disagree and are very open about it, to the general amusement (or maybe consternation) of the audience. We only had two hours, which wasn’t enough, but the audience was very receptive and had some good questions. The slides for this presentation are available here.

From Long Branch, NJ, I pointed the MetadataMobile north, through the horror of Big Apple driving (BIG trucks!), though Connecticut and up to Amherst, Mass., for a confab with a group from the Five Colleges. This had been in the planning works for some months, but was made more interesting by the announcement a few weeks ago of some pending reorganization of technical services operations in the colleges in response to the same kinds of financial pressures being experienced everywhere. I’d been forewarned about this by the organizers of this session, and included some of the issues most relevant to the group in my presentation (though the slides, available here don’t always reflect this directly). My underlying point was that they were not, in fact, dealing with only one crisis—the financial meltdown—but two, if you include (as I do) the pending changes in how we do business as data creators and managers, changes that are absolutely necessary to avoid the continuing marginalization of our efforts to provide information to our users.

One thing that impressed me about this group was that it included a number of the library chiefs, systems folks and others not part of the cataloging cohort that normally predominates in my audiences. This is good—very good, in fact—because it signals to me that the issues I talk about are being seen as crucial to the discussions around a change in mission and strategy essential to creating positive change over the longer term, not just arguing against cuts in budgets for the short term. But, like President Obama points out, the one-crisis-at-a-time approach doesn’t work in a context of multiple crises related to one another in fundamental ways. Attempting the financially necessary reorganization without a refocus on mission, in the face of the huge challenges we face in remaining relevant in the current information environment, is self-defeating at best, suicidal at worst. I spoke to one of the chiefs after the presentation, and he said that he wished his public services folks had been there, too, and I agreed. Certainly the demise of the library catalog as we know it will affect their work hugely, and it behooves us all to break down those barriers of specialization as we face issues of our own survival as information providers.

What this also suggests is that the focus of the testing being done by the US national libraries cannot be limited to cost benefit analysis. If we fail to look at the issues of most importance to libraries, as the LC Working Group on the Future of Bibliographic Control certainly did, we risk our future entirely. Of course, in the current environment particularly, we need to pay attention to costs and efficiency as we move forward, but they cannot be our sole criteria for decision making.

For those of you knitters among my readers, I also made a pilgrimage to Webs and spent a goodly portion of my speaker’s fees to support the local economy. It was a great, though at this point I’m going to have to do a “real” retirement sooner than anticipated to use up all the accumulated yarn and patterns I’ve amassed. Either that or recruit a chauffeur on these trips so I can knit on the way.

By Diane Hillmann, May 3, 2009, 2:23 pm (UTC-5)