A New Kind of Catalog Charley Pennell Principal Cataloger for Metadata North Carolina State University North Carolina Library Association 2007
Where is this talk headed? • Local motivation • National trends • What is Endeca? • Features • Does Endeca work? • Where are we going from here? • Where is everybody else going?
A little TRLN catalog primer • TRLN libraries (Duke, NCCU, NCSU, UNC-CH) jointly develop and maintain BIS, 1985-1992 • DRA implemented for catalog (UNC & Duke continue Acq/Serials modules), 1991-1993 • No integrated keyword/browse capability, 1993-1999 • Web2 catalog implemented, 1999- • Sirsi & DRA “merge” in 2002; Taos DOA
A little TRLN catalog primer 2 • NCSU & NCCU to Unicorn; Duke to Aleph; UNC-CH to Millenium, 2003-2004 • Sirsi/Dynix merger, 2004: vendor focus shifts (even more) toward school/public market • While agreeing to continue to support Web2, S/D increasingly looking to merge all product catalogs into single interface
What was the catalog lacking? • Simplicity: a simple, hopefully uncluttered interface • Interactivity: ways to interact with results to get better results • Forgiveness: just fix my typos and case errors, don’t make me feel stupid! • Response time: always • Real-time sorting: the limit is how many?!! • Relevance ranking: as if! • Web services: use the Web to repurpose data, enable mash-ups, add-ons & improvements
So, why DOES everyone think that the catalog sucks stinks? "Most integrated library systems, as they are currently configured and used, should be removed from public view." - Roy Tennant, OCLC
The integrated library system • Historically, the ILS developed as an inventory control system for use by library staff only • First library automation systems (Plessey, CLSI, Geac, Innovative) were designed around circulation or acquisitions functions • Interaction time was calibrated to the slow pace of backroom work where the audience was basically captive • Staff focus on known-item searching, not resource discovery
The catalog as part of the ILS • The first integrated OPACs were veneers on top of existing inventory management systems—patrons & staff competed for system resources! They still do! • First OPACs allowed for browse only; early keyword searching restricted to certain fields (A/T/S) only • Libraries with no IT support were stuck with what their vendor provided and the enhancement process for improvements • Libraries with IT support created their own systems: BIS, NOTIS, Clarement Colleges, Georgetown, PALS, DOBIS/LIBIS
The state of the ILS in 2007 • Customer demands for increasing functionality in a marketplace with little $$ to spend has reduced the ILS vendor pool through mergers and buyouts • New functionality (multi-search, ERMS, E-Ref, ILL, etc.) increasingly being met by stand-alone and third party applications • Increasing competition from open source (Koha, Evergreen, Scriblio, LibraryThing) and e-commerce • Q: Is our dogged adherence to MARC the only thing keeping the remaining ILS vendors afloat?
The state of the catalog 2007 • Library users’ search expectations have been conditioned by interactions with commercial Websites and Google, with which Libraries can barely afford to compete, but must • Libraries are becoming increasingly virtual as users interact with us online (e-resources, Second Life) • User expectations for online experiences are more interactive, instantaneous, and inviting
Perhaps most importantly… • The information resources represented in the catalog represent a shrinking percentage of what end users need or want Calhoun’s Aristotelian vs. Copernican views of the catalog
What do users want from the OPAC? • Make subject searching in online catalogs easier using post-Boolean probabilistic searching with automatic spelling correction, term weighting, intelligent stemming, relevance feedback, and output ranking • Streamline users' book selection decisions at the catalog by adding tables of contents and back-of-the-book indexes to cataloging (i.e., metadata) records • Reduce the many failed subject searches by expanding the online catalog with full texts—journal and newspaper articles, encyclopedias, dissertations, government documents, etc. Increase finding strategies in online catalogs through the library classification -- Markey, Karen (2007). “The online library catalog: Paradise lost and paradise regained”, D-Lib Magazine, 13(1/2).
“Many researchers express surprise at the brevity (from one to three words) of the queries people submit to online systems. Belkin tells why so few words make up their queries, "Precisely because of the inquirer's lack of knowledge about a problem area, it is impossible to specify what would resolve it." For Belkin, the saving grace is the inquirer's ability to recognize what he or she wants or does not want during the course of the search. Therein lies an important solution to the problem—information systems that report results for easy eyeballing and instantaneous recognition of relevant possibilities.” – Karen Markey
A software company based in Cambridge, MA • A search and information access technology provider for a number of major e-commerce websites • Developers of the Endeca Information Access Platform
Endeca features • Commercial-strength search/sort speeds • Site customizable relevance ranking • Faceted browse • True browsing (LC classification) • Spell-checking • ”Did you mean?” • Automatic word stemming
Endeca at NCSU Libraries • Went live in January 2006 • Works with a text version of a daily snapshot of Libraries’ MARC & other metadata • Used to improve the discovery portion of the library catalog • Interoperates with ILS for holdings, current availability status • Web2 interface still present for known item & authority searching
Implementation timeline • License / negotiation: Spring 2005 • Acquire: Summer 2005 • Implementation: • August 2005 : vendor training • September 2005 : finalize requirements • October 2005 – January 2006 : design and development • January 12, 2006 : go-live date • Widen to TRLN partners: Winter 2008
Implementation Team • Implementation Team brought together from IT, DLI, Cataloging, Collections, Reference, Circulation • Worked on indexing, UI, usability testing, etc. • Areas of contention • Number of initial search boxes (1 or 2) • Order, grouping of facets • Placement of classification hierarchies, breadcrumbs • Use of “search” and “browse” on tabs • Visualization aided by Tito’s wireframes
Brief view vs. Full view gives user choice about displaying holdings. Reduces complexity of continuing and online resources. 8th (and Final) Revision: Aggregate holdings information by library.
NCSU Endeca features Breadcrumbs Call # browse Results Facets
Features we started with • Faceted browse • Availability facet • Breadcrumbs • Spell check / Did you mean • Hierarchical subject browse based on LCC • Fuzzy link to live Web2 data • New book browse for titles added in last week only
Features that we’ve added • New book browse based on relative date (last week, last month, last three months) • RSS feeds based on user results • “Search within” results • Send search to TRLN partners • Static unique link to live Web2 data
Relevance ranking Based on locally customizable algorithm: • Most relevant: query exactly as entered • For multi-term searches: phrase match • Field match • title match more relevant than notes match • Other factors: • number of fields matched • weighted frequency • static ordering (publication date, circulation stats)
Faceting at the NCSU Libraries • Follows on what we have learned from the commercial Web search model • Mines metadata already available via MARC record, local class number, ILS item categories, circ status, and date stamping • Required massive clean-up of 6xx subdivisions • Allows both pre- and post-coordinate limits • Uses table mapping to enable drilling down through call number results
Availability Author Library Format Language New(ness) LC Classification Subject: Topic Subject: Genre Subject: Region Subject: Era Facet refinements
A single facet need not represent data from a single field • Single Unicorn item types (Book, Kit, Manuscript, Map, Data set) • Multiple Unicorn item types (Audio, Microform, Thesis/Dissertation, Software & Multimedia, Videos) • Leader byte 07 (Bib lvl): Journal, Magazine • Library (Online)
Ranking facet results by number of postings makes sense in a short list, but not in a long list
Technical overview Information Access Platform NCSU exports and reformats Data Foundry MDEX Engine Parse text files Raw MARC data Indices Flat text files HTTP HTTP NCSU Web Application
MARC ingest • MARC flat text file(s) for ingest by Endeca. • Transformation accomplished with MARC4J. • Opportunity to manipulate data on the back-end.
The end result… Video
Other Endeca library catalogs • Phoenix Public Library: http://www.phoenixpubliclibrary.org/ • McMaster University: http://libcat.mcmaster.ca • Florida Center for Library Automation http://catalog.fcla.edu/ • Individual Florida universities http://fs.catalog.fcla.edu/, etc.
Problems:authority control • Endeca is a keyword search engine; “browse” can only be effected using sort options • There is no authority control within Endeca itself, rather it relies on AC within ILS • To make use of available metadata, subjects were split along subdivisions. Authors were not • Talks were held with the vendor to explain the potential for drawing on authority x-refs to collocate searches
Problems:subject context • Problems with wrong delimiter values (esp. $v) • Problems maintaining context in atomized LCSH • One-way relationships • English language$vDictionaries$xSpanish • Chronological headings devoid of geographic context • Cuba$xHistory$yRevolution, 1959 • Phrase headings expressed in multiple subdivisions • Prisoners$xAbuse of
Problems:subject hierarchies • Chronological hierarchy not built into $y • “19th century” does not subsume 1800-1809, 1801-1861, 1809-1817, 1815-1861, 1817-1825, Civil War, 1861-1865, etc. • Geological periods exist as text only (Ordovician, Pleistocene, etc.) • Some chronological headings are expressed as text in 650$a • Middle Ages • Nineteen sixties • Geographic hierarchy not consistent between 651 and 650 • $zNorth Carolina$zRaleigh • $aRaleigh (N.C.) • BT/NT/RT relationships from authority file lacking
Some potential solutions • Search behavior education • FAST (Faceted Application of Subject Terminology) • Web2 x-refs to redirect searches to Endeca • Combining $z hierarchies • Hierarchy lists
“The new Endeca system is incredible. It would be difficult to exaggerate how much better it is than our old online card catalog (and therefore that of most other universities). I've found myself searching the catalog just for fun, whereas before it was a chore to find what I needed.” - NCSU Undergrad, Statistics “The new library catalog search features are a big improvement over the old system. Not only is the search extremely fast, but seemingly it's much more intelligent as well.” - NCSU faculty, Psychology