500 likes | 594 Vues
Beyond Data Mining: Delivering the Next Generation of Services from Library Data. Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC. WorldCat as an “Aggregate Collection”. Data Mining and Analysis of WorldCat:
E N D
Beyond Data Mining:Delivering the Next Generation of Services from Library Data Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC
WorldCat as an “Aggregate Collection” • Data Mining and Analysis of WorldCat: • “…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.” • Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.
WorldCat: July 2008 Manifestations (records): 108,828,533 Works: 84,096,107 Total holdings: 1,292,763,300 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion
Global Origins of WorldCat Materials Germany 10% Rest of World 27% Unknown 17% France 4% Canada 3% UK 8% US 28%
Global Origins of WorldCat Materials Materials w/non-US origins: 57.9 million (55%) Top 5: Germany: 10.0 million UK: 8.8 million France: 4.2 million Netherlands: 2.9 million Canada: 2.9 million Content Languages: 478 49% of WC non-English Top 5 non-English: German: 12 million French: 6.1 million Spanish: 3.5 million Dutch: 2.6 million Japanese: 2.4 million Non-English Metadata Language: 28 million (66 languages) Top 5: German: 11 million French: 1.8 million Dutch: 5.0 million Finnish: 0.7 million Swedish: 1.9 million
WorldCat as a Decision-Making Resource • Collection management • Cooperative collection development • Comparative collection analysis • Collection assessment • Mass digitization • Off-site storage • Preservation
WorldCat as a Decision-Making Resource • Services • Virtual reference • Recommender services • Social networking • Systems • Precision
WorldCat as a Decision-Making Resource • Three Areas of Data Mining Research: • OCLC WorldMap • Audience Level • Publisher Name Server
OCLC WorldMapTM: Objectives • Geographically represent WorldCat data • Titles published in each country • Holdings for titles published in each country • Languages represented for titles published in each country
OCLC WorldMapTM: Objectives • Geographically represent data from UNESCO, ARL, and NCES for each country • Number of • Libraries • Library volumes • Certified/degreed librarians • Registered library users • Library expenditures • Cultural heritage institutions (museums and archives) • Publishers
OCLC WorldMapTM: Objectives • Research prototype • Support OCLC data mining research • Visually display data for review and analysis • Internal use • Sales and marketing • External use • Library collection assessment and comparison • Data may be processed AT A GLANCE • Complement the AAU/ARL Global Resources Network project • Project of the Council on Library and Information Resources (CLIR)
Audience Level: Rationale and Objectives Holdings represent selection decisions by librarians … implies there are more than 1 billion individual selection decisions in the WorldCat holdings file • Selections serve the interests of a library’s target community … • Associate community (audience level) to library profiles - e.g., ARL, non-ARL academic, public, K-12 school … ? • Thus we can infer materials’ audience level from holdings patterns, which in turn can support: • Collection management • Readers’ advisory services • Reference services • Information retrieval
“FRBRizing” Audience Level Results • Calculate Audience Level for each Manifestation • Aggregate weighted holdings for Work
Evaluating the OCLC Audience Level • Random sample of 30 Zoology books, all audience levels • Human subjects • Ranked books “in increasing order of difficulty” • Strong statistical correlation between human subjects’ ranking and programmatic ranking
Publisher Name Server: Research Objectives • Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Complement Collection Analysis Service • Librarians • Publishers • Capture and profile attributes of individual publishers • Location(s) • Language(s) of materials published • Genre(s)/format(s) • Dominant subject domain(s) • Parent company and subsidiaries
Publisher Name Server: Methodology • Programmatically cluster publishers’ records using ISBN prefixes • Data clustering (The Free Dictionary) • "The science of extracting useful information from large data sets or databases" • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Data in each subset (ideally) share some common trait • Hand parse the entities and resolve ISBN prefixes
Publisher Name Server: Database • 1750 publishing entities • Relational database, preserving hierarchical relationships • Begins with high-occurrence entities: • “Top 10” lists (USA, UK, Canada, Australia, Germany, France, Netherlands, Japan, Italy, China, Russia, Spain, Finland, Australia, Taiwan, New Zealand) • Top 10 university presses • Mergers and acquisitions, last 8 years
Publisher Name Server: Data Captured Database Fields: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL ----- Languages Formats Conspectus Subjects Data Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers’ Weekly Online Hoover’s Handbook Online Standard and Poor’s Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING
Publisher Name Server: Database • More than 56,000 separate strings mapped to 1750 entities • 8.5 million OCLC records • 22% of these are Library of Congress records • ~490 million holdings • Hierarchical relationships maintained
Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Pearson Canada Pearson Technology Group Allen Lane Ladybird Books Riverhead Books Copp Clark Adobe Press Cisco Press Puffin Books Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Allyn and Bacon Prentice-Hall, Inc. Dominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co.
Publisher Profiles • Oxford University Press • 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) • Pearson PLC • Includes 14 subsidiaries and acquisitions • Aggregate: 291,433 records (0.27% of WorldCat)
Oxford Univ. Press: English 96.74% Latin 0.51% German 0.39% Chinese 0.39% French 0.37% Spanish 0.28% Afrikaans 0.14% Middle English 0.13% Malay 0.09% Swahili 0.09% Pearson PLC: English 95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04% Publisher Profiles – Top Languages
Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature 18.67% Business/ Economics 13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75% Publisher Profiles – Conspectus Divisions
Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music 3.48% Vocal music 3.09% Literature on music 2.26% History – Britain 1.82% Economic history 1.38% American lit. 1.35% History – S. Asia 1.30% General history 1.29% Pearson PLC: English language 7.74% Business admin. 4.62% English literature 3.63% Economics 2.94% Comp. programming 2.39% Electrical engineering 2.24% Early childhood ed. 2.05% Computer software 1.88% U.S. federal law 1.80% Computer Science 1.54% Publisher Profiles – Conspectus Categories
Oxford Univ. Press: English – modern 5.57% English lit – prose 2.51% English lit – 19th c. 2.23% Juvenile lit. 1.06% English lit – poetry 1.03% English lit – collections 0.80% Biographies 0.76% English lit – 1900-1960 0.74% Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern 7.68% Management 2.53% Programming 1.74% Arithmetic 1.09% Economic theory 1.06% Marketing 1.06% General algebra 1.04% Accounting 0.97% Juvenile lit. 0.93% English lit – 19th c. 0.89% Publisher Profiles – Conspectus Subjects
Projected MARC coding of Authorized Forms • 710 Added Entry – Corporate Name • Add $4 for publisher name • Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) • 752 Added Entry – Hierarchical Place Name • Add $2 FAST where place of publication matches FAST geographical subject headings
Future Research • Further data mining • Profile aspects of publication output • Deeper scaling into WorldCat (beyond ISBN) • Plan for long-term maintenance • ISBN-13 compliance • File expansion of ongoing mergers/ acquisition activities
Thank You! • Questions and Discussion • Lynn Silipigni Connaway connawal@oclc.org • Timothy J. Dickey dickeyt@oclc.org