1 / 33

Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D.

Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File. Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research. Overall Research Goals.

Télécharger la présentation

Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining, Advanced Collection Analysis, and Publisher Profiles: An Update on the OCLC Publisher Name Authority File Lynn Silipigni Connaway, Ph.D. Senior Research Scientist OCLC Research Timothy J. Dickey, Ph.D. Post-Doctoral Researcher OCLC Research

  2. Overall Research Goals • To Build a Database that Will: • Identify • Authoritative strings for publisher names • Common variants for names and locations • Hierarchical references indicating relationships and nesting of subsidiaries • Definitions of publishing entities

  3. Overall Research Goals • To Build a Database that Will: • Produce • Profiles, including data-mined information regarding formats, languages, subjects, etc. for publishers • Conform • to international authority and standards practice, and • inter-operate with other OCLC products

  4. Issues & Challenges • Database Quality: • Historical Practices • “…the shortest form in which it can be understood.” [AACR2 2004] • Different versions of cataloging rules • Abbreviations • Errors and misspellings • Local Practices

  5. Method: Data Mining in an “Aggregate Collection” • Data Mining and Analysis of WorldCat: • “…affords high-level perspective on historical patterns, suggests future trends, and supplies useful intelligence with which to inform decision making.” • Lavoie, B.F., Connaway, L. S., & O’Neill, E. T. (2007). Mapping WorldCat’s digital landscape. Library Resources & Technical Services, 51, 106-115 at 107.

  6. WorldCat: July 2008 Manifestations (records): 108,828,533 Works: 84,096,107 Total holdings: 1,292,763,300 Digital Items: 3,182,550 Institutions: 69,000 Physical Items: ~1.2 billion

  7. Global Origins of WorldCat Materials Germany 10% Rest of World 27% Unknown 17% France 4% Canada 3% UK 8% US 28%

  8. Global Origins of WorldCat Materials Materials w/non-US origins: 57.9 million (55%) Top 5: Germany: 10.0 million UK: 8.8 million France: 4.2 million Netherlands: 2.9 million Canada: 2.9 million Content Languages: 478 49% of WC non-English Top 5 non-English: German: 12 million French: 6.1 million Spanish: 3.5 million Dutch: 2.6 million Japanese: 2.4 million Non-English Metadata Language: 28 million (66 languages) Top 5: German: 11 million French: 1.8 million Dutch: 5.0 million Finnish: 0.7 million Swedish: 1.9 million

  9. OCLC Publisher Name Server

  10. Publisher Name Server: Objectives • Resolve for data mining and quality of WorldCat • ISBN prefixes to publisher name • Variant publisher names to a preferred form • Complement Collection Analysis Service • Librarians & Publishers

  11. Publisher Name Server: Objectives • Capture and profile attributes of individual publishers: • Location(s) • Language(s) of materials published • Genre(s)/format(s) • Dominant subject domain(s) • Parent company and subsidiaries

  12. Publisher Name Server: Methodology • Programmatically cluster publishers’ records using ISBN prefixes • Data clustering • Classification of similar objects into different groups • Partitioning of a data set into subsets (clusters) • Hand parse the entities and resolve ISBN prefixes

  13. Publisher Name Server: Database • 1750 publishing entities • Relational database, preserving hierarchical relationships • Begins with high-occurrence entities: • “Top 10” lists • Top 10 university presses • Mergers and acquisitions, last 8 years

  14. Example: Top U.S. Publishing Entities by ISBN

  15. Publisher Name Server: Data Captured Data: Publisher Name, Preferred Form Source of Preferred Form Former Names Variant Forms ISBN Prefixes HQ City HQ Country Other Cities URL ----- Languages Formats Conspectus Subjects Sources: U.S. Library of Congress, National Authority File, 110 (Corporate Name) field Books In Print Online (W.W. Bowker) The International ISBN Registry (K.G. Saur) Publishers’ Weekly Online Hoover’s Handbook Online Standard and Poor’s Corporate Descriptions The Directory of Corporate Affiliations (DIALOG) Company websites DATA MINING

  16. Publisher Name Server: Current Scope • More than 56,000 separate strings mapped to 1750 entities • 8.5 million OCLC records • 22% of these are Library of Congress records • ~490 million holdings • Hierarchical relationships maintained

  17. Entity-Parsing in a World of Mergers and Acquisitions Pearson PLC Penguin Books Pearson Canada Pearson Technology Group Allen Lane Ladybird Books Riverhead Books Copp Clark Adobe Press Cisco Press Puffin Books Putnam Books Berkeley Publishing Group Pearson Education, Inc. Avery Addison-Wesley Publishing Company Allyn and Bacon Prentice-Hall, Inc. Dominie Press Benjamin/Cummings Publishing Company Scott, Foresman and Company HarperCollins Educational Publishers Longmans, Green, and Co.

  18. Publisher Profiles within WorldCat • Oxford University Press • 119,237 records with ISBNs mapped to 210,095 records (0.19% of WorldCat) • Pearson PLC • Includes 14 subsidiaries and acquisitions • Aggregate: 291,433 records (0.27% of WorldCat) • Springer (Firm) • 197,263 records (0.18% of WorldCat) • Reed Elsevier PLC • Includes dozens of subsidiaries • Aggregate: 370,029 records (0.34% of WorldCat)

  19. Oxford Univ. Press: English 96.74% Latin 0.51% German 0.39% Chinese 0.39% French 0.37% Spanish 0.28% Afrikaans 0.14% Middle English 0.13% Malay 0.09% Swahili 0.09% Pearson PLC: English 95.27% Spanish 1.43% German 1.33% French 0.60% Dutch 0.55% Latin 0.26% Malay 0.06% Ancient Greek 0.05% Portuguese 0.05% Italian 0.04% WorldCat Publisher Profiles – Top Languages

  20. Springer (Firm): English 61.25% German 37.10% French 1.02% Italian 0.29% Polish 0.13% Czech 0.04% Spanish 0.04% Hungarian 0.03% Dutch 0.02% Danish 0.02% Reed Elsevier PLC: English 83.64% French 9.34% Dutch 2.32% Spanish 0.95% Italian 0.60% Latin 0.27% Afrikaans 0.16% Ancient Greek 0.12% Portuguese 0.09% Polish 0.06% WorldCat Publisher Profiles – Top Languages

  21. Oxford University Press: Printed Material 89.57% Computer File 8.23% Microform 1.39% Sound Recording 0.50% Video Recording 0.16% Springer (Firm): Printed Material 81.69% Computer file 17.51% Microform 0.71% Video Recording 0.05% Pearson PLC: Printed Material 92.98% Microform 2.82% Computer File 2.15% Video Recording 0.70% Sound Recording 0.67% Reed Elsevier PLC: Printed Material 92.31% Computer File 5.46% Microform 1.85% Video Recording 0.14% WorldCat Publisher Profiles - Formats

  22. Oxford Univ. Press: Language/ Literature 27.12% History 11.92% Music 9.78% Philosophy/ Religion 9.55% Business/ Economics 6.15% Medicine 4.36% Law 3.85% Sociology 3.75% Political Science 3.58% Biology 2.60% Pearson PLC: Language/ Literature 18.67% Business/ Economics 13.30% Computer Science 9.42% Engineering 8.04% History 7.59% Mathematics 6.04% Education 5.64% Sociology 4.18% Philosophy/ Religion 3.81% Physical Sciences 2.75% WorldCat Publisher Profiles – Conspectus Divisions

  23. Oxford Univ. Press: English literature 10.66% English language 5.86% Instrumental music 3.48% Vocal music 3.09% Literature on music 2.26% History – Britain 1.82% Economic history 1.38% American lit. 1.35% History – S. Asia 1.30% General history 1.29% Pearson PLC: English language 7.74% Business admin. 4.62% English literature 3.63% Economics 2.94% Comp. programming 2.39% Electrical engineering 2.24% Early childhood ed. 2.05% Computer software 1.88% U.S. federal law 1.80% Computer Science 1.54% WorldCat Publisher Profiles – Conspectus Categories

  24. Oxford Univ. Press: English – modern 5.57% English lit. – prose 2.51% English lit. – 19th c. 2.23% Juvenile lit. 1.06% English lit. – poetry 1.03% English lit. – collections 0.80% Biographies 0.76% English lit. – 1900-1960 0.74% Shakespeare 0.68% Sacred choruses 0.66% Pearson PLC: English – modern 7.68% Management 2.53% Programming 1.74% Arithmetic 1.09% Economic theory 1.06% Marketing 1.06% General algebra 1.04% Accounting 0.97% Juvenile lit. 0.93% English lit. – 19th c. 0.89% WorldCat Publisher Profiles – Conspectus Subjects

  25. Springer (Firm): Computer Science 16.83% Engineering 15.12% Mathematics 12.96% Medicine 9.93% Physical Sciences 9.83% Biology 5.22% Business/ Economics 5.13% Health Professions 4.48% Chemistry 3.14% Geography 2.58% Reed Elsevier PLC: Language/ Literature 14.18% Law 11.78% Engineering 11.73% Business/ Economics 6.82% Medicine 6.50% Physical Sciences 5.01% History 4.57% Biology 4.32% Health Professions 3.70% Chemistry 3.51% WorldCat Publisher Profiles – Conspectus Divisions

  26. Springer (Firm): Computer science 5.23% General math 4.48% Health professions 4.03% Electrical engineering 3.73% General engineering 3.25% Mathematical analysis 3.06% Computer software 2.37% Comp. programming 2.34% Probability/ Statistics 2.20% Mech. engineering 2.17% Reed Elsevier PLC: English literature 5.84% Health professions 3.40% English language 2.79% U.S. federal law 2.32% General engineering 2.26% Electrical engineering 2.10% General law 1.70% Industrial economics 1.65% Business admin. 1.53% U.S. state law 1.46% WorldCat Publisher Profiles – Conspectus Categories

  27. Springer (Firm): Health professions 3.56% Math collections 2.76% Computer science 1.84% Programming 1.46% Access/ security 1.10% Artificial intelligence 1.03% Mathematical stats 1.03% Analytical physics 1.02% Industrial management 0.99% Engineering materials 0.90% Reed Elsevier PLC: English – modern 2.68% English - prose 2.06% Health professions 1.92% U.S. state law 1.37% Industrial management 1.22% Legal periodicals 1.16% English lit. - 1900-1960 1.15% Engineering materials 0.86% English fiction 0.83% Nuclear physics 0.68% WorldCat Publisher Profiles – Conspectus Subjects

  28. Projected MARC coding of Authorized Forms • 710 Added Entry – Corporate Name • Add $4 for publisher name • Add $2 NAF where preferred form matches existing authority record (44% of current PNAF) • 752 Added Entry – Hierarchical Place Name • Add $2 FAST where place of publication matches FAST geographical subject headings

  29. Ongoing Research • Further data mining • Profile other aspects of publication output • Profile other publishers • Trends over time • Author clusters • Geographic holdings patterns • Collection Analysis

  30. Ongoing Research • Plan for long-term maintenance • ISBN-13 compliance • File expansion of ongoing mergers/ acquisition activities • Deeper scaling into WorldCat (beyond ISBN)

  31. OCLC Publisher Name Server • Project page: • http://www.oclc.org/research/projects/publisherns/

  32. Thank You! • Questions and Discussion • Lynn Silipigni Connaway connawal@oclc.org • Timothy J. Dickey dickeyt@oclc.org

More Related