1 / 35

Google Books

Google Books. Where we're going and how we got here. Jon Orwant Engineering Manager Google Books. Overview. Why and how Google scans books  The Google Books settlement From pages to ideas. Google Confidential and Proprietary. Why  and How Google Scans Books. Google’s mission.

jarvis
Télécharger la présentation

Google Books

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Google Books Where we're going and how we got here Jon Orwant Engineering Manager Google Books

  2. Overview Why and how Google scans books  The Google Books settlement From pages to ideas Google Confidential and Proprietary

  3. Why  and How Google Scans Books

  4. Google’s mission To organize the world’s information and make it universally accessible and useful. Online contentBillions of web pages Offline contentBillions of items becoming indexed Google Confidential and Proprietary

  5. Limited previews from publishers & authors

  6. http://books.google.com

  7. Google Books in a nutshell Google Confidential and Proprietary

  8. Vital stats Scans Number of books scanned: 15M+ Number of pages: 4B Number of words: 2T Libraries: 40+ Publishers: 30K+ Metadata Number of books: 130M Number of records: 4B Number of metadata fields: 1T Google Confidential and Proprietary

  9. Identifying the book Library of Congress Books in Print Lord of the Rings, v.1 The Fellowship of the Ring title author John Roland Reuel Tolkien J.R.R. Tolkien publisher Houghton Mifflin Ballantine Books year 1954 1994

  10. How Google Handles Metadata Collect data from 100+ sources (libraries, commercial aggregators, union catalogs, publishers, retailers) Parse the records into our internal format MARC, ONIX, others... "UVA stores item data and call numbers in 955$a..." Cluster the records into expressions and manifestations Create a "best of" record for each cluster Index and display elements of that record on books.google.com Google Confidential and Proprietary

  11. 478 languages Kabardian: 16Khasi: 78Khoisan: 53Khotanese: 21Kikuyu, Gikuyu: 48Kinyarwanda: 77 Kirghiz, Kyrgyz: 702Kimbundu: 14Konkani: 83Komi: 48Kongo: 134Korean: 35905 Kosraean: 10 Kpelle: 6Karachay-balkar: 17Karelian: 28Kru: 26Kurukh: 30Kuanyama: 9Kumyk: 16Kurdish: 220Kutenai: 0Klingon: 3Kalmyk: 26 Kashubian: 14 Kara-kalpak: 102Kabyle: 50Kachin: 18Kalaallisut: 82Kamba: 29Kannada: 2600Karen: 50Kashmiri: 289Kanuri: 25Kawi: 106 Kazakh: 1871

  12. Translit-aware similarity metrics for names and titles

  13. Material content & form <datafield tag="245" ind1=" " ind2=" ">  <subfield code="a">[Turkey probe]</subfield><datafield tag="260" ind1=" " ind2=" ">  <subfield code="a">Syracuse : Betty Crocker Supplies, ca 1987</subfield><datafield tag="300" ind1=" " ind2=" ">  <subfield code="a">1 pointy thing , 46 cm. </subfield> <datafield tag="650" ind1=" " ind2=" ">  <subfield code="a">Microwave cookery</subfield> <datafield tag="650" ind1=" " ind2=" ">  <subfield code="a">April Fool's Day</subfield>

  14. Cover generation

  15. Parsing Uncertain Dates • 18?? • [196-?] • 1957/8 • late 14th century • finita quarto nonas Januarias [1490] • mense Septembri: Anno Millesimo q[ui]ngentesimo decimonono • mense iulio, anno M.D.XXXX • התשנ״א (Hebrew year 5751 = Gregorian 1990/1 CE) • ١٣٧٣ (either Islamic year 1373 AH = Gregorian 1953/4 CE or Persian year 1373 AP = Gregorian 1994/5 CE)

  16. Annotations

  17. The Google Books Settlement

  18. Google Books Settlement • If approved, resolves lawsuit brought against Google by AAP & AG • Benefits: • Rightsholder control • Snippets => 20% • Library subscriptions • Free terminal in every US public library building • Downloadable books for purchase • Access for the print-disabled • Book Rights Registry: a non-profit organization to find and pay rightsholders • Research corpus

  19. Linguistic Analysis "Research that performs linguistic analysis over the Research Corpus to understand language, linguistic use, semantics and syntax as they evolve over time and across different genres or other classifications of Books."

  20. From Pages to Ideas

  21. Books as a corpus of human knowledge • Understand one book • Understand all books • Understand relations between books

  22. Insights into human progress oxide of lead may be thus a heavy fire a striking proof miles distant from terms of peace presents the appearance more than mortal vexation of spirit zeal and devotion lesbian and gay health care professionals abuse and neglect the overall process shift away from the power elite a research project the poor countries probability of failure increased awareness of Old-fashioned trigrams New-fangled trigrams Source: Matthew Gray & Yuan K. Shen Google Confidential and Proprietary

  23. Semantic Stack Google Confidential and Proprietary

  24. Semantic Stack (video remix) Google Confidential and Proprietary

  25. Reframing the Victorians (Cohen & Gibbs, GMU) Google Confidential and Proprietary

  26. Victorian terms    Google Confidential and Proprietary

  27. Discipline-specific progress occurs by... ...moving up one level ...or improving the results at one level by creating a reusable data set ...or reasonably using one level as a proxy for a higher level Google Confidential and Proprietary

  28. Reframing the Victorians ...reasonably using one level as a proxy for a higher level Google Confidential and Proprietary

  29. Interdisciplinary progress occurs by... ...moving up one level ...or improving the results at one level ...by creating infrastructure that can be used by others Google Confidential and Proprietary

  30. Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques Intralanguage translations (Efron, U. Illinois) Google Confidential and Proprietary

  31. Intralanguage translations improving the results at one level ...by creating infrastructure that can be used by others Google Confidential and Proprietary

  32. Automatic Identification and Extraction of Structured Linguistic Passages in Texts Grammar inference(Abney & Szymanski, Univ. Michigan) Google Confidential and Proprietary

  33. Grammar inference moving up one level ...by creating infrastructure that can be used by others Google Confidential and Proprietary

  34. The "Great Man" theory

  35. Thank You! Q&A

More Related