html5-img
1 / 63

Automated Metadata Creation: Possibilities and Pitfalls

Automated Metadata Creation: Possibilities and Pitfalls. Presented by Wilhelmina Randtke June 10, 2012 Nashville, Tennessee At the annual meeting of the North American Serials Interest Group. Materials posted at www.randtke.com/presentations/NASIG.html.

jamar
Télécharger la présentation

Automated Metadata Creation: Possibilities and Pitfalls

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automated Metadata Creation: Possibilities and Pitfalls • Presented by Wilhelmina Randtke • June 10, 2012 • Nashville, Tennessee • At the annual meeting of the North American Serials Interest Group. • Materials posted at www.randtke.com/presentations/NASIG.html

  2. Teaser: Preview of the sample project. http://www.fsulawrc.com

  3. Background: What is “metadata”? • Metadata = any indexing information • Examples: MARC records color, size, etc. to allow clothes shopping on a website writing on the spine of a book food labels

  4. What we'll cover • Automated indexing: • Human vs machine indexing • Range of tools for automated metadata creation: Techy and less techy. • Sample projects • A little background on relational databases • Database design for a looseleaf (a resource that changes state over time). • Sample project: The Florida Administrative Code 1970-1983

  5. Automated Indexing: What’s easy for computers? • Computers like black and white decisions. • Computers are bad with discretion.

  6. Word search vs. Subject headings

  7. One Trillion • 1,000,000,000,000 • webpages indexed in Google • … 4 years ago …

  8. Nevertheless…… Human indexing is alive and well

  9. How to fund indexing?

  10. http://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dresshttp://www.ebay.com/sch/Dresses-/63861/i.html?_nkw=summer+dress

  11. How to fund indexing?

  12. How to fund indexing?

  13. Who made the metadata:Human or Machine? How GoogleBooks gets its metadata: http://go-to-hellman.blogspot.com/2010/01/google-exposes-book-metadata-privates.html

  14. Not automated indexing, but a related concept…. • Always try to think about • how to reuse existing metadata.

  15. High Tech automated metadata creation

  16. The high end: Assigning subject headings with computer code • Some technologies: • UIMA (Unstructured Information Management Architecture) • GATE (General Architecture for Text Engineering) • KEA (Keyphrase Extraction Algorithm)

  17. Person’s role: Select an appropriate ontology. Configure the program so that it’s looking at outside sources. Review the results and make sure the assigned subject headings are good. Program’s role: Take ontology or thesaurus and apply it to each item to give subject headings. Computer Program for Automated Indexing Ontology Thesaurus Item Subject Headings

  18. http://www.nzdl.org/Kea/examples1.html

  19. The lower end: Deterministic fields

  20. There’s an app for that • Scripts for extracting fields from a thesis posted on GitHub: https://github.com/ao5357/thesisbot

  21. Batch OCR

  22. Many tools exist to extract text from PDFS to Excel

  23. Walkthrough – examining the extracted spreadsheets • http://fsulawrc.com/excelVBAfiles/index.html

  24. How to plan the program • Look for patterns • Write step-by-step instructions about how to process the Excel file • Remember, NO DISCRETION, computers do not take well to discretion. • Good steps: • Go to the last line of the worksheet • Look for the letter a or A • Copy starting from the first number in the cell, up to and including the last number in the cell. • Bad steps: • Find the author’s name (this step needs to be broken into small “stupid” steps)

  25. Writing the program • Identify appropriate advisors. • Remember, most IT staff on a campus just install computers in offices, etc. Programming and database planning are rare skills. The worst IT personnel will not realize that they do not have these skills. • If an IT staff tells you they do not know how to do something, then go back to that person for advice on all future projects. • Try to find entry level material on coding. • (Sadly, most computer programming instructions already assume you know some programming.) • If outsourcing or collaborating, remember, the index is the ultimate goal. Understanding of the index needs to be in the picture. You probably have to bring it in.

  26. Finding Advisors: Most campus IT is about carrying heavy objects

  27. Finding Advisors: Most campus IT is about carrying heavy objects

  28. Perfection? • How close to perfection can you get? • Let’s run some code: • A spreadsheet with extracted text: http://fsulawrc.com/excelVBAfiles/23batch6A.xls • Visual Basic script: http://fsulawrc.com/excelVBAfiles/VBAscriptForFAC.docx • The files: You can retrieve some of these same files by searching 6A-1 in the main search for the database at www.fsulawrc.com

  29. How much metadata was missing?

  30. Cheap and fastand incomplete • This is a search engine build on an index for the automated metadata only: • http://fsulawrc.com/automatedindex.php • It’s better than a shuffled pile of 30,000 pages. • It’s not very good. • If you are thousands of miles away, then this is better than print. If you are in the same room as organized print, print might be better.

More Related