280 likes | 292 Vues
Automatic Metadata Generation. <a look at some of the state-of-the-state > Jane Greenberg April 22, 2009. Overview. Automatic metadata generation Definitions and approaches Metadata classes and functions Research and development Brief comments on relationship to Dryad
E N D
Automatic Metadata Generation <a look at some of the state-of-the-state> Jane Greenberg April 22, 2009
Overview • Automatic metadata generation • Definitions and approaches • Metadata classes and functions • Research and development • Brief comments on relationship to Dryad • Dryad’s priorities / discussion
Definitions • Metadata generation • The act of creating or producing metadata. Metadata can be generated via different processes, applications, and classes of people • Processes: human/manual, automatically, or combinatory • Applications: LMS, CMS, metadata and content creation apps. • Classes of people: professionals, technical staff, content creators, the public • Automatic metadata generation • Generating metadata with the aid of machine processing (Greenberg, J.; Severins, T. (2007). DCMI Tools Glossary: http://www.dublincore.org/groups/tools/glossary.shtml)
Refined definitions and approaches (DCMI Tools Glossary) • Derived Metadata • Algorithms or pre-programmed profiles for obtaining property values. • Document size, or date modified; and default values, such as “rights access,” or “creator” information. • Metadata Extraction • Automatic analysis resource content. • Automatic indexing, IR techniques, data mining (derive meaning), and manipulation of semi-structured text. • Metadata Harvesting • Automatically gathering metadata from existing sources (resource header, metadata registry), regardless of whether it was originally generated via automatic or manual means. • Extraction of semi-structured metadata from document content has an element of harvesting
Refined definitions and approaches (Metadata Generation for Resource Discovery/JISC report, 2007) • “Automated metadata generation is still in its infancy but several approaches have emerged” (p.3) • Metatag harvesting • Content extraction • Automatic indexing or classification • Text and data mining • Social tagging
Overview • Automatic metadata generation • Definitions and approaches • Metadata classes and functions • Research and development • Dryad’s priorities
Metadata Types, Functions and Properties *A property can be multi-functional. (Greenberg, in press, 2009; Greenberg, 2005)
Overview • Automatic metadata generation • Definitions and approaches • Metadata classes and functions • Research and development • Dryad’s priorities • Concluding remarks
Context • Difficult to define this as one area of study—because of overlap with many endeavors • Automatic indexing, natural language processing, data mining, even data warehousing and related efforts • Data integration and semantic integration is also relevant BIG… area • Mainly focused on documents (not data) • What is a document (Buckland, JASIST, 1997) • Metadata records are document
Research and Development Areas • Automatic indexing—document structure • Workflow and lifecycle • Image metadata • Software design • Maintenance
Automatic indexing—document structure • Experimental work focusing on document structure using a Support Vector Machine (SVM) algorithm (e.g., Han et al., 2003) and Variable Hidden Markov Model (DVHMM) (Takasu, 2003) have been fairly successful for metadata generation. [Placement of title, author information, but small samples/experimental/domain specific.] • Similar work: Semi-structured metadata extraction (Giuffrida, et al, 2000; Hu, et al, 2006; Mimno, et al, 2005); • Researchers have identified relationships between document genre, content, and structure (Toms, Campbell & Blades, 1999). • Document genre can inform textual density, which might “be used to predict metadata extraction algorithm performance for certain types of documents” (Greenberg, 2004). • Less is more / content richness
Example: Comstock, J.P., McCouch, S.R.; Martin, B.C.; Tauer, C. J. Vision, T. J.; Xu and Pausch, R. C. (2005). The effects of resource availability and environmental conditions on genetic rankings for carbon isotope discrimination during growth in tomato and rice. Functional Plant Biology 32(12) 1089–1105.
Example: Comstock, J.P., McCouch, S.R.; Martin, B.C.; Tauer, C. J. Vision, T. J.; Xu and Pausch, R. C. (2005). The effects of resource availability and environmental conditions on genetic rankings for carbon isotope discrimination during growth in tomato and rice. Functional Plant Biology 32(12) 1089–1105.
Comstock, J.P., McCouch, S.R.; Martin, B.C.; Tauer, C. J. Vision, T. J.; Xu and Pausch, R. C. (2005). The effects of resource availability and environmental conditions on genetic rankings for carbon isotope discrimination during growth in tomato and rice. Functional Plant Biology 32(12) 1089–1105. • ABSTRACT: Carbon isotope discrimination (Δ) is frequently used as an index of leaf intercellular CO2 concentration (ci) and variation in photosynthetic water use efficiency. In this study, the stability of Δ was evaluated in greenhouse-grown tomato and rice with respect to variable growth conditions including temperature, nutrient availability, soil flooding (in rice), irradiance, and root constriction in small soil volumes. Δ exhibited several characteristics indicative of contrasting set-point behaviour among genotypes of both crops. These included generally small main environmental effects and lower observed levels of genotype-by-environment interaction across the diverse treatments than observed in associated measures of relative growth rate, photosynthetic rate, biomass allocation pattern, or specific leaf area. Growth irradiance stood out among environmental parameters tested as having consistently large main affects on Δ for all genotypes screened in both crops. We suggest that this may be related to contrasting mechanisms of stomatal aperture modulation associated with the different environmental variables. For temperature and nutrient availability, feedback processes directly linked to ci and / or metabolite pools associated with ci may have played the primary role in coordinating stomatal conductance and photosynthetic capacity. In contrast, light has a direct effect on stomatal aperture in addition to feedback mediated through ci.
Research and Development Areas • Automatic indexing—document structure • Workflow and lifecycle • Image metadata • Software design • Maintenance
Workflow and automatic metadata generation • Kim, J.; Gil, Y. 1; and Ratnakar, V. (2006). Semantic Metadata Generation for Large Scientific Workflows. In Proceedings of the 5th International Semantic Web Conference, ISWC-2006, Athens, GA, USA. > system is for: large executable workflows in an earthquake science application • Semantic metadata generation and reasoning approach that supports creation of large workflows. The system… • Propagates metadata constraints for datasets (inputs/outputs for a dataset) • Describes datasets that are used or created by the workflow • Detects equivalent datasets and prevents unnecessary execution [instantiation!] • Manages large datasets and their provenance. [ditto]
Workflow and automatic metadata generation • Kepler scientific workflow system (http://en.wikipedia.org/wiki/Kepler_scientific_workflow_system) • Metadata recommendation: EML part of the Science Environment for Ecological Knowledge (SEEK) project and used in KNB • Not clear to me where/how metadata is captured” during” the workflow • Dryad: Difficulties in terms of many different workflows, but something to think about • Pale, B. et al (2008). • Towards Quantification of Limits in Automated Curation of e-Science Data. http://www.cs.indiana.edu/~plale/papers/Plale-e-Science08.pdf • Pale, B. et al (2008). Towards Quantifying Limits of Automated Curation of Geospatial Data TechReport: http://www.cs.indiana.edu/pub/techreports/TR672.pdf • Automated metadata generation with an e-Science workflows to execute dynamically adaptive regional weather forecast and analysis on-demand
Lifecycle w/in workflow, or?? • Moore, et al., (2002). Data Grid Implementations. http://www.ppdg.net/docs/WhitePapers/Capabilities-grids.v6.pdf. • Defines functionalities w/in SRB, applicable to discussion of digital repositories/libraries, etc. • Includes a very large table of systems and functionalities > Automated attribute generation for size, time stamp • Renear, et al., Collection/Item Metadata Relationships. (2008). Proceedings of the International Conference on Dublin Core and Metadata Applications: https://www.ideals.uiuc.edu/handle/2142/9144 • Explores modal notions and first-order logic formulations focusing on attribute/value propagation, value-propagation, and value-constraints • Rodriguez, H.; Bollen, J.; &Van de Sompel, H. (2008). Automatic Metadata Generation Using Associative Networks. ACM Transactions on Information Systems 27, no. 2: http://arxiv.org/abs/0807.0023. • Metadata-rich to metadata-poor resources, “Robin Hood principle”
Renear, et al (2008) Questions being asked: When does propagation convert information without loss? What about propagation from items to collections? How expressive a logic is needed for propagation rules? – how much of first order logic? – what extensions to first order logic? (modal, default, …?) – what are the consequences for computational efficiency? see slides at: https://www.ideals.uiuc.edu/bitstream/handle/2142/9144/cimrDCMI08_Final.ppt.pdf?sequence=4
Research and Development Areas • Automatic indexing—document structure • Workflow and lifecycle • Image metadata • Software design • Maintenance
Image metadata • GIS research (mainly technical capture) • NISO standards • Content vs. concept • Resolution/density/pixils/color vs. captions and vocabularies • Image & art tagging/collaborative work • Steve project (Trant, 2008) • OCR work w/Specimens • Herbis project: http://www.herbis.org/
Research and Development Areas • Automatic indexing—document structure • Workflow and lifecycle • Image metadata • Software design • Maintenance
Software • Most cohesive advances/document centric in the educational community • Creating, re-use, metadata inheritance • MetaTools Final Report (JISC) (http://www.jisc.ac.uk/media/documents/programmes/reppres/metatoolsfinalreport.pdf) • Data Fountains/iVia, DC-dot, SamgI, and the Yahoo! Term Extractor • Notes: JHove, Droid, and the NLNZ Metadata Extraction Tool, and Metadata Analysis Tool (MAT) • iRods rules
Visualisation MINDS ≈ 6000 records 6,000 row HTML table browser stress
Research and Development Areas • Automatic indexing—document structure • Workflow and lifecycle • Image metadata • Software design • Maintenance
AMeGA report, Section 7. Metadata Evaluation • Use a range of criteria to determine quality, give a confidence rating • How much metadata was harvested? • Was a digital signature associated with the metadata, and if so, was it registered as a trusted source? • How much metadata was extracted? • What extraction algorithm was used? • How well did the automatically generated metadata match content standards used to assign metadata values?
AMM-GO • Jie Jin. (2008). NC Health Info and Go Local: An Analysis of Web Change Impacts on Metadata Quality and A Proposed Framework for Semi-Automatic Metadata Maintenance. A Master’s Paper for the M.S. in I.S degree (UNC/SILS): http://etd.ils.unc.edu/dspace/bitstream/1901/510/1/JJ_MasterPaper2008.4.pdf. • Mathematical Set Theory for updating metadata records, due to change in a resource (see images p. 21 and 22). • Wei-Hsin Su. (2008). A Visual Enhancement for Metadata Generation Tools: A Semi-Automatic Approach via KWIC and Highlighting. A Master’s Paper for the M.S. in I.S degree (UNC/SILS): http://etd.ils.unc.edu/dspace/bitstream/1901/515/1/weihsinsu.pdf.pdf. • Figure 13, p. 22
Dryad’s priorities and discussion • Good metadata, intelligible • Easy to produce • How rich? • stages • What degree of machine processability do we want?