I. Introduction • The state of the net today II. What is metadata and why do we need it?

I. Introduction • The state of the net today II. What is metadata and why do we need it? • An explanation • Benefits of metadata III. What different metadata schemes are available? • Dublin Core • Warwick Framework

Introduction • How many are online? World Total 580.78 million Africa 6.31 million Asia/Pacific 167.86 million Europe 185.83 million Middle East 5.12 million Canada & USA 182.67 million South America 32.99 million http://www.nua.com/surveys/how_many_online/index.html

There were 162,128,493hosts in the net Internet Software Consortium (July 2002) http://www.isc.org/ds/WWW-200207/index.html “The publicly indexable World Wide Web now contains about 800 million pages, encompassing about 6 terabytes of text data” “Our results show that search engines are increasingly falling behind in their efforts to index the web” Lawrence and Giles (1999). Accessibility of information on the web. Nature 400 (July 8). 107, 108

Coverage Search engine coverage of the publicly indexable web has decreased substantially “with no engine indexing more than about 16% of web pages” Unequal access Search engines are typically more likely to index sites that have more links to them (more ‘popular’ sites) They are also more likely to index US sites than non-US sites and .com rather than .edu sites Out of date Indexing of new or modified pages by just one of the major search engines can take months http://www.wwwmetrics.com/

http://www.thestandard.com/media/030600met5_theweb2.ppt

Information distribution 83% of sites contain commercial content and 6% contain scientific or educational content 1.5% of sites contain pornographic content Low metadata use The simple HTML “keywords” and “description” metatags are only used on the homepages of 34% of sites Only 0.3% of sites use the Dublin Core metadata standard http://www.wwwmetrics.com/

Some observations The net was never intended to be a tool for information organization and retrieval Network resources are proliferating rapidly, so some organization and method of access (beyond browsing) is needed These resources increasing at an increasing rate (we are helping) Material on the net is quirky, transient, and chaotically archived Because of the decentralized nature of the net, it is clear that an imposed scheme is unworkable

• So what is metadata and why do we need it? The internet is full. Go away! Metadata may be one way for us to find what we need when we need it and in the form we want “The concept of metadata predates the Web, having … been coined ... in the 1960s to describe datasets effectively. Metadata is data about data, and ... provides basic information such as the author of a work, the date of creation, links to any related works, etc.” Miller. P. (1996). Metadata for the Masses. Ariadne. http://www.ariadne.ac.uk/issue5/metadata-masses/

A metadata schema will usually have the following characteristics: A limited number of elements The name of each element The meaning of each element The semantics of the scheme will be descriptive of The contents, location, physical attributes, type (e.g. text or image, map or model) and form (e.g. print copy, electronic file) Key elements support access to published documents The originator of a work, its title, date and location of publication and the subject areas

When we search, we find that there are many more irrelevant hits in a typical search engine return page What good is a search that returns 47,000 documents for the phrase “dublin core”? “Metadata is information that describes other information sources. [It is] a potential remedy to the problem of finding relevant information on the Internet” Thomas, C.T. and Griffin, L.S. (1999). Who will create metadata for the Internet. First Monday. 3(12). http://www.firstmonday.dk/issues/issue3_12/thomas/index.html

Two important measures are “relevant” here Recall (completeness of search) Relevant documents/total relevant documents in set Missing a lot of relevant information means poor recall Precision (purity of search) Relevant documents/total documents returned Getting flooded by a lot of irrelevant information means poor precision Recall and precision factors of 10-20% are often acceptable for most purposes Search engines haveprecision factors less than 1%

In addition, there is the interesting question of the type of metadata that is appropriate for the web “There is an obvious requirement for metadata, [it] must be of a form suitable for interpretation both by the search engines and by human beings, and it must also be simple to create so that any web page author may easily describe the contents of their page and make it immediately both more accessible and more useful” Miller. P. (1996). Metadata for the Masses. Ariadne. http://www.ariadne.ac.uk/issue5/metadata-masses/

Metadata is the information necessary to identify, locate, organize, and access an electronic resource It describes what can be said about something and what people can do with it (rights) It describes datasets concisely using a standard format For this reason it has the unique ability of making all metadata records equal in worth Metadata records provide information about data in a similar way that library catalogues provide information about books A catalogue facilitates searching for particular topics or author(s) - metadata is searchable in a comparable way.

There are two levels of this problem Organizing an existing collection so that it is accessible over the Internet The American Memory Project at: http://lcweb2.loc.gov/ammem/ammemhome.html Berkeley Digital Sunsite collections at: http://sunsite.berkeley.edu/Admin/collection.html Developing schemes to organize directories of networked information and search tools This is being done with search engines and metadata

Who uses metadata? Business uses of metadata External: advertising and search engine placement Internal: management of internal digital documents Academic uses of metadata To provide a scheme for organizing digital information For extending access to these materials

Types of metadata Descriptive (access) Description: captions, keywords or categories Access points, location, identifier Relationship to other objects File type, size or creation date Administrative Management information Provenance: authentication, document conversion info Rights, terms and conditions Structural Putting the object together from its logical components

Benefits of metadata For the producer, it can Provide relevant details about the resource Provide information which is not in the resource (e.g. descriptive text for images or executable files) Highlight most important aspects of resource For the indexing service No need to guess about resource content Highly structured data to index Less bandwidth, more efficient, easier to maintain

For the user More precise results via retrieval on surrogate content Field-based searching Access to non-textual resources Less information overload

Metadata can support many potential applications: Content ratings E-commerce Authentication Data management Intellectual property rights management Digital preservation Searching, location Resource management Quality/rating Semantic interoperability Authentication Resource discovery

There are two levels at which the problem can be attacked Classifying and organizing a core collection of digital materials The questions: what to collect, how to organize it, how to maintain it, and how to provide access to it Creating directories, search tools, metadata schemes and other means of access to digital materials outside the core collection The questions: what to include, why, maintenance, and the provision of access These questions are becoming increasingly important in the design of digital libraries

III. What different metadata schemes are available? The “Dublin Core Metadata Program” What are the necessary elements that should be used to describe networked information? This was discussed at an OCLC workshop in 1995 Goals Fostering a common understanding of the needs, strengths, shortcomings, and solutions of stakeholders Reaching consensus on a core set of metadata elements to describe networked resources http://www.oclc.org:5047/oclc/research/conferences/metadata/dublin_core_report.html

A small set of metadata elements would be valuable Authors and publishers would provide metadata, in a form that automated resource discovery tools could collect Network publishing tools wold be created containing a template for metadata elements, simplifying the task of creating metadata records This type of record could serve as the basis for a more detailed cataloging record if the need arises If something like the Dublin Core becomes a standard, metadata records will be able to be understood across user communities

Defined Universal Bibliographic Language for INternet and Coherent Online REsource It is a minimal information resource description set for information organization and resource discovery on the web It will improve searching with simple resource description semantics There is a consensus around a core element set that is: Simple and intuitive Cross-disciplinary International Flexible

The Dublin core metadata element set supports resource discovery because it is: Easy for authors and content managers to create and maintain Interoperable, extensible, and platform independent Syntax-independent Intended for, but not limited to, network resources Intended to be embedded, but needn’t be Not intended to meet complete metadata needs of any given community

These are the elements in the Dublin Core Title: The name of the object Creator: The person(s) primarily responsible for the intellectual content of the object Subject/keywords: The topic addressed by the work Typically expressed as keywords, key phrases or classification codes that describe a topic of the resource Description: An account of the content of the resource May include an abstract, table of contents, reference to a graphical representation of content or a free-text account of the content

Publisher: The agent or agency responsible for making the object available Date: A date associated with an event in the life cycle of the resource YYYY-MM-DD Resource Type: The genre of the object, such as novel, poem, or dictionary Format: The data representation of the object, or the physical or digital manifestation of the resource, Typically the media-type or dimensions of the resource May be used to determine the software, hardware or other equipment needed to display or operate it

Resource Identifier: An unambiguous reference to the resource within a given context, using string or number conforming to a formal identification system URI, URL, DOI, ISBN Relation: Relationship to other objects Source: Objects, either print or electronic, from which this object is derived, if applicable Coverage: The extent or scope of the content of the resource Will include spatial location (geographic coordinates or a place name), temporal period (a period label, date, or date range) or jurisdiction (a named administrative entity)

OtherAgent/contributor: The person(s), (editors and transcribers) who have made other significant intellectual contributions to the work Language: Language of the intellectual content Rights Management: information about rights held in and over the resource Using the Dublin Core: dc.title=The Book of Me dc.creator=Me dc.subject=My life dc.subject=All about me dc.publisher=The Press of Me dc.contributor=Only me

Here’s what it might look like embedded in an HTML document: <HTML> <HEAD> <TITLE>The Home Page of Me</TITLE> <META NAME="package" CONTENT="(TYPE=begin) Dublin Core"> <META NAME="DC.title" CONTENT="The story of me"> <LINK REL=SCHEMA.dc HREF="http://purl.org/dublin_core_elements #title"> <META NAME="DC.subject" CONTENT=”biography, fascinating person, me"> <LINK REL=SCHEMA.dc HREF="http://purl.org/dublin_core_elements#subject"> <META NAME="DC.description" CONTENT="A hard hitting biography"> <LINK REL=SCHEMA.dc HREF="http://purl.org/dublin_core_elements#description> <META NAME="DC.creator" CONTENT="Howard Rosenbaum"> <LINK REL=SCHEMA.dc HREF="http://purl.org/dublin_core_elements#creator> </HEAD>

The Dublin Core has advantages: It is useable and flexible Its elements is designed to be clear enough to be understood without the need for training These elements are easily identifiable by having the work in hand, such as intellectual content and physical format It is not intended to supplant other resource descriptions, but rather to complement them It describes essential features of electronic documents that support resource discovery

Further advantages It is mostly syntax independent, to support its use in the widest range of applications All elements are optional and each site can define elements as mandatory optional All elements are repeatable The elements may be modified in limited and well- defined ways through the use of specific qualifiers, such as the name of the thesaurus used in the subject element It can be extended to meet the demands of more specialized communities

The Warwick Framework At the Warwick Workshop, researchers developed a “container architecture” known as the Warwick Framework The goal was to create an architecture that associates diverse types of metadata with a resource It is a mechanism for logically and physically aggregating distinct “packages” of metadata The Framework is an advance because: It allows the designers of individual metadata sets to focus on specific requirements without concerns for generalization

The syntax of each metadata set can vary in conformance with semantic requirements, community practices, and functional processing requirements The management of and responsibility for specific metadata sets is left to respective “communities of expertise” It promotes interoperability by allowing tools and agents to selectively access and manipulate individual packages and ignore others It permits access to different metadata sets that are related to the same object to be separately controlled It flexibly accommodates future metadata sets by not requiring changes to existing sets or the programs that make use of them

I. Introduction • The state of the net today II. What is metadata and why do we need it?

I. Introduction • The state of the net today II. What is metadata and why do we need it?

Presentation Transcript