430 likes | 537 Vues
Languages are bridges … not barriers. ReferNet Technical Meeting 24-25 September 2009. Chiara Carlucci – CEDEFOP Library. Languages are bridges … not barriers. What it is … Why to use it … How to use it … What else. What.
E N D
Languages are bridges … not barriers ReferNet Technical Meeting 24-25 September 2009 Chiara Carlucci – CEDEFOP Library
Languages are bridges … not barriers What it is … Why to use it … How to use it … What else ..
What Is there any place left for thesauri in this new information retrieval environment?
What for sure there is a place for thesauri but they must change in order to continue to be of value. A true thesaurus has equivalence relationships but it also supports other kinds of relationship and provides navigation assistance by means of scope notes and other aids.
What A thesaurus suggest other ways of expressing an idea which is already in the user's mind and remind the user of related ideas that might be valuable in searching.
What It’s useful recounts some classic moments of indexation because the documents are changing rapidly, because the habit of making the same things and leads to repetitive behavior and not considered, because the thesaurus is to be used as a thesaurus !
What it must be remembered that, though a thesaurus appears to be made up of a natural language terms, it is an artificial language, a controlled vocabulary with a limited number of descriptors the meaning of each being understood through the: • context provided by the descriptors as a whole in a bibliographical context (as VET bib) these information provided by the whole system of descriptors are also helped by • the title of the document • the abstract of the document
What • Is not • a dictionary which contains definitions and pronunciations. Unlike a dictionary, a thesaurus entry does not define words. • a glossary which contains explanations of concepts relevant to a certain field of study or action. • a lexicon because the lexicon of a language is its vocabulary, including its words and expressions. • a vocabulary which is the set of words they are familiar with in a language. A vocabulary usually grows and evolves with age, and serves as a useful and fundamental tool for communication and acquiring knowledge.
What The thesaurus is a thesaurus
What The thesaurus is a thesaurus With his propre Hierarchical relationships that are used to indicate terms which are narrower and broader in scope. A "Broader Term" (BT) is a more general term, e.g. “Apparatus” is a generalization of “Computers”. Reciprocally, a Narrower Term (NT) is a more specific term, e.g. “Digital Computer” is a specialization of “Computer”. BT and NT are reciprocals; a broader term necessarily implies at least one other term which is narrower. BT and NT are used to indicate class relationships, as well as part-whole relationships.
What The thesaurus is a thesaurus With his propre Equivalency relationshipthat are used primarily to connect synonyms and near-synonyms. Use (USE) and Used For (UF) indicators are used when an authorized term is to be used for another, unauthorized, term. Reciprocally, the entry for the unauthorized term would have a indicator "USE". Unauthorized terms are often called "entry vocabulary", "entry points", "lead-in terms", or "non-preferred terms", pointing to the authorized term (also referred to as the Preferred Term or Descriptor) that has been chosen to stand for the concept.
What The thesaurus is a thesaurus With his propre Associative relationships that are used to connect two related terms whose relationship is neither hierarchical nor equivalent. This relationship is described by the indicator "Related Term" (RT). Associative relationships should be applied with caution, since excessive use of RT will reduce specificity in searches. Consider the following: if the typical user is searching with term "A", would they also want resources tagged with term "B"? If the answer is no, then an associative relationship should not be established.
Why • To translate the concept you are looking for into key-words • Multilingualism and standardisation are the main advantages of this powerful indexing tool covering the fields of VET • The thesaurus is an operational tool used to retrieve documents according to their semantic content • Thesaurus must be delivered to users to identify their information needs • Thesaurus provides a conceptual framework for understanding reality through graphic presentations that preserve the specificity • It presents in an unambiguous way the conceptual content of documents.
Why • A thesaurus is fit for the digital environment to show his versatility • Is open to the interoperability information because the thesaurus context is not only an operating environment but an organizational criterion • It can be integrated with other tools of information retrieval
Why research in systems of unstructured information → web
Why ETT is used to index and represent the content of a document. It is mostly used by documentalists and librarians to identify the concepts laid down in the text and to represent them by attributing keywords from the thesaurus. This operation enables extracting the relevant records from a collection of bibliographic references or from a full-text documentary database to answer the user’s query. End-users can combine ETT descriptors in order to represent their search query. The indexation through ETT enables all documents on the same subject to be retrieved through a single query.
Why ETT is useful for taxonomy and semantic web applications. The main role of a thesaurus is to standardise the indexing process in order to make searches simpler, more efficient and consistent regardless of the language of the query. It is a multilingual conceptual thesaurus which strives to satisfy both the Community and national needs on a wide range of subjects. Each descriptor is related to one concept in each of the languages.
Why Another interesting option offered by ETT is the possibility for users to ask questions in one language and retrieve the answers in different languages and this Google doesn’t do, or not yet !!
Why Is only a term In this case the descriptor ‘transparency of qualifications’ represents a precise concept and can be able to retries many web pages, not necessarily documents, that have the descriptor in the exact form in the text
Why In this case ‘transparency of qualifications’ is more than a descriptor: is a concept. We can find documents relating to the subject even if: 1. the term is not within the text 2. the document is in a different language.
Why ETT is also used in Cedefop website for automatic categorisation or classification of documents in websites and in Library’s reference desk to categorize user’s questions. A simple click enables crosslingual information access to the translation of a descriptor or of the complete semantic chain of a descriptor. These advanced options open the door to many cross-lingual applications, such as calculating document similarity across languages.
How Indexing with the ETT’s update version … knowing how something is stored makes finding it easier
How The main, word-by-word alphabetical display the most familiar since it provides a variety of information for each descriptor. The term’s main entry in the alphabetical display shows the appropriate coordination. This includes a SN, a BT and NT, USE and UF relations, RT But be careful … this approach is easy to understand but non so easy for end-user for example the fact that BT and NT mean that two terms are related hierarchically is obvious only to specialists !
How Showing to the users hierarchical structures is a useful mechanism for query expansion also because … - users with varying levels of domain knowledge make use of thesauri in different ways - thesauri are capable of providing end-users with additional, useful terms for query formulation and expansion
How A KWIC index is formed by sorting and aligning the words within an article title to allow each word (except the stop words) in titles to be searchable alphabetically in the index. It was a useful indexing method for technical manuals before computerized full text search became common. The term permuted index is another name for a KWIC index, referring to the fact that it indexes all cyclic permutations of the headings. A permutation is called a cyclic permutation if and only if it will be constructed with exactly 1 cycle A cyclic permutation is built from one or more sets of elements in cyclic order.
How Indexing with the ETT’s update version • New 465 descriptors = have added to the thesaurus since 2008 edition so you can not search previous literature using these descriptors Oldest literature on topics represented by these terms is searchable using related descriptors.
How Indexing with the ETT’s update version • 415 Deleted descriptors = are non longer used in indexing but they may be used for searching data base entries prior to ETT’s 2008 edition More recent literature on topics represented by these terms is searchable using related descriptors.
How How can I add the new descriptors using VET det ? 1) introduce the new descriptors (p.16-19 of ETT printed version) in the field notes preceding of the word, NEWDESCRIPTOR, and separating these with commas. i.e. Notes field: NEWDESCRIPTOR certification of learning outcomes, key competences • If the new descriptor is a main descriptor NEWMAINDESCRIPTOR at the beginning 2) not to introduce the deleted descriptors (p. 20-22 of ETT printed version)
How Fundamental, basic, classic indexing rules really important because VEt BIB contains 70.000 records!!! Index ONLY what is in the document and Index at the LEVEL of specificity of the document • Statements or assumptions are not indexed
How Fundamental indexing rules 2. Very general descriptors are not used unless the document covers a topic very broadly 3. Main descriptor cover the main focus or subject of a document 4. Other descriptors indicate less important aspects within the document
How Fundamental indexing rules 5. ETT avoids ‘indexing up’ to a broader descriptor when an appropriate more specific exists
How Fundamental indexing rules
How Fundamental indexing rules • Indexing is complementary to information found in other parts of the document (mainly title and abstract)
How Fundamental indexing rules • The number of the descriptors should be proportioned with the number of pages
How Fundamental indexing rules
How Fundamental indexing rules • “Indexable” concepts are translated into descriptors using the thesaurus helps maintain consistency and prevents proliferation of concepts
How Fundamental indexing rules • Thus a single descriptor may be imprecise even ambiguous while the greater the number of descriptors used together the greater the precision
How Fundamental indexing rules • This world precision is used in a technical sense to mean the ratio of relevant to irrelevant documents in a retrieved set
How Fundamental indexing rules • The word recall is used to mean the ratio of relevant documents retrieved to those wich are relevant and not retrieved
What else … … for the future Permitting the searcher to switch between navigating the thesaurus and searching the database can only improve access an obvious way in which a thesaurus can be applied directly in retrieval is to use the relationship as a means of expanding the search. Research, however, has shown that these relationship must be used with caution (precision/recall)
What else … … for the future In general, expanding a search to include the narrower terms tends to improve recall without great sacrifice in precision. Expanding to include broader or related terms while does improve recall typically has a significant negative impact on precision.
What else … … for the future • How is it possible to remain positive about the need for continued use of thesauri ? Because only a thesaurus can become the basis of a more extensive semantic network that provide information not just on what terms are used in indexing but on how they are used within the system.