290 likes | 445 Vues
INIS Training Seminar Subject Analysis, Thesaurus und Computer Assisted Indexing. Alexander Nevyjel Head, Content Management Group. 23 – 27 November 2009 Vienna, Austria. Introduction to Subject Analysis.
E N D
INIS Training SeminarSubject Analysis, Thesaurus undComputer Assisted Indexing Alexander Nevyjel Head, Content Management Group 23 – 27 November 2009 Vienna, Austria INIS Training Seminar
Introduction to Subject Analysis • Subject Analysis should be carried out whenever possible by subject specialists with a good knowledge of the subject matter and a familiarity with the subject analysis tools of the respective database (subject categories, thesaurus, subject analysis rules) • Steps of Subject Analysis • subject classification • abstracting • subject indexing INIS Training Seminar
Subject Classification • The main topic of the document determines the primary subject category • If there are other significant topics, one or more secondary subject categories can be assigned in addition INIS Training Seminar
Abstracting • Each input item should contain an English abstract(exception: short communications) • Abstracts in other languages are optional • If an author abstract is available, it should be checked by the subject specialist, and edited, if necessary • An abstract should be as informative as possible • Emphasize what is novel about the information in the original document INIS Training Seminar
Thesaurus „A thesaurus is aterminological control deviceused intranslatingfrom thenatural languageof documents, indexers or users into a more constrainedsystem language. It is a controlled and dynamic vocabulary ofsemantically and generically related termswhich covers aspecific domain of knowledge“ This definition has been adopted by UNESCO „Guidelines for the establishment and development of monolingual thesauri“, UNESCO, SC/W/255, Paris, September 1973 INIS Training Seminar
The Thesaurus and its Structure Relationship Sy Cross reference hierarchical BT broader term (level 1, 2,...) hierarchical NT narrower term (level 1, 2,...) affinitive RT related term preferential UF used for (reciprocally USE ...) preferential UF+ used for multiple (reciprocally USE ... AND ...) preferential SF seen for (reciprocally SEE ... OR ...) INIS Training Seminar
Subject Indexing Subject indexing means analysing the information content of a piece of literature and expressing the meaningfull information content in the language of the database using the controlled vocabulary of the Thesaurus • Understanding of the content --> subject specialist • Familiarity with Thesaurus and indexing rules • Select a set of descriptors that describes the subject content of the piece of literature INIS Training Seminar
Procedures for Indexing • Carefully read the title and abstract and scan the body of the piece of literature • scan the full text (introduction, table of content, tables, graphs, figures, conclusion) to find information items missing from the abstract or requiring more precision • Identify the concept(s) about which the piece of literature contains useful information • Translate the concepts into descriptors • Avoid overindexing INIS Training Seminar
Proposed Terms (Technical Note 175) If no suitable descriptor exists in the Thesaurus for the retrieval of a usefull concept, make a proposal for a new one, containing the following: • Proposed term • Proposed word block of the term (in particular proposed BTs) • Potential forbidden terms pointing to this proposed descriptor • Scope note when appropriate • Explanation and justification for the proposal • One or more sample records INIS Training Seminar
The purpose of subject indexing is to enable useful retrieval INIS Training Seminar
Computer-assisted Indexing - CAI • Kick-off Meeting Jan 2004 • Implementation and Customisation Jun 2004 • Production Indexing from Jun 2004 ongoing • CAI version 1.0 final acceptance Aug 2004 • Tuning of the system from Aug 2004 ongoing • CAI batch processing for Member States Dec 2004 • CAI online from remote for MS Nov 2007 INIS Training Seminar
CAI Thesaurus extension “Hidden terms” are character patterns representing the different appearances of a concept in the free text, which is indexed by one or more descriptors. • handled similar to “forbidden terms” with one or more USE relations • CAI internal only • not exported to INIS production system • not exported to FIBRE • not printed in any appearance of the thesaurus • support identification of descriptors in the free text INIS Training Seminar
Hidden Terms: Compounds Descriptor hidden term free text MAGNESIUM BORIDES MgB_2 MgB2 MAGNESIUM CARBONATES MgCO_3 MgCO3 MAGNESIUM HYDRIDES MgH_2 MgH2 IRON BROMIDES iron dibromide IRON BROMIDES iron tribromide ARSENIC IONS As"3"- As3- ACETYLENE C_2H_2 C2H2 ACETALDEHYDE C_2H_4O C2H4O ACETIC ACID C_2H_4O_2 C2H4O2 approx. 1400 hidden terms (expected 3000) INIS Training Seminar
Hidden Terms: Isotopes Descriptor hidden term free text CESIUM 137 Cesium 137, Cesium-137 "1"3"7cs 137Cs 137 caesium 137 Caesium, 137-Caesium caesium 137 Caesium 137, Caesium-137 137 cesium 137 Cesium, 137-Cesium 137 cs 137 Cs, 137-Cs s 137 Cs 137, Cs-137 cs"1"3"7 Cs137 cs137 Cs137 CESIUM 138 "1"3"8"mcs 138mCs cs"1"3"8"m Cs138m approx. 22.400 hidden terms INIS Training Seminar
Hidden Terms: Elementary Particles Descriptor hidden term free text B QUARKS bottom quarks T QUARKS top quarks ELECTRON NEUTRINOS #nu#_e νe MUON NEUTRINOS #nu#_#mu# νμ TAU NEUTRINOS #nu#_#tau# ντ RHO-770 MESONS #rho#-770 ρ-770 OMEGA-782 MESONS #omega#-782 ω-782 KAONS NEUTRAL K"0 K0 KAONS NEUTRAL SHORT-LIVED K"0_S K0S KAONS NEUTRAL LONG-LIVED K"0_L K0L approx. 300 hidden terms INIS Training Seminar
Hidden Terms: UK/US Spellings Descriptor hidden term A CENTERS a centres ACTIVITY METERS activity metres ANALOG COMPUTERS analogue computers ANESTHESIA anaesthesia ARCHAEOLOGY archeology AUSTRIAN ORGANIZATIONS austrian organisations BALLISTIC MISSILE DEFENSE ballistic missile defence BAYARD-ALPERT GAGES bayard-alpert gauges BEAM ANALYZERS beam analysers BEHAVIOR behaviour CATALOGS catalogues approx. 800 hidden terms INIS Training Seminar
Hidden Terms: Diacritics and Countries Descriptor hidden term Diacritics: BAECKLUND TRANSFORMATION backlund transformation BRUECKNER MODEL bruckner model BRUNSBUETTEL REACTOR brunsbuttel reactor MOESSBAUER EFFECT mossbauer effect Country Names: CAMBODIA kampuchea COTE D'IVOIRE ivory coast GREECE hellas MYANMAR burma SYRIA syrian arab republic THAILAND siam approx. 250 hidden terms INIS Training Seminar
Hidden Terms: Other Spellings Descriptor hidden term Singular/Plural FUNGI fungus FUNGI funguses G MATRIX g matrices G MATRIX g matrixes Reverse Sequence ATOM-MOLECULE COLLISIONS atom-molecule scattering ATOM-MOLECULE COLLISIONS molecule-atom scattering ATOM-MOLECULE COLLISIONS atom-molecule reactions ATOM-MOLECULE COLLISIONS molecule-atom reactions ATOM-MOLECULE COLLISIONS atom-molecule interactions ATOM-MOLECULE COLLISIONS molecule-atom interactions approx. 900 hidden terms INIS Training Seminar
CAI Thesaurus Extension • Thesaurus • Valid Descriptors 21.826 • Forbidden Terms 9.009 • CAI • Hidden Terms 34.381 • Total 65.216 Terminological Knowledge Base INIS Training Seminar
Further Improvements necessary • “+” and “-“ signs • K+ KAONS PLUS, KAONS MINUS, POTASSIUM IONS • Case sensitivity • TiN TIN (instead of TITANIUM NITRIDES) • gas GALLIUM SULFIDES • “…who is the …” WHO (World Health Organization) • Verbs versus Nouns • “… this leads us to …” LEAD • “… this leaves it ….” LEAVES • Homographic terms • Solutions SOLUTIONS or MATHEMATICAL SOLUTIONS • Nuclear Reactions, e.g. 14N(γ,α)10B • Targets • Beams • Reactions INIS Training Seminar
CAI-Workflow Batch Mode Interactive CAI Processing Conventional Processing INIS Training Seminar
CAI Batch and Online Processing • Input: MemSt-CC-yymmdd-xxxxxxxxxxx • MemSt is a standard prefix (meaning “member state”) • CC is the country code • yymmdd is the date when the file was generated • xxxxxxxxxxx is any additional identification • Examples • MemSt-AR-041203-thisismytestfile • MemSt-FR-041212-fileidentification INIS Training Seminar
CAI Batch Processing • Output: _MemSt-CC-yymmdd-xxxxxxxxxxx • These files will carry the CAI suggested descriptors in tag 800, preceded by the string ##CAI suggestions##; • Example: • 800^##CAI suggestions##; DESCRIPTOR1; DESCRIPTOR2; DESCRIPTOR3; ……. • sent back to the member state for reviewing INIS Training Seminar
CAI Batch and Online ProcessingReviewing Process • Delete all suggested descriptors which are too general • Add relevant descriptors which were not found • numerical values, e.g. pressure ranges, temperature ranges,... • nuclear reactions • chemical compounds, alloys, etc. • CAI is cleaning up BT/NTs clean up BT/NTs from manual additions • Clean up suggestions from homographic terms INIS Training Seminar
CAI Batch and Online ProcessingFinalisation Process CAI batch • When reviewing of the record completed:Delete “##CAI suggestions## “ • When reviewing of all records completed: Submit file to “INIS Input Box” CAI online • When reaching the last record:press “export and exit” button • File goes directly to INIS production system, or if required, sent back to Member State for reviewing INIS Training Seminar
CAI Production Statistics01-06-2004 until 31-08-2009 INIS Training Seminar
CAI Batch Processing Statistics2005 until 31-08-2009 INIS Training Seminar
Tested by China Germany France India Japan Switzerland Uruguay Regularly in use by Argentina Brazil China Czech Republic Japan Switzerland CAI online for Member Statesintroduced in July 2007 CAI online and CAI batch are now regular services for Member States INIS Training Seminar