220 likes | 462 Vues
ODF 000419 Benchmarking Oracle8 i Intermedia Text . Benchmarking Oracle 8 i Intermedia Text. Background for this benchmark Interesting new features in OIMT Benchmarking, methodology and problems Results Conclusions. ODF 000419 Benchmarking Oracle8 i Intermedia Text .
E N D
ODF 000419 Benchmarking Oracle8i Intermedia Text Benchmarking Oracle 8i Intermedia Text • Background for this benchmark • Interesting new features in OIMT • Benchmarking, methodology and problems • Results • Conclusions
ODF 000419 Benchmarking Oracle8i Intermedia Text Background for this benchmark The task of the thesis project The EDMS Search Engine CERN’s EDMS CADIM/EDB, managing product data documents MP5, managing physical components Oracle 7, database platform Implement Oracle8i in the future?
ODF 000419 Benchmarking Oracle8i Intermedia Text Oracle 8i Intermedia A product embedded in Oracle8i Intermedia allows a unified technique for accessing various types of data such as: • Text • Documents • Images • Audio • Video
ODF 000419 Benchmarking Oracle8i Intermedia Text Features in Intermedia Text • Database integration • Mixed queries against multiple text columns • Single SQL API • Indexes on most database columns • All index data in the database • Automatic triggers on textcolumns to detect changes • Full text search • Exact Word and phrase search, operators, removing stopwords • XML and HTML document section searching
ODF 000419 Benchmarking Oracle8i Intermedia Text Features in Intermedia Text • About Search (All languages) • Parses any text search to perform an optimal search • Complements full-text search • Theme Identification (English only) • Identifies strong themes in documents using a ”Theme base” • ”Thematic” information is put into the index • Themesearch is available via the about operator • May be customized for specific terminology
ODF 000419 Benchmarking Oracle8i Intermedia Text Features in Intermedia Text • Document services • View documents as plain text or HTML format • Store documents in a database, file system or at an URL adress • Multilingual text search • Full-text search in most languages including Japanese, Chinese and Korean • Support for all Oracle-NLS character sets • Stemming and Fuzzy search for Dutch, English, French, German and Spanish • Base-letter support and alternate spelling for Western European languages
ODF 000419 Benchmarking Oracle8i Intermedia Text Features in Intermedia Text To be investigated: • The text indexing technique • The new query operators • How to use this in the EDMS Search Engine
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Textindexes An OIMT textindex can be created on: • varchar2 columns • Large object (LOB) columns • B-file columns, indexing entire files Allowing queries like: Select id from table where contains(column,’Atlas’)>0; 1 One of the detectors in the LHC ring is ATLAS 2 The Atlasmountains are situated in the north of Africa -> 1 … contains(column,’Atlas inner detector’)>0; 1 The problems with the ATLAS inner detector... 2 The inner detector of ATLAS... -> 1 Query optimization Selects the best executing plan based on analyze table…compute statistics;
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Textindexes An OIMT textindex is an ”inverted” index 1 the LHC accelerator 2 the LEP accelerator -> accelerator: row #1 position 2, row #2 position 2 LHC: row #1 position 1 LEP: row #2 position 1 A textindex is built up by five objects Four tables: I,K,N,R and one b-tree index: X Create OIMT textindex statement create index myindex on table(column) indextype is ctxsys.context; Note that tablemust have a primary key Updating, two choices Automaticly by starting ctxsrv and commit Rebuild ”manually”
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Textindexes The indexing ”pipeline” • Loops over the rows and reads data out of the column • Or from remote servers, accessed via http or ftp via pointers Datastore • Transformns the data into text representation • Output can be in HTML or XML Filter • Takes the output from the filter and converts it to plain text • Different sectioner for different formats • Detects important section tags Sectioner Lexer • Separates text into tokens and words • Remove stopwords
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Textindexes The preference system allows customization of textindexes Classes of ”customizable” objects: DATASTORE, FILTER, SECTION_GROUP, LEXER, WORDLIST, STOPLIST, STORAGE Createa preference to customize an OIMT textindex: execute ctx_ddl.create_preference(’my_pref’, ’BASIC_LEXER’); execute ctx_ddl.set_attribute(’my_pref’, ’INDEX_THEMES’, ’YES’); create index my_index on table(column) indextype is ctxsys.context parameters(’LEXER my_pref’); Default preferences Unset preferences get their value from the default system
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Query operators Scoring operators:WEIGHT(*), THRESHOLD(>), ACCUM(,), MINUS(-) Examples: …contains(column, ’(edms*2) AND cms’)>10; …contains(column, ’edms, cms’)>0; edms+cms scores higher than each word alone Word expansion operators:WILDCARD(%), FUZZY(?), STEM($), SOUNDEX(!), EQUIV(=) Examples: …contains(column, ’?mignets’)>0; Fuzzy correct missspellings 1 The magnets of the LHC accelerator… -> 1 …contains(column, ’$go’)>0; Stem considers e.g. go, went, gone as the ”same” word 1 I will go to the cinema. as well as plurals, magnet=magnets 2 I went back home. 3 The train has gone. -> 1,2,3 …contains(column, ’!dog’)>0; Soundex retrieves all word which sounds alike 1 I have a dog. 2 Someone has dug a hole. -> 1,2
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Query operators Proximity operator: NEAR(;) Examples: …contains(column, NEAR((edms,cms),4, TRUE))>0; 1 EDMS is available for CMS, ATLAS, LHCb... 2 The EDMS at CERN manages all product data documents of CMS -> 1 Section limiting and theme operators:WITHIN, ABOUT about is used as a ”general” operator, optimizing the query by including stem($) and theme search if available Thesaurus operator: SYN Examples: SYN(dog) == {boxer} | {rotweiler} | {terrier} Boolean operators: AND, OR, NOT
ODF 000419 Benchmarking Oracle8i Intermedia Text OIMT Benchmarks • Is SELECT…CONTAINS(column,’word’)>0;faster than SELECT…LIKE(’%word%’);? • How do indexes on entire files perform? • What is the prize in terms of memory storage, maintenance? • Still fast retrieval? • Is updating flexible? • Do the query operators perform as expected?
ODF 000419 Benchmarking Oracle8i Intermedia Text Comparing CONTAINS and LIKE,retrieval times Fulltablescan using LIKE, as in EDMS Search today Y-axis: Retrieval time [s] X-axis: Tablesize [# of rows] Using OIMT’s CONTAINS
ODF 000419 Benchmarking Oracle8i Intermedia Text Comparing CONTAINS and LIKE,retrieval times Fulltablescan using LIKE, as in EDMS Search today Y-axis: Retrieval time [s] X-axis: Tablesize [# of rows] Using OIMT’s CONTAINS
ODF 000419 Benchmarking Oracle8i Intermedia Text Indexing entire files • Tests with smaller amounts of files (<1000) works fine • Encountered heavy problems when indexing up to 9000 files: • Security bug • A user gets the database owner’s OS privileges when accessing files through FILE DATASTORE. • Special roles will be introduced in the 8.1.7 version to manage this problem. • Not indexing certain Excel files, bug • Said to be fixed in the 8.1.7 version • Storage problems, tablespace, temp, shared pool, lobsegments • A OIMT textindex is complex, several storage parameters have to be extended, which was non-trivial. • Unrelevant errormessages • Somewtimes when problems occur, unrelevant errormessages or no errormessages at all are returned, making troubleshooting difficult.
ODF 000419 Benchmarking Oracle8i Intermedia Text Indexing entire files Statistics about the 9000 file index: Total indexsize: 1.0 GB Number of files: 8942 Accumulated filesize: 4.4 GB Tot indexsize/acc filesize: 23.6 % Average indexsize per file: 115.8 KB/file Average filesize: 490.1 KB/file Creation time: 7:06 hours Creation time per file: 2.9 seconds/file Updating: Works fine, 2 seconds/inserted row Quering: 1-3 seconds/query (depends on the query!) Machine: SUN 4CPU 300MHz 1.5 GB RAM
ODF 000419 Benchmarking Oracle8i Intermedia Text A view of the testtable Index created on this file reference column
ODF 000419 Benchmarking Oracle8i Intermedia Text Conclusions O8i IMT • A very interesting ”platform” for the future EDMS Search Engine • Will provide tools for fast full-text searches, with both simple and advanced queries Textindexes • Indexing entire files works, but is not trivial to do in version 8.1.6, to be improved in 8.1.7? • Quering is fast, both for indexed varchar2 columns and filecolumns • Updating textindexes can be done automatic • Index preferences may be customized • Multilingual Query Operators • Works mainly as expected, providing powerful tools for an advanced Search Engine • Wildcards a very slow when executed on fileindexed columns
ODF 000419 Benchmarking Oracle8i Intermedia Text The EDMS Search Engine Interface Application interfaces are non-trivial to create Interface important to make userfriendly Many hits may be retrieved from full-text searches, limiting these may be crucial Some ideas: Ask the users for hints Keep the interface as simple as possible, avoid too many graphical objects etc A simple and an advanced search option Menus for choosing query operators, scoring etc
ODF 000419 Benchmarking Oracle8i Intermedia Text An OIMT Test Search Engine: 9000 indexed files from the EDMS production database http://oraweb01.cern.ch:9000/anders/owa/edms_extended_search.enter_search