1 / 38

ProjectWise 101 – Chapter 9 Document Indexing

ProjectWise 101 – Chapter 9 Document Indexing. Gary Cochrane – Technical Director Geospatial Sales – North America. Introduction. ProjectWise Document Indexing Really means three things Full Text Indexing, in support of full text searching Thumbnail Extraction Document Property Extraction

keagan
Télécharger la présentation

ProjectWise 101 – Chapter 9 Document Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ProjectWise 101 – Chapter 9Document Indexing Gary Cochrane – Technical Director Geospatial Sales – North America

  2. Introduction • ProjectWise Document Indexing • Really means three things • Full Text Indexing, in support of full text searching • Thumbnail Extraction • Document Property Extraction • We won’t cover this one in PW101 • See Bentley Institute PW Admin course guide for this

  3. Full Text Indexing • We did not write the engine for this • But elected to use the one Microsoft provides • Included with every copy of Windows • That engine is called the MS Indexing Service • And it was installed in the VM as an optional Windows component • Microsoft indexes the following file formats • MSWord, Excel, PPT, HTML, XML, TXT

  4. Pre-installed in VM ProjectWise Integration Server ProjectWise Orchestration Framework  MicroStation V8i-SS1  Supported Database Engine  Microsoft Message Queuing Service  Microsoft Indexing Service  Microsoft .NET Framework 2.0  Windows Server 2003 with SP2

  5. Extending the MS Index Service • Microsoft provides an SDK for third parties to extend the Indexing service • So the Indexing service will know how to “filter” files from that vendor • For instance, Adobe provides an “iFilter” that teaches the MS Index Service how to extract text from a PDF file • The Adobe PDF iFilter is installed with Acrobat Reader V9x

  6. Indexing Overview • Within PW, Indexing consists of: • Scheduling • A process that wakes up, checks for new, (or modified files), adds them to the Copy-out queue, and goes back to sleep • Copy-out • Copy the file from the Storage Area, to the machine running the Indexing Service. Then add file to the extraction queue. • Remember, files may be stored on multiple servers • Also, in large installations, a machine may be dedicated to indexing

  7. Indexing Overview – Part II • Overview – continued • Extraction • This process gets the text from the file and adds it to the MS Index catalog. Then adds the file to the Update queue • Update • This process sets the flag on the file (in the PW database) that says it is “done” • New files are added with the flag set to “undone” • Check-out/in causes the flag to be set to “undone”

  8. A note on “done” • Done does not necessarily mean it was successful • It means the file has been processed • In other words, what happens if an unknown file (Ex: an Autocad file) is sent to the Indexing Service? • The file is attempted… • And the indexing service says, “I don’t know how to extract text from this file” • There would be no point in trying the file again • So it is marked as “done”, even when unsuccessful

  9. MicroStation and AutoCAD • ProjectWise provides a mechanism to index the text from these file types • Instead of writing an iFilter, Bentley elected to: • Copy-out the file • Run MicroStation in the background, extract all the text, and write it to an XML file • Send the XML file to the Indexing Engine • Since MicroStation can parse DWG as well… • Then this method saved us from having to write two iFilters

  10. Summary • So within ProjectWise, we index: • Word, PPT, Excel, XML, HTML, TXT • Adobe PDF • DGN, & DWG • More good news • iFilters can be found for many file formats • Some free, and some for purchase

  11. PW Orchestration Framework • Remember when we installed this? • PWOF is responsible for managing batch processes for ProjectWise • This includes all those processes discussed on the previous slides • For Full Text Indexing, that means • Scheduler process, Copy-out process, Extraction process, Updater process, and the MicroStation instance running in the background

  12. Lab 1a • PW Orchestration Framework • Start the Windows Task Manager • Hint: Right-click on empty part of Taskbar • Examine memory usage • On the Performance tab • Switch to Processes tab • Sort by Mem Usage column (descending) • Look for ustation.exe • Look for DmsAfpEngine(s) • Lots of memory consumed here…

  13. Lab 1b • Now open Services dialog • Remember “gears” icon on Quick-Launch • Locate PW Orchestration Framework service • Select the PW OF service, and choose> Stop • Watch memory usage in Task Manager • For remainder of exercise, we need PWOF running • So start it back up now • Note PWOF is configured for automatic startup • It will run each time machine is booted • Close Services and Task Manager

  14. Lab 2a • Open PW Administrator • Log in as> adminpw • Drill down to: • Document Processors> Full Text Indexing • Right-click, choose> Properties

  15. Accept defaut, unless Indexing is to be run on another machine Turn on adminpw adminpw Set to 60 Lab 2b - Full Text Indexing

  16. Enable all times in the schedule Set to 2 Lab 2c - Full Text Indexing

  17. Lab 2d • Switch to File Type Associations tab • Press> Add • In the Extension field, enter> DWG • In the bottom field, enter> DGN • So that DWG files are processed as if they were DGN • Press> OK

  18. Lab 2e

  19. Lab 2f • Still on the File Type Associations tab • Again, press> Add • In the Extension field, enter> itiff • In the bottom, enable> Do not process these documents • You can’t extract text from a raster so this prevents wasted file transfers • Press> OK • Press OK again • To close the Full Text Indexing Properties

  20. Lab 2g • Open Task Manager again • Switch to Performance tab • Within 2 minutes, you should see heavy CPU usage • Memory usage will also go up • Up to 60 documents will be indexed in the first pass • If there are more than 60 documents to be done, then they will be queued in the next pass • 2 minutes from now

  21. Analysis • All documents will eventually be processed • When done, the index will be ready for fast full text searches • Once the indexer has caught up, future load will be lighter due to only processing incremental documents

  22. Lab 3a • When done, close Task Manager, open PW Explorer • Log in as user1 • From the main tool box, select> Find Documents • Binocular icon • Change to Full Text tab • Enter Look For> detail • Press OK to start search • Then Close the Search dialog • Your results should include: DGN’s, DWG’s, and PDF’s

  23. Lab 3b • Browse to: • User1/Document Indexing/MS-SHT • These files were not successful because they have an unknown extension • But they were attempted, and flagged as done • Return to PW Administrator • Select datasource name (pwdemo) • Right-click, choose> Properties • Change to Statistics tab • Choose Refresh • Review Full Text Statistics • Close dialog

  24. Lab 3c • While still in PW Administrator • Open Full Text Indexing Properties again • Switch to the File Type Associations tab • Press Add • In the Extension field, enter> SHT • In the bottom Extension field, enter> DGN • So that SHT files will be processed as if they were DGN files • Press OK to complete the Extension mapping • Press OK again to close the Properties dialog

  25. Lab 3d • Once new file type has been added… • Now a small problem • These files were flagged as done, and the Indexer won’t try them again unless they are checked out/in • And even that won’t work unless you actually makes changes… • PW compares files to version on server, and doesn’t transfer back if there are no changes

  26. Lab 3e • Rather than check them all out, and back in • From PW Administrator • Right-click Full Text Indexing • Choose> • Mark folder Documents for Reprocessing • Browse “…” to • USer1/Document Indexing/MS-SHT • Press OK • Press OK again

  27. Analysis • Within 2 minutes, these documents will be re-processed • If you run the search again (in a few minutes), you should also get SHT files in your results • Re-visit Datasource statistics to see if it Full Text categories have changed

  28. Summary • Once the index is created, • You can stop the PW Orchestration Framework service • It is used to create the index, but not to search the index • This will save memory, and CPU cycles • So in a demo, your machine will run faster • BUT, new, (or modified) files will not be re-indexed • Up until now, the PWOF was not being used at all • Full Text Indexing is the first time we’ve needed PWOF, even though it has been running since installation

  29. PW Thumbnails • PW Thumbnails is not “indexing” in the proper sense, but it is similar in nature to Full Text • PW Thumbnails extracts a thumbnail from the document, and stores a copy in the PW database • This allows one to browse PW Explorer, and see thumbnails in the Preview Pane • Not all file types support thumbnails • Among those that do, some don’t do it per the industry standard

  30. Thumbnails – Part II • Important to remember • ProjectWise does not create thumbnails • It only extracts what might be in the file • A good test is to check to see if Windows Explorer displays a thumbnail for the file • If it does, then PW should as well

  31. Lab 4a • Open Windows Explorer • Browse to: • C:\PW-101 Class Files\Document Indexing\MS-V8 • Change to Thumbnail display • MicroStation V8 files have thumbnails

  32. Lab 4b • Browse through remaining Document Indexing folders • Note which include thumbnails • Additional notes • PDF files take a long time because you are really looking at a small view of the whole file, not a thumbnail • AutoCAD doesn’t adhere to the Industry standard • These files only display correctly because MicroStation is installed, and is responsible for displaying a thumbnail • Autodesk may have fixed this in later versions?

  33. Lab 5a • Open PW Administrator • Log in as> adminpw • Drill down to: • Document Processors> Thumbnail Extraction • Right-click, choose> Properties • Similar to Full Text Indexing • But actually less involved

  34. Turn on adminpw adminpw Set to 60 Lab 5b

  35. Enable all times in the schedule Set to 2 Lab 5c

  36. Lab 5d • No changed required on the File Type Associations tab • Press OK to complete the configuration and close the dialog • Within a few minutes, thumbnails should show up in the preview pane

  37. Analysis • Thumbnails are extracted and stored in the PW database • Because document storage may not be local • Thus “touching” the document to see thumbnail in real-time is not practical • Thumbnail notes • Requires less processing than full text • MicroStation not running in this process • Requires PWOF to extract, but not to display

  38. Review • Topics covered in this Chapter • Full text Indexing – Configuration • Full Text Searches • ProjectWise Orchestration Framework • Thumbnail Extraction • Microsoft Indexing Service • And iFilters to extend default supported file types • (I have a free Visio, and MSG iFilter from Microsoft)

More Related