1 / 22

CIS392 Text Processing, Retrieval, and Mining Spring 03

CIS392 Text Processing, Retrieval, and Mining Spring 03. Instructor: Dr. Y. F. Brook Wu BOW toolkit: http://www.cs.cmu.edu/~mccallum/bow. Login in to AFS. On campus: go to a computer lab in GITC 2305. At home: make sure the internet connection has been established.

Télécharger la présentation

CIS392 Text Processing, Retrieval, and Mining Spring 03

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CIS392 Text Processing, Retrieval, and MiningSpring 03 Instructor: Dr. Y. F. Brook Wu BOW toolkit: http://www.cs.cmu.edu/~mccallum/bow Assign#1

  2. Login in to AFS • On campus: go to a computer lab in GITC 2305. • At home: make sure the internet connection has been established. • Assume everyone has Windows at home. Click on Start  Run • Type in “telnet afs1.njit.edu” (without quotes; the first screen shows some useful information.) • Enter user name and password  • What if your account doesn’t work: Call help desk 973.596.2900, they can reset your password for you. Assign#1

  3. Useful UNIX commands • Note: All filenames and commands in UNIX system are case sensitive. •  General syntax: Command [option] Argument • Options modify the way command works, and they are optional. • Arguments are usually files; sometimes they are optional too. • Ex: rm –r directory_name Assign#1

  4. Note • Typing two “-” next to each other in MS PowerPoint will make them look like “—” . Those BOW and UNIX commands you see in these slides, therefore, are confusing. So, please refer to BOW help file and UNIX documentations for their actual usages. Assign#1

  5. Useful UNIX commands • man (for manual) ex: man ls (manual for ls command) • cd (change directory) • ls (list files and attributes) • dir (list files) • mkdir (crete a directory) • rm (delete a file) • rm –fr directory_name (delete the whole directory and files inside it.) Assign#1

  6. Useful UNIX commands • rmdir (remove directory) • cp (copy) • pwd (current working directory) • pico (a text editor) • more filename (read plain text file one screen at a time. Press space bar to continue and “q” to quit.) • quota (disk space) Assign#1

  7. More useful UNIX commands • http://www.njit.edu/CSD/Docs/unixcmds.html • http://www.njit.edu/Directory/Admin/CSD/Academic_Computing/Manuals/UNIX/UNIX.html Assign#1

  8. How to create your home page on AFS system? • Help info: http://www-ec.njit.edu/ec_info/newuser/web/web.html • Execute this command at the UNIX prompt: /usr/ec/bin/home.page.setup • Your URL: http://www-ec.njit.edu/~yourusername Assign#1

  9. Overview of Retrieval Experiment • Create a sub-directory for CIS392 assignments under ~your_user_name/public_html • Create 3 sub-directories under the above directory for the 3 automatic indexing activities • Perform 3 automatic indexing activities with 3 different options Assign#1

  10. Overview of Retrieval Experiment (cont) • Perform 3 retrievals for each of the above 3 auto indexing activities • Analyze how different indexing options affect retrieval • Make an html page to present your results. Assign#1

  11. Creating sub directories • Change directory to public_html by typing: cd public_html • mkdir cis392 (now you’ve created a directory for your CIS392 retrieval assignments) • cd cis392 (go inside cis392 directory) Assign#1

  12. Creating three sub-directories • mkdir model1 (this directory stores results from default settings: no stemming and stopped words removed.) • mkdir model2 (this directory stores results from the following settings: no stemming, and stopped words INCLUDED.) • mkdir model3 (this directory stores results from the following settings: stemming, and stopped words removed.) Assign#1

  13. URL of your retrieval experiment • http://www-ec.njit.edu/~yourusername/cis392/cis392re.html • See a sample page created by Prof Wu: http://www-ec.njit.edu/~wu/cis392/cis392re.html Assign#1

  14. Getting Access to BOW and Test Collection • there are three directories under ~wu/IR_Tools: • bow (for BOW system), to execute BOW, change directory to: ~wu/IR_Tools/bow/bin • som (for self-organizing map program. Do NOT use it now!) • tc (test collection, Library and Information Science Abstracts) the text is under ~wu/IR_Tools/tc/lisa/text/group0 to group5 Assign#1

  15. Test Collection: LISA • The sample queries are stored in~wu/IR_Tools/tc/lisa/LISA.QUE • The relevant documents corresponding to queries are stored in:~wu/IR_Tools/tc/lisa/LISA.REL (“-1” marks the end of the entry.) Assign#1

  16. Operating Arrow of BOW • Read information from BOW’s web site (again, the URL is list on the “Resources” section of the class syllabus) • Read Arrow’s help file (available on syllabus page; You should print a copy of the help file.) Assign#1

  17. Automatic Indexing • To begin the retrieval tasks, first you need to index the whole document collection. • Specify lexing options (stopped words removal and/or stemming) at this time. • arrow -d ~yourusername/public_html/cis392 --index ~wu/IR_Tools/tc/lisa/text/* • The * sign is a wildcard represents all files and directories under ~wu/IR_Tools/tc/lisa/text Assign#1

  18. Automatic Indexing • -d parameter specifies where you will store the statistics resulted from indexing. (You will have to specify this directory when you want to index and retrieve documents.) • The path after –index specifies the location of text collection.  • The default lexing settings of the above task include: NO stemming performed, and stopped words REMOVED. Assign#1

  19. Query assigned for retrieval • Please refer to retrieval experiment section of the online syllabus to see which query you get for the experiment. (http://web.njit.edu/~wu/teaching/sp03/CIS392/CIS392-Sp03.htm) Assign#1

  20. Retrieval • First, please specify where the indexing statistics is stored, and then the query to be performed. • arrow –d ~yourusername/public_html/cis392/model1 --num-hits-to-show=25 –query > ~yourusername/public_html/cis392/model1/retrieved_docs • The greater-than sign (>) specifies the output filename and where it will be stored. Assign#1

  21. Presenting your RE • create a page under your ~/public_html/cis392 directory named: cis392re.html • this page should contain several pieces of information, see: http://web.njit.edu/~wu/cis392/cis392re.html Assign#1

  22. Presenting your RE • You can create this html page with the pico editor in UNIX (if you know basic html tags) , Microsoft Word (save the file in html format), or Netscape composer. • If you use an html editor, you might need FTP software. http://www.zdnet.com/downloads/stories/info/0,10615,30994,00.html • Before due date: Please check all items on your html page and make sure all of them are displayed properly. • After due date: do not make changes. I can check when the files were last updated. Assign#1

More Related