Discussion Class 3 The Porter Stemmer
Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When answering: Stand up. Give your name. Make sure that the TA hears it. Speak clearly so that all the class can hear. Suggestions: Do not be shy at presenting partial answers. Differing viewpoints are welcome.
Question 1: Stemming Who wrote this paper? When? For what audience? Define the terms: stem, suffix, prefix, conflation What makes a good stemming algorithm? How would you measure it? Porter proposes a criterion for removing suffixes. What is it? Do you agree with it?
Question 2: Effectiveness Earlier system Present system precision recall precision recall 0 57.24 0 58.60 10 56.85 10 58.13 20 52.85 20 53.92 30 42.61 30 43.51 40 42.20 40 39.39 50 39.06 50 38.85 60 32.86 60 33.18 70 31.64 70 31.19 80 27.15 80 27.52 90 24.59 90 25.85 100 24.59 100 25.85 Explain the data in this table. The paper calls this, "the standard recall cutoff method". Have you any comments?
Question 3: Categories of Stemmer The following diagram illustrate the various categories of stemmer. Porter's algorithm is shown by the red path. What do these terms mean? Conflation methods Manual Automatic (stemmers) Affix Successor Table n-gram removal variety lookup Longest Simple match removal
Question 4: Mechanics Step 1a The paper gives the following example of Step 1a. Explain what this step does. Suffix Replacement Examples sses ss caresses -> caress ies i ponies -> poni ties -> ti ss ss caress -> caress s cats -> cat
Question 5: Mechanics Step 1b Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing (a) Explain this table (b) How does this table apply to: "exceeding", "ringed"?
Question 6: Mechanics Step 5a Step 5a is defined as follows. What does this do and why? (m>1) E -> probate -> probat rate -> rate (m=1 and not *o) E -> cease -> ceas
Question 7. Ad hoc decisions Discuss the following: "The algorithm is careful not to remove a suffix when the stem is too short, the length of the stem being given by its measure, m. There is no linguistic basis for this approach. It was merely observed that m could be used quite effectively to help decide whether or not it was wise to take off a suffix." (a) What is m? (b) Why is it a reasonable measure? (c) What anomalies does it produce?
Question 8: Stemming in Web searching (a) In Web search engines, the tendency is not to use stemming. Why? (There are several answers.) (b) Does your answer to part (a) mean that stemming is no longer useful?