 
                
                E N D
1. Text Mining Overview  Piotr Gawrysiak
gawrysia@ii.pw.edu.pl
Warsaw University of Technology
Data Mining Group 
2. Topics Natural Language Processing 
Text Mining vs. Data Mining
The toolbox
Language processing methods
Single document processing
Document corpora processing
Document categorization – a closer look
Applications
Classic
Profiled document delivery
Related areas
Web Content Mining & Web Farming 
3. Natural Language Processing Natural language – test for Artificial Intelligence
Alan Turing 
NLP and NLU 
4. Information explosion 
5. Data Mining This information explosion is not only problem with text, but also with all other kinds of data. Here data mining comes to the rescue.This information explosion is not only problem with text, but also with all other kinds of data. Here data mining comes to the rescue. 
6. Knowledge pyramid 
7. Text Mining – a definition TM can be described as statistical method – because KDD is mostly based on statisticsTM can be described as statistical method – because KDD is mostly based on statistics 
8. Text Mining tools Linguistic analysis
Thesauri, dictionaries, grammar analysers etc.
Machine translation
Automatic feature extraction
Automatic summarization
Document categorization
Document clustering
Information retrieval
Visualization methods 
9. Language analysis 
10. Thesaurus construction 
11. Machine translation 
12. Fully automatic approach 
13. Feature extraction 
14. Document summarization New area – multimedia document summarization
New area – multimedia document summarization
 
15. Document categorization & clustering 
16. Categorization/clustering system 
17. Information retrieval 
18. IR – exact match 
19. IR – fuzzy search 
20. Document visualization 
21. Document visualization 
22. Document categorization A closer look 
23. Measuring quality 
24. Metrics Wartosc wskaznika dokladnosci okresla prawdopodobienstwo dokonania poprawnej klasyfikacji, dla losowo wybranego dokumentu ze zbioru D. Wartosc wskaznika precyzji okresla prawdopodobienstwo, iz losowy dokument wybrany z dokumentów uznanych za relewantne, jest rzeczywiscie dokumentem relewantnym. Zupelnosc odpowiada prawdopodobienstwu tego, iz dokument faktycznie relewantny, zostanie za taki uznany przez system. Zaszumienie okresla z kolei prawdopodobienstwo niepoprawnego uznania za relewantny dokumentu, który faktycznie 
relewantny nie jest.
Wartosc wskaznika dokladnosci okresla prawdopodobienstwo dokonania poprawnej klasyfikacji, dla losowo wybranego dokumentu ze zbioru D. Wartosc wskaznika precyzji okresla prawdopodobienstwo, iz losowy dokument wybrany z dokumentów uznanych za relewantne, jest rzeczywiscie dokumentem relewantnym. Zupelnosc odpowiada prawdopodobienstwu tego, iz dokument faktycznie relewantny, zostanie za taki uznany przez system. Zaszumienie okresla z kolei prawdopodobienstwo niepoprawnego uznania za relewantny dokumentu, który faktycznie 
relewantny nie jest.
 
25. Multiple class scenario 
26. Categorization example 
27. Document representations 
28. Bigram example 
29. Probabilistic interpretation 
30. Positional representation 
31. Creating positional representation 
32. Examples 
33. Processing representations 
34. Expanding and trimming 
35. Representation processing 
36. Attribute selection 
37. Attribute space remapping 
38. Applications 
39. Thank you Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings will no longer rely on their memory and therefore will recall everything from potentially misleading external sources.Plato has written in Fajdros that the art of writing may be lethal to our knowledge and wisdom, as human beings will no longer rely on their memory and therefore will recall everything from potentially misleading external sources.