80 likes | 96 Vues
This study delves into enhancing memory-based learning methods for managing adjective order for better context in semantic classes. Methods, considerations, and results are thoroughly analyzed using a corpus from the British National Corpus. The research focuses on various features such as morphological analysis, positional probabilities, and their combination to elevate accuracy rates. The findings indicate a significant improvement in adjective order accuracy, yet issues like data sparsity persist as potential challenges.
E N D
Preventing Sexual Unprotected Intercourse Prenominal Adjective Ordering Ben Newman, Chris Collette
Motivation • Leather Old Green Chair • Ninja Mutant Teenage Turtles • Moral Irish High Standards • Sleeping Green Bag • Green Sleeping Bag • …and Sexual Unprotected Intercourse
Outline • Context • Method • Considerations • Memory-based Learning • Features • Results
Context • Prenominal Adjective Ordering • Statistics based on establishment of semantic classes • Building off of work by Robert Malouf (2000) • Ordering on a bigram level • Sparsity • Simplicity • Generally established approximation: • Size/length/shape < old/new/young < color < nationality < style < gerund < denominal • A < B means a class A adjective should precede a class B adjective
Method • Considerations • Capitalization • Turned into a feature • Non-Alphabetic Characters (é for é) • Left them in as extra information • Artificial Frequency of Rare Sequences • e.g. <Nationality> <adjective> in specific articles • Removed matching adjacent adjective sequences • Multi-word adjectives • Used POS tags as delimiters
Method • Corpus: British National Corpus • 100 million words • 415,731 Adj Adj sequences • 404,686 sequences after adjacent duplicate removal • Memory-based Learning • Tilburg Memory-Based Learner • Order adjective Bigrams based on array of features • Everything is either ordered correctly or not • No precision versus recall
Method • Features • Morphological • Last 8 characters of each Adj. as 16 individual features • First letter capitalization as well • Nationality and short word extra information • Improved test set accuracy by 0.14% • Brute Force • Lists of words for semantic classes • Lowered Accuracy • Positional Probabilities • Probability that a word is first in any pair given corpus • Combination
Results • Accuracies: • Morphological: 89.47% • Positional Probabilities: 89.02% • Combined: 90.17% • Analysis • Accurate • Exact effects of individual features and considerations difficult to extract • Less than Malouf’s 91.85% • Likely due to data cleaning (adjacent sequence removal) • Data sparsity continual problem