1 / 19

TRANS: T ransportation R esearch A nalysis using N LP Technique S

TRANS: T ransportation R esearch A nalysis using N LP Technique S. Hyoungtae Cho, Melissa Egan, Ferhan Ture Final Presentation December 9, 2009. Project Sponsor. Michael Pack Director, Center for Advanced Transportation Technology Laboratory (CATT Lab) University of Maryland .

selia
Télécharger la présentation

TRANS: T ransportation R esearch A nalysis using N LP Technique S

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TRANS: Transportation Research Analysis using NLP TechniqueS HyoungtaeCho, Melissa Egan, FerhanTure Final Presentation December 9, 2009

  2. Project Sponsor Michael Pack Director, Center for Advanced Transportation Technology Laboratory (CATT Lab) University of Maryland

  3. Outline • Motivation • Goals • Data • Methods • Clustering • Pairwise similarity • TRANS Demo • Future work • Conclusions

  4. Project motivation • Project was inspired by issues in the transportation research community. • First issue: Researchers in the field, including Michael Pack, have concerns about the inefficient use of funds due to repetitive research in the field. • Many research ideas and projects are repeatedly published with only slight repackaging. • It would be ideal if such projects could be detected at the time of their proposal.

  5. Project motivation, continued • Second issue: Categorization of research projects within the field. • Useful for: • Tracking the amount of research done in each sub-field. • Understanding research trends within the community. • Bringing researchers with similar interests together. • At the moment, these tasks are partially managed by the Transportation Research Board (TRB), but this is costly and not always effective. • Performing the tasks automatically will produce fast, cheap, and objective results. • Visualizing the results will make interpretation and analysis easier, and will communicate them to a larger portion of the community.

  6. Outline • Motivation • Goals • Data • Methods • Clustering • Pairwise similarity • TRANS Demo • Future work • Conclusions

  7. Project goals • First goal: Use natural language processing (NLP) techniques to analyze the research statements from past years. • Build a system that can • detect statements that are very similar, and • classify each statement with a topic/category. • Create visualizations to highlight interesting results. • E.g., trends in transportation research over the years

  8. Project goals, continued • Second goal: Create a web site to collect and analyze research ideas in the field. • Web site should: • Allow users to submit research needs statements or ideas. • Allow other users to vote on these ideas. • Generate appropriate visualizations to summarize research needs and interests.

  9. Outline • Motivation • Goals • Data • Methods • Clustering • Pairwise similarity • TRANS Demo • Future work • Conclusions

  10. Data Preprocessing • Extract research needs statements and paper abstracts Research needs statements ID, Text, Date 1, … 2, … … 809, … TRB WEBSITE CRAWLER1 Paper abstracts ID, Text, Date, Address 1, … 2, … … … … 9552, … TRB WEBSITE CRAWLER2

  11. Clustering An algorithm to group similar data points together In our work, • Categories of statements and papers not available • IDEA: • Use clustering to group similar statements • Assign a category to each cluster

  12. Features A weight is computed for each term • Tokenization removes stop words, and truncates words • Each document is represented by a vector of feature weights • As a global recession of unprecedented scale threatens to engulf much of the United States economy, congress and federal policy-makers have assembled a large package of government stimulus spending that can reverse job losses and revive consumer demand. Economists identify road construction as a good way to create jobs in the short-term and to boost economic productivity in the long-term by lowering transportation costs. As a result, highways feature prominently in the proposed Congressional economic stimulus bill and about $30 billion in new federal money for pavements, bridges, and tunnels is likely to flow to state departments of transportation (DOTs) in 2009 and 2010 • global recess unpreced scale threaten engulf unit state economi congress federpolici maker assembllargpackag govern stimulu spend can revers job loss revivconsum demand economist identifi road construct waicreat job short term boost econom product long term lower transport cost result highwaifeaturpromin propos congressioneconomstimulu bill 30 billion new federmonei pavement bridg tunnel like flow state depart transport dot 2009 2010 only unigrams • global = 0.03 • recess = 0.7 • unpreced = 0.41 • …

  13. k-means Clustering Construction Since clusters are not labeled by the algorithm, we look at the most frequent terms and manually decide on names Highways The center of cluster 1 is adjusted For each document that has not been assigned to a cluster Do the same for each unassigned document …and assign it to the nearest one k=3 documents randomly selected as ‘centers’ User chooses number of clusters k (e.g., k=3) Find the distance from this document to each center Administration C3 C1 C1 C1 C3 C2 C1 C3 C1 C2 C2 C1 C3 C3 C1 C3 C2 C2 C1 C3 C2 C2 C2 C1 C2 C2 C3

  14. Outline • Introduction • Motivation • Data • Methods • Clustering • Pairwise similarity • TRANS Demo • Future work • Conclusions

  15. Pairwise Similarity Given two documents, compute a similarity score: • Can be used to detect duplicate work and generate ‘‘more like this’’ lists • Use same features as clustering 0.28 0.51 0.83 1.0 exactly same 0.0 no similarity

  16. Outline • Introduction • Motivation • Data • Methods • Clustering • Pairwise similarity • TRANS Demo • Future work • Conclusions

  17. Demo • TRANS Java Applet • TRANS Web Application

  18. Future Work • Better Features • Using N-gram features • Transportation Ontology • LDA Topic Presentation • Visualization for sub-categorization • Citation Network Analysis

  19. Conclusion • Implement transportation research Visualization tool, TRANS • TRANS tool • TRANS Website • Extend to another academic field

More Related