1 / 22

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

Text Analysis Method Using Latent Topics for Field Notes in Area Studies. Taizo Yamada Historiographical Institute, The University of Tokyo. Contribution. Text analysis for Area Studies applying topic model to a field note for Area studies

winona
Télécharger la présentation

Text Analysis Method Using Latent Topics for Field Notes in Area Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Analysis Method Using Latent Topics for Field Notes in Area Studies TaizoYamada Historiographical Institute, The University of Tokyo PNC2013

  2. Contribution • Text analysis for Area Studies • applying topic model to a field note for Area studies • We use LDA (Latent Dirichlet Allocation) as a topic model. • Similar fragments or scenes in field note can be obtained. • Visualization of the relationship between place names • The place information does not have Latitude and longitude. • We don’t have any dictionaries of place name. PNC2013

  3. Outline • Background, purpose • Methodology of text analysis • Text structuring, • Term extraction • Characterization of term • Method of obtaining similar text fragments • Visualization and System • Conclusion PNC2013

  4. Background • Recently, Area Studies has made remarkable progress. • Researchers in Area Studies can search and analyze large volumes of data easily and quickly. • using information technology such as web technology, data analysis, data engineering,… • In order to promote the analysis, the researchers have published databases. • catalogues, images, statistical data, spatial data and temporal data. • For more the progress of the study, • we believe that text analysis is one of the essential elements. • a text such as a field note has a description of sights, scenes and customs, • but latent topics or subjects can be key elements characterizing the area. PNC2013

  5. Purpose • Text analysis method for a field note in Area Studies. • We prepare a field note database in which the data unit is a description of a sight or a scene. • In order to detect latent topics, we use latent Dirichlet allocation (LDA). • LDA is one of a topic model. • in LDA each text can be viewed as a mixture of various latent topics and each topic can be viewed as a mixture of various words. • In order to detect the gait of investigator in a field note • Visualization of the gait shows presentation of relations between place names. PNC2013

  6. Text(1) • Target: Koichi Takaya, “The Field note collection2 Sumatra” (in Japanese) • 1984. 10. 19 ― 1985. 1. 18 • Overall Sumatra Island PNC2013

  7. Text structuring (1) PNC2013

  8. Text structuring (1) PNC2013

  9. Text structuring (2) PNC2013

  10. Term extraction(1) Result of morphological analysis • morphological analysis • mecab+ipadic (morphological analyzer; dictionary) マングローブ 名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ 。 記号,句点,*,*,*,*,。,。,。 前面 名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン の 助詞,連体化,*,*,*,*,の,ノ,ノ 海 名詞,一般,*,*,*,*,海,ウミ,ウミ に 助詞,格助詞,一般,*,*,*,に,ニ,ニ は 助詞,係助詞,*,*,*,*,は,ハ,ワ バガン 名詞,一般,*,*,*,*,* 。 記号,句点,*,*,*,*,。,。,。 魚 名詞,一般,*,*,*,*,魚,サカナ,サカナ 取り 名詞,接尾,一般,*,*,*,取り,トリ,トリ 用 名詞,接尾,一般,*,*,*,用,ヨウ,ヨー の 助詞,連体化,*,*,*,*,の,ノ,ノ 櫓 名詞,一般,*,*,*,*,櫓,ロ,ロ 。 記号,句点,*,*,*,*,。,。,。 いくつ 名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ も 助詞,係助詞,*,*,*,*,も,モ,モ ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル 。 記号,句点,*,*,*,*,。,。,。 EOS Text (a scene) マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつもある。 “名詞”: Noun, “助詞”: postpositional particle, “記号”:Symbol, “動詞”: Verb PNC2013

  11. Term extraction(2) Bag-of-Words Result of morphological analysis Bakauhumi:1 マングローブ:1 前面:1 海:1 バガン:1 魚取り用:1 櫓:1 ココヤシ:1 下:1 家:1 チョウジ:1 斜面:1 マングローブ 名詞,一般,*,*,*,*,マングローブ,マングローブ,マングローブ 。 記号,句点,*,*,*,*,。,。,。 前面 名詞,一般,*,*,*,*,前面,ゼンメン,ゼンメン の 助詞,連体化,*,*,*,*,の,ノ,ノ 海 名詞,一般,*,*,*,*,海,ウミ,ウミ に 助詞,格助詞,一般,*,*,*,に,ニ,ニ は 助詞,係助詞,*,*,*,*,は,ハ,ワ バガン 名詞,一般,*,*,*,*,* 。 記号,句点,*,*,*,*,。,。,。 魚 名詞,一般,*,*,*,*,魚,サカナ,サカナ 取り 名詞,接尾,一般,*,*,*,取り,トリ,トリ 用 名詞,接尾,一般,*,*,*,用,ヨウ,ヨー の 助詞,連体化,*,*,*,*,の,ノ,ノ 櫓 名詞,一般,*,*,*,*,櫓,ロ,ロ 。 記号,句点,*,*,*,*,。,。,。 いくつ 名詞,代名詞,一般,*,*,*,いくつ,イクツ,イクツ も 助詞,係助詞,*,*,*,*,も,モ,モ ある 動詞,自立,*,*,五段・ラ行,基本形,ある,アル,アル 。 記号,句点,*,*,*,*,。,。,。 EOS • Extraction target: only noun • But following types are not extracted: • pronoun, number, • The number of the kinds of term is 5,666. PNC2013

  12. Term extraction(3) 720km: Jakarta 出発 830km: Bakauhumi(*1) ①マングローブ。前面の海にはバガン( 魚取り用の櫓) いくつもある。  ② ココヤシ多い。この下に少し家ある。  ③ チョウジの多い斜面。  853km: 稲。今若実り。 54km: このあたりよりチョウジ多くなる。その下を時に耕している。トウモロコシを植えるらしい。 70km: 水田をよく見る。東に海見える。 77-79km: ココヤシが多い。時に水田あり、それ実っている。 85km: ココヤシ園広い。時にチョウジがある。 90km: 西海岸に来る。マングローブあるが、その背後にはココヤシ多い。 97km: チョウジが多い。この辺りは殆どがジャワ人だという。 01km: Sidomulyo。周り、シラス台地。 11km: 5 ~ 10 年生のココヤシ多い。他に、チョウジ、バナナ、ランブータン、ドリアン。 18km; 左の海にはバガンが100 基ほど見える。 22km: 海岸は広くココヤシ。これ60 年生。高みはチョウジ多い。 • Markup the extracted terms • The terms may characterize the scene in the text. • Extracted terms for each scene are different. • By the way, What features do the terms have? • We should prepare a method of a detection of the features. • But we don’t have any thesaurus or dictionaries. • Then, in order to detect, we introduce topic model. • Using topic model, we can detect latent topics as the features. PNC2013

  13. Using topic model(1) • We use LDA(Latent Dirichlet Allocation) as topic model. • Topic model • Modeling of co-occurrence of terms. • The results show term classification. • The kind of topic model • LSI(Latent Semantic Indexing): the model of introducing latent topic to VSM(Vector Space Model). • PLSI(Probabilistic Latent Semantic Indexing): The re-definition as a probabilistic model of LSI. • LDA: improved PLSI based on Bayesian learning PNC2013

  14. Using topic model(2) • LDA :D.M.Blei, et al. “Latent Dirichlet Allocation”, 2003. • document generation model where generating probability of latent topic follows Dirichlet distribution. • Latent topics can be determined if parameters of LDA can be tuned. • parameter of LDA • :latent topic • : generating probability • : document.: term.: the total number of term in d • Dir: Dirichlet distribution PNC2013

  15. Using topic model(2) • LDA :D.M.Blei, et al. “Latent Dirichlet Allocation”, 2003. • document generation model where generating probability of latent topic follows Dirichlet distribution. • Latent topics can be determined if parameters of LDA can be tuned. • parameter of LDA • :latent topic • : generating probability • : document.: term.: the total number of term in d • Dir: Dirichlet distribution Topic can be generated according to θ. Document can be generated according to terms θ can be generated by α The term can be generated according to topic z_k and β. PNC2013

  16. Detection of latent topic • Feature of LDA • text • A set of terms • Having multiple topics • term • Belong to multiple topics • Not only specific topic • Spatial changing(scene changing) • Because of the visualization of detection results, we can understand the changing . • Latent topics are changed according to the spatial changing. • By the way, which is similar? PNC2013

  17. Similarity between texts (1) • We introduce VSM (Vector Space Model). • Feature vectors are needed by VSM. • The vector has an element which is total number of terms per topic. • Similarity between vectors is calculated by cosine similarity. • x,y: text(scene) • :The weight of topic in text x. • : tf.idf weighting • : the frequency of in text x. • : the number of text which has topic . • N: the number of text PNC2013

  18. Similarity between texts (2) PNC2013

  19. Track of investigation (1) • Beginning of text • Date: Oct. 19. ‘84 • “Jakarta よりKotabumiへ行く。” • The text means the movement from ”Jakarta” to ”Kotabumi”. • Tracking the movement • Extracting place name. • Rule: • from: ○○[から|より|出発|…] • to: ○○[へ|まで|に|泊|…] • Unfortunately, we don’t have any dictionaries or gazetteers. • I connect extracted place names for the time being. PNC2013

  20. Track of investigation (2) Force-Directed Graph Jakarta Using D3.js Nov. ‘84 http://d3js.org/ Oct. ‘84 Pekanbaru Tembilahan Solok Dec. ‘84 Jan. ‘85 Singapore PNC2013

  21. Conclusion, Future works • We introduce text analysis for field note in Area Studies. • Using topic model LDA • Tracking of the investigator. • Future work • Improvement of text analysis for Area Studies. • What is the system that the researcher for Area Studies wants? • We consider about the answer, and develop system according to the answer. PNC2013

  22. Thank you for listening to my presentation. • E-mail: t_yamada@hi.u-tokyo.ac.jp PNC2013

More Related