ACL(+WS) 2007 EMNLP-CoNLL 2007 サーベイ

ACL(+WS) 2007EMNLP-CoNLL 2007サーベイ 東大　中川研　二宮　崇機械学習勉強会　2007年12月6日

ACL 2007・EMNLP-CoNLL 2007 • 2007年6月23日～6月30日 • ＠プラハ • きれいな街並みとお城 • しかし、統計的には登録参加者800人中48人はスリにあう、という危険なところでもあるそうです…。

プラハの思い出

Domain Adaptation • J. Jiang & C.X. Zhai (2007) Instance Weighting for Domain Adaptation in NLP, in Proc. of ACL 2007 • J. E. Miller, M. Torii, K. Vijay-Shanker (2007) Building Domain-Specific Taggers without Annotated (Domain) Data, in Proc. of EMNLP-CoNLL 2007 • J. Blitzer, M. Dredze, F. Pereira (2007) Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification, in Proc. of ACL 2007 • J.Blitzer, R. McDonald, F. Pereira (2006) Domain Adaptation with Structural Correspondence Learning, in Proc. of EMNLP 2006 • Rie Kubota Ando, Tong Zhang (2005) A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data, in JMLR, 6:1817-1853

Domain Adaptation: motivation (1/2) • 特定のドメインで高性能なNLPツールは異なるドメインでは性能が低下(NLP tools achieve high performance in some specific domain. Performance of NLP tools drop significantly in different domains) • NLP Tools: POS tagger, Named entity tagger, Parser, Sentiment analyzer • 特定のドメイン (specific domain)：news paper • 異なるドメイン (different domains)：speech, blog, e-mail, bio-medical papers

Domain Adaptation: motivation(2/2) • 多くの高性能なNLPツールは教師付学習に依存 (Many NLP tools use supervised learning techniques) • 特定ドメインには比較的多量の正解データ (large amount of annotated resources in some specific domain) • ちょっと別のドメインになると、少量の正解データしかない/正解データがまったくない (only a small amount of annotated resources in different domains) • かといって、教師無学習は教師付学習ほど性能が高くない… (but, unsupervised methods don’t work as much as supervised methods) そこで、、、 (so,,,) • 多量の正解付データで学習した識別器を異なるドメインに適応(adopt the classifier trained on the resources on some specific domain to some different domains) • 少量の正解データをフル活用 (utilize the small amount of annotated resources) • 大量の生データを利用 (utilize raw resources(=not annotated resources) )

Domain Adaptation: terminology • ドメイン (Domain) • ソースドメイン (Source Domain) • 多量の正解データがあって、十分高性能な解析ができているドメイン (the domain in which we have large amounts of resources with annotation) • ターゲットドメイン(Target Domain) • 研究対象のドメイン。(the domain in which we want to achieve high performance) 解析性能を上げたいが、正解データが少ないドメイン。(but, we have only a few/no amounts of resources with annotation in this domain) • 仮定 (assumption) • ソースドメインに多量の正解付データ (a large amount of annotated resources in the source domain) • ターゲットドメインに少量/無の正解付データ (a few amount of resources in the target domain) • ターゲットドメインに大量の正解無データ (no resources with annotation, but a large amount of raw resources)

取り組み方その１Story #1 • 学習データ (Training Data) • Source Domain: 大量の正解付データ (Large annotated resources) • Target Domain:少量正解付データ(Small annotated resources) Source Domain Target Domain θ θ‘ Annoatted Data （Blog、 Bio-Medical Papers） Annotated Data （News Paper）

取り組み方その２Story #2 • 学習データ (Training Data) • Source Domain: 多量の正解付データ (Large annotated resources) • Target Domain: 大量の生データ (Very large raw resources) Target Domain Source Domain θ θ‘ Annotated Data （news paper） Raw Data （Blog, Bio-Medial papers)

取り組み方その３Story #3 • 学習データ (Training Data) • Source Domain: 多量の正解付データ (Large annotated resources) • Target Domain • 大量の生データ (Very large raw resources) • 少量の正解データ (Small annotated resources) Source Domain Target Domain θ θ‘ Annotated Data （Blog, Bio-Medical Papers） Annotated Data （News Paper） Raw Data （Blog, Bio-Medical Papers）

とりあえず思いつく簡単な手法(Naive Methods) • SrcOnly • ソースドメインの正解データだけ利用 (Use only annotated data in the source domain) • TargetOnly • ターゲットドメインの正解データだけ利用 (Use only annotated data in the target domain) • All • ソースドメインの正解データ、ターゲットドメインの正解データを合わせて利用 (Use annotated data in both source and target domains) • Weighted • ソースとターゲットの正解データの量で重みづけ (Weighting annotated data in the source domain and the target domain)

とりあえず思いつく簡単な手法(Naive Methods) • Pred • ソースドメインで学習した分類器の出力をターゲットドメインの素性の一つとして用いる (Use the output of the source domain classifier as a feature of the target domain classifier) • LinInt • ソースドメインで学習した分類器の出力と、ターゲットドメインで学習した分類器の出力の線形補間 (Linear interporation of the output of the target domain classifier and the source domain classifier)

Instance Weighting for Domain Adaptation in NLP (Jiang&Zhai2007) • 3種類全部のデータを使うモデル (Use all three types of data) • データ (Data) • 正解付データ (annotated data): {(xi, yi)}i=1...N • xiは入力の特徴ベクトル (input: feature vector) • yiは出力 (output) • 生データ (raw data): {xj}j=1...M • 目的関数(objective function)とパラメータ (parameters)

Instance Weighting: 目的関数 (objective function) • 目的関数 (objective function) • 普通の教師付学習なら (Empirical estimation with training data) • p(x,y) = p(y | x) p(x)と展開して、 • Labeling Adaptation: p(y | x)を適応 • Instance Adaptation: p(x)を適応

Instance Weighting (1) • Labeling Adaptation: p(y|x)の適応 • ps(y| x): Probability in the source domain • pt(y | x): Probability in the target domain • For Data(xi, yi) in the source domain, estimate the similarity of ps(yi | xi)and pt(yi | xi) ⇒ if it is similar, then use it as the training data • Exactly, for the source domain data(xi, yi), if yi = argmaxy pt(y | xi) then use it as the training data

Instance Weighting (2) • Instance Adaptation: p(x)の適応 • adjust the count C with C(pt(x)/ps(x)) • But, no experiment... because it is difficult to estimate it (1, 0, 1, 1, 0, 0, 1) (1, 0, 1, 1, 0, 0, 1) ⇒ PERSON p(PERSON|(1,0,1,1,0,0,1)) p((1,0,1,1,0,0,1)) p((1,0,1,1,0,0,1)) replace Target Domain Source Domain

Instance Weighting (3) • boosting • θ(n-1): parameters in (n-1)-th iteration of training • generate the target domain annotated data (xi,yi) with θ(n-1)by analyzing the target domain raw data (xi) • yi= argmaxy’ p(y’ | xi) • use only top-k data as the training data

Instance Weighting: 結果 • Labeling Adaptationのみの結果 • ターゲットドメインの正解データを付加

Instance Weighting: 結果 • bootstrapを用いた結果

ターゲットドメインの正解データを使わない手法ターゲットドメインの正解データを使わない手法 • J. E. Miller, M. Torii, K. Vijay-Shanker (2007) Building Domain-Specific Taggers without Annotated (Domain) Data, in Proc. of EMNLP-CoNLL 2007 • EMアルゴリズムによるHMMタガー • 遷移確率の初期値はソースドメインの正解付コーパス(Penn WSJ)から (initial transition probability comes from the source domain annotated corpus) • 出力確率の初期値はターゲットドメインの生コーパスとソースドメインの正解付コーパスから学習 (initial emission probability comes from the emission probability which is the most similar word) “phosphorylation” “phosphorylate” (リン酸化) “phosphorylates” “create” “phosphorylately”

ターゲットドメインの正解データを使わない手法ターゲットドメインの正解データを使わない手法 • J. Blitzer, M. Dredze, F. Pereira (2007) Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification, in Proc. of ACL 2007 • J.Blitzer, R. McDonald, F. Pereira (2006) Domain Adaptation with Structural Correspondence Learning, in Proc. of EMNLP 2006 • Rie Kubota Ando, Tong Zhang (2005) A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data, in JMLR, 6:1817-1853

SVD-ASO • 主問題:trainig data (xi,yi) → test (x, ?) • 補助問題: 主問題と別の問題を複数作成 • unsupervised approach • 主問題と似たようなタスクを設定 • ただし、訓練データの正解yiを使わずxiだけを使って正解データを設定できるタスク • 例: POS taggingなら、次に来る単語の予測など • 例: テキストジャンルの推定なら、テキストを２つに分割して、半分のテキストから残り半分のテキストの最も頻度の高い単語の予測など

SVD-ASO • 主問題:trainig data (xi,yi) → test (x, ?) • 補助問題: 主問題と別の問題を複数作成 • semi-supervised approach • 二種類の独立した素性のマップΦ1、Φ2を作成 • 主問題のclassifierをΦ1を使って作成 • 補助問題はΦ2を使って主問題のclassifierの出力を予想する

SVD-ASO • 全ての問題 l=1,...,mに対し、次の損失関数から、θ, wl, vlを求める • θは全問題で共通の行列 • SVDで求める • vl ,wlは各問題にspecificな重みベクター

SVD-ASO：アルゴリズム

SVD-ASOのDomain Adapationへの応用 • 補助問題を正解がない別ドメインと考える • POS tagger • J.Blitzer, R. McDonald, F. Pereira (2006) Domain Adaptation with Structural Correspondence Learning, in Proc. of EMNLP 2006 • Sentiment Analysis • J. Blitzer, M. Dredze, F. Pereira (2007) Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification, in Proc. of ACL 2007

SVD-ASOのPOS tagger Domain Adapationへの応用：アルゴリズム

ACL(+WS) 2007 EMNLP-CoNLL 2007 サーベイ