118 Views
June 16, 15
スライド概要
Presentation about Modeless Japanese Input Method
池上有希乃です・・・†
Hybrid method for modeless Japanese input using N-gram based binary classification and dictionary Yukino Ikegami Setsuo Tsuruta 2014/01/20
Necessity of Japanese Input Method • Japanese has many characters – Kana • Hiragana – 81 characters e.g.) いろはにほへと • Katakana – 81 characters e.g.) イロハニホヘト – Kanji (Chinese-characters) • More than 6,000 characters e.g.) 以呂波仁保反止 • We can’t input directly by a keyboard ➢ Japanese input method (Converting alphabet to Japanese character) is necessary 2
If all Japanese characters are assigned to each key… • Toooo many keys! • Japanese input method is necessary
Japanese Input Method -Roman to Kana-Kanji Converter• Flow 1. Receive the Romanized alphabets ①n ekodesu 2. Convert the Romanized alphabets ②ねこです into Kana using Roman-to-Kana table 3. Convert Kana into Kanji (if necessary) ③猫です 4
Problems on Japanese Input Method • Need to switch input modes between Japanese and ASCII e.g. To input ‘あれは8Byteです’ (That is 8Byte) areha [Return][ASCII Mode] 8byte [Japanese Mode] desu Switching Switching • Switching is cumbersome! 5
Adding Term to Dictionary for Switching Mode Problem • Adding term of other languages to dictionary of conventional input method editor • Shortcoming – New term is created continuously – Homograph problem
Related Work • Modeless Pinyin-Chinese Input [Chen et al. 2000] – Convert alphabet (Pinyin) to Chinese – Using word-surface feature only for classification • Type-Any [Ehara et al. 2009] – Convert Alphabet to Any Language – Need press Delimiter-key when converting – Using word-surface feature only for classification 7
Approach -Modeless Japanese Input Method• Automatically switching input mode 1. Generate discriminating model by Support Vector Machine (SVM) – the model describe multiple n-gram features 2. Distinguish a segment whether Kana or not in alphabet sequences using the discriminating model – e.g. nekohacatdesu → nekoha / cat / desu → ねこはcatです Japanese / English / Japanese 8
Main flow of Modeless Japanese Input Method User input (alphabet sequence) Non Japanese Dic. Kana-conversion Discriminative Model each character in user inputs if character is still ASCII? True System Response (Kana & alphabet sequence) False Kana conversion 9
Flow of Generating Discriminative Model Load Texts • 猫はcatです Kanji to Kana • Using Japanese Morphological Analyzer (MeCab) • ネコハcatデス Kana to ASCII • Using Kana to ASCII table (used by Google Japanese input) • nakohacatdesu ASCII to n-gram n-gram to ID Describe as binary model Learning on SVM • character-surface: ne, ek, nek, ko, eko, oh, koh, ha, oha... • character-type: LL, LL, LLL, LL, LLL, LL, LLL... • History: KK,KK, KKK, KK, KKK, KKK... • 1, 3, 4, 13, 22... • 1:1, 3:1, 4:1, 13:1, 32:1... • 1.344, 0.691, 0,023, -1.398... 10
n-gram Features あ れ は 8 B y t e a r e h a 8 B y t e (in case of n-gram upper limit n = 2, window size m = 2, focus-point xi = 2nd “a”) • Character-Surface – Substring of backward and forward at focus point – e.g.) -2/ha -1/a8 0/8B 1/By • Character-Type – Upper-case(U), Lower-case(L), Number(N), and Symbol(S). – e.g.) -2/LL -1/LN 0/NU 1/UL 11
Generating Non-Japanese Dictionary • Words never appeared in Japanese only text – More than 5 length – Contains substring can’t convert to Kana • Source – Corpus of Contemporary American English (COCA) – Japanese Wikipedia article title list 12
Compare with Conventional IME Conventional method areha [Return][Alphabet Mode] 8Byte [Japanese Mode] desu Switching Switching Typing : 17 Modeless Japanese input method areha8Bytedesu Typing : 14 • The number of typing key is decreased 13
Datasets used in Evaluation Experiment • Generating Model & Evaluating Method – Balanced Corpus of Contemporary Written Japanese (BCCWJ) • book, magazine, blog, government document and others • Non Japanese Dictionary Source – COCA – Japanese Wikipedia article title list 14
Criteria
Results of Evaluation Baseline (Char. surface n-gram) Proposed method (Char. {surface, type} n-gram & Dictionary) Kana Precision .998 .999 ASCII Precision .989 .996 Kana Recall .993 .780 .998 .884 .953 .858 .968 .924 ASCII Recall Kana F1-measure ASCII F1-measure • Outperforms baseline 16
User test • 4 females and 7 males • Input example sentences (chat, mail, technological text) Person No. 1 2 3 4 5 6 7 8 9 Conventional 18.18 17.89 15.4 IME 12.71 11.09 10.18 11.42 12.38 10.48 Proposed method 12.23 6.03 13.34 14.68 9.88 7.00 … 11.03 11.37 10.30 • Outperforms conventional method 17
Summary • Switching input mode is cumbersome • Hybrid Modeless Japanese Input Method – Automatically switching input mode between Japanese and ASCII – Using n-gram features model for discrimination • character-{surface, type} – Outperforms conventional methods 18