"An R package for creating and exploring word2vec and other vector models"を試す(+ GloVeと比較)
前書き
Rでword2vecが適用できるという{wordVectors}がGitHub上に公開されたので、早速試してみました(+「言語処理100本ノックの課題でGloVeによる結果と比較」)。
オリジナルのword2vecのC実装を.C()
でラップした{tmcn.word2vec}を修正したもので、モデル構築のパラメータを変更できる点や、類似度や似た単語を抽出する関数群が定義されております。
下記がそのパッケージのGitHubリンクです。
{wordVectors}で遊ぶ
READMEに記載のあるテキストデータ(料理の本)を使って、 {wordVectors}で定義された関数を試してみます。
ライブラリ読み込みや定数定義と入力ファイル準備
# インストールしておく # library(devtools) # devtools::install_github("bmschmidt/wordVectors") library(wordVectors) library(magrittr) library(tsne) SET_PRI_FILE <- list( TARGET_URL = "http://archive.lib.msu.edu/dinfo/feedingamerica/cookbook_text.zip", DISTINATION = "cookbooks.zip", TARGET_DIR = "cookbooks" ) SET_WORD2VEC_FILE <- list( TRAIN = "cookbooks.txt", MODEL = "cookbook_w2v.model" ) SET_WORD2VEC_PARAM <- list( VECTORS = 100, WINDOW = 10, THREADS = 1 ) # ファイルダウンロード if (!file.exists(SET_PRI_FILE$DISTINATION)) { download.file(url = SET_PRI_FILE$TARGET_URL, destfile = SET_PRI_FILE$DISTINATION) } unzip(zipfile = SET_PRI_FILE$DISTINATION, exdir = SET_PRI_FILE$TARGET_DIR)
テキスト前処理
# 分かち書きと記号除去、小文字化するなどを`wordVectors::prep_word2vec()`で実行 ## すでに前処理されたテキストがあるなら、`wordVectors::prep_word2vec()`は使わない方がいいとのこと ## READMEにある「Testing the setup」 の「Note: this prep_word2vec」以降の記述 > wordVectors::prep_word2vec(origin = SET_PRI_FILE$TARGET_DIR, destination = SET_WORD2VEC_FILE$TRAIN, lowercase = TRUE) Beginning tokenization to text file at cookbooks.txt cookbooks/amem.txt. cookbooks/amwh.txt. cookbooks/army.txt......... cookbooks/aunt.txt.... cookbooks/bart.txt... cookbooks/beec.txt.... cookbooks/blue.txt........ cookbooks/bost.txt...................... cookbooks/brkf.txt........ cookbooks/buck.txt...... cookbooks/cclu.txt. cookbooks/chas.txt........... cookbooks/chin.txt.. cookbooks/choc.txt.. cookbooks/comm.txt............ cookbooks/conf.txt..... cookbooks/coow.txt................... cookbooks/creo.txt..... cookbooks/dcvb.txt....... cookbooks/dish.txt. cookbooks/dome.txt....... cookbooks/econ.txt... cookbooks/ency.txt.................. cookbooks/engl.txt... cookbooks/epia.txt....... cookbooks/epib.txt........ cookbooks/favd.txt...... cookbooks/fcsc.txt............ cookbooks/fish.txt.. cookbooks/fofb.txt... cookbooks/fore.txt. cookbooks/fran.txt........ cookbooks/frca.txt..... cookbooks/frch.txt. cookbooks/gohk.txt.... cookbooks/good.txt.... cookbooks/grea.txt. cookbooks/gtte.txt..... cookbooks/hand.txt............... cookbooks/henr.txt........ cookbooks/hosf.txt................................... cookbooks/hote.txt. cookbooks/hous.txt.. cookbooks/ital.txt. cookbooks/jenn.txt......... cookbooks/jewi.txt............. cookbooks/lady.txt..... cookbooks/ldnw.txt.... cookbooks/linc.txt.............. cookbooks/mara.txt....... cookbooks/mary.txt... cookbooks/matf.txt... cookbooks/miss.txt........ cookbooks/neig.txt...... cookbooks/notm.txt... cookbooks/oldv.txt...... cookbooks/orie.txt..... cookbooks/pach.txt........... cookbooks/pcdg.txt.................. cookbooks/prac.txt........ cookbooks/pres.txt......... cookbooks/prho.txt................................... cookbooks/rore.txt.... cookbooks/sauc.txt....... cookbooks/scie.txt...................... cookbooks/sett.txt........................ cookbooks/sevf.txt. cookbooks/swed.txt.. cookbooks/syst.txt..... cookbooks/time.txt....... cookbooks/virg.txt... cookbooks/wash.txt....... cookbooks/whit.txt.. cookbooks/wosu.txt.. cookbooks/youn.txt.. cookbooks/zuni.txt... > file.exists(SET_WORD2VEC_FILE$TRAIN) [1] TRUE
分散表現を学習
# 前処理したファイル(ここでは"cookbooks.txt")を使って単語の分散表現を学習 ## "Progress"が100%を越える場合も? > word2vec_model <- wordVectors::train_word2vec( train_file = SET_WORD2VEC_FILE$TRAIN, output_file = SET_WORD2VEC_FILE$MODEL, vectors = SET_WORD2VEC_PARAM$VECTORS, window = SET_WORD2VEC_PARAM$WINDOW, threads = SET_WORD2VEC_PARAM$THREADS ) Starting training using file /Users/yamano357/cookbooks.txt Vocab size: 25699 Words in train file: 10373493 Alpha: 0.000042 Progress: 99.94% Words/thread/sec: 45.47k # ファイル出力されたモデルは`wordVectors::read.vectors()`で読み込んで利用可能 # word2vec_model <- wordVectors::read.vectors(filename = SET_WORD2VEC_FILE$MODEL) # 学習されたベクトルを表示 > word2vec_model A VectorSpaceModel object of 25699 words and 100 vectors V1 V2 V3 V4 V5 V6 </s> 0.002886 0.001317 -0.001745 -0.000629 -0.001375 0.003481 the -0.011442 0.191060 0.330702 -0.290557 -0.291106 -0.182090 and -0.028725 0.102838 0.345141 -0.277161 -0.295462 -0.145107 of 0.009846 0.145960 0.363108 -0.233844 -0.238913 -0.053135 a -0.012503 0.171922 0.460543 -0.352876 -0.293058 -0.263102 in -0.029359 0.172742 0.339801 -0.298472 -0.372895 -0.029149 to -0.020429 0.234571 0.205464 -0.180147 -0.303744 -0.114839 it 0.031375 0.088285 0.256754 -0.228791 -0.101396 -0.137876 with 0.005248 0.108248 0.437498 -0.213623 -0.228777 -0.174847 or -0.237626 0.088747 0.653050 -0.353354 -0.450014 -0.181351 # 単語指定も可能 > word2vec_model[["fish"]] A VectorSpaceModel object of 1 words and 100 vectors V1 V2 V3 V4 V5 V6 -0.195566 0.264130 0.275779 0.202528 -0.070085 -0.131659
定義されている関数を試す
# `wordVectors::nearest_to()`で指定した要素にコサイン類似度が似た単語をn個出力 > word2vec_model %>% + wordVectors::nearest_to(vector = word2vec_model[["fish"]], n = 20) fish summary eggs leading butter cheese meats omelets vegetables reformation soups tea nuts 0.0000000 0.2293168 0.2806487 0.3034526 0.4148904 0.4483116 0.4645882 0.4811739 0.5383347 0.5442060 0.5460867 0.5565817 0.5625072 flesh oysters xlvii yeast omnivorous cookery expense 0.5659814 0.5827559 0.5847825 0.5885628 0.5898036 0.5912861 0.5965578 # 複数要素の指定も可能 ## ひとつの次元で返ってくる > word2vec_model %>% + wordVectors::nearest_to(vector = word2vec_model[[c("salmon", "eels")]], n = 20) eels salmon smelts trout sturgeon halibut carp haddock 5050to turbot brook cod flounders perch tench 0.1089593 0.1539040 0.1872735 0.2088890 0.2331441 0.2633695 0.2635104 0.2722559 0.2794106 0.2830821 0.2879877 0.2980044 0.3059720 0.3118793 0.3131088 soles mackerel shad bass skate 0.3131817 0.3145151 0.3151385 0.3154590 0.3175680 # `wordVectors::reject()`や`wordVectors::project()`は単語の曖昧性解消に使える?(要確認) ## `wordVectors::reject()`で"summary"の要素を除去(より料理に近づいた気がする) > word2vec_model %>% + wordVectors::nearest_to( + vector = word2vec_model[["fish"]] %>% + wordVectors::reject(word2vec_model[["summary"]]), + n = 20 + ) fish meats soups omelets oysters vegetables clams roast dishes 6565oyster tea 65oyster 6464oyster fresh 0.3627815 0.4387543 0.5038327 0.5306427 0.5343506 0.5607881 0.5860421 0.6010517 0.6045615 0.6182329 0.6209280 0.6222338 0.6288608 0.6383299 fishfish 6565fulton salads game meatsmeats 66fulton 0.6413938 0.6440959 0.6467585 0.6477857 0.6515009 0.6519537 ## `wordVectors::project()`でさらに"pasta"の要素を追加(イタリア料理に近づいた?) > word2vec_model %>% + wordVectors::nearest_to( + vector = word2vec_model[["fish"]] %>% + wordVectors::reject(word2vec_model[["summary"]]) %>% + wordVectors::project(word2vec_model[["pasta"]]), + n = 20 + ) pasta budino minestra margherita sm salsa acciughe burro carob alla genoa alpine 2.220446e-16 3.388659e-01 3.644817e-01 4.112720e-01 4.114118e-01 4.240042e-01 4.318024e-01 4.324377e-01 4.333508e-01 4.354541e-01 4.371379e-01 4.387601e-01 beoregh del maccheroni funghi cappelletti riso nuremberg semolino 4.410889e-01 4.502326e-01 4.523589e-01 4.563918e-01 4.612530e-01 4.651600e-01 4.659397e-01 4.746725e-01 # コサイン類似度も定義されている ## ウナギ(eels)よりもニジマス(trout)の方がサーモン(salmon)と似ている > wordVectors::cosineSimilarity( + x = word2vec_model[[c("salmon")]], y = word2vec_model[[c("trout")]] + ) [,1] [1,] 0.6474464 > wordVectors::cosineSimilarity( + x = word2vec_model[[c("salmon")]], y = word2vec_model[[c("eels")]] + ) [,1] [1,] 0.5119509 # `wordVectors::magnitudes()`でマグニチュードを計算 ## (ここでいうマグニチュードがどういう意味を持つかわかっていないので、今後調査) ## 各単語毎のマグニチュード > wordVectors::magnitudes(matrix = word2vec_model) %>% + head(n = 10) </s> the and of a in to it with or 0.02704919 1.79423257 1.74728611 1.62943262 1.81958723 1.69660988 1.66308927 1.66002468 1.83629404 2.05842731 # 指定した単語のマグニチュード > wordVectors::magnitudes(matrix = word2vec_model[[c("the")]]) [1] 1.794233 # 指定単語からなる分散表現のマグニチュード > wordVectors::magnitudes(matrix = word2vec_model[[c("the", "and")]]) [1] 1.743263 # T-SNEでベクトルを次元に落としてプロット wordVectors::plot( x = word2vec_model, y = word2vec_model[[c("salmon", "trout", "eels")]] )
不明な点
# 複数要素をまとめて指定した場合と加算の違いがよくわからない ## 同じになるかと思ったので要調査 > word2vec_model[[c("salmon", "eels")]] A VectorSpaceModel object of 1 words and 100 vectors V1 V2 V3 V4 V5 V6 -0.2952295 0.1111025 0.1369030 0.1668495 -0.0012665 0.1551060 > (word2vec_model[["salmon"]] + word2vec_model[["eels"]])[, 1:6] V1 V2 V3 V4 V5 V6 -0.590459 0.222205 0.273806 0.333699 -0.002533 0.310212
GloVeと比較
アナロジータスク(意味の足し算・引き算)では、StanfordのNLPグループのGloVeがword2vecよりも精度がいいという話です(ただし、negative samplingによるword2vecの方が精度がよいという報告も)。そこで言語処理100本ノックの10章(ベクトル空間法2)の課題を、先述の{wordVectors}によるword2vecの結果とGloVeによる分散表現で試してみました。課題のRコードはRPubsに上げているものを流用しております。
なお、GloVeはglove-pythonによる結果を{PythonInR}で呼び出しております。
- GloVe: Global Vectors for Word Representation
- maciejkula/glove-python · GitHub
- Pennington, Socher, and Manning. (2014) GloVe: Global vectors for wor…
- 単語の分散表現と構成性の計算モデルの発展
ライブラリ読み込みや関数定義・定数設定
library(hadleyverse) library(PythonInR) library(wordVectors) # 事前準備 # https://github.com/maciejkula/glove-python # git clone https://github.com/maciejkula/glove-python.git # $ sudo python setup.py develop # $ sudo python setup.py install # PythonInR::pyIsConnected() # PythonInR::pyExit() # PythonInR::pyConnect() # Python側への定数設定用関数 # 何もせずにIntegerで渡す際にはFloat扱いされるため、"L"指定 defPyConst <- function ( param_list ) { sapply( X = seq(from = 1, to = length(param_list)), FUN = function (i) { cast_fun <- ifelse( test = is.integer(x = param_list[i][[1]]), yes = as.integer, no = ifelse( test = is.character(x = param_list[i][[1]]), yes = as.character, no = as.numeric ) ) PythonInR::pySet( key = stringr::str_to_lower(string = names(param_list[i])), value = cast_fun(param_list[i][[1]]) ) } ) } callPyConst <- function ( param_list_vec ){ sapply(X = param_list_vec, FUN = defPyConst) } # Python環境に複数混在しているときにインストールした先を直指定 PythonInR::pyImport(import = "os") PythonInR::pyExec(code = 'sys.path = ["/usr/local/lib/python2.7/site-packages/glove-0.0.1-py2.7-macosx-10.11-x86_64.egg"] + sys.path') PythonInR::pyImport(import = c("Glove"), from = c("glove")) PythonInR::pyImport(import = c("Corpus"), from = c("glove")) PythonInR::pyImport(import = c("gensim")) # 入力コーパスは言語処理100本ノックの8章で作成したものを流用 SET_CORPUS <- list( FILE_NAME = list( WIKI = "enwiki-corpus.txt" ), CORPUS_PARAM = list( WINDOW_SIZE = 10L ) ) # GloVeのパラメータ設定 SET_GLOVE_PARAM <- list( MODEL = list( NO_COMPONENTS = 100L, LEARNING_RATE = 0.05 ), TRAIN = list( EPOCHS = 30L, NO_THREADS = 1L ) ) # word2vecのパラメータ設定 SET_WORD2VEC_PARAM <- list( MODEL = list( VECTORS = 100L, OUTPUT_FILE_NAME = "w2v.model" ), TRAIN = list( THREADS = 1 ) )
Glove({PythonInR}でglove-pythonを呼び出し)
# Python側で定数設定 callPyConst( param_list_vec = list( SET_CORPUS$FILE_NAME, SET_CORPUS$CORPUS_PARAM, SET_GLOVE_PARAM$MODEL, SET_GLOVE_PARAM$TRAIN ) ) # テキスト読み込み(窓サイズ未満の文を除外) sentences <- stringr::str_split( string = readr::read_lines(file = SET_CORPUS$FILE_NAME$WIKI, n_max = -1), pattern = "[:space:]", n = Inf ) sentences <- sentences[sapply(X = sentences, FUN = length) > SET_CORPUS$CORPUS_PARAM$WINDOW_SIZE] PythonInR::pySet(key = "sentences", value = sentences) # コーパスから共起生成 create_corpus <- ' corpus = Corpus() corpus.fit(corpus = sentences, window = window_size) ' PythonInR::pyExec(code = create_corpus) # PythonInR::pyPrint(objName = 'len(corpus.dictionary)') # 分散表現の学習 train_glove <- ' glove_model = Glove(no_components = no_components, learning_rate = learning_rate) glove_model.fit(matrix = corpus.matrix, epochs = epochs, no_threads = no_threads, verbose = False) glove_model.add_dictionary(corpus.dictionary) ' PythonInR::pyExec(code = train_glove) # 学習した分散表現の取得 word_vectors <- PythonInR::pyGet(key = 'glove_model.word_vectors') words <- sort(x = PythonInR::pyGet(key = 'corpus.dictionary')) + 1 rownames(word_vectors) <- names(words) # 学習した分散表現を一部表示 > t(head(x = word_vectors, n = 10)) Anarchism is a political philosophy that advocates stateless societies often [1,] -0.0197145532 -0.4361363946 -0.40416750 -0.193117726 -0.140611586 -0.237563057 -0.046598826 0.009139675 -0.0821713315 -0.151447989 [2,] -0.0051968756 -0.3356956869 -0.45635064 -0.220704961 -0.120949534 -0.398483653 -0.081941467 -0.020349749 -0.1421136639 -0.161386703 [3,] -0.0259156176 -0.4448075527 -0.34874114 -0.130795252 -0.083520086 -0.328155519 -0.076097158 0.004337605 -0.1461577022 -0.246513429 [4,] -0.0043913481 -0.5169732639 -0.70765749 -0.103548952 -0.145778789 -0.530804483 -0.116348320 -0.013401881 -0.1168139116 -0.254454518 [5,] -0.0305508373 -0.5764952948 -0.52620944 -0.027373519 -0.257815001 -0.097907021 -0.079508698 -0.073136353 -0.1206446645 0.032631675 [6,] 0.0749686613 -0.1794381970 0.01762950 -0.487407337 0.009455781 -0.041630558 -0.070110422 0.023744998 0.1533433754 0.434715968 [7,] -0.0024962612 -0.3521671374 -0.48363684 -0.217974959 -0.119997734 -0.419784075 -0.065102083 -0.004165331 -0.1394088693 -0.278742332 [8,] 0.0170612838 0.5489748851 0.78152246 0.144684091 0.156954794 0.586498076 0.111868779 0.015849797 0.1532414837 0.265544077 [9,] 0.0515458670 0.4665317597 0.34598457 -0.223505193 0.089171692 0.157786324 0.123186641 -0.001915330 0.1421349038 0.267037327 [10,] -0.0031886044 -0.0363404205 -0.36182028 0.377003226 0.208618979 0.838592799 0.073078756 -0.010946617 0.1597453332 0.324131463 [11,] -0.0185492240 0.3487305293 0.57693163 0.129886436 0.072246633 0.257844065 0.055315964 0.005769830 0.0112895031 0.071638211 [12,] 0.0226010025 -0.1860932274 -0.38641932 -0.335470659 0.062645629 -0.154836778 -0.063333331 -0.027651475 -0.0725693888 -0.331522689 [13,] 0.0016881726 0.3883017157 0.48291680 0.053385104 -0.015432378 0.341312372 0.118245875 -0.025380432 0.0371774365 0.031272980 [14,] 0.0086828359 -0.5255879583 -0.58531238 -0.216808218 -0.131160694 -0.316911435 -0.100522769 -0.027765126 -0.1522589235 -0.174193142 [15,] -0.1089926755 0.0203967680 0.17147519 -0.717717603 -0.530402374 -0.778761418 -0.249039013 0.097070707 -0.1698115245 -0.463900783 [16,] 0.0030559733 0.3512268406 0.52451335 0.173579495 0.086379282 0.403854910 0.076972725 0.007120531 0.1085776666 0.217951923 [17,] -0.0148275260 -0.3528285016 -0.58686794 -0.232983274 -0.114082497 -0.462181092 -0.117756650 -0.011234874 -0.1198279333 -0.173512454 [18,] -0.0117125910 -0.0266292151 -1.08163328 -0.192090724 -0.015539769 0.319086646 0.212865968 -0.015488157 0.0950753334 0.182846830 [19,] -0.0119362582 -0.3244356508 -0.41837150 -0.321034094 -0.168214931 -0.330301100 -0.084536198 -0.009036183 -0.1315066528 -0.323298402 [20,] 0.0160479620 -0.2346154766 -0.51840973 -0.335267911 -0.059002582 -0.233250101 -0.021293960 -0.007004657 -0.1779479501 -0.179526263 [21,] -0.0154758274 0.1660385767 -0.95994298 -0.132029239 -0.437509421 0.165499910 0.016931909 0.022650722 -0.0737461807 0.434064753 [22,] 0.0011506072 0.4046531405 0.48964372 0.200083206 0.179276939 0.606439138 0.101426141 0.007878873 0.1284688805 0.181948076 [23,] 0.0189475965 0.4145417272 0.48929381 0.118478652 0.128228047 0.393478590 0.074299445 0.007583208 0.1135969381 0.241456471 [24,] 0.0006030805 -0.7427414627 -0.49427071 0.041272942 -0.183448173 -0.444094650 -0.016447016 0.003897035 -0.0934452865 -0.314341983 [25,] -0.0192373000 0.0526836420 0.13213338 -0.104688730 -0.012674917 -0.178252022 -0.108240858 0.015204545 -0.0279979965 -0.135153956 [26,] 0.0062709858 0.4717125555 0.58454293 0.126356903 0.148322720 0.489259035 0.111330520 0.014219198 0.1268075704 0.247614822 [27,] 0.0847132763 0.4481084875 0.01405853 -0.059643428 0.460104704 -0.158658503 0.104609273 0.040206195 0.1269354009 0.311140674 [28,] -0.0102049699 -0.4589758037 -1.06099978 -0.343752406 -0.333528150 -0.507279626 0.011064684 -0.045546065 -0.1011985321 0.109155009 [29,] 0.0562630307 -0.4140540269 -0.22232832 -0.534987128 -0.096878204 -0.166532895 -0.149527122 0.049786397 0.0093110259 0.067270629 [30,] -0.0068158522 -0.5014671306 -0.60730339 -0.172517110 -0.167756759 -0.517940963 -0.097854411 -0.001358169 -0.1150995655 -0.249592593 [31,] 0.0393558194 0.0181378481 0.29117599 -0.443669800 -0.191659530 -0.842353205 0.087577873 0.028907993 -0.0534251201 -0.370400630 [32,] 0.0192426866 0.4592208419 0.61211524 0.166206888 0.138929340 0.451500306 0.110388801 0.021950032 0.1693328590 0.266713987 [33,] 0.0216123054 0.6350235096 0.86521595 0.390665296 0.012771762 -0.419958765 0.009356715 -0.018836813 -0.0757690269 0.342266468 [34,] -0.0357372420 -0.4158297246 0.30539349 -0.294826119 0.015366480 -0.536440920 0.002033716 0.033994576 -0.1938344202 -0.104782415 [35,] -0.0717777646 -0.3653842070 -0.44774052 0.012509784 -0.056637978 -0.111359797 0.079541793 -0.139792513 -0.0163524969 0.106019543 [36,] -0.0171051762 0.4344582926 0.23299880 0.319869177 -0.011585923 0.078602428 0.010928161 -0.024195741 -0.1047758708 0.127300837 [37,] -0.0494688373 -0.6256539687 -0.36236712 -0.191885889 -0.088878014 -0.017151488 0.012161787 -0.013819499 -0.1412635154 -0.055105674 [38,] -0.0325005489 0.1432966149 -0.58535988 0.219629302 0.238529005 0.566467383 0.070372636 0.047610551 0.0449240101 -0.197880463 [39,] -0.0014954914 -0.3587102264 -0.68950974 -0.298566125 -0.165592022 -0.499233356 -0.099054544 -0.009077594 -0.1327680784 -0.240335269 [40,] 0.0478706651 0.2003966624 -0.60748906 -0.476104393 -0.180799873 -0.281531232 -0.386493290 -0.207400907 -0.3966388316 0.008678525 [41,] -0.1183027126 -1.0554008944 -0.32073729 0.465788273 -0.104708494 -0.524271806 0.064211662 0.055099862 0.0299698854 0.125219143 [42,] 0.0168698385 0.5738141747 0.94006908 0.195171899 0.180706582 0.656736189 0.138439725 0.017836755 0.1684518475 0.274693362 [43,] 0.0035486012 -0.4277225829 -0.60645601 -0.187081831 -0.171732028 -0.460151368 -0.101166688 -0.010708023 -0.0958334521 -0.213122101 [44,] -0.0158722507 -0.5388573292 -0.88663790 -0.162431788 -0.154965661 -0.589360519 -0.134959614 -0.027487420 -0.1550236857 -0.239045635 [45,] 0.0048777855 0.4550079304 0.55284303 0.217302619 0.167956105 0.563065102 0.106619261 0.012901426 0.1243152114 0.215701274 [46,] -0.0039425913 -0.2081573948 -0.22579277 -0.234228017 -0.066747717 -0.441414703 -0.003970894 0.048150346 -0.1536932576 0.036622833 [47,] 0.0417514177 0.4245880080 0.48591137 0.047777313 0.064814058 0.547218606 0.073185456 0.013980617 -0.1277746363 0.065539343 [48,] 0.0123870238 0.4143675104 0.55735628 0.188210437 0.144191665 0.443381680 0.101316432 0.004364792 0.1189175600 0.217139836 [49,] -0.0717251262 -0.4739171811 0.30132180 0.348478552 0.114325298 -0.671310254 -0.223303147 -0.062992705 0.1350065050 -0.192851701 [50,] 0.1141044459 0.2593176844 0.10583593 0.032002586 -0.035409030 -0.203857839 0.109752619 0.071421101 0.0444681986 0.176123204 [51,] 0.0364327741 0.4160492768 0.01493623 0.203729982 0.358984649 0.686817694 0.084121141 -0.054235250 0.0981004620 0.037212324 [52,] -0.0131500855 -0.4144524306 -0.51597878 -0.299986383 -0.202104598 -0.415060055 -0.091450397 -0.018433016 -0.1378850404 -0.228909798 [53,] 0.0044910878 0.3733333682 0.51783859 0.125488919 0.150170899 0.458449006 0.080973063 0.013432540 0.1151645363 0.225140519 [54,] 0.0095010919 0.3849994628 0.50856393 0.258537460 0.063926216 0.259538376 0.100198301 0.023301131 0.1384841102 0.215757277 [55,] 0.0307780030 -0.4959743949 -0.30571387 0.734662852 0.296939609 -0.355658963 0.045742124 0.067541653 0.1525978820 -0.371526932 [56,] -0.0115759543 0.0292115777 -0.19150361 -0.199577459 -0.004298763 -0.589075896 -0.054537339 0.048702822 -0.1560178696 0.190896509 [57,] -0.0085855772 -0.2974894841 -0.37703834 -0.055275041 -0.163617156 -0.401610885 -0.096457784 0.006410818 -0.1027561618 -0.314888523 [58,] -0.0138372369 0.2049840017 -0.89401277 0.054006711 0.026741686 -0.126097922 -0.027098648 0.031900925 0.3586221346 0.182107356 [59,] 0.0049005716 0.0534940558 -0.27116775 0.456515996 0.128434495 -0.302198560 0.188085747 0.055259963 0.2297430009 -0.181497718 [60,] -0.0134961178 -0.4285287300 -0.51868436 -0.138071418 -0.161620308 -0.541280714 -0.114660642 -0.002963525 -0.1477575096 -0.268538296 [61,] -0.0176748351 -0.5069625026 -0.38050449 -0.252399719 -0.164750310 -0.473363827 -0.065082851 -0.015354667 -0.1228629791 -0.206848188 [62,] -0.0271739373 0.2405493745 -0.04381818 0.042122119 0.168984884 -0.243660093 -0.024775282 -0.090497467 0.0041780401 0.330996175 [63,] 0.0230359483 0.4355426813 0.53164131 0.153765230 0.135106949 0.477118355 0.114679256 0.015650456 0.1186629896 0.262662295 [64,] -0.0202139012 -0.5159714451 -0.10797941 -0.198721812 -0.039985496 -0.314425345 -0.071941128 0.019623951 -0.0768114679 -0.314996288 [65,] -0.0368061859 -0.3008513915 -0.55051517 -0.444961021 -0.071938408 -0.163734249 -0.122263858 -0.113686651 -0.1933002815 -0.608370687 [66,] -0.0121900680 -0.5357345329 -0.74988022 -0.242300990 -0.150729009 -0.453042267 -0.086211125 -0.014488720 -0.1344864530 -0.245738620 [67,] 0.0419097638 0.2677340276 -0.42194915 -0.176070359 0.192625358 -0.844283559 -0.034321742 0.063609987 -0.0022081964 0.212206713 [68,] -0.0154766211 0.4223752527 -0.40769825 0.015010305 -0.111772412 0.328195977 0.009551410 -0.062792379 0.1285210338 0.181448159 [69,] -0.0243687001 -0.8717223301 0.04504987 -0.486937163 -0.391276481 -0.150043868 0.158853980 0.030239300 -0.1179283291 -0.201660470 [70,] -0.0133391840 -0.4308425869 -0.56852777 -0.202090110 -0.157213135 -0.492851852 -0.103370267 -0.013495955 -0.1274218245 -0.291202449 [71,] -0.0093812775 -0.4939714021 -0.33162552 -0.292112581 -0.083596358 -0.177746573 -0.043879882 -0.003626592 -0.0613281114 -0.231458737 [72,] -0.0084546081 -0.4509095157 -0.68119542 -0.209282985 -0.161350102 -0.606185853 -0.105967356 -0.006340981 -0.1544706870 -0.219814226 [73,] -0.0040338081 -0.2186681601 -0.28858979 -0.276200107 -0.134113260 -0.532825196 -0.098242434 -0.020171048 -0.2276183927 -0.167111462 [74,] 0.0107221662 -0.0008869788 -0.01377464 -0.296819324 0.122283739 -0.192219837 0.114477963 0.059288255 0.0861802090 -0.067785183 [75,] -0.0357174565 -0.2752424263 0.02992663 -0.106030517 -0.158516103 -0.466164705 0.033982629 0.029600331 0.0415627675 0.165371106 [76,] 0.0210161745 0.4816928471 0.54338037 0.068561353 0.149185461 0.516637503 0.087358705 0.008254110 0.1354252494 0.278524623 [77,] -0.0028661887 0.3200857553 0.71186560 -0.023545452 -0.169581388 -0.693092673 -0.066925138 0.026749531 -0.1014504072 0.342181352 [78,] -0.0215558722 -0.2522340966 -0.80025301 -0.425031370 -0.182379692 -0.437732077 -0.039900232 0.036174019 -0.1031289742 -0.052815655 [79,] 0.0213858089 0.5267653451 0.75637087 0.500624835 0.294552163 0.237372325 0.168510492 0.018304497 -0.1153358489 -0.024256570 [80,] -0.0199335482 -0.2988781637 -0.02295991 -0.095358817 -0.100838401 -0.150536675 -0.045424382 -0.021837321 0.0553483908 -0.271035778 [81,] 0.0284521148 0.5260219619 0.69699955 0.164727017 0.156632279 0.542904140 0.116606913 0.008682174 0.1586905470 0.264015032 [82,] 0.0219479811 0.5390417867 0.81514508 0.217850193 0.173833381 0.515297580 0.105593064 0.014087228 0.1498949213 0.256982504 [83,] -0.0764780915 -0.3369233297 -0.19218809 -0.320357611 -0.059573015 -0.090262303 0.010967649 0.005883679 -0.0551290387 -0.211779780 [84,] 0.0142787094 0.4104034736 1.14293582 -0.298581437 -0.293125259 0.158460489 -0.035579180 0.076096012 -0.2719509355 -0.198749959 [85,] 0.0249539309 0.5393290622 0.77047149 0.184880338 0.173639282 0.557435709 0.111269794 0.018913686 0.1412093436 0.242423415 [86,] 0.0871820369 0.0219810632 -0.54041596 -0.270206924 0.124731487 0.593714203 0.142500061 -0.073330152 -0.0380155565 -0.480748775 [87,] 0.0111952659 0.3078488995 0.39975400 0.222497309 0.094248672 0.289402960 0.041248027 0.018310443 0.0383205670 0.319387634 [88,] -0.1799860570 -0.4094767256 -0.33680591 0.352880274 0.157358380 -0.243021364 -0.189177869 -0.014882951 0.1362633075 0.121062914 [89,] 0.0119851126 0.4779562953 0.63480159 0.270245373 0.149566060 0.385136031 0.087121469 0.014831627 0.1327678838 0.172673638 [90,] 0.0111812888 0.3401699639 0.48901885 -0.004901518 0.211304077 0.340507385 0.013088195 0.009064575 0.0451943048 0.242144118 [91,] -0.0012671804 0.2021036778 -0.21575785 -0.250080773 -0.016671391 -0.576771908 -0.048600631 -0.035192905 -0.2303521594 -0.264146876 [92,] -0.0112525934 -0.6998474524 -0.04731718 0.409151772 0.254243912 -0.003090596 0.190535821 0.009505694 -0.0001135237 -0.121124580 [93,] 0.0010096917 0.3448931065 0.11754894 0.146814486 0.019479537 0.508570170 0.077013883 0.010548312 0.0391527947 0.470809748 [94,] -0.0099490705 -0.4439013956 -0.43145890 -0.269437492 -0.178484161 -0.488810451 -0.089725648 0.005737978 -0.1321212659 -0.236763957 [95,] 0.0402091738 0.3460396139 0.43848132 -0.002983367 0.126655836 0.376237325 0.026766804 -0.004893391 0.0986507054 0.226229817 [96,] -0.0103022619 -0.5168858368 -0.69784768 -0.217812151 -0.178791360 -0.563700850 -0.114075094 -0.020861923 -0.1467886514 -0.262698681 [97,] -0.0052224460 -0.2289570470 -0.43984427 -0.131714051 -0.115106558 -0.464252583 -0.034770321 -0.002773447 -0.1322943999 -0.215956576 [98,] 0.0123605114 0.5526238110 0.83910364 0.092369221 0.167509939 0.543014494 0.134696809 0.015206805 0.1320821366 0.249760227 [99,] 0.0262288006 -0.4488389771 0.46400712 -0.355740042 -0.339483867 -0.079013743 0.101946834 0.005225885 -0.0977554036 -0.432639049 [100,] -0.0306522086 -0.8251075272 -0.41099599 -0.104166837 -0.251574966 -0.155326841 -0.065591903 0.027921995 -0.0739140133 -0.336994578 # "king"のコサイン類似度が高い単語上位10個 > dplyr::data_frame( + word = rownames(word_vectors), + similarity = as.numeric( + x = wordVectors::cosineSimilarity( + x = word_vectors, + y = word_vectors[c("king"), , drop = FALSE] + ) + ) + ) %>% + dplyr::arrange(dplyr::desc(similarity)) %>% + dplyr::top_n(wt = similarity, n = 10) %>% + t [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] word "king" "emperor" "McCreadie" "throne" "ruler" "Lord" "kingdom" "King" "prince" "Alexander" similarity "1.0000000" "0.8208892" "0.8202088" "0.8145893" "0.8099106" "0.8015906" "0.7881976" "0.7855727" "0.7853441" "0.7782566" # 「"king" - "man" + "woman"」のコサイン類似度が高い単語上位10個 ## いい結果が出ていないような > dplyr::data_frame( + word = rownames(word_vectors), + similarity = as.numeric( + x = wordVectors::cosineSimilarity( + x = word_vectors, + y = word_vectors[c("king"), , drop = FALSE] - word_vectors[c("man"), , drop = FALSE] + word_vectors[c("woman"), , drop = FALSE] + ) + ) + ) %>% + dplyr::arrange(dplyr::desc(similarity)) %>% + dplyr::top_n(wt = similarity, n = 10) %>% + t [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] word "king" "prince" "emperor" "monarch" "ruler" "Lord" "throne" "crown" "bishop" "McCreadie" similarity "0.9225327" "0.7988402" "0.7718948" "0.7685985" "0.7591299" "0.7580483" "0.7569725" "0.7565388" "0.7469163" "0.7420729"
word2vec({wordVectors}を使用)
word2vec_model <- wordVectors::train_word2vec( train_file = SET_CORPUS$FILE_NAME$WIKI, output_file = SET_WORD2VEC_PARAM$MODEL$OUTPUT_FILE_NAME, vectors = SET_WORD2VEC_PARAM$MODEL$VECTORS, window = SET_CORPUS$CORPUS_PARAM$WINDOW_SIZE, threads = SET_WORD2VEC_PARAM$TRAIN$THREADS ) # 学習した分散表現を一部表示 > word2vec_model A VectorSpaceModel object of 87343 words and 100 vectors V1 V2 V3 V4 V5 V6 </s> 0.004879 0.000446 -0.003817 0.004790 -0.003004 -0.004449 the -0.072486 0.011465 -0.076258 -0.008318 -0.136823 -0.017355 of -0.104772 0.075201 -0.013187 -0.154044 -0.036493 -0.062068 and -0.005420 0.026938 -0.051364 0.003884 -0.038359 -0.049197 in 0.001223 -0.021385 -0.147751 -0.179655 -0.062868 -0.085900 to -0.025559 0.004284 0.190833 0.016868 -0.187808 -0.127656 a -0.033775 0.025930 -0.121103 0.023335 -0.124543 -0.032977 was 0.041974 0.092868 -0.185291 -0.081785 -0.112621 -0.116294 The -0.002049 0.038936 -0.115069 -0.044791 -0.178706 -0.001728 is -0.154961 0.158679 -0.139067 0.030145 0.051581 -0.080358 # "king"のコサイン類似度が高い単語上位10個 > word2vec_model %>% + wordVectors::nearest_to(vector = word2vec_model[["king"]], n = 10) king throne ruler vassal prince Bosporan emperor Uroš Tiridates claimant 1.110223e-16 1.635638e-01 1.977024e-01 2.077052e-01 2.127661e-01 2.217835e-01 2.257257e-01 2.303447e-01 2.322657e-01 2.367654e-01 # 「"king" - "man" + "woman"」のコサイン類似度が高い単語上位10個 ## やっぱりいい結果が出ていないような > word2vec_model %>% + wordVectors::nearest_to( + vector = (word2vec_model[["king"]] - word2vec_model[["man"]] + word2vec_model[["woman"]]), + n = 10 + ) king throne prince Kosala ruler vassal Wuzong emperor kingdom Emperor 0.1155837 0.2154374 0.2379058 0.2689162 0.2948246 0.3043600 0.3082114 0.3101102 0.3170385 0.3265235
言語処理100本ノックのタスクで比較
GloVeの結果オブジェクト名以外はそのまま流用しました(共通化できる部分はありますが、今回はしていませんのでご了承ください)。
関数定義
# def function 8x. --------------------------------------------------------------------- filterCosineSim <- function ( seed_word_vector, target_word_vectors, extract_rownames = NULL ) { word_vectors <- rbind(seed_word_vector, target_word_vectors) numerator <- crossprod(x = t(x = word_vectors)) denominator <- diag(numerator) return((numerator / sqrt(outer(denominator, denominator)))[extract_rownames, ]) } createArithmeticWordVector <- function ( word_sence, def_arithmetic ) { return( colSums( do.call( what = "rbind", args = lapply( X = names(def_arithmetic), FUN = function (each_arithmetic) { return( switch(EXPR = as.character(def_arithmetic[each_arithmetic]), "+" = + word_sence[each_arithmetic, ], "-" = - word_sence[each_arithmetic, ] ) ) } ) ) ) ) } # def function 92. --------------------------------------------------------------------- createArithmticWordName <- function( target_word, set_arithmetic = c("1" = "-", "2" = "+", "3" = "+") ) { word_arithmetic <- as.character(set_arithmetic) names(word_arithmetic) <- target_word[as.integer(names(set_arithmetic))] return(list(word_arithmetic)) } fetchMulutiCosineSimilarity <- function( seed_word_vector, target_words_sence, seed_word_name, split_size ){ # 疎行列から行列へ変換 seed_word_vector <- t(apply(X = seed_word_vector, MARGIN = 1, FUN = as.matrix)) target_words_sence <- data.frame( t(apply(X = target_words_sence, MARGIN = 1, FUN = as.matrix)), stringsAsFactors = FALSE ) target_words_sence$split <- sample.int( n = split_size, size = nrow(target_words_sence), replace = TRUE ) fetch_cs <- lapply( X = split( x = target_words_sence[, !is.element(colnames(target_words_sence), "split")], f = target_words_sence$split ), FUN = function (target_sence) { # 「fetchCosineSimilarity」と下記が異なる cosine_sim_res <- filterCosineSim( seed_word_vector = seed_word_vector, target_word_vectors = as.matrix(target_sence), extract_rownames = seed_word_name ) cosine_sim_res <- cosine_sim_res[, !is.element( el = colnames(cosine_sim_res), set = rownames(seed_word_vector)) ] return( na.omit( object = replace(x = cosine_sim_res, list = is.nan(cosine_sim_res), values = 0) ) ) } ) fetch_cs <- do.call(what = "cbind", args = fetch_cs) return(fetch_cs) } applyAnalogy <- function ( wordvec_analogy, wordvec_matrix, apply_arithmetic_param = list( arithmetic_pattern = c("1" = "-", "2" = "+", "3" = "+"), extract_size = 10 ), is_sort = TRUE ) { create_wordvec_arithmtic_word_name <- apply( X = wordvec_analogy[, 1:3], MARGIN = 1, FUN = stringr::str_c, collapse = "_" ) def_wordvec_arithmetic_lst <- do.call( what = "rbind", args = apply( X = wordvec_analogy, MARGIN = 1, FUN = createArithmticWordName, set_arithmetic = apply_arithmetic_param$arithmetic_pattern ) )[, ] fetch_arithmetic_cs <- fetchMulutiCosineSimilarity( seed_word_vector = matrix( data = sapply( X = def_wordvec_arithmetic_lst, FUN = createArithmeticWordVector, word_sence = wordvec_matrix ), nrow = length(def_wordvec_arithmetic_lst), ncol = ncol(wordvec_matrix), byrow = TRUE, dimnames = list(create_wordvec_arithmtic_word_name, NULL) ), target_words_sence = wordvec_matrix, seed_word_name = create_wordvec_arithmtic_word_name, split_size = as.integer(nrow(wordvec_matrix) / 5000) ) return ( do.call( what = "rbind", args = lapply( X = seq(from = 1, to = length(def_wordvec_arithmetic_lst)), FUN = function (i) { fetched_arithmetic_vec <- fetch_arithmetic_cs[i, ] fetched_arithmetic_vec <- fetched_arithmetic_vec[ setdiff(x = names(fetched_arithmetic_vec), names(def_wordvec_arithmetic_lst[[i]])) ] if (is_sort) { fetched_arithmetic_vec <- sort(x = fetched_arithmetic_vec, decreasing = TRUE) } fetched_arithmetic_vec <- fetched_arithmetic_vec[ seq(from = 1, to = apply_arithmetic_param$extract_size) ] return( list( word = names(fetched_arithmetic_vec), similarity = as.numeric(fetched_arithmetic_vec) ) ) } ) ) ) } splitWordVector <- function (target_lst) { return( stringr::str_split_fixed( string = stringr::str_c( mapply(target_lst$word, target_lst$similarity, FUN = stringr::str_c, sep = ":"), collapse = ":" ), pattern = ":", n = length(target_lst$word) * 2 ) ) } # def function 94. --------------------------------------------------------------------- extractSimi <- function ( word_1, word_2, sim_mat ){ if (is.element(el = word_1, set = colnames(x = sim_mat)) & is.element(el = word_2, set = colnames(x = sim_mat)) ) { return( dplyr::data_frame( word_1 = word_1, word_2 = word_2, similarity = sim_mat[word_1, word_2] ) ) } else{ return( dplyr::data_frame( word_1 = word_1, word_2 = word_2, similarity = 0 ) ) } } extractWordVecSim <- function ( target_words, word_sim_mat, word_sim_word ) { # 単語数が少ないので一度にコサイン類似度を求める wordvec_sim <- as.matrix( x = filterCosineSim( seed_word_vector = word_sim_mat, target_word_vectors = NULL, extract_rownames = rownames(x = word_sim_mat) ) %>% replace(x = ., list = is.na(.), values = 0) ) diag(x = wordvec_sim) <- 1 # 計算した類似度行列を使用して、単語同士の類似度を出力 return( dplyr::bind_rows( target_words %>% dplyr::rowwise(.) %>% dplyr::do( word2vec_sim = extractSimi( word_1 = .$word_1, word_2 = .$word_2, sim_mat = wordvec_sim ) ) %>% .$word2vec_sim ) ) }
定数定義
# 課題用ファイル設定 SET_EVAL <- list( # 単語アナロジーの評価データ ANALOGY = "https://word2vec.googlecode.com/svn/trunk/questions-words.txt", # The WordSimilarity-353 Test Collectionの評価データの使うファイル SIMILARITY_COMBINED = "combined.csv" ) # def const 91. --------------------------------------------------------------------- SET_EXTRACT_PATTERN <- list( SECTION_START = "^:", TARGET_SECTION = "family" ) # def const 92. --------------------------------------------------------------------- SET_APPLY_ARITHMETIC <- list( ARITHMETIC_PATTERN = c("1" = "-", "2" = "+", "3" = "+"), EXTRACT_SIZE = 10, WRITE_TOP_SIM = 1 ) # def const 93. --------------------------------------------------------------------- SET_ANALOGY_COL_PROF <- list( TRUE_COL = 4, SELECT_COL = 5 )
比較
# 91. --------------------------------------------------------------------- # 全要素とセクションのID read_analogy <- dplyr::data_frame( text = readr::read_lines(file = basename(path = SET_EVAL$ANALOGY), n_max = -1) ) %>% dplyr::mutate( section_id = cumsum( x = stringr::str_detect(string = .$text, pattern = SET_EXTRACT_PATTERN$SECTION_START) ) ) # 必要なセクションのみ analogy_eval_word <- read_analogy %>% dplyr::filter( is.element( el = .$section_id, set = read_analogy %>% dplyr::filter( stringr::str_detect(string = .$text, pattern = SET_EXTRACT_PATTERN$TARGET_SECTION) ) %>% .$section_id ) ) %>% .$text analogy_eval_word <- analogy_eval_word[-1] # 92. --------------------------------------------------------------------- analogy_eval_word_mat <- stringr::str_split_fixed( string = analogy_eval_word, pattern = "[:space:]", n = 4 ) create_arithmtic_word_name <- apply( X = analogy_eval_word_mat[, 1:3], MARGIN = 1, FUN = stringr::str_c, collapse = "_" ) def_arithmetic_lst <- do.call( what = "rbind", args = apply( X = analogy_eval_word_mat, MARGIN = 1, FUN = createArithmticWordName, set_arithmetic = SET_APPLY_ARITHMETIC$ARITHMETIC_PATTERN ) )[, ] include_analogy_word <- unique(as.character(analogy_eval_word_mat[, 1:3])) # word2vec > setdiff(x = include_analogy_word, y = rownames(word2vec_model)) [1] "grandpa" "stepbrother" "grandma" "policewoman" "stepsister" wordvec_analogy_eval <- analogy_eval_word_mat[!apply( X = !apply( X = analogy_eval_word_mat, MARGIN = 1, FUN = is.element, set = rownames(word2vec_model) ), MARGIN = 2, FUN = any ), ] # すべて含まれる > setdiff(x = unique(as.character(wordvec_analogy_eval[, 1:3])), y = rownames(word2vec_model)) character(0) wordvec_arithmetic_top_n <- applyAnalogy( wordvec_analogy = wordvec_analogy_eval, wordvec_matrix = word2vec_model, apply_arithmetic_param = list( arithmetic_pattern = SET_APPLY_ARITHMETIC$ARITHMETIC_PATTERN, extract_size = SET_APPLY_ARITHMETIC$EXTRACT_SIZE ), is_sort = TRUE ) # タスクに必要分だけに限定 analogy_append_wordvec_res <- cbind( wordvec_analogy_eval, t(apply( X = wordvec_arithmetic_top_n, MARGIN = 1, FUN = splitWordVector ))[, seq(from = 1, to = SET_APPLY_ARITHMETIC$WRITE_TOP_SIM * 2)] ) # glove > setdiff(x = include_analogy_word, y = rownames(word_vectors)) [1] "grandpa" glove_analogy_eval <- analogy_eval_word_mat[!apply( X = !apply( X = analogy_eval_word_mat, MARGIN = 1, FUN = is.element, set = rownames(word_vectors) ), MARGIN = 2, FUN = any ), ] # すべて含まれる > setdiff(x = unique(as.character(glove_analogy_eval[, 1:3])), y = rownames(word_vectors)) character(0) glove_arithmetic_top_n <- applyAnalogy( wordvec_analogy = glove_analogy_eval, wordvec_matrix = word_vectors, apply_arithmetic_param = list( arithmetic_pattern = SET_APPLY_ARITHMETIC$ARITHMETIC_PATTERN, extract_size = SET_APPLY_ARITHMETIC$EXTRACT_SIZE ), is_sort = TRUE ) # タスクに必要分だけに限定 analogy_append_glove_res <- cbind( glove_analogy_eval, t(apply( X = glove_arithmetic_top_n, MARGIN = 1, FUN = splitWordVector ))[, seq(from = 1, to = SET_APPLY_ARITHMETIC$WRITE_TOP_SIM * 2)] ) # 93. --------------------------------------------------------------------- # word2vec > sum( + analogy_append_wordvec_res[, SET_ANALOGY_COL_PROF$TRUE_COL] == + analogy_append_wordvec_res[, SET_ANALOGY_COL_PROF$SELECT_COL] + ) / nrow(analogy_eval_word_mat) [1] 0.2924901 > nrow(analogy_append_wordvec_res) [1] 380 # GloVe > sum( + analogy_append_glove_res[, SET_ANALOGY_COL_PROF$TRUE_COL] == + analogy_append_glove_res[, SET_ANALOGY_COL_PROF$SELECT_COL] + ) / nrow(analogy_eval_word_mat) [1] 0.2924901 > nrow(analogy_append_glove_res) [1] 462 # 94. --------------------------------------------------------------------- # 単語ペアの正解データを読み込み read_wordsim <- readr::read_csv( file = SET_EVAL$SIMILARITY_COMBINED, n_max = -1, skip = 1, col_names = c("word_1", "word_2", "similarity_score") ) similarity_word <- unique(as.character(unlist(read_wordsim[, 1:2]))) # word2vec word2vec_sim_word <- rownames(word2vec_model) word2vec_sim <- dplyr::left_join( x = read_wordsim, y = extractWordVecSim( target_words = read_wordsim, word_sim_mat = word2vec_model[ is.element(el = word2vec_sim_word, set = similarity_word), ], word_sim_word = word2vec_sim_word ), by = c("word_1" = "word_1", "word_2" = "word_2") ) > head(x = word2vec_sim, n = 15) Source: local data frame [15 x 4] word_1 word_2 similarity_score similarity 1 love sex 6.77 0.3925980 2 tiger cat 7.35 0.5601799 3 tiger tiger 10.00 1.0000000 4 book paper 7.46 0.3963225 5 computer keyboard 7.62 0.4075819 6 computer internet 7.58 0.6018095 7 plane car 5.77 0.4098035 8 train car 6.31 0.4230066 9 telephone communication 7.50 0.4230575 10 television radio 6.77 0.6657252 11 media radio 7.42 0.4370717 12 drug abuse 6.85 0.5924038 13 bread butter 6.19 0.7688245 14 cucumber potato 5.92 0.8931500 15 doctor nurse 7.00 0.5380360 # GloVe glove_sim_word <- rownames(word_vectors) glove_sim <- dplyr::left_join( x = read_wordsim, y = extractWordVecSim( target_words = read_wordsim, word_sim_mat = word_vectors[ is.element(el = glove_sim_word, set = similarity_word), ], word_sim_word = glove_sim_word ), by = c("word_1" = "word_1", "word_2" = "word_2") ) > head(x = glove_sim, n = 15) Source: local data frame [15 x 4] word_1 word_2 similarity_score similarity 1 love sex 6.77 0.6236812 2 tiger cat 7.35 0.7486611 3 tiger tiger 10.00 1.0000000 4 book paper 7.46 0.7486611 5 computer keyboard 7.62 0.6654303 6 computer internet 7.58 0.7072629 7 plane car 5.77 0.7837656 8 train car 6.31 0.7281016 9 telephone communication 7.50 0.6292834 10 television radio 6.77 0.9021711 11 media radio 7.42 0.6435035 12 drug abuse 6.85 0.7523819 13 bread butter 6.19 0.8421186 14 cucumber potato 5.92 0.3579985 15 doctor nurse 7.00 0.6959713 # 95. --------------------------------------------------------------------- > word2vec_sim %>% + dplyr::select(similarity_score, similarity) %>% + cor(method = "spearman") similarity_score similarity similarity_score 1.0000000 0.5963557 similarity 0.5963557 1.0000000 > glove_sim %>% + dplyr::select(similarity_score, similarity) %>% + cor(method = "spearman") similarity_score similarity similarity_score 1.0000000 0.3810665 similarity 0.3810665 1.0000000
アナロジータスクで同じ精度で、類似度計算ではGloVeよりもword2vecの方がよいという結果でした。
まとめ
Rでword2vecが適用できるという{wordVectors}を動かしてみたり、定義されている関数を試したり、{PythonInR}で呼び出したGloVeと比較してみました。project/rejectとアナロジーとの違いや、複数要素を指定した際の挙動など、もう少し調べる必要がありますが、word2vecを試してみる分には充分と思います。現在はword2vecのみですが、他のベクトルモデル(例えば、GloVe)にも対応するようで、今後に期待したいです。
言語処理100本ノックの課題でword2vecとGloVeの精度を比較してみましたが、とりあえずやってみたというもので、パラメータチューニングやデータ前処理などを真面目にしていません。処理次第で違った結果になるかもしれないので、今後の課題にします。
もう少し早く公開されていれば、言語処理100本ノックで利用できたと思うと無念でなりません。
参考
- http://nlp.stanford.edu/projects/glove/glove.pdf
- word2vecよりも高性能らしいGloVeを触ってみた - 鴨川にあこがれる日々
- Word Embedding using GloVe - のんびりしているエンジニアの日記
- Getting Started with Word2Vec and GloVe in Python | Text Mining Online | Text Analysis Online | Text Processing Online
実行環境
> devtools::session_info() Session info ------------------------------------------------------------------------------------------------------------------------ setting value version R version 3.2.2 (2015-08-14) system x86_64, darwin13.4.0 ui RStudio (0.99.486) language (EN) collate ja_JP.UTF-8 tz Asia/Tokyo Packages ---------------------------------------------------------------------------------------------------------------------------- package * version date source assertthat * 0.1 2013-12-06 CRAN (R 3.2.0) colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0) crayon 1.3.0 2015-06-05 CRAN (R 3.2.1) curl 0.9 2015-06-19 CRAN (R 3.2.0) DBI 0.3.1 2014-09-24 CRAN (R 3.2.0) devtools * 1.8.0 2015-05-09 CRAN (R 3.2.0) digest 0.6.8 2014-12-31 CRAN (R 3.2.0) dplyr * 0.4.2.9002 2015-07-25 Github (hadley/dplyr@75e8303) ggplot2 * 1.0.1 2015-03-17 CRAN (R 3.2.0) git2r 0.10.1 2015-05-07 CRAN (R 3.2.0) gtable 0.1.2 2012-12-05 CRAN (R 3.2.0) hadleyverse * 0.1 2015-08-09 Github (aaboyles/hadleyverse@16532fe) haven * 0.2.0 2015-04-09 CRAN (R 3.2.0) lazyeval 0.1.10.9000 2015-07-25 Github (hadley/lazyeval@ecb8dc0) lubridate * 1.3.3 2013-12-31 CRAN (R 3.2.0) magrittr * 1.5 2014-11-22 CRAN (R 3.2.0) MASS 7.3-43 2015-07-16 CRAN (R 3.2.2) memoise 0.2.1 2014-04-22 CRAN (R 3.2.0) munsell 0.4.2 2013-07-11 CRAN (R 3.2.0) pack 0.1-1 2015-04-21 local plyr * 1.8.3 2015-06-12 CRAN (R 3.2.0) proto 0.3-10 2012-12-22 CRAN (R 3.2.0) PythonInR * 0.1-1 2015-09-19 CRAN (R 3.2.0) R6 2.1.1 2015-08-19 CRAN (R 3.2.0) Rcpp 0.12.0 2015-07-26 Github (RcppCore/Rcpp@6ae91cc) readr * 0.1.1.9000 2015-07-25 Github (hadley/readr@f4a3956) readxl * 0.1.0 2015-04-14 CRAN (R 3.2.0) reshape2 1.4.1 2014-12-06 CRAN (R 3.2.0) rstudioapi 0.3.1 2015-04-07 CRAN (R 3.2.0) rversions 1.0.1 2015-06-06 CRAN (R 3.2.0) scales 0.2.5 2015-06-12 CRAN (R 3.2.0) stringi 0.5-5 2015-06-29 CRAN (R 3.2.0) stringr * 1.0.0.9000 2015-07-25 Github (hadley/stringr@380c88f) testthat * 0.10.0 2015-05-22 CRAN (R 3.2.0) tidyr * 0.2.0.9000 2015-07-25 Github (hadley/tidyr@0dc87b2) tsne * 0.1-2 2012-05-02 CRAN (R 3.2.0) wordVectors * 1.0 2015-10-29 Github (bmschmidt/wordVectors@cfd14a5) xml2 * 0.1.1 2015-06-02 CRAN (R 3.2.0)