Word2Vec pre-training models for KCC corpus (Korean language)


[Demo] http://nlp.kookmin.ac.kr/kcc/word2vec/demo

1. Test Korean word2vec, FastText pre-training models by Python ("wv_KMA_tokens_test.py" in KCC_KMA_Word2Vec.zip)

# Download FastText-KCC150.zip -- "FastText pre-training model for KCC150"
# One of the word2vec pre-training model in "http://nlp.kookmin.ac.kr/kcc/word2vec"
# Install Python and 'gensim' library, and then running
# C> pip install gensim"
# C> python

from gensim.models import Word2Vec

model_name = "FastText-KCC150.model"  # "FastText-KCC150.zip"¿¡ ÀÖÀ½
model = Word2Vec.load(model_name)

# ´Ü¾îº¤ÅÍ ¿¬»ê: º¹ÇÕ°³³ä¿¡¼­ ´ÜÀÏ°³³äÀ» Á¦°Å(-)ÇÏ°í, ´Ù¸¥ ´ÜÀÏ°³³äÀ» Ãß°¡(+)
# º¹ÇÕ°³³ä('¿©¹è¿ì') - ´ÜÀÏ°³³ä('¿©ÀÚ') + ´ÜÀÏ°³³ä('³²ÀÚ')
# Ex1) (¿©¹è¿ì + ³²ÀÚ) - ¿©ÀÚ = ?
print(model.wv.most_similar(positive=[u'¿©¹è¿ì', u'³²ÀÚ'], negative=[u'¿©ÀÚ'], topn=10))
print(model.wv.most_similar(positive=[u'¿©¿Õ', u'³²ÀÚ'], negative=[u'¿©ÀÚ'], topn=10))

# Ex2) (¼­¿ï - ´ëÇѹα¹) + ÀϺ» = ?
print(model.wv.most_similar(positive=[u'¼­¿ï', u'ÀϺ»'], negative=[u'´ëÇѹα¹'], topn=10))

# À¯»ç ´Ü¾î topn°³ Ãâ·Â ¿¹Á¦
model.wv.most_similar('º¸¼ö', topn=20)
model.wv.most_similar('Áøº¸', topn=20)

model.wv.most_similar('Çø¿À', topn=20)
model.wv.most_similar('¼º¼Ò¼öÀÚ', topn=20)

2. [Download] Korean FastText/Word2Vec models -- one of the below .zip file

FastText pre-training models for KCC corpus (Korean language)
  FastText pre-training model for KCC150 (recommended)
  FastText pre-training model for KCC460

Word2Vec pre-training models for KCC corpus (Korean language)
  Word2Vec pre-training model for KCC150 (recommended)
  Word2Vec pre-training model for KCCq28
  Word2Vec pre-training model for KCC940
  Word2Vec pre-training model for KCC460

  Word2Vec pre-training model for KCC150+q28+940
  Word2Vec pre-training model for KCC150+q28

  Word2Vec pre-training model for KCC460+150+940+q28
  Word2Vec pre-training model for KCC460+150+940
  Word2Vec pre-training model for KCC460+150

  Word2Vec pre-training model for KCCq28+150+940+460


3. Python sources for Word2Vec/FastText training

KCC_KMA_Word2Vec.zip -- word2vec training for "KMA tokenized Korean corpus".
  wv_KMA_tokens_train.py -- training one file
  wv_KMA_tokens_train_ADD.py -- training two or more files
  wv_KMA_tokens_test.py -- load pre-trained model & test

KCC_KMA_FastText_doc2vec.zip -- word2vec training for "KMA tokenized Korean corpus".
  FastText_Train.py -- training one file
  FastText_Trainrain_ADD.py -- training two or more files
  doc2vec.py -- model training & test


4. Korean tokenized raw corpus for Word2Vec/FastText training

Word2Vec training is performed by "wv_KMA_tokens_train.py" for "Korean tokenized raw corpus".
--> "Korean tokenized raw corpus" is tokenized by KLT2000 Korean morphological analyzer.
--> Download one of "Korean tokenized raw corpus" for self-training

  KMA tokenized KCC corpus for Word Embedding: KCC150 ("EUCKR" encoded file)
  KMA tokenized KCC corpus for Word Embedding: KCC940 ("EUCKR" encoded file)
  KMA tokenized KCC corpus for Word Embedding: KCCq28 ("EUCKR" encoded file)
  KMA tokenized KCC corpus for Word Embedding: KCC460 ("EUCKR" encoded file)
  --> KMA tokenized KCC corpus for Word Embedding: KCC460 ("UTF8" encoded file)
  Korean Wiki Text -- ko_wiki_text.zip

Above files are automatically created by KLT2000 Korean morphological analyzer. See below for the details.
--> You can download "KCC(Korean raw corpus)" at http://nlp.kookmin.ac.kr/kcc

  C> index2018.exe -c input.txt output.txt

  https://cafe.naver.com/nlpkang/3
  https://cafe.naver.com/nlpk/278