Word2Vec pre-training models for KCC corpus (Korean language)
1. Test Korean word2vec, FastText pre-training models by Python ("wv_KMA_tokens_test.py" in KCC_KMA_Word2Vec.zip)
# Download FastText-KCC150.zip -- "FastText pre-training model for KCC150"
# One of the word2vec pre-training model in "http://nlp.kookmin.ac.kr/kcc/word2vec"
# Install Python and 'gensim' library, and then running
# C> pip install gensim"
# C> python
from gensim.models import Word2Vec
model_name = "FastText-KCC150.model" # "FastText-KCC150.zip"¿¡ ÀÖÀ½
model = Word2Vec.load(model_name)
# ´Ü¾îº¤ÅÍ ¿¬»ê: º¹ÇÕ°³³ä¿¡¼ ´ÜÀÏ°³³äÀ» Á¦°Å(-)ÇÏ°í, ´Ù¸¥ ´ÜÀÏ°³³äÀ» Ãß°¡(+)
# º¹ÇÕ°³³ä('¿©¹è¿ì') - ´ÜÀÏ°³³ä('¿©ÀÚ') + ´ÜÀÏ°³³ä('³²ÀÚ')
# Ex1) (¿©¹è¿ì + ³²ÀÚ) - ¿©ÀÚ = ?
print(model.wv.most_similar(positive=[u'¿©¹è¿ì', u'³²ÀÚ'], negative=[u'¿©ÀÚ'], topn=10))
print(model.wv.most_similar(positive=[u'¿©¿Õ', u'³²ÀÚ'], negative=[u'¿©ÀÚ'], topn=10))
# Ex2) (¼¿ï - ´ëÇѹα¹) + ÀϺ» = ?
print(model.wv.most_similar(positive=[u'¼¿ï', u'ÀϺ»'], negative=[u'´ëÇѹα¹'], topn=10))
# À¯»ç ´Ü¾î topn°³ Ãâ·Â ¿¹Á¦
model.wv.most_similar('º¸¼ö', topn=20)
model.wv.most_similar('Áøº¸', topn=20)
model.wv.most_similar('Çø¿À', topn=20)
model.wv.most_similar('¼º¼Ò¼öÀÚ', topn=20)
2. [Download] Korean FastText/Word2Vec models -- one of the below .zip file
FastText pre-training models for KCC corpus (Korean language)
FastText pre-training model for KCC150 (recommended)
FastText pre-training model for KCC460
Word2Vec pre-training models for KCC corpus (Korean language)
Word2Vec pre-training model for KCC150 (recommended)
Word2Vec pre-training model for KCCq28
Word2Vec pre-training model for KCC940
Word2Vec pre-training model for KCC460
Word2Vec pre-training model for KCC150+q28+940
Word2Vec pre-training model for KCC150+q28
Word2Vec pre-training model for KCC460+150+940+q28
Word2Vec pre-training model for KCC460+150+940
Word2Vec pre-training model for KCC460+150
Word2Vec pre-training model for KCCq28+150+940+460
3. Python sources for Word2Vec/FastText training
KCC_KMA_Word2Vec.zip -- word2vec training for "KMA tokenized Korean corpus".
wv_KMA_tokens_train.py -- training one file
wv_KMA_tokens_train_ADD.py -- training two or more files
wv_KMA_tokens_test.py -- load pre-trained model & test
KCC_KMA_FastText_doc2vec.zip -- word2vec training for "KMA tokenized Korean corpus".
FastText_Train.py -- training one file
FastText_Trainrain_ADD.py -- training two or more files
doc2vec.py -- model training & test
4. Korean tokenized raw corpus for Word2Vec/FastText training
Word2Vec training is performed by "wv_KMA_tokens_train.py" for "Korean tokenized raw corpus".
--> "Korean tokenized raw corpus" is tokenized by KLT2000 Korean morphological analyzer.
--> Download one of "Korean tokenized raw corpus" for self-training
KMA tokenized KCC corpus for Word Embedding: KCC150 ("EUCKR" encoded file)
KMA tokenized KCC corpus for Word Embedding: KCC940 ("EUCKR" encoded file)
KMA tokenized KCC corpus for Word Embedding: KCCq28 ("EUCKR" encoded file)
KMA tokenized KCC corpus for Word Embedding: KCC460 ("EUCKR" encoded file)
--> KMA tokenized KCC corpus for Word Embedding: KCC460 ("UTF8" encoded file)
Korean Wiki Text -- ko_wiki_text.zip
Above files are automatically created by KLT2000 Korean morphological analyzer. See below for the details.
--> You can download "KCC(Korean raw corpus)" at http://nlp.kookmin.ac.kr/kcc
C> index2018.exe -c input.txt output.txt
https://cafe.naver.com/nlpkang/3
https://cafe.naver.com/nlpk/278