================================================================================= KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences Total 732 million words (48,878,948 sentences) ================================================================================= KCC(Korean Contemporary Corpus) -- raw sentences of the Korean langugae 1) KCC150 --150,705,457 words (11,961,347 sentences) Sentences with quotes are not included. 2) KCCq28 -- 28,782,776 words (1,337,721 sentences with double qoutes) All the sentences include a quote. 3) KCC940 -- 93,210,332 words (6,263,454 sentences) All the sentences are no more than 30 words. 4) KCC460 -- about 460 million words (29,316,426 sentences) All the sentences are no more than 30 words. 5) KCC text corpus for Word2Vec word embedding of the above KCC corpus. [Download] Korean word embedding model Seung-Shik Kang, Ph.D Professor at Kookmin University Email: nlpkang AT g m a i l . c o m =================================================================================
Download one of the following(UTF8 encoding).
- KCC150_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- KCCq28_Korean_sentences_UTF8_v2.zip -- "UTF8" encoded file (0xA1A1 -> 0x20)
- KCC940_Korean_sentences_UTF8_V2.txt.gz -- "UTF8" encoded file (merge Grapheme to word)
- KCC460 -- no "UTF8" encoded file. Download KCC460_EUCKR.txt.gz and convert to UTF8 by iconv library!
- http://203.246.112.71/sskang/kcc/KCC460.txt.gz -- EUCKR encoding
Download iconv --> $ iconv -c -f utf-8 -t -cp949 KCC460_EUCKR.txt > KCC460_utf8.txt
Download one of the following text files for word embedding(Word2Vec) --> EUCKR encoded files
These files are automatically created by KLT2000 Korean morphological analyzer. See below for the details.
C> index2018.exe -c test.txt output.txt https://cafe.naver.com/nlpkang/3 https://cafe.naver.com/nlpk/278- KCC150_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
================================================================================ ´ë¿ë·® ÆÄÀÏ ´Ù¿î·Îµå ¹®Á¦·Î ÀÎÇÏ¿© ´Ù¿î¹Þ±â ¾î·Á¿î °æ¿ìµéÀÌ ¹ß»ýÇÏ´Â °æ¿ì¿¡... ================================================================================ 1) KCC150 -- 1¾ï5õ¸¸ ¾îÀýÀ» ºÐÇÒÇÏ¿© ´Ù¿î·Îµå KCC150 11,961,347¹®Àå(1¾ï5õ¸¸ ¾îÀý)Àº 100¸¸ ¹®Àå(¶óÀÎ)¾¿ 12°³ ÆÄÀÏ·Î ºÐÇÒ. ¾Æ·¡ 12°³ ÆÄÀÏÀ» ¼ø¼´ë·Î °áÇÕÇϸé KCC150_Korean_sentences_UTF8.txt¿Í µ¿ÀÏÇÔ(´Ü, encodingÀº EUCKR) - KCC150_K01.txt.gz - KCC150_K02.txt.gz - KCC150_K03.txt.gz - KCC150_K04.txt.gz - KCC150_K05.txt.gz - KCC150_K06.txt.gz - KCC150_K07.txt.gz - KCC150_K08.txt.gz - KCC150_K09.txt.gz - KCC150_K10.txt.gz - KCC150_K11.txt.gz - KCC150_K12.txt.gz 2) KCCq28 -- 2,878¸¸ ¾îÀýÀ» ºÐÇÒÇÏ¿© ´Ù¿î·Îµå Å«µû¿ÈÇ¥ Æ÷ÇÔµÈ 1,331,721¹®Àå(2,878¸¸ ¾îÀý)Àº 20¸¸ ¹®Àå(¶óÀÎ)¾¿ 7°³ ÆÄÀÏ·Î ºÐÇÒ. ¾Æ·¡ 7°³ ÆÄÀÏÀ» ¼ø¼´ë·Î °áÇÕÇϸé KCCq28_Korean_sentences_UTF8.txt¿Í µ¿ÀÏÇÔ(´Ü, encodingÀº EUCKR) - KCCq28_Q01.txt.gz - KCCq28_Q02.txt.gz - KCCq28_Q03.txt.gz - KCCq28_Q04.txt.gz - KCCq28_Q05.txt.gz - KCCq28_Q06.txt.gz - KCCq28_Q07.txt.gz