KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences



=================================================================================
   KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences
   Total 732 million words (48,878,948 sentences)
=================================================================================

KCC(Korean Contemporary Corpus) -- raw sentences of the Korean langugae
1) KCC150 --150,705,457 words (11,961,347 sentences)
   Sentences with quotes are not included.
   
2) KCCq28 -- 28,782,776 words (1,337,721 sentences with double qoutes)
   All the sentences include a quote.

3) KCC940 -- 93,210,332 words (6,263,454 sentences)
   All the sentences are no more than 30 words.
   
4) KCC460 -- about 460 million words (29,316,426 sentences)
   All the sentences are no more than 30 words.

5) KCC text corpus for Word2Vec word embedding of the above KCC corpus.
   [Download] Korean word embedding model
   
Seung-Shik Kang, Ph.D
Professor at Kookmin University
Email: nlpkang AT g m a i l . c o m
=================================================================================

Download one of the following(UTF8 encoding).
- KCC150_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- KCCq28_Korean_sentences_UTF8_v2.zip -- "UTF8" encoded file (0xA1A1 -> 0x20)
- KCC940_Korean_sentences_UTF8_V2.txt.gz -- "UTF8" encoded file (merge Grapheme to word)
- KCC460 -- no "UTF8" encoded file. Download KCC460_EUCKR.txt.gz and convert to UTF8 by iconv library!
- http://203.246.112.71/sskang/kcc/KCC460.txt.gz -- EUCKR encoding
Download iconv --> $ iconv -c -f utf-8 -t -cp949 KCC460_EUCKR.txt > KCC460_utf8.txt

Download one of the following text files for word embedding(Word2Vec) --> EUCKR encoded files
These files are automatically created by KLT2000 Korean morphological analyzer. See below for the details.

  C> index2018.exe -c test.txt output.txt
  https://cafe.naver.com/nlpkang/3 
  https://cafe.naver.com/nlpk/278 
- KCC150_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
- KCC940_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
- KCCq28_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
- KCC460_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
--> KCC460_Korean_sentences for Word2Vec embedding ("UTF8" encoded file)
- Korean Wiki Text -- ko_wiki_text.zip

================================================================================
´ë¿ë·® ÆÄÀÏ ´Ù¿î·Îµå ¹®Á¦·Î ÀÎÇÏ¿© ´Ù¿î¹Þ±â ¾î·Á¿î °æ¿ìµéÀÌ ¹ß»ýÇÏ´Â °æ¿ì¿¡...
================================================================================
1) KCC150 -- 1¾ï5õ¸¸ ¾îÀýÀ» ºÐÇÒÇÏ¿© ´Ù¿î·Îµå
   KCC150 11,961,347¹®Àå(1¾ï5õ¸¸ ¾îÀý)Àº 100¸¸ ¹®Àå(¶óÀÎ)¾¿ 12°³ ÆÄÀÏ·Î ºÐÇÒ.
   ¾Æ·¡ 12°³ ÆÄÀÏÀ» ¼ø¼­´ë·Î °áÇÕÇϸé KCC150_Korean_sentences_UTF8.txt¿Í µ¿ÀÏÇÔ(´Ü, encodingÀº EUCKR)

   - KCC150_K01.txt.gz
   - KCC150_K02.txt.gz
   - KCC150_K03.txt.gz
   - KCC150_K04.txt.gz
   - KCC150_K05.txt.gz
   - KCC150_K06.txt.gz
   - KCC150_K07.txt.gz
   - KCC150_K08.txt.gz
   - KCC150_K09.txt.gz
   - KCC150_K10.txt.gz
   - KCC150_K11.txt.gz
   - KCC150_K12.txt.gz

2) KCCq28 -- 2,878¸¸ ¾îÀýÀ» ºÐÇÒÇÏ¿© ´Ù¿î·Îµå
   Å«µû¿ÈÇ¥ Æ÷ÇÔµÈ 1,331,721¹®Àå(2,878¸¸ ¾îÀý)Àº 20¸¸ ¹®Àå(¶óÀÎ)¾¿ 7°³ ÆÄÀÏ·Î ºÐÇÒ.
   ¾Æ·¡ 7°³ ÆÄÀÏÀ» ¼ø¼­´ë·Î °áÇÕÇϸé KCCq28_Korean_sentences_UTF8.txt¿Í µ¿ÀÏÇÔ(´Ü, encodingÀº EUCKR)

   - KCCq28_Q01.txt.gz
   - KCCq28_Q02.txt.gz
   - KCCq28_Q03.txt.gz
   - KCCq28_Q04.txt.gz
   - KCCq28_Q05.txt.gz
   - KCCq28_Q06.txt.gz
   - KCCq28_Q07.txt.gz

Korean morphological analyzer