KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences



=================================================================================
   KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences
   Total 732 million words (48,878,948 sentences)
=================================================================================

KCC(Korean Contemporary Corpus) -- written raw sentences of the Korean langugae
1) KCC150_Korean_sentence_EUCKR.txt
   --> 150,705,457 words (11,961,347 sentences)
   Sentences with quotes are not included.
   
2) KCCq28_Korean_sentence_EUCKR.txt
   All the sentences include a quote.
   --> 28,782,776 words (1,337,721 sentences with double qoutes)

3) KCC940_Korean_sentence_EUCKR.txt
   All the sentences are no more than 30 words.
   --> 93,210,332 words (6,263,454 sentences)
   
4) KCC460_Korean_sentence_EUCKR.txt
   All the sentences are no more than 30 words.
   --> about 460 million words (29,316,426 sentences)

5) KCC text corpus for Word2Vec word embedding of the above KCC corpus.
   [Download] Korean word embedding model
   
Seung-Shik Kang, Ph.D
Professor at Kookmin University
Email: nlpkang AT g m a i l . c o m
=================================================================================

Download one of the following: the same corpus(EUCKR or UTF8 encoding).
- KCC150_Korean_sentences_EUCKR.txt.gz -- "EUCKR" encoded file
- [OLD] KCCq28_Korean_sentences_EUCKR.txt.gz -- "EUCKR" encoded file
- [NEW] KCCq28_Korean_sentences_EUCKR_v2.zip -- "EUCKR" encoded file (0xA1A1 -> 0x20)
- [OLD] KCC940_Korean_sentences_EUCKR.txt.gz -- "EUCKR" encoded file
- [NEW] KCC940_Korean_sentences_EUCKR_V2.txt.gz -- "EUCKR" encoded file (merge Grapheme to word)
- KCC460_EUCKR.txt.gz -- "EUCKR" encoded file

- KCC150_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- [OLD] KCCq28_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- [NEW] KCCq28_Korean_sentences_UTF8_v2.zip -- "UTF8" encoded file (0xA1A1 -> 0x20)
- [OLD] KCC940_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- [NEW] KCC940_Korean_sentences_UTF8_V2.txt.gz -- "UTF8" encoded file (merge Grapheme to word)
- KCC460 -- no "UTF8" encoded file. Download KCC460_EUCKR.txt.gz and convert to UTF8 by iconv library!
Download iconv --> $ iconv -c -f utf-8 -t -cp949 KCC460_EUCKR.txt > KCC460_utf8.txt

If you cannot download from the above server, then try to the following server.
- http://203.246.112.71/kcc/KCC150_Korean_sentences_EUCKR.txt.gz
- http://203.246.112.71/kcc/KCCq28_Korean_sentences_EUCKR.txt.gz
- http://203.246.112.71/kcc/KCC940_Korean_sentences_EUCKR.txt.gz
- http://203.246.112.71/kcc/KCC460.txt.gz

Download one of the following text files for word embedding(Word2Vec) --> EUCKR encoded files
These files are automatically created by KLT2000 Korean morphological analyzer. See below for the details.

  C> index2018.exe -c test.txt output.txt
  https://cafe.naver.com/nlpkang/3 
  https://cafe.naver.com/nlpk/278 
- KCC150_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
- KCC940_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
- KCCq28_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
- KCC460_Korean_sentences for Word2Vec embedding ("EUCKR" encoded file)
--> KCC460_Korean_sentences for Word2Vec embedding ("UTF8" encoded file)
- Korean Wiki Text -- ko_wiki_text.zip

================================================================================
´ë¿ë·® ÆÄÀÏ ´Ù¿î·Îµå ¹®Á¦·Î ÀÎÇÏ¿© ´Ù¿î¹Þ±â ¾î·Á¿î °æ¿ìµéÀÌ ¹ß»ýÇÏ´Â °æ¿ì¿¡...
================================================================================
1) KCC150_Korean_sentences_EUCKR.txt -- 1¾ï5õ¸¸ ¾îÀýÀ» ºÐÇÒÇÏ¿© ´Ù¿î·Îµå
   KCC150 11,961,347¹®Àå(1¾ï5õ¸¸ ¾îÀý)Àº 100¸¸ ¹®Àå(¶óÀÎ)¾¿ 12°³ ÆÄÀÏ·Î ºÐÇÒ.
   ¾Æ·¡ 12°³ ÆÄÀÏÀ» ¼ø¼­´ë·Î °áÇÕÇϸé KCC150_Korean_sentences_EUCKR.txt¿Í µ¿ÀÏÇÕ´Ï´Ù.

   - KCC150_K01.txt.gz
   - KCC150_K02.txt.gz
   - KCC150_K03.txt.gz
   - KCC150_K04.txt.gz
   - KCC150_K05.txt.gz
   - KCC150_K06.txt.gz
   - KCC150_K07.txt.gz
   - KCC150_K08.txt.gz
   - KCC150_K09.txt.gz
   - KCC150_K10.txt.gz
   - KCC150_K11.txt.gz
   - KCC150_K12.txt.gz

2) KCCq28_Korean_sentences_EUCKR.txt -- 2,878¸¸ ¾îÀýÀ» ºÐÇÒÇÏ¿© ´Ù¿î·Îµå
   Å«µû¿ÈÇ¥ Æ÷ÇÔµÈ 1,331,721¹®Àå(2,878¸¸ ¾îÀý)Àº 20¸¸ ¹®Àå(¶óÀÎ)¾¿ 7°³ ÆÄÀÏ·Î ºÐÇÒ.
   ¾Æ·¡ 7°³ ÆÄÀÏÀ» ¼ø¼­´ë·Î °áÇÕÇϸé KCCq28_Korean_sentences_EUCKR.txt¿Í µ¿ÀÏÇÕ´Ï´Ù.

   - KCCq28_Q01.txt.gz
   - KCCq28_Q02.txt.gz
   - KCCq28_Q03.txt.gz
   - KCCq28_Q04.txt.gz
   - KCCq28_Q05.txt.gz
   - KCCq28_Q06.txt.gz
   - KCCq28_Q07.txt.gz

Korean morphological analyzer