KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences



=================================================================================
   KCC150, KCCq28, KCC940 -- Korean Contemporary Corpus of Written Sentences
   Total 272,698,565 words (19,562,522 sentences)
=================================================================================

KCC(Korean Contemporary Corpus) -- written raw sentences of the Korean langugae
1) KCC150_Korean_sentence_EUCKR.txt
   --> 150,705,457 words (11,961,347 sentences)
   Sentences with quotes are not included.
   
2) KCCq28_Korean_sentence_EUCKR.txt
   All the sentences include a quote.
   --> 28,782,776 words (1,337,721 sentences with double qoutes)

3) KCC940_Korean_sentence_EUCKR.txt
   All the sentences are no more than 30 words.
   --> 93,210,332 words (6,263,454 sentences)
   
Seung-Shik Kang, Ph.D
Professor at Kookmin University
Email: nlpkang AT g m a i l . c o m
=================================================================================

Download one of the following: the same corpus(EUCKR or UTF8 encoding).
- KCC150_Korean_sentences_EUCKR.txt.gz -- "EUCKR" encoded file
- KCCq28_Korean_sentences_EUCKR.txt.gz -- "EUCKR" encoded file
- KCC940_Korean_sentences_EUCKR.txt.gz -- "EUCKR" encoded file

- KCC150_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- KCCq28_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file
- KCC940_Korean_sentences_UTF8.txt.gz -- "UTF8" encoded file

If you cannot download from the above server, then try to the following server.
- http://203.246.112.72/kcc/KCC150_Korean_sentences_EUCKR.txt.gz
- http://203.246.112.72/kcc/KCCq28_Korean_sentences_EUCKR.txt.gz
- http://203.246.112.72/kcc/KCC940_Korean_sentences_EUCKR.txt.gz

================================================================================
대용량 파일 다운로드 문제로 인하여 다운받기 어려운 경우들이 발생하는 경우에...
================================================================================
1) KCC150_Korean_sentences_EUCKR.txt -- 1억5천만 어절을 분할하여 다운로드
   KCC150 11,961,347문장(1억5천만 어절)은 100만 문장(라인)씩 12개 파일로 분할.
   아래 12개 파일을 순서대로 결합하면 KCC150_Korean_sentences_EUCKR.txt와 동일합니다.

   - KCC150_K01.txt.gz
   - KCC150_K02.txt.gz
   - KCC150_K03.txt.gz
   - KCC150_K04.txt.gz
   - KCC150_K05.txt.gz
   - KCC150_K06.txt.gz
   - KCC150_K07.txt.gz
   - KCC150_K08.txt.gz
   - KCC150_K09.txt.gz
   - KCC150_K10.txt.gz
   - KCC150_K11.txt.gz
   - KCC150_K12.txt.gz

2) KCCq28_Korean_sentences_EUCKR.txt -- 2,878만 어절을 분할하여 다운로드
   큰따옴표 포함된 1,331,721문장(2,878만 어절)은 20만 문장(라인)씩 7개 파일로 분할.
   아래 7개 파일을 순서대로 결합하면 KCCq28_Korean_sentences_EUCKR.txt와 동일합니다.

   - KCCq28_Q01.txt.gz
   - KCCq28_Q02.txt.gz
   - KCCq28_Q03.txt.gz
   - KCCq28_Q04.txt.gz
   - KCCq28_Q05.txt.gz
   - KCCq28_Q06.txt.gz
   - KCCq28_Q07.txt.gz

Korean morphological analyzer