Tuesday, December 20, 2022

[NLP - Data] A list of Korean Acoustic Corpora

The Speech Corpus of Reading-Style Standard Korean (NIKL 2005; https://github.com/homink/speech.ko)

  • 120 hours (??)
  • Read speech 
  • 120 speakers -- gender balanced (60 males; 60 females) and the age of the speakers ranged from 19 to 71 at the time of recording in 2003.
  • Region: Seoul metropolitan area -- speakers of Seoul dialect 
  • Content: 19 well-known short stories and essays containing a total of 930 sentences
  • Available format: Each sentence is stored as a separate wav file in the corpus. 
  • 120 speakers 
  • Around 88,800 audio files
The Korean Corpus of Spontaneous Speech (http://koreascience.or.kr/article/JAKO201521159149292.page)
  • In order to get the material, you need to contact the authors.
  • 40 hours
  • 40 speakers (age and gender -- balanced)
  • Interview speech: kind of a monologue: one hour per speaker -- sociolinguistic interview format
  • Similar to the Buckeye corpus
  • All the utterances are transcribed.
  • 3 hours
  • Talk speech (monologue)
  • 41 speakers -- gender not balanced (32 male and 9 female)
  • Regions: Seoul (14), Busan (14) and Daejeon / Daedeok (13)
  • Out of 11,704 fragments: ASR corpus is close to 3 hours (2 hours 48 minutes) in audio length (corresponds to 26.4% and 23.6% of the total number of fragments and audio length, respectively).
  • 50 hours
  • Cleaned: 60,000 utterances (cleaned from a pool: from 11,000 people: each person 10 unique sentences (repeated once or twice)). 
  • Read speech: 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain (only speakers' requests).
  • 11,000 speakers (age and gender not sure) -- but only 10 utterance per speaker.


  • Large-scale Korean open domain dialog speech corpus from AIHub
  • 610 hours: 510 hours (pre-training) , 100 hours (fine-tuning)
  • Description:
1. Around 1,000 hours
2. Spontaneous speech
3. 2,000 speakers
4. Conversation between two people about various topics (e.g., weather, economics)
5. ERTI transcription rule
6. File: Segmented at the utterance level (long pause; format: 16kHz/16bits, headerless (endian) linear PCM) and transcribed (format: EUC-KR)
  • 95.7 hours 
  • Read speech
  • 46,347 utterances, 181 speakers, 27,330 uniq. sentences
  • 69 hours
  • general open-domain dialog utterances
  • 2000 native Korean speakers in a clean environment
  • the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances
  • a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments
  • For preprocessing, use the script at https://github.com/sooftware/ksponspeech

No comments:

Post a Comment