Cheonkam's Deep Learning Space: [NLP - Data] A list of Korean Acoustic Corpora

Tuesday, December 20, 2022

[NLP - Data] A list of Korean Acoustic Corpora

The Speech Corpus of Reading-Style Standard Korean (NIKL 2005; https://github.com/homink/speech.ko)

120 hours (??)
Read speech
120 speakers -- gender balanced (60 males; 60 females) and the age of the speakers ranged from 19 to 71 at the time of recording in 2003.
Region: Seoul metropolitan area -- speakers of Seoul dialect
Content: 19 well-known short stories and essays containing a total of 930 sentences
Available format: Each sentence is stored as a separate wav file in the corpus.
120 speakers
Around 88,800 audio files

The Korean Corpus of Spontaneous Speech (http://koreascience.or.kr/article/JAKO201521159149292.page)

In order to get the material, you need to contact the authors.
40 hours
40 speakers (age and gender -- balanced)
Interview speech: kind of a monologue: one hour per speaker -- sociolinguistic interview format
Similar to the Buckeye corpus
All the utterances are transcribed.

Pansori-TEDxKR (https://github.com/yc9701/pansori-tedxkr-corpus)

3 hours
Talk speech (monologue)
41 speakers -- gender not balanced (32 male and 9 female)
Regions: Seoul (14), Busan (14) and Daejeon / Daedeok (13)
Out of 11,704 fragments: ASR corpus is close to 3 hours (2 hours 48 minutes) in audio length (corresponds to 26.4% and 23.6% of the total number of fragments and audio length, respectively).

CloveCall (https://github.com/ClovaAI/ClovaCall)

50 hours
Cleaned: 60,000 utterances (cleaned from a pool: from 11,000 people: each person 10 unique sentences (repeated once or twice)).
Read speech: 60,000 pairs of a short sentence and its corresponding spoken utterance in a restaurant reservation domain (only speakers' requests).
11,000 speakers (age and gender not sure) -- but only 10 utterance per speaker.

AIHub (https://aihub.or.kr/aihubdata/data/view.do?currMenu=115&topMenu=100&aihubDataSe=realm&dataSetSn=123)

Large-scale Korean open domain dialog speech corpus from AIHub
610 hours: 510 hours (pre-training) , 100 hours (fine-tuning)
Description:

1. Around 1,000 hours
2. Spontaneous speech
3. 2,000 speakers
4. Conversation between two people about various topics (e.g., weather, economics)
5. ERTI transcription rule
6. File: Segmented at the utterance level (long pause; format: 16kHz/16bits, headerless (endian) linear PCM) and transcribed (format: EUC-KR)

Zeroth (https://github.com/goodatlas/zeroth)

95.7 hours
Read speech
46,347 utterances, 181 speakers, 27,330 uniq. sentences

KsponSpeech (https://aihub.or.kr/aidata/105)

69 hours
general open-domain dialog utterances
2000 native Korean speakers in a clean environment
the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances
a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments
For preprocessing, use the script at https://github.com/sooftware/ksponspeech

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)