ML Scientist

Connecting Scholars with the Latest Academic News and Career Paths

FeaturedNews

LDC Releases New Mandarin Chinese Audio and Transcripts

LDC releases new Mandarin Chinese audio and transcripts datasets to support machine translation and information retrieval research.

The Linguistic Data Consortium (LDC) has released two new publications: BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio (LDC2025S04) and BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations (LDC2025T05).

The audio dataset consists of 93 hours of unscripted telephone conversations between native speakers of Mainland Mandarin Chinese, with 60% of the recordings being publicly released for the first time. The transcripts and translations dataset contains verbatim transcripts and English translations for the conversational telephone speech.

  • The data is divided into training, development, and evaluation partitions.
  • Transcribers used simplified Chinese orthography and added minimal markup.
  • 89% of the transcripts were translated into English.

These datasets were developed to support the DARPA BOLT program, focusing on machine translation and information retrieval for less formal genres. Members can access these corpora through their LDC accounts, while non-members may license the data for a fee.

Tags: LDC, Mandarin Chinese, Audio Corpus, Transcripts, Machine Translation, DARPA BOLT, Linguistic Data Consortium