List of children's speech corpora

A child speech corpus is a speech corpus documenting first-language language acquisition. Such databases are used in the development of computer-assisted language learning systems and the characterization of children's speech at difference ages.^[1] Children's speech varies not only by language, but also by region within a language. It can also be different for specific groups like autistic children, especially when emotion is considered. Thus different databases are needed for different populations. Corpora are available for American and British English as well as for many other European languages.^[1]^[2]^[3]

Overview of Children's Speech Corpora

In the table below, the age range may be described in terms of school grades. "K" denotes "kindergarten" while "G" denotes "grade". For example, an age range of "K - G10" refers to speakers ranging from kindergarten age to grade 10.

This table is based on a paper from the Interspeech conference, 2016.^[4] This online article is intended to provide an interactive table for readers and a place where information about children speech corpora that can be updated continuously by the speech research community.

Corpus	Author	Languages	# Speakers	# Utt.	Duration	Age Range	Date	Remarks
Boulder Learning—MyST Corpus (v0.4.0) ^[5]	Cole et al.^[6]	English	1371	228,874	~393h	G3 - G5	2019	dialog interaction between a student and a virtual tutor on science topics; typically 20-40 minute (wall clock) duration of a session; roughly 49% of the utterances have been transcribed, and more being transcribed. volunteers encouraged. available free for research; flat $10K for commercial use.
CMU Kids Corpus ^[7]	Eskenazi	English	24M, 52F	5180		6 - 11	1997
CSLU Kids' Speech Corpus ^[8]	Shobaki	English	1100	1017		K - G10	2007
PF-STAR Children's Speech Corpus ^[9]^[10]	Russell	English,	158		~14.5h	4 - 14	2006	word-level transcriptions
CALL-SLT ^[11]	Rayner	German		5000			2014
TBALL ^[12]	Kazemgadeh	English	256	5000	40h	K - G4	2005	partially non-native speech
CASS_CHILD ^[13]	Gao	Mandarin	23			1 - 4	2012	phonetic transcriptions
CU Children's Read and Prompted Speech Corpus ^[14]	Hagen	English	663	~100		K - G5	2001	consists of isolated words, sentences and short spontaneous story telling; word-level transcriptions
CU Story Corpus ^[14]	Hagen	English	106	5000	40h	G3 - G5	2003	consists of story prompts and spontaneous spoken summary of the material; word-level transcriptions
Providence Corpus ^[15]	Demuth	English	6		363h	1 - 3	2006	mother-child spontaneous speech interactions; broad phonetic transcription
Lyon Corpus ^[16]	Demuth	French	4		185h	1 - 3	2007	mother-child spontaneous speech interactions; broad phonetic transcription
Demuth Sesotho Corpus ^[17]	Demuth	Sesotho	4	~13250	98h	2 - 4	1992	family/peer spontaneous speech interactions; morphologically tagged
CHIEDE ^[18]	Garrote	Spanish	59	15444	~8h		2008	spontaneous conversation, personal interviews, adult-child interaction; orthographic transcriptions; automatic phonological transcription
TIDIGITS ^[19]	Leonard	English	326 (101 children)			6 - 15	1993	mix of adult and child speakers
FAU Aibo Emotion Corpus	Steidl	German	51		9h	10 - 13		human-annotated with 11 emotion categories
Swedish NICE Corpus ^[20]	Bell		5580			8 - 15	2005	consists of child-machine and adult-child interactions; orthographic transcriptions
SingaKids-Mandarin ^[4]	Chen	Mandarin	255	79,843	125h	7 - 12	2016	word and phone-level transcriptions; human-annotated proficiency ratings
CFSC^[21]	Pascual	Filipino	57		~8h	6-11	2012	consists of children's read speech; contains both good pronunciations and reading miscues; partially transcribed to word- and phoneme-levels

References

^ ^a ^b Habernal, Ivan; Vaclav, Matousek (2013). Text, Speech, and Dialogue: 16th International Conference, TSD 2013, Pilsen, Czech Republic, September 1-5, 2013, Proceedings. Springer. p. 545. ISBN 9783642405853. Retrieved 11 December 2015.
^ Neustein, Amy (2014). Speech and Automata in Health Care. Walter de Gruyter. pp. 225–226. ISBN 9781614515159. Retrieved 11 December 2015.
^ Ronzhin, Andrey; Potapova, Rodmonga; Fakotakis, Nikos (2015). Speech and Computer: 17th International Conference, SPECOM 2015, Athens, Greece, September 20-24, 2015, Proceedings. Springer. pp. 144–145. ISBN 9783319231327. Retrieved 11 December 2015.
^ ^a ^b Nancy F. Chen, Rong Tong, Darren Wee, Peixuan Lee, Bin Ma and Haizhou Li. SingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese, in Proc. of Interspeech, 2016.
^ "MyST Corpus | Boulder Learning inc". Retrieved 2019-07-17.
^ "My Science Tutor and the MyST Corpus". ResearchGate. Retrieved 2019-07-17.
^ Maxine Eskenazi, Jack Mostow, and David Graff. The CMU Kids Corpus LDC97S63. Web Download. Philadelphia: Linguistic Data Consortium, 1997.
^ Khaldoun Shobaki, John-Paul Hosom, and Ronald Cole. CSLU: Kids' Speech Version 1.1 LDC2007S18. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
^ Martin Russell. The PF-STAR British English Children's Speech Corpus. The Speech Ark Limited. 2006.
^ Anton Batliner, Mats Blomberg, Shona D'Arcy, Daniel Elenius, Diego Giuliani, Matteo Gerosa, Christian Hacker, Martin Russell, Stefan Steidl, Michael Wong. The PF STAR Children's Speech Corpus. In Proc. of Interspeech, 2005.
^ Manny Rayner, Nikos Tsourakis, Claudia Baur, Pierrette Bouillon, Johanna Gerlach. CALL-SLT: A Spoken CALL System based on grammar and speech recognition. In Linguistic Issues in Language Technology, vol. 10, issue 2. 2014.
^ Abe Kazemzadeh, Hong You, Markus Iseli, Barbara Jones, Xiaodong Cui, Margaret Heritage, Patti Price, Elaine Anderson, Shrikanth Narayanan and Abeer Alwan. TBALL Data Collection: The Making of a Young Children's Speech Corpus, in Proc. of Interspeech, 2005.
^ Jun Gao, Aijun Li and Ziyu Xiong. Mandarin Multimedia Child Speech Corpus: CASS_CHILD in International Conference on Speech Database and Assessments (Oriental COCOSDA), 2012.
^ ^a ^b Andreas Hagen, Bryan Pellom and Ronald Cole. Children's Speech Recognition with Application to Interactive Books and Tutors in IEEE Workshop on Automatic Speech Recognition and Understanding, 2003.
^ Demuth, K., Culbertson, J. & Alter, J. 2006. Word-minimality, epenthesis, and coda licensing in the acquisition of English. Language & Speech, 49, 137-174.
^ Demuth, K. & A. Tremblay. 2007. Prosodically-conditioned variability in children's production of French determiners. Journal of Child Language, 34, 1-29.
^ Demuth, K. 1992. Acquisition of Sesotho. In D. Slobin (ed.), The Cross-Linguistic Study of Language Acquisition, vol 3, 557-638. Hillsdale, N.J.: Lawrence Erlbaum Associates.
^ Marta Garrote. CHIEDE: A Spontaneous Child Language Corpus of Spanish. Ph.D. thesis, Universidad Autónoma de Madrid, Spain. 2008.
^ R. Gary Leonard, and George Doddington. TIDIGITS LDC93S10. Web Download. Philadelphia: Linguistic Data Consortium, 1993.
^ Linda Bell, Johan Boyce, Joakim Gustafson, Mattias Heldner, Anders Lindström and Mats Wirén. The Swedish NICE Corpus - Spoken Dialogues between Children and Embodied Characters in a Computer Game Scenario, in Proc. of Eurospeech, 2005.
^ Pascual, R. M.; Guevara, R. C. L. (November 2012). "Developing a children's Filipino speech corpus for application in automatic detection of reading miscues and disfluencies". TENCON 2012 IEEE Region 10 Conference. pp. 1–6. doi:10.1109/TENCON.2012.6412235. ISBN 978-1-4673-4824-9. S2CID 8795591.

[proc-1] Habernal, Ivan; Vaclav, Matousek (2013). Text, Speech, and Dialogue: 16th International Conference, TSD 2013, Pilsen, Czech Republic, September 1-5, 2013, Proceedings. Springer. p. 545. ISBN 9783642405853. Retrieved 11 December 2015.

[2] Neustein, Amy (2014). Speech and Automata in Health Care. Walter de Gruyter. pp. 225–226. ISBN 9781614515159. Retrieved 11 December 2015.

[3] Ronzhin, Andrey; Potapova, Rodmonga; Fakotakis, Nikos (2015). Speech and Computer: 17th International Conference, SPECOM 2015, Athens, Greece, September 20-24, 2015, Proceedings. Springer. pp. 144–145. ISBN 9783319231327. Retrieved 11 December 2015.

[chen2016-4] Nancy F. Chen, Rong Tong, Darren Wee, Peixuan Lee, Bin Ma and Haizhou Li. SingaKids-Mandarin: Speech Corpus of Singaporean Children Speaking Mandarin Chinese, in Proc. of Interspeech, 2016.

[5] "MyST Corpus | Boulder Learning inc". Retrieved 2019-07-17.

[6] "My Science Tutor and the MyST Corpus". ResearchGate. Retrieved 2019-07-17.

[7] Maxine Eskenazi, Jack Mostow, and David Graff. The CMU Kids Corpus LDC97S63. Web Download. Philadelphia: Linguistic Data Consortium, 1997.

[8] Khaldoun Shobaki, John-Paul Hosom, and Ronald Cole. CSLU: Kids' Speech Version 1.1 LDC2007S18. Web Download. Philadelphia: Linguistic Data Consortium, 2007.

[9] Martin Russell. The PF-STAR British English Children's Speech Corpus. The Speech Ark Limited. 2006.

[10] Anton Batliner, Mats Blomberg, Shona D'Arcy, Daniel Elenius, Diego Giuliani, Matteo Gerosa, Christian Hacker, Martin Russell, Stefan Steidl, Michael Wong. The PF STAR Children's Speech Corpus. In Proc. of Interspeech, 2005.

[11] Manny Rayner, Nikos Tsourakis, Claudia Baur, Pierrette Bouillon, Johanna Gerlach. CALL-SLT: A Spoken CALL System based on grammar and speech recognition. In Linguistic Issues in Language Technology, vol. 10, issue 2. 2014.

[12] Abe Kazemzadeh, Hong You, Markus Iseli, Barbara Jones, Xiaodong Cui, Margaret Heritage, Patti Price, Elaine Anderson, Shrikanth Narayanan and Abeer Alwan. TBALL Data Collection: The Making of a Young Children's Speech Corpus, in Proc. of Interspeech, 2005.

[13] Jun Gao, Aijun Li and Ziyu Xiong. Mandarin Multimedia Child Speech Corpus: CASS_CHILD in International Conference on Speech Database and Assessments (Oriental COCOSDA), 2012.

[hagen03_cukids-14] Andreas Hagen, Bryan Pellom and Ronald Cole. Children's Speech Recognition with Application to Interactive Books and Tutors in IEEE Workshop on Automatic Speech Recognition and Understanding, 2003.

[15] Demuth, K., Culbertson, J. & Alter, J. 2006. Word-minimality, epenthesis, and coda licensing in the acquisition of English. Language & Speech, 49, 137-174.

[16] Demuth, K. & A. Tremblay. 2007. Prosodically-conditioned variability in children's production of French determiners. Journal of Child Language, 34, 1-29.

[17] Demuth, K. 1992. Acquisition of Sesotho. In D. Slobin (ed.), The Cross-Linguistic Study of Language Acquisition, vol 3, 557-638. Hillsdale, N.J.: Lawrence Erlbaum Associates.

[18] Marta Garrote. CHIEDE: A Spontaneous Child Language Corpus of Spanish. Ph.D. thesis, Universidad Autónoma de Madrid, Spain. 2008.

[19] R. Gary Leonard, and George Doddington. TIDIGITS LDC93S10. Web Download. Philadelphia: Linguistic Data Consortium, 1993.

[20] Linda Bell, Johan Boyce, Joakim Gustafson, Mattias Heldner, Anders Lindström and Mats Wirén. The Swedish NICE Corpus - Spoken Dialogues between Children and Embodied Characters in a Computer Game Scenario, in Proc. of Eurospeech, 2005.

[21] Pascual, R. M.; Guevara, R. C. L. (November 2012). "Developing a children's Filipino speech corpus for application in automatic detection of reading miscues and disfluencies". TENCON 2012 IEEE Region 10 Conference. pp. 1–6. doi:10.1109/TENCON.2012.6412235. ISBN 978-1-4673-4824-9. S2CID 8795591.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

List of children's speech corpora

Overview of Children's Speech Corpora

See also

References