Deeply Audio Datasets
Machine / Deep Learning Training Data

We offer five types of large-scale audio datasets with different characteristics that have been collected over multiple years. They can be used for training and improving machine learning and deep learning AI models. We also offer a pre-trained model for each dataset. Gathering datasets is supported by the Korean government, and the data has been verified by government agency NIPA. Each dataset can be used for commercial or academic use at different price points. Detailed descriptions, statistics, and audio samples are provided on each dataset page below.

Please do not hesitate to contact us regarding price inquiries. Enter your email below.

For a faster reply, please contact - contact@deeplyinc.com

Nonverbal Vocal Data

Parent-Child Data

Emotional Speech Data

Multiple Location Data

Vocal Distance Data

00:00 / 00:03

coughing sound sample

Nonverbal Vocalization Data

This non-verbal voice dataset does not contain spoken language. There is a total of 16 types of nonverbal sounds, including screams, laughter, cries, moans, tickling, etc. With 57 hours of data collected from 1419 people, the quality of the data has been verified through double inspection.

Parent-Child Vocal Interaction Data

Consists of various conversations between parents and children. The interactions can be classified under 8 categories, including talking, singing, crying, etc. A total of 282 hours of data has been verified by double inspection. Each conversation between the pairs was recorded by two types of cell phones (iPhoneX and Samsung Galaxy S7) from distances of 0.4m, 2.0m, and 4.0m. The conversations were recorded under the same conditions described above in three different locations: a studio apartment, dance studio, and anechoic room.

child singing sample

00:00 / 00:01

Emotional Speech Corpus

Consists of voice data that conveys various emotions. Sentences containing positive, neutral, or negative meanings were recorded with the speaker conveying neutral emotion. Other sentences were recorded with the speaker conveying positive, neutral, or negative emotions. A total of 290 hours of data has been verified through double inspection. Each sound was recorded by two types of cell phones (iPhone X, Samsung Galaxy S7) under the distance conditions of 0.4m, 2.0m, and 4.0m. Taking into account the characteristics of the recording space, each piece of data was recorded in a studio apartment, dance studio, or anechoic room.

negative voice sample

00:00 / 00:03

Multiple Location Data

Voice data was recorded in various spaces: an anechoic chamber, a studio apartment, and a dance studio. An anechoic chamber is a room designed to stop the reflection of sound or electromagnetic waves. This results in an exceptionally low amount of reverb. The studio apartment results in moderate reverb while the dance studio produces high reverb. 570 hours of the sound data was verified through double inspection.

young-asian-duet-singers-with-microphone-recording-song-record-music-studio.jpg

Anechoic chamber sample

00:00 / 00:01

Sample recorded at 0.4m

00:00 / 00:02

Vocal Distance Data

The speakers are recorded by 3 microphones simultaneously. The microphones are placed at 3 different distances from the individual: 0.4m, 2m, and 4m. A total of 573 hours of data was recorded and validated through double inspection. This dataset can be used to improve the quality of the speech research over various distances.

Secure datasets are already being used by large companies, research institutes, and universities to improve speech, nonverbal, and emotional AI analysis.

More information regarding our datasets can be found at the following links:

* github link

* blog post

Please do not hesitate to contact us about price inquiries. Enter your email address below.

For a faster reply, please contact - contact@deeplyinc.com