examples for AQA, AAC, SER, SEC/ASC
congrats on the release !
we gotten some examples for vc and the tts,
could we get more examples in particular for the other capability and the prompts used during training for those ?
im after:
audio question answering (AQA),
audio captioning (AAC),
speech emotion recognition (SER),
sound event/scene classification (SEC/ASC)
Hi guys, thanks for your attention. You can refer to our benchmark evaluation files, which contain evaluation prompts for different tasks: . As for the training task prompts, we have designed many, and here are some examples: https://github.com/MoonshotAI/Kimi-Audio-Evalkit/blob/master/data/download_benchmark.py.
For the speech emotion task, the training prompts are:1)Identify the predominant emotion in this speech.\nOptions:\n(A) neutral\n(B) joy\n(C) sadness\n(D) anger\n(E) surprise\n(F) fear\n(G) disgust\n.Answer with the option's letter from the given choices directly and only give the best option.
2)Based on the speech, what is the main emotion?\nOptions:\n(A) neutral\n(B) joy\n(C) sadness\n(D) anger\n(E) surprise\n(F) fear\n(G) disgust\n.Answer with the option's letter from the given choices directly and only give the best option.
For the acoustic scene classification task, the evaluation prompts follow this general format:1)Identify the acoustic scene in the audio.\nOptions:\n(A) beach\n(B) bus\n(C) cafe or restaurant\n(D) car\n(E) city center\n(F) forest path\n(G) grocery store\n(H) home\n(I) library\n(J) metro station\n(K) office\n(L) park\n(M) residential area\n(N) train\n(O) tram\n.Answer with the option's letter from the given choices directly and only give the best option.
2)Classify the location heard in the sound.\nOptions:\n(A) beach\n(B) bus\n(C) cafe or restaurant\n(D) car\n(E) city center\n(F) forest path\n(G) grocery store\n(H) home\n(I) library\n(J) metro station\n(K) office\n(L) park\n(M) residential area\n(N) train\n(O) tram\n.Answer with the option's letter from the given choices directly and only give the best option.
For the acoustic event detection task, the evaluation prompts follow this general format:1)Identify the sound event in the audio.
2)What sound event occurs in this audio?
For the audio captioning task, the evaluation prompts follow this general format:1)Please describe the sound events in the audio.
2)Please generate the audio caption.
As for the Audio Question Answering (AQA) task, the evaluation prompt is completely random. You can ask any question (for example, you can refer to datasets like MMAU, ClothoAQA, Comp-R, AVQA, MusicAVQA, etc.)
It’s best to follow our format for asking, but the specific content details can vary freely.
Hi, this feature is not currently supported.
@YifeiXin
If I want, for example, to perform an acoustic event detection task for an audio, but within several events occur, and I would like to receive a result in the style of:
(start_1, end_1, event_1), (start_2, end_2, event_2), etc.
Is this possible?
the way ive done it on my stuff is mostly via scribe v1 - i should have enough data to build a distilled whisper that can do that . but this is time intensive / you can always just snip it up and generate data per event and compound it that way .. but at 0 shot there isnt anything out there that can give word time stamps on N audio_events