examples for AQA, AAC, SER, SEC/ASC

by MrDragonFox - opened 12 days ago

Discussion

MrDragonFox

12 days ago

•

edited 12 days ago

congrats on the release !

we gotten some examples for vc and the tts,

could we get more examples in particular for the other capability and the prompts used during training for those ?

im after:

audio question answering (AQA),
audio captioning (AAC),
speech emotion recognition (SER),
sound event/scene classification (SEC/ASC)

YifeiXin

10 days ago

•

edited 10 days ago

Hi guys, thanks for your attention. You can refer to our benchmark evaluation files, which contain evaluation prompts for different tasks: . As for the training task prompts, we have designed many, and here are some examples: https://github.com/MoonshotAI/Kimi-Audio-Evalkit/blob/master/data/download_benchmark.py.

For the speech emotion task, the training prompts are:
1)Identify the predominant emotion in this speech.\nOptions:\n(A) neutral\n(B) joy\n(C) sadness\n(D) anger\n(E) surprise\n(F) fear\n(G) disgust\n.Answer with the option's letter from the given choices directly and only give the best option.
2)Based on the speech, what is the main emotion?\nOptions:\n(A) neutral\n(B) joy\n(C) sadness\n(D) anger\n(E) surprise\n(F) fear\n(G) disgust\n.Answer with the option's letter from the given choices directly and only give the best option.

For the acoustic scene classification task, the evaluation prompts follow this general format:
1)Identify the acoustic scene in the audio.\nOptions:\n(A) beach\n(B) bus\n(C) cafe or restaurant\n(D) car\n(E) city center\n(F) forest path\n(G) grocery store\n(H) home\n(I) library\n(J) metro station\n(K) office\n(L) park\n(M) residential area\n(N) train\n(O) tram\n.Answer with the option's letter from the given choices directly and only give the best option.
2)Classify the location heard in the sound.\nOptions:\n(A) beach\n(B) bus\n(C) cafe or restaurant\n(D) car\n(E) city center\n(F) forest path\n(G) grocery store\n(H) home\n(I) library\n(J) metro station\n(K) office\n(L) park\n(M) residential area\n(N) train\n(O) tram\n.Answer with the option's letter from the given choices directly and only give the best option.

For the acoustic event detection task, the evaluation prompts follow this general format:
1)Identify the sound event in the audio.
2)What sound event occurs in this audio?

For the audio captioning task, the evaluation prompts follow this general format:
1)Please describe the sound events in the audio.
2)Please generate the audio caption.

As for the Audio Question Answering (AQA) task, the evaluation prompt is completely random. You can ask any question (for example, you can refer to datasets like MMAU, ClothoAQA, Comp-R, AVQA, MusicAVQA, etc.)

It’s best to follow our format for asking, but the specific content details can vary freely.

jocoyo

9 days ago

@YifeiXin
If I want, for example, to perform an acoustic event detection task for an audio, but within several events occur, and I would like to receive a result in the style of:
(start_1, end_1, event_1), (start_2, end_2, event_2), etc.
Is this possible?

YifeiXin

8 days ago

Hi, this feature is not currently supported.

MrDragonFox

8 days ago

@YifeiXin
If I want, for example, to perform an acoustic event detection task for an audio, but within several events occur, and I would like to receive a result in the style of:
(start_1, end_1, event_1), (start_2, end_2, event_2), etc.
Is this possible?

the way ive done it on my stuff is mostly via scribe v1 - i should have enough data to build a distilled whisper that can do that . but this is time intensive / you can always just snip it up and generate data per event and compound it that way .. but at 0 shot there isnt anything out there that can give word time stamps on N audio_events

MrDragonFox changed discussion status to closed 8 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment