Skip to content

Datasets #2

@StuartIanNaylor

Description

@StuartIanNaylor

The Single Word Target Segment at https://commonvoice.mozilla.org/en/datasets is not much better than a multilingual Google speech commands apart from it contains the word 'Hey' as the example is great but the shorter and less unique a KW is the harder it is to accurately not to have overlap with another label.
So x10 short numbers in a KWS is probably one of the hardest tasks you can do with KWS but often chosen as an example as 'Word' datasets are rare and choice of example is limited.
Apart from numbers the commonvoice dataset does have firefox but I have tended with a bit of sox use is to create 'Hey Marvin', 'Hey Shelia' and 'Hey House' from the Google command set.
Still though that can give a great known word label but we are still really short of phonetic variation of words there just isn't enough of them for the !notkw label.
https://mlcommons.org/en/multilingual-spoken-words/ is brilliant for that and also has a few other options adding hey or combining words to fill the 1sec KW envelope with as much unique spectral info you can with some margins.

I really like what you have done and storing processed MFCC in a .npz dataset as its a huge compressive codec completely lossy but it never needs to be returned to a pcm and the end size is so tiny.
I have a favorite recipe for a single KW dataset of the labels 'Noise', 'KW', '!KW' that much is what I learnt from the Google-research streaming KWS repo but if you have a relatively phonetically unique KW with other spoken words in '!KW' and none spoken noise in 'Noise' the accuracy balloons because of the uniqueness but also providing a much more varied dataset to !KW that is further expanded with the 'Noise' label.
Also stealing again much of what a gathered from the Google-research label mixing 70-80% of KW, with 25-35% levels of noise can help massively with open mics where the example would allow more than just headset mics.
I asked about spec-augment in another issue and just close these off after being read as just questions of curiosity but the above makes a huge difference whilst spec-augment I am not that sure about.

There is no license @mazko and would like to feedback upstream to you as what you have done as far as I can gather is about perfect from models of DS-CNN so that it can run on TFlite-Micro, which also I am looking at the ESP32 but my interesting is with the S3.
So for embedded to micro you have provided dataset and model that is just about perfect so rather than copy and rewrite it would be great if you could add some sort of license.
PS are you aware of any better models that will run on Micro as DS-CNN surpasses a straight CNN but say to a CRNN is quite 'heavy'? as far as I know LSTM & GRU are not avail.

I am aiming at a client/server to make KWS a bolt on to any ASR rather than branded embedded KWS by a simple websockets broadcast on KW using on device training to provide additional training of usage data organised into simple zones.
The server side is just the audio stream to a loopback (snd-aloop) for threading with a simple folder to hold meta-data of zones.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions