Adapt Tortoise-TTS for another language - machine-learning

I watched a YouTube video about voice cloning: https://www.youtube.com/watch?v=Kfr_FZof_hs
It's an interesting topic, but this project's repository only supports English: https://colab.research.google.com/drive/1NxiY3zHN4Nd8J3YAqFsbYaOB71IiLE04?usp=sharing#scrollTo=JrK20I32grP6
I want to adapt it for Italian.
I am a beginner in machine learning.
What do I need to do to get TTC to "learn" Italian?
Is it necessary to train the model on audio files or rebuild the model, or what needs to be done?
Can you advise me)

You can check the following where the creator answers this question.
Creator's answer:
Here is what you need to train this:
a wav2vec or similar asr model for your language
at least 10,000 hours of usable spoken language, with no environmental noises, music, etc. This does not need to be transcribed. I used audiobooks and podcasts for english.
approximately 16 months total of v100 time
https://github.com/neonbjb/tortoise-tts/issues/5

Related

Good resources for finding intents from strings using NLP

I am looking to decode intents from many strings from many conversations I have stored in a database, so I can use machine learning to create an intelligent chatbot. I have heard and tested tools like Amazon Lex, but I am looking to receive the intent from a string not create my own intents. Here is a sample starting question from the data I am working with:
Hi, can I please find out the location of the nearest Depot to Meadow Springs WA 6210?
Is there any chance we could get 34 cases from Melbourne to Sydney by Friday. I am hoping it will be a Sydney to Sydney tomorrow but if not can anyone do this by Friday from Melbourne?
can you tell me how the warranty claim is going on booking number 9528 thanks
Intents are usually created for a specific application by providing examples. However, some services do provide pre-defined intents you can use.
LUIS provides prebuilt "domains" which include some place related queries
Snips has an "intents library" that you can use
That may be able to get you started. If that doesn't work, this guide to a from-scratch implementation may be useful.

How to adapt acoustic model for UK/Indian accent

I am working on speech recognition for Indian Accent. For better recognition, I want to create language model for Indian accent.
The tutorial I got describes about Linux OS. Is there any way to do in Windows for acoustic model adaptation ?? Is there any alternative other than recorded sound to create a acoustic model ??
I found online language model tool here >>http://www.speech.cs.cmu.edu/tools/lmtool.html
But it is US accent.
Is there any online tool to create Indian/UK accent language model ??
First a quick clarification. Creating a new language model will not help you with accents. The language model only specifies in which orders the words can appear. You are looking for acoustic, not language, model adaptation.
Because adaptation is a complex process, there are not good online tools for it (the one you linked is a language model tool). I would recommend downloading SphinxTrain at http://cmusphinx.sourceforge.net for a good all-purpose adaptation/training tool. There are many tutorials and forums to help you get started.

Named Entity recognition

I am using stanford ner for removing the identity from the essays.
It is detecting the names like Werner..But indian names such as ram, shyam etc. goes undetected.
What i should do to make them recognizable.
You should train NER for Indian names. I could not find detailed information for how to achieve that. But this FAQ page ( http://nlp.stanford.edu/software/crf-faq.shtml#a ) has some information which may be a starting point for you. Especially the questions 2-3 are directly related to your question.

iOS / C: Algorithm to detect phonemes

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

Music analysis software [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Greetings
I may have imagined this but does anyone know if Last.fm previously used some form of open source project to perform analysis on music to determine similar music.
As its now moved to a pay version I'd like to make something which can add known music to my playlist. (I hate scanning my computer for similar music manually)
Failing that - does anyone know of any system that I could use to replace this ? Ideally I'd like some form of API / Source code that I can use to automate the whole process into batch jobs.
Thanks,
[edit]
Ideally I was looking for something more along the lines of content matching. I'm the type of person who just throws all my music into one unorganized location. Then being lazy I would ideally expect a playlist to be generated giving me a similar music type of playlist.
Last.fm uses http://www.audioscrobbler.net/ - it also provides access to its database via an API.
[/edit]
Music similarity is not an easy problem.
There are two general approaches to solving this problem.
Approach 1.
Throw data at the problem. This is the approach LastFM and Pandora take. It's basically one huge database which is maintained by either a community or group of experts. Note that to use this approach you will need clean metadata or some kind of audio fingerprinting solution like musicbrainz. Once you have the feature database you can use algorithms such as Pearson correlation coefficient to find similar items.
Approach 2.
Throw algorithms at the problem. In particular, computer audition algorithms. This means you calculate vectors of various features a song contains and using neural nets and a variety of other techniques you find other songs with similar vectors. This approach has been used successfully for automatic genre classification and query by example.
If you are looking for open source software for music analysis, marsyas can do pretty much everything the commercial stuff can do. Its the brain child of George Tzanetakis and on his web site you can find many papers about the state of affairs with computer audition.
There's a web API at The Echo Nest that includes a get_similar web service that allows you to retrieve similar artists to a set of seed artists. You can use this to help build playlists. The Echo Nest also has a set of web APIs that will perform a detailed analysis of a track (similar to the aforementioned Marsyas) that one could use as the basis for an acoustic-based song similarity method. (Caveat, I work at the Echo Nest). Of course, if you use iTunes, there's some canned solutions. iTunes now has a music recommender / playlist generator that will build playlists of songs from simliar artists. Similarly, the company Mufin has an iTunes add on which will perform acoustic analysis of your tracks and use this analysis to build playlists.
If you are interested in building your own music similarity system, I suggest that you take a look at the proceedings for ISMIR (the International Society of Music Information Retrieval). There's quite a bit of research around music similarity and playlisting that you'll find helpful. You can find the proceedings at ismir.net
Wouldn't it be simpler/more efficient to query(build?) some internet database based on genre/style/etc? I used last.fm and similar sites but never felt they did anything more then this (at least the results weren't indicating that) ;)
I am not very sure what exactly you want, but how about MusicBrainz?
To be clear, AudioScrobbler is the tech built by Last.fm to run their service. They collect stats on the tracks which people listen to (also 'Like's of tracks and artists).
So Last.fm does social similarity... users who listened to X also listened to Y - you like X so maybe you will also like Y.
Given a large enough user base submitting stats, social similarity is likely to provide better results than computer analysis approaches. For example, try querying the Last.fm API for similar artists to someone you know - probably comes up with some good matches and a few obscure or oddball ones, which nonetheless reflect real people's listening habits. The more obscure the artist you search for the more likely you'll get weird matches.
Even if you could get the automatic genre classification method described by George Tzanetakis to work well you are missing out on the subjective judgements of quality supplied by real people. eg two tracks both look like 'Jazz' but there are many different kinds of Jazz... and I might be interested in non-Jazz albums that a favourite jazz musician has played on. Social similarity would be more likely to capture that info.
I used to use Predixis Magic Mixer. It will perform a brief analysis of the audio in a file, produce a "finger print" and compared it to fingerprints in a central database. If listed, it would set an identification code which is the result of the analysis of the entire file into the client copy. If not, it would do a full analysis on the client computer (takes a while) and upload that to the central database and keep the local copy as well. From that information it can set up a play list that relates tunes, one to another' depending upon the actual sounds. I have not used it for a few years so I don't know if the central database servers still are in operation, but a web search says no. It should still work, but every file will require full analysis.

Resources