Music analysis software [closed] - machine-learning

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Greetings
I may have imagined this but does anyone know if Last.fm previously used some form of open source project to perform analysis on music to determine similar music.
As its now moved to a pay version I'd like to make something which can add known music to my playlist. (I hate scanning my computer for similar music manually)
Failing that - does anyone know of any system that I could use to replace this ? Ideally I'd like some form of API / Source code that I can use to automate the whole process into batch jobs.
Thanks,
[edit]
Ideally I was looking for something more along the lines of content matching. I'm the type of person who just throws all my music into one unorganized location. Then being lazy I would ideally expect a playlist to be generated giving me a similar music type of playlist.
Last.fm uses http://www.audioscrobbler.net/ - it also provides access to its database via an API.
[/edit]

Music similarity is not an easy problem.
There are two general approaches to solving this problem.
Approach 1.
Throw data at the problem. This is the approach LastFM and Pandora take. It's basically one huge database which is maintained by either a community or group of experts. Note that to use this approach you will need clean metadata or some kind of audio fingerprinting solution like musicbrainz. Once you have the feature database you can use algorithms such as Pearson correlation coefficient to find similar items.
Approach 2.
Throw algorithms at the problem. In particular, computer audition algorithms. This means you calculate vectors of various features a song contains and using neural nets and a variety of other techniques you find other songs with similar vectors. This approach has been used successfully for automatic genre classification and query by example.
If you are looking for open source software for music analysis, marsyas can do pretty much everything the commercial stuff can do. Its the brain child of George Tzanetakis and on his web site you can find many papers about the state of affairs with computer audition.

There's a web API at The Echo Nest that includes a get_similar web service that allows you to retrieve similar artists to a set of seed artists. You can use this to help build playlists. The Echo Nest also has a set of web APIs that will perform a detailed analysis of a track (similar to the aforementioned Marsyas) that one could use as the basis for an acoustic-based song similarity method. (Caveat, I work at the Echo Nest). Of course, if you use iTunes, there's some canned solutions. iTunes now has a music recommender / playlist generator that will build playlists of songs from simliar artists. Similarly, the company Mufin has an iTunes add on which will perform acoustic analysis of your tracks and use this analysis to build playlists.
If you are interested in building your own music similarity system, I suggest that you take a look at the proceedings for ISMIR (the International Society of Music Information Retrieval). There's quite a bit of research around music similarity and playlisting that you'll find helpful. You can find the proceedings at ismir.net

Wouldn't it be simpler/more efficient to query(build?) some internet database based on genre/style/etc? I used last.fm and similar sites but never felt they did anything more then this (at least the results weren't indicating that) ;)

I am not very sure what exactly you want, but how about MusicBrainz?

To be clear, AudioScrobbler is the tech built by Last.fm to run their service. They collect stats on the tracks which people listen to (also 'Like's of tracks and artists).
So Last.fm does social similarity... users who listened to X also listened to Y - you like X so maybe you will also like Y.
Given a large enough user base submitting stats, social similarity is likely to provide better results than computer analysis approaches. For example, try querying the Last.fm API for similar artists to someone you know - probably comes up with some good matches and a few obscure or oddball ones, which nonetheless reflect real people's listening habits. The more obscure the artist you search for the more likely you'll get weird matches.
Even if you could get the automatic genre classification method described by George Tzanetakis to work well you are missing out on the subjective judgements of quality supplied by real people. eg two tracks both look like 'Jazz' but there are many different kinds of Jazz... and I might be interested in non-Jazz albums that a favourite jazz musician has played on. Social similarity would be more likely to capture that info.

I used to use Predixis Magic Mixer. It will perform a brief analysis of the audio in a file, produce a "finger print" and compared it to fingerprints in a central database. If listed, it would set an identification code which is the result of the analysis of the entire file into the client copy. If not, it would do a full analysis on the client computer (takes a while) and upload that to the central database and keep the local copy as well. From that information it can set up a play list that relates tunes, one to another' depending upon the actual sounds. I have not used it for a few years so I don't know if the central database servers still are in operation, but a web search says no. It should still work, but every file will require full analysis.

Related

Improving Twilio Speech Recognition of Proper Nouns

I am working in an application that gathers a user's voice input for an IVR. The input we're capturing is a limited set of proper nouns but even though we have added hints for all of the possible options, we very frequently get back unintelligible results, possibly as a result of our users having various accents from all parts of the world. I'm looking for a way to further improve the speech recognition results beyond just using hints. The available Google adaptive classes will not be useful, as there are none that match the type of input that we're gathering. I see that Twilio recently added something called experimental_utterances that may help but I'm finding little technical documentation on what it does or how to implement.
Any guidance on how to improve our speech recognition results?
Google does a decent job doing recognition of proper names, but not in real time just asynchronously. I've not seen a PaaS tool that can do this in real time. I recommend you change your approach and maybe identify callers based on ANI or account number or have them record their name for manual transcription.
david

Stream multiple media sources from a single software/hardware encoder?

It's been a while since I first started looking into this and I still haven't found any feasible solutions, here's to hoping someone might have some suggestions/ideas...
The situation: We currently have a couple of live streams streaming mixed source content (some of the streams are being streamed as file playlists that are modified to change the files in the playlist, while others are streamed as live video directly from input). For each new live stream we usually just end up setting up a new streamer... it's feels rather counterproductive and wasteful.
The question: Does there exist a hardware or software solution (LINUX or Windows) that would allow to live stream multiple, for example, two (independent of each other) file playlists and optionally one or two live A/V inputs, from the same encoder?
According to my findings, with the help of FFMPEG library, it is possible to stream multiple live A/V inputs and even stream file playlists ... but it requires too much hacking to get it working and playlists have to be redone by hand and restarted every time changes have been made. This might work for me personally, but this won't do for a less tech-sawy people...
I'm basically looking for a way to reduce the computer hardware instead of allowing it to exponentially grow with each addition of a new live streaming source/destination.
Thank you for all your input and all the posted solutions. By sheer luck I found the solution I was originally looking for.
For anyone else looking for this or similar solution, the combo of systems that can combat our unusual requirements (and that can be integrated into our existing workflow by adjusting the hardware/software to meet our needs instead of us adjusting to hardware/software requirements/limitations) are: Sorenson Squeeze Server 3.0, MediaExcel Hero Live and MediaExcel File

Human annotation tool for corpora in NLP [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am trying to build my own training corpus for Named Entity Recognition, but I don't know if there is already an existing tool for this or if I have to implement one myself.
Basically, what I need to do is take a corpus and manually tag it word by word, which is pretty tedious, but it has to be done.
Can anyone tell me if there is already an existing one and where to get it?
I had a good experience working with BRAT.
GATE is also a very complex tool for annotating, steeper learning curve.
We had a nice experience using DataTurks . They provide nice intuitive UI which allows to add collaborator, insights into data, leaderboard for annotators and some other funky features.
https://dataturks.com
For online annotation of text or HTML corpus of relatively short documents I also recommend BRAT. You will have to go under the hood of the python web application if you want to do anything custom. It also failed to work for me on large HTML documents (100 or so pages).
I have also used stand-alone apps:
Protege + Knowtator: a bit cumbersome to setup / use, but it
works;
Gate: also cumbersome, and it somewhat works. Backup
your annotations at regular intervals as you might get
surprised by a stacktrace that also wiped or corrupted your annotated
corpus (which is just serialized Java objects).
If you are dealing with PDF documents, we built a web-based PDF Annotation Tool: NOTA. It accepts anything printed to PDF, including scans. We do commercial OCR on our end to recover text from images. There is a REST API to create color-coded annotation schemas and pre-populate documents with annotations, as well as a REST API for exporting formatted text and annotation offsets. There is also a JS API you can use to customize any annotation workflows, add metadata to annotations, etc. Relationships are not supported out of the box. Large documents, 200+ pages are supported. Email us and we can give you an API key to try it out. Details and documentation links can be found here. It is free for small research projects.
Here is a screenshot of what the annotations looks like :
I co-develop myself the web-based text annotation tool: tagtog.net
There is nothing to install, and you can define the type of entities you want to annotate. Additionally you can annotation relationships, document labels, and much more. You can upload your documents in many different formats, including PDF or markdown. You can annotate together with your team collaboratively. We have put great care in making the interface easy and beautiful. It looks like this:
You can start right away with a free account. Also I would be happy to help you with any doubt or issue you may have; just ping me or write us an email to the address shown on the website, tagtog.net.
Our annotation tool Prodigy is very scriptable, and is designed for active learning. It integrates especially well with our NLP library spaCy.
We've paid particular attention to the Named Entity Recogntion (NER) annotation workflows, as entity recognition can otherwise be very slow. I have a tutorial video on this:
https://www.youtube.com/watch?v=l4scwf8KeIA
There is this tool called, Dataturks is super simple to use, fully online NLP annotation tool, so that I even can easily push my teammates to complete datasets for our projects.
try TagEditor ,
It is a desktop application designed to annotate text for training with spaCy library.
You can tag Named Entities, Dependencies, Parts of speech, text categories
and print json file.
Example

shazam for voice recognition on iphone

I am trying to build an app that allows the user to record individual people speaking, and then save the recordings on the device and tag each record with the name of the person who spoke. Then there is the detection mode, in which i record someone and can tell whats his name if he is in the local database.
First of all - is this possible at all? I am very new to iOS development and not so familiar with the available APIs.
More importantly, which API should I use (ideally free) to correlate between the incoming voice and the records I have in the local db? This should behave something like Shazam, but much more simple since the database I am looking for a match against is much smaller.
If you're new to iOS development, I'd start with the core app to record the audio and let people manually choose a profile/name to attach it to and worry about the speaker recognition part later.
You obviously have two options for the recognition side of things: You can either tie in someone else's speech authentication/speaker recognition library (which will probably be in C or C++), or you can try to write your own.
How many people are going to use your app? You might be able to create something basic yourself: If it's the difference between a man and a woman you could probably figure that out by doing an FFT spectral analysis of the audio and figure out where the frequency peaks are. Obviously the frequencies used to enunciate different phonemes are going to vary somewhat, so solving the general case for two people who sound fairly similar is probably hard. You'll need to train the system with a bunch of text and build some kind of model of frequency distributions. You could try to do clustering or something, but you're going to run into a fair bit of maths fairly quickly (gaussian mixture models, et al). There are libraries/projects that'll do this. You might be able to port this from matlab, for example: https://github.com/codyaray/speaker-recognition
If you want to take something off-the-shelf, I'd go with a straight C library like mistral, as it should be relatively easy to call into from Objective-C.
The SpeakHere sample code should get you started for audio recording and playback.
Also, it may well take longer for the user to train your app to recognise them than it's worth in time-saving from just picking their name from a list. Unless you're intending their voice to be some kind of security passport type thing, it might just not be worth bothering with.

iOS / C: Algorithm to detect phonemes

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

Resources