Text recommendation - suggestion - machine-learning

I am working on a personal project. I worked on some publicly available data sets. I have to build a text recommendation system that suggest users meaning full text(1 -2 lines) based on pre-defined rules.
How do I go about building a text recommendation system and how do I define the rules (The rules are simple checks like AC, then suggest a text if not suggest alternate text). I also need to know how I can make the system learn by itself to update rules
ANy inputs, links to research papers, github links etc will help

First of all, you will be encountering a cold start problem.
The simple solution would be to recommend all-time favorites text. More sophisticated solutions are here.
Thereafter you can use collaborative filtering by collecting implicit data like views, clicks, comment , bookmarks of the text. Because users seldom give ratings.
So you can use implicit data as:
'VIEW': 1.0,
'LIKE': 2.0,
'BOOKMARK': 3.0,
'FOLLOW': 4.0,
'COMMENT CREATED': 5.0, as ratings.
This kind of reccomender system can be found here

Related

Automatic text / HTML annotation / highlighting

Nowadays there are softwares which, when provided a text or a html document page, will output a summary.
I wonder if there exist anything to automatically annotate (or at least highlight) the same documents.
The idea is to be able to keep the full text, but highlight the most meaningful parts (somehow like a summarisation tool would do I guess). And maybe provide additional inferred insights (?)
Also I would like to know how it works if it exists :) Would it really be very different of summarization, or is it just the same principles with a different "output format"?
I'm looking for something to annotate HTML documents, like AnnotatorJS is designed for, looking like this:
This is not a complete answer, but it can lead to what you want. The first suggestion is looking at GATE. It provides a great annotation framework and as long as you don't want to program anything for it, it is easy to use. The second thing is to search for summarization plug-ins for GATE. GATE has been around for such a long time that I am sure someone has already implemented a summarization plug-in for it.

Mahout recommendations with metadata related to preference

I was planning to write a recommender which treats preferences differently depending contextual information (time the preference was made, device used to make the recommendation, ...)
Within the Mahout in Action book and within the code examples shipped with Mahout I can't seem to find anything related. In some examples to there's metadata (a.k.a content) used to express user or item similarity - but that's not what I'm looking for.
I wonder if anyone already made an attempt to do sth similar with Mahout?
Edit:
A practical example could be that the current session is done on a mobile device and this should cause a push up (rating*1.1) for all preferences tracked on mobile devices and a drop for preferences tracked differently (rating*0.9).
...
Another example could be that some ratings are collected implicit and others explicit. How would I be able to keep track of this fact without "coding" that directly into the tracked value and how would I be able to use that information when calculating the scores?
I would say one approach is to use the Rescorer class to do just that, but my guess is that this is what you are referring to when you say that's not what you are looking for.
Another approach would be to pre-process the entire data you have to adjust the preferences according to your needs, before using Mahout to generate recommendations.
If you provide some more detail on how you expect to use your data to modify preferences, people here would be able to help even further.

How to store math equation/symbol and display them on the web?

I want to build a website where people can create tests with questions and answers . I want people can type in math equation/symbol and equations in a textbox or something like that, and they will be store in database, it'also displayed on the web like image.
My idea is i will store the text user input in latex syntax and store it, then display it using MathJax, i don't know it's possible or will have better way to do this.
And a problem is in user input will have normal text with "math text" (latex), so how can i separate them and only save the latex text? Please give me some idea or suggest the way to solve it, thanks.
p/s: i'm building this site in ruby on rails, i found the gem mathjax-rails but it seem not working.
Consider building off Gollum. It is the backend for the wiki system Github uses and works fairly well with LaTex equations (currently their is a very irritating bug with less/greater than symbols, but is documented and likely will be fixed in the next release). I start using it this summer to take notes in a math classes, an example of a full page of rendered LaTex equations notes is here here.
Note: You must be logged into Github in order for the equation to render.

iOS / C: Algorithm to detect phonemes

I am searching for an algorithm to determine whether realtime audio input matches one of 144 given (and comfortably distinct) phoneme-pairs.
Preferably the lowest level that does the job.
I'm developing radical / experimental musical training software for iPhone / iPad.
My musical system comprises 12 consonant phonemes and 12 vowel phonemes, demonstrated here. That makes 144 possible phoneme pairs. The student has to sing the correct phoneme pair 'laa duu bee' etc in response to visual stimulus.
I have done a lot of research into this, it looks like my best bet may be to use one of the iOS Sphinx wrappers ( iPhone App › Add voice recognition? is the best source of information I have found ). However, I can't see how I would adapt such a package, can anyone with experience using one of these technologies give a basic rundown of the steps that would be required?
Would training be necessary by the user? I would have thought not, as it is such an elementary task, compared with full language models of thousands of words and far greater and more subtle phoneme base. However, it would be acceptable (not ideal) to have the user train 12 phoneme pairs: { consonant1+vowel1, consonant2+vowel2, ..., consonant12+vowel12 }. The full 144 would be too burdensome.
Is there a simpler approach? I feel like using a fully featured continuous speech recogniser is using a sledgehammer to crack a nut. It would be far more elegant to use the minimum technology that would solve the problem.
So really I'm hunting for any open source software that recognises phonemes.
PS I need a solution which runs pretty much real-time. so even as they are singing the note, firstly it blinks on to illustrate that it picked up the phoneme pair that was sung, and then it glows to illustrate whether they are singing the correct note pitch
If you are looking for a phone-level open source recogniser, then I would recommend HTK. Very good documentation is available with this tool in the form of the HTK Book. It also contains an entire chapter dedicated to building a phone level real-time speech recogniser. From your problem statement above, it seems to me like you might be able to re-work that example into your own solution. Possible pitfalls:
Since you want to do a phone level recogniser, the data needed to train the phone models would be very high. Also, your training database should be balanced in terms of distribution of the phones.
Building a speaker-independent system would require data from more than one speaker. And lots of that too.
Since this is open-source, you should also check into the licensing info for any additional details about shipping the code. A good alternative would be to use the on-phone recorder and then have the recorded waveform sent over a data channel to a server for the recognition, pretty much something like what google does.
I have a little bit of experience with this type of signal processing, and I would say that this is probably not the type of finite question that can be answered definitively.
One thing worth noting is that although you may restrict the phonemes you are interested in, the possibility space remains the same (i.e. infinite-ish). User training might help the algorithms along a bit, but useful training takes quite a bit of time and it seems you are averse to too much of that.
Using Sphinx is probably a great start on this problem. I haven't gotten very far in the library myself, but my guess is that you'll be working with its source code yourself to get exactly what you want. (Hooray for open source!)
...using a sledgehammer to crack a nut.
I wouldn't label your problem a nut, I'd say it's more like a beast. It may be a different beast than natural language speech recognition, but it is still a beast.
All the best with your problem solving.
Not sure if this would help: check out OpenEars' LanguageModelGenerator. OpenEars uses Sphinx and other libraries.
http://www.hfink.eu/matchbox
This page links to both YouTube video demo and github source.
I'm guessing it would still be a lot of work to mould it into the shape I'm after, but is also definitely does do a lot of the work.

Is there a good script similar to Iconfinder.com?

Basically I'm looking for a search engine that searches through a given database. The content will be text being searched.
You will probably want to use a service such as Solr. The easiest way to get started using it is to find a 'cloud' based version, such as Websolr. However, the solution will depend on what language you wish to use when programming your site.
Solutions depend somewhat on language:
1. For java/c#, you have lucene/solr
2. for python you have haystack
You could do text search in the DB directly via LIKE/ILIKE, but the performance depends on DB.
Iconfinder was coded specifically for icon search and at the time (launched in 2007) there were no scripts that worked well for this.
Building a search engine like Iconfinder is not rocket science. I think the hardest part is getting the SQL tuned and figure out how to rank the content. At the moment I collect data about impressions and downloads and calculate a value from that. The icons' rank is based on this value (download/impression) and how well keywords match the tags for the given icon.

Resources