Open Alternatives to Google Prediction API - machine-learning

A recent announcement by Google about the Google Prediction API sounded very interesting. It could be useful for a project that is coming up, and would probably do a better job than some custom code I was considering.
However, there is some vendor lock-in. Google retain the trained model, and could later choose to overcharge me for it. It occurred to me that there are probably open-source equivalents, if I was willing to host the training myself (I am) and live without their ability to throw hardware at the problem at a moment's notice.
Last time I looked at 3rd Party computer training code was many years ago, and there were a lot of details that needed to be carefully considered and customised for your project. Google appear to have hidden those decisions, and take care of them for you. To me, this is still indistinguishable from magic, but I would like to hear whether others can do the same.
So my question is:
What alternatives to Google Prediction API exist which:
categorise data with supervised machine learning,
can be easily configured (or don't need configuration) for different kinds and scales of data-sets?
are open-source and self-hosted (or at the very least, provide you with a royalty free use of your model, without a dependence on a third party)

Maybe Apache Mahout?

PredictionIO is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery.

Have been looking recently at tools like google prediction API, one of the first ones I got put on to was Weka machine learning tool which could be worth checking out for anyone looking.

I'm not sure if it's relevant, but directededge seams to be doing exactly that :)

There is good free for use service Yandex Predictor with 100000/day request quota. It works for text only, supports several languages and spell correction.
You need to get free API Key, then you can use simple RESTful API. Api support JSON, XML and JSONP as output.
Unfortunately I cannot find documentation in English. You can use Google Translate.
I can translate docs if there is some demand.

Related

Moses - Online Integration

We're actually looking to integrate Moses into our localization workflow. Our application is in Java and we're looking at using Moses' functionalities using xml-rpc calls.
Specifically, we're looking at APIs for:
Incremental training (i.e. Avoid having to retrain the model
from scratch every time we wish to use some new training data)
Domain-specific training (i.e. It should maintain separate
phrase tables for each domain that the input data belongs),
Decoding
The tutorial says that these can be achieved via xml-rpc calls. But, I don't find any examples or clear ways to do them. Can someone please provide some examples?
Also, I would like to know if the training and decoding phases can be done in a
distributed manner.
Thanks!
this question is perfectly suitable for moses mailing list:
http://www.statmt.org/moses/?n=Moses.MailingLists
moses server documentation (via xml-rpc):
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc28
However, I have better experiences with: moses/contrib/web/bin/daemon.pl which makes server as well, and you communicate via tcp stream.
General examples are harder to find(everyone has different enviroment,...), but make your question more specific and send it to moses mailing list. (e.g. someone had a problem with server installation: http://comments.gmane.org/gmane.comp.nlp.moses.user/7242 )

Would Codeigniter 2 be a good choice for my case - explained below?

Have spent a week trying to wrap my head around Yii framework, but while I do get a sense of it's elegance, I am finding the learning curve rather steep, compared to the 2 days I spent on Codeigniter. My background is of Unix systems programming (communication stacks), with no MVC exposure, and know only basic PHP (find it fairly simple & straight-forward).
I've considered the no-framework approach, but find it even more daunting given that, I've almost no web-development experience. A framework, at minimum would give guidance in terms of architecture & design.
I might be shooting myself in the foot, but with a tight deadline on ramp-up, and delivering a somewhat complex web-application, I need to get productive real fast.
So wondering if community can guide me, if Codeigniter 2.x, would be good choice for me, given the following requirements --
Easy to learn and able to deliver quickly, something that is functional. Thus needs to have extensive, easy-to-use documents, tutorials (beyond simple-blogs) and a very active community.
Framework needs to make it easy to integrate features like -
User registration with captcha
User verification using random verification key sent via mobile phone
Send Email, short-message to mobile phone
Integrate with Payment Gateway
Have significant no. (close to a hundred) possible CRUD operations
Doesn't get in the way (if not making it easy) to work with AJAX, for things like timeline presentation - including audio-snippets, photographs, video-snippets
Doesn't get in the way (if not making it easy) for the web application to be made accessible on mobile devices s.a. smartphones
Has reasonable performance. Need not be the fastest, but performance is a concern, although secondary.
Of course, I do not need all the features on day-1, and willing to invest some time in reading/learning about the framework, but wouldn't want to read an entire manual first.
Note that I've already searched the Codeigniter forums and found discussion on some of the required functionalities, however most of the interesting libraries seem to be available only on Codeigniter 1.7 and found little confirmation of those being available also for Codeigniter 2.1 ! Also, all Codeigniter books are for 1.7, and none for 2.1. Does that mean that 2.x doesn't have enough adoption and community support ?
Yes. CodeIgniter 2 is a good choice.
It is pretty easy to use and learn. I'd suggest understanding the MVC architecture in general though. They're official documentation is awesome, although sometimes I yearn for the straight-forward API format. You'll notice that they don't show all the available parameters a function will accept up-front; sometimes you have to read the entire page to figure it out al your available options. Note: You'll find that there is no one way or standard in using models in CI. They're as helpful as you manually code them to be.
There are tons of libraries and helpers to do pretty much anything you need, as well as tutorials on how to do them. Not sure what you meant by verification key sent via mobile phone. AJAX is not a problem. CI has this pretty handy is_ajax_request() function that's really useful. Note: there used to be a problem with AJAX request expiring sessions. Not sure if that's still an issue. As far as making it accessible for mobile-devices, you'll find more issues on the front-end than the back.
Baseline (virgin CI) performance is pretty good. It's up to you (your code and queries) to keep it lean.
Many of the libraries you find may say that they were made for 1.7, but may work with 2.x You can try updating them yourself if necessary. We'd be glad to help. Note that "plugins" have been deprecated in CI 2, you'll have to convert plugins to helpers or libraries (depending on your needs). CI 1.7 has a 3-year lead on CI 2. It'll take some time for "the community" to catch up.
Hope this helps.

A production-ready, real-time recommendation engine thats easy to setup

I want to store large number of data ppints for user actions, like likes, tags etc (I have plans for both e-commerce and document management).
With the data points, I want to support functions such as
"users who loved X loved Y,Z" recommendations
"fetch more stuff similar to X,Y" clustering.
By production-ready, real-time; I mean that I can enter data points and make queries at the same time, the server will take care of answering queries and updating scores by itself.
I searched around the interwebs and the solutions that come up are either of:
Data-mining libraries that are mostly academic-oriented and are meant for large batch operations, not for heavy real-time queries
Hadoop/Mahout, which is production-ready and support real time updates and queries, but have a steep learning curve and tough to administer.
For recommenders, Mahout has a non-distributed recommender implementation that does not use Hadoop. In fact, this is the only part that is real-time; the Hadoop-based parts are not.
I think there is little learning curve to it; see here and here for a pretty complete writeup.
Mahout in Action chapters 2-5 cover this quite well too.
Please understand that for useful recommendations, the various parameters of such a system must be carefully fine tuned. The out of the box functionality many systems have (Oracle data mining, Microsoft data mining extensions etc.) just offer the core functionality.
So in the end, you will not get around the "steep learning curve", I guess. That is why you need experts for data mining. If there were a point-and-click solution, it would already be integrated everywhere.
Example "similar items". I laughed hard, when Amazon once recommended me to buy two products: Debian Linux Administrators Handbook and ... Debian Linux Admininstrators Handbook WITH CD.
I hope you get the key point of this example: to a plain algorithm, the two books appear "similar", and thus a sensible combination. To a human, it it pointless to buy the same book twice. You need to teach such rules to any recommendation system, as they cannot be trivially learned from the data. There will always be good results and useless results, and you need to tune and parameterize the system carefully.

Image/File hosting storage best practices and standards

We are building an image and file hosting website and we will save these files on our servers, so I want to know if there are any best practices or standards I need to read and follow to make our website scalable and easy to extend in the future.
Is there a book or articles or videos talking about this subject, please share.
As per my experience to deal with large data.
its always best to opt for Cloud, check for "Amazon S3" (Amazon AWS) or Windows Azure.
features like "CDN" (cloud front) is a big plus.
I believe this is not a simple question that can be answered without knowing
how many files are expected ?
how many users/files accesses per hour/day/minute ?
your usage scenarios with this files (downloading? streaming? how many concurrent files downloaded at once?
are you stuck in one particular OS (windows) and filesystem (NTFS), or is there freedom in this ?
My personal note : Building own image/file hosting is not a trivial task, i strongly recommend you to hire somebody with experience from this area.
I would recommend that if possible, you look at a 3rd party solution that provides an api. you'll then get the benefits of lower cost of ownership, no maintenance costs for the hardware and continual updates thrown in for free when the 3rd party adds new features to the core offering. I know this from 1st hand experience as we scoped out the options for doing this in a recent project and came to the conclusion that we'd spend 100 times more on our own solution and even then, may not get it right. We opted for a company called Razuna who offer both a hosted and open source version of their platform. Their api is very straightfwd and can be consumed inside your mvc app with potentially only a few days effort (depending on your use case). The beauty of this approach is that the hosted elements are actually on the nirvanix backbone and are served via their CDN - so win win.
You can get the details at:
http://www.razuna.com
and can view the api docs at:
http://wiki.razuna.com/display/ecp/Developer+Guides
Good luck and if you need any further real-life guidence on this, feel free to come back. Oh and btw, we were also able to ask for 'paid for' features to be added to the core offering at pretty much standard market day rates.

What is the best Delphi n-tier low bandwidth technology?

I need to deploy a Delphi app in an environment that needs centralized data and file storage system (for document imaging) but has multiple branch offices with relatively poor inter connectivity. I believe a 3 tier database application is the best way to go so I can provide a rich desktop experience with relatively light-weight data transfer needs. So far I have looked briefly at Delphi Datasnap, kbmMW and Remobjects SDK. It seems that kbmMW and Remobjects SDK use the least bandwidth. Does anyone have any experience in deploying any of these technologies in a challenging environments with a significant number of users (I need to support 700+)? Thanks!
Depends if you are tied to remote datasets. If you aren't dataset bound then SOAP would likely be a good choice. Or, what I've done is write my own protocol that is similar to SOAP in nature. This was done before SOAP was standard and I'm glad I did - this gives you the ability to control more of the flow of data. It's given that if you have poor connectivity then you will be spending time supporting it. It's very nice if it's your own code you are supporting versus having to wait on a vendor. (Although KBM and REM are known to be pretty good vendors.)
Personal note: 700 users in a document imaging application over poor connectivity sounds like a mess. Spend the money on upgrading connectivity as it'll be cheaper in the long run.
Both kbmMW and RO SDK offer binary format, which is more compact than SOAP format,specially you are working with documents.
RO sdk seems to offer more GUI tools to help you doing your services.
Also give a RealThinClient SDK a look, it's a lightweight remoting framework.
But what ever framework you go with, your design of work will make it fast or slow, I have some applications working on slow 128kb lines, and it's working perfect without any user complain, but I don't do a large transfer for files.
One thing to remember...its not the number of users, but the number of them using the resources at the same time that will be the issue. Attempt to develop your application "server stateless" if at all possible, this will allow greater flexibility in the long term if you find you have to add more servers to the pool to support your customer base. The hardest thing about n-tier is scaling beyond the first server...plan on that from the start. Each request should not know anything about a prior request...or at the very least the request should have a way of passing the context so the server can look it up in a session table or something.
Personally, I would recommend RemObjects. I have used it with good results.
I don't know if it's the very best / most efficient (glad you asked this question!), but I've had good results w/RemObjects SDK + DataAbstract. The latter made much of the plumbing details less involved, which was helpful. Still implementing, but so far so good.
If you really wanna go "low-bandwidth" use BSD Sockets API - that'll give you full control over what's being sent and there you can send as little information as you want. Of course then you'll have to implement all the tiers yourself, but hey - that's still an option :D

Resources