Moses - Online Integration - localization

We're actually looking to integrate Moses into our localization workflow. Our application is in Java and we're looking at using Moses' functionalities using xml-rpc calls.
Specifically, we're looking at APIs for:
Incremental training (i.e. Avoid having to retrain the model
from scratch every time we wish to use some new training data)
Domain-specific training (i.e. It should maintain separate
phrase tables for each domain that the input data belongs),
Decoding
The tutorial says that these can be achieved via xml-rpc calls. But, I don't find any examples or clear ways to do them. Can someone please provide some examples?
Also, I would like to know if the training and decoding phases can be done in a
distributed manner.
Thanks!

this question is perfectly suitable for moses mailing list:
http://www.statmt.org/moses/?n=Moses.MailingLists
moses server documentation (via xml-rpc):
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc28
However, I have better experiences with: moses/contrib/web/bin/daemon.pl which makes server as well, and you communicate via tcp stream.
General examples are harder to find(everyone has different enviroment,...), but make your question more specific and send it to moses mailing list. (e.g. someone had a problem with server installation: http://comments.gmane.org/gmane.comp.nlp.moses.user/7242 )

Related

How i know when a web site is using RDF?

How do I know when a web site uses an RDF ?
For example , I know that eBay and Amazon uses RDF because I've read in many articles, but as I know it in practice ?
In practice, there is currently no single standardized way for websites to "advertise" their use of RDF. You find out by them informing you about it in some fashion, e.g. by them publishing a link to API documentation that describes how they use RDF, or indeed by writing an article about it, so pretty much the same way you find out about any REST API / web service. Of course, in the case of RDF/linked data you are often helped by the fact that other datasets you already know about may be linking to the new source, thus making it discoverable.
There are some attempts at defining more standardized mechanisms for 'advertising' a website's linked data use. The W3C VoID specification is the closest thing to a standard in that regard: it provides a vocabulary for publishers to describe the data and access mechanisms they offer, and it also gives pointers on how to make things discoverable. Unfortunately, it is not (yet) very widely adopted.

A production-ready, real-time recommendation engine thats easy to setup

I want to store large number of data ppints for user actions, like likes, tags etc (I have plans for both e-commerce and document management).
With the data points, I want to support functions such as
"users who loved X loved Y,Z" recommendations
"fetch more stuff similar to X,Y" clustering.
By production-ready, real-time; I mean that I can enter data points and make queries at the same time, the server will take care of answering queries and updating scores by itself.
I searched around the interwebs and the solutions that come up are either of:
Data-mining libraries that are mostly academic-oriented and are meant for large batch operations, not for heavy real-time queries
Hadoop/Mahout, which is production-ready and support real time updates and queries, but have a steep learning curve and tough to administer.
For recommenders, Mahout has a non-distributed recommender implementation that does not use Hadoop. In fact, this is the only part that is real-time; the Hadoop-based parts are not.
I think there is little learning curve to it; see here and here for a pretty complete writeup.
Mahout in Action chapters 2-5 cover this quite well too.
Please understand that for useful recommendations, the various parameters of such a system must be carefully fine tuned. The out of the box functionality many systems have (Oracle data mining, Microsoft data mining extensions etc.) just offer the core functionality.
So in the end, you will not get around the "steep learning curve", I guess. That is why you need experts for data mining. If there were a point-and-click solution, it would already be integrated everywhere.
Example "similar items". I laughed hard, when Amazon once recommended me to buy two products: Debian Linux Administrators Handbook and ... Debian Linux Admininstrators Handbook WITH CD.
I hope you get the key point of this example: to a plain algorithm, the two books appear "similar", and thus a sensible combination. To a human, it it pointless to buy the same book twice. You need to teach such rules to any recommendation system, as they cannot be trivially learned from the data. There will always be good results and useless results, and you need to tune and parameterize the system carefully.

Open Alternatives to Google Prediction API

A recent announcement by Google about the Google Prediction API sounded very interesting. It could be useful for a project that is coming up, and would probably do a better job than some custom code I was considering.
However, there is some vendor lock-in. Google retain the trained model, and could later choose to overcharge me for it. It occurred to me that there are probably open-source equivalents, if I was willing to host the training myself (I am) and live without their ability to throw hardware at the problem at a moment's notice.
Last time I looked at 3rd Party computer training code was many years ago, and there were a lot of details that needed to be carefully considered and customised for your project. Google appear to have hidden those decisions, and take care of them for you. To me, this is still indistinguishable from magic, but I would like to hear whether others can do the same.
So my question is:
What alternatives to Google Prediction API exist which:
categorise data with supervised machine learning,
can be easily configured (or don't need configuration) for different kinds and scales of data-sets?
are open-source and self-hosted (or at the very least, provide you with a royalty free use of your model, without a dependence on a third party)
Maybe Apache Mahout?
PredictionIO is an open source machine learning server for software developers to create predictive features, such as personalization, recommendation and content discovery.
Have been looking recently at tools like google prediction API, one of the first ones I got put on to was Weka machine learning tool which could be worth checking out for anyone looking.
I'm not sure if it's relevant, but directededge seams to be doing exactly that :)
There is good free for use service Yandex Predictor with 100000/day request quota. It works for text only, supports several languages and spell correction.
You need to get free API Key, then you can use simple RESTful API. Api support JSON, XML and JSONP as output.
Unfortunately I cannot find documentation in English. You can use Google Translate.
I can translate docs if there is some demand.

What are some interesting projects to solve in Erlang for learning purposes? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I recently discovered Erlang and am now working my way through a couple of tutorials. By now I'm looking forward to actually implement something as a hobby project. I'm not really interested in yet another chat server. I would like to code something more interesting (yes I'm aware that this is a rather fuzzy term) which is also manageable, so I can finish it in my spare time.
Any suggestions?
Edit: The project should preferably highlight Erlang's strenghts (concurrency, distributed).
Build a distributed system that searches twitter feeds in real time and allows anyone to perform searches from a web front end.
Build a distributed file system. Implement distributed B*Trees or B+Trees as the base of this file system. Do it in erlang.
Build a distributed key value store on top of the distributed file system built in step 2.
Build a distributed web index (to be used by a distributed web search engine) on top of the key value store.
Build a distributed linker. Advanced build automation offers remote agent processing for distributed builds and/or distributed processing.
Build a MMORPG backend that relies on distributed storage of the game/player state and distributed processing of user requests.
For something for yourself, consider writing a simple server; something that, for example, services date/time requests or -- a little fancier -- an HTTP daemon that serves only static content.
The best part of Erlang is the way it handles concurrency; exercize that.
Project Euler, for sure.
Some things from my copious ToDo list that would both be good learning exercises and helpful to the erlang community at large:
Profile all the available Key/Value stores:
Write a library for testing insert, lookup, delete, search times for a variety of K/V stores
Create a benchmark suite people can run
Make it work with ets, dets, proplists, gb_trees, dict, orddict, redblack trees, bdb, tokyocabinet, ...
Produce pretty graphs
Make it easy to update, contribute to and run on anyone's machine
write a new io_lib:format routine that uses named parameters:
io_lib:nformat("Hi there ~{name}s~n.", [{name, "Bob"}]).
This is useful for internationalisation if the position of parameters changes when the language of the format string changes.
Extend erl -make (make.erl)
Allow adding code paths (so that you don't need to do erl -pa LibraryPath -make)
Compile/load behaviour modules before modules that implement those behaviours
Handle hierarchal modules correctly (output path in particular)
This doesn't exactly answer your question, but if you are looking for an interesting free, open-source project that is written in Erlang, you should definitely check out CouchDB. From the website:
Apache CouchDB is a distributed,
fault-tolerant and schema-free
document-oriented database accessible
via a RESTful HTTP/JSON API. Among
other features, it provides robust,
incremental replication with
bi-directional conflict detection and
resolution, and is queryable and
indexable using a table-oriented view
engine with JavaScript acting as the
default view definition language.
CouchDB is written in Erlang, but can
be easily accessed from any
environment that provides means to
make HTTP requests. There are a
multitude of third-party client
libraries that make this even easier
for a variety of programming languages
and environments.
The CouchDB website has more details. Happy coding!
find something erlang doesn't have that you understand and like. I did that with etap https://github.com/ngerakines/etap/ Now nick has taken over management and it's used internally at EA games. It was fun to make and like a previous poster it was something real so I learned to serve real world problems working on it.
File indexing/search system. This was going to by intro project but I've switched over to something else.
Once you've got it working you could move the indexes to mnesia, and then spread the thing out other nodes to a have a whole network index.

What weaknesses can be found in using Erlang?

I am considering Erlang as a potential for my upcoming project. I need a "Highly scalable, highly reliable" (duh, what project doesn't?) web server to accept HTTP requests, but not really serve up HTML. We have thousands of distributed clients (other systems, not users) that will be submitting binary data to central cluster of servers for offline processing. Responses would be very short, success, fail, error code, minimal data. We want to use HTTP since it is our best chance of traversing firewalls.
Given this limited information about the project, can you provide any weaknesses that might pop up using a technology like Erlang? For instance, I understand Erlang's text processing capabilities might leave something to be desired.
You comments are appreciated.
Thanks.
This sounds like a perfect candidate for a language like Erlang. The scaling properties of the language are very good, but if you're worried about the data processing abilities, you shouldn't be. It's a very powerful language, with many libraries available for developers. It's an old language, and it's been heavily used/tested in the past, so everything you want to do has probably already been done to some degree.
Make sure you use erlang version R11B5 or newer! Earlier versions of erlang did not provide the ability to timeout tcp sends. This results in stalled or malicious clients being able to execute a DoS attack on your application by refusing to recv data you send them, thus locking up the sending process.
See issue OTP-6684 from R11B5's release notes.
With Erlang the scalability and reliability is there but from your project definition you don't outline what type of text processing you will need.
I think Erlang's main limitation might be finding experienced developers in your area. Do some research on the availability of Erlang architects and coders.
If you are going to teach yourself or have your developers learn it on the job keep in mind that it is a very different way of coding and that while the core documentation is good a lot of people do wish there were more examples. Of course the very active community easily makes up for that.
I understand Erlang's text processing
capabilities might leave something to
be desired.
The starling project already provides basic unicode support and there is a EEP (Erlang Enhancement Proposal) currently in draft, but going in to bring it into the mainstream of Erlang/OTP support.
I encountered some problems with Redis read performance from Erlang. Here is my question. I tend to think the reason is Erlang-written module, which has troubles while processing tons of strings during communication with Redis.

Resources