Myrrix tagging API to represent/weight parent/child Item relationship - mahout

I've been using the Tagging API to tag my items in order to allow Item-Item 'similarity' scores to be calculated, so: Item 1 gets tagged with {UK, MALE, 50}, Item 2 with {FRANCE, MALE, 22}, that kind of thing. That's been working fine.
What I'd like to do is represent item-item 'relationships', so if my application says that 1 is a parent of 2 (and just to make things a little more complex, this is multi-level), I'd like to be able to tell Myrrix to pull those two items a little closer together.
My first solution was to add a 'PARENT_[name]' tag to each Item and, for each parent it has, add a 'PARENT_[parentname]' tag, with a lower weight as we go up the hierarchy. That did succeed in pulling parents and children closer.
Unfortunately the overall quality of suggestions seemed to fall a little, and the results seemed increasingly variable, e.g. run the import again, results seem completely random. Is this something that can be fixed at the features / lambda level?
I'm still not really all that clear what 'features' represents, but my suspicion is that by massively increasing the number of possible tags, I need to configure the model very differently...

That's the right way to think about it. It's overloading the API a fair bit, but still principled.
It may or may not actually help the results. It kind of depends on whether users who like A will also like B because they have a common product family. Maybe for music; unlikely for things you buy once like a toaster.
Variability comes from the random starting point. You will get different models each time. If the difference is significant when you start from scratch, then you are likely getting into over-fitting. It may be that your # of features is too high or lambda too low for the data set.
You should also run an eval to see whether the scores are good at all. If it's scoring poorly, yeah it's a case of parameters that are well off their best values.
The idea is that you need not build a new model from scratch every time though.

Related

When should inferred relationships and nodes be used over explicit ones?

I was looking up how to utilise temporary relationships in Neo4j when I came across this question: Cypher temp relationship
and the comment underneath it made me wonder when they should be used and since no one argued against him, I thought I would bring it up here.
I come from a mainly SQL background and my main reason for using virtual relationships was to eliminate duplicated data and do traversals to get properties of something instead.
For a more specific example, let's say we have a robust cake recipe, which has sugar as an ingredient. The sugar is what makes the cake sweet.
Now imagine a use case where I don't like sweet cakes so I want to get all the ingredients of the recipe that make the cake sweet and possibly remove them or find alternatives.
Then there's another use case where I just want foods that are sweet. I could work backwards from the sweet ingredients to get to the food or just store that a cake is sweet in general, which saves time from traversal and makes a query easier. However, as I mentioned before, this duplicates known data that can be inferred.
Sorry if the example is too strange, I suck at making them. I hope the main question comes across, though.
My feeling is that the only valid scenario for creating redundant "shortcut" relationships is this:
Your use case has a stringent time constraint (e.g., average query time must be less than 200ms), but your neo4j query -- despite optimization -- exceeds that constraint, and you have verified that adding "shortcut" relationships will indeed make the response time acceptable.
You should be aware that adding redundant "shortcut" relationships comes with its own costs:
Queries that modify the DB would need to be more complex (to modify the redundant relationships) and also slower.
You'd always have to add the redundant relationships -- even if actually you never need some (most?) of them.
If you want to make concurrent updates to the DB, the chances that you may lose some updates and introduce inconsistencies into the DB would increase -- meaning that you'd have to work even harder to avoid inconsistencies.
NOTE: For visualization purposes, you can use virtual nodes and relationships, which are temporary and not actually stored in the DB.

Is this problem suitable for machine leaning - brain.js?

The problem I would like to solve is how to choose the best seats on a train based on some ordered user preferences. eg. whether they'd like a seat facing forwards, backwards (or don't care), whether they'd like a seat at a table or not, whether they need to be near a toilet, luggage rack, buffet car, near the door. Window / Aisle seat. Whether they want the aisle to the left or the right (can be very important some someone with a stuff knee!).
Most customers will specify one or two preferences, other may specify more. For some, being near the toilet might be the most important factor, for others having that table to work at might be the most important.
There may be more than one passenger (although they will share preferences). These should be sat as close to each other as possible. 2 passengers would ideally be sat next to each other, or opposite each other at a table seat. A group of 8 passengers might best be split into 2 groups of 4 or 4 groups of 2...
Position is defined by carriage number (seats in the same carriage are better then seats in different carriages) and by x/y coordinate within that carriage - so easy enough to calculate distance between any pair of seats - but a BIG job to calculate distances between EVERY pair of seats...)
Each [available] seat (pre-filtered by ticket class) will have the above attributes either defined or set to NULL (for unknown - seat facing is often unknown).
So for training I can provide a vast array of example trains and customer preferences with the best balance of preferences version position.
For execution I want to provide a run-time specific array of seats with attributes, a set of user preferences and a set if weighting for those preference (eg. passenger 1 thinks being near toilet is most important, passenger 2 think having a table is most important, passenger 3 think being in the quiet carriage is..) and finally the number of passengers.
Output will be an array of seats (one per passenger) that strike the best compromise between matching as many customer preferences as possible (usually not possible to match all preferences) and keeping the seats fairly close to each other.
eg. We might be able to match 2 preferences with seats 2 rows apart, but match 3 preference with seats 10 rows apart...
Obviously distance will need a weighting the same as the individual preference and necessary to choose between those two. I suppose a distance not greater than X becomes just one more customer preference...
I've not done any ML work before, so it's all going to be a learning exercise for me. I wish I had the time to just play and see what comes out, but I don't, Happy to do that, but I need to have a reasonable expectation of a positive result otherwise I'll have to focus on a more traditional approach. Limited time and all that...
So, my questions are:
Is this a suitable problem for machine learning?
If so, is brain.js a good choice, or is something else more suitable? AWS ML service perhaps?
Any advice on how to organise all my data into something suitable for an ML engine to process?
Machine Learning is good at finding hidden patterns in complex data. In your case, you would need a lot of data where user preferences are already matched with optimal seating arrangements.
You could then try to see if the ML model can actually make optimal seating arrangements by itself. It’s an interesting problem but it may also lead to unexpected seating :)
If you don’t have training data you could collect it live, by registering where people sit down, knowing their preferences.

Node vs Relationship

So I've just worked through the tutorial and I'm unclear about a few things. The main one, however, is how do you decide when something is a relationship and when it should be a Node?
For example, in the Movies Database,there is a relationship showing who acted in which film. A property of that relationship is the Role. BUT, what if it's a series of films? The role may well be constant between films (say, Jack Ryan in The Hunt for Red October, Patriot Games etc.)
We may also want to have some kind of Character bio, which would obviously remain constant between movies. Worse, the actor may change from one movie to another (Alec Baldwin then Harrison Ford.) There are many others like this (James Bond, for example).
Even if the actor doesn't change (Main roles in Harry Potter) the character is constant. So, at what point would the Role become a node in its own right? When it does, can I have a 3-way relationship (Actor-Role-Movie)? Say I start of with it being a relationship and then, down the line, decide it should've been a node, is there a simple way to go through the database and convert it?
No there is no way to convert your datamodel. When you start your own Database first take time to find a fitting schema. There is no ideal way to create a schema and also there are many different models fitting to the same situation without being totally wrong.
My strategy is to put less information to the relationship itself. I only add properties that directly concern the relationship and store all the other data in the nodes. Also think of properties you could use for traversing the graph. For example you might need some flags or even different labels for relationships even they more or less are the same. The apoc.algo.aStar is only including relationshiptypes you want (you could exclude certain nodes by giving them a special relationshiptype). So keep that in mind that you take a look at procedures that you might use later.
Try to create the schema as simple as possible and find a way to stay consistent in terms of what things are nodes and what deserves a relationship. Dont mix it up.
Choose a design that makes sense for you! (device 1)-[cable]-(device 2) vs (device 1)-[has cable]-(cable)-[has cable]-(device 2) in this case I'd prefer the first because [has cable] wouldn't bring anymore information. Irrespective to what I wrote above I would have a lot of information in this [cable] relationship but it totally makes sense for me because I wouldnt want to search in a device node for cable information.
For your example giving the role a own node is also valid way. For example if you want to espacially query which actors had the same role in common I'll totally go for giving the role a extra node.
Summary:
Think of what you want to do with the data and choose the easiest model.

Predicting Football match winners based only on previous data of same match

I'm a huge football(soccer) fan and interested in Machine Learning too. As a project for my ML course I'm trying to build a model that would predict the chance of winning for the home team, given the names of the home and away team.(I query my dataset and accordingly create datapoints based on previous matches between those 2 teams)
I have data for several seasons for all teams however I have the following issues that I would like some advice with.. The EPL(English Premier League) has 20teams which play each other at home and away (380 total games in a season). Thus, each season, any 2 teams play each other only twice.
I have data for the past 10+ years, resulting in 2*10=20 datapoints for the two teams. However I do not want to go past 3 years since I believe teams change quite considerably over time (ManCity, Liverpool) and this would only introduce more error into the system.
So this results in just around 6-8 data points for each pair of team. However, I do have several features(upto 20+) for each data point like Full-time goals, half time goals, passes, shots, yellows, reds, etc. for both teams so I can include features like recent form, recent home form, recent away form etc.
However the idea of just having only 6-8 datapoints to train with seems incorrect to me. Any thoughts on how I could counter this problem?(if this is a problem in the first place i.e.)
Thanks!
EDIT: FWIW, here's a link to my report which I compiled at the completion of my project. https://www.dropbox.com/s/ec4a66ytfkbsncz/report.pdf . It's not 'great' stuff but I think some of the observations I managed to elicit were pretty cool (like how my prediction worked very well for the Bundesliga because Bayern win the league all the time).
That's an interesting problem which I don't think has an unique solution. However, there are a couple of little things that I could try if I were in your position.
I share your concerning about 6-8 points per class being too little data to build a reliable model. So I would try to model the problem a bit differently. In order to have more data for each class, instead of having 20 classes I would have only two (home/away) and I would add two features, one for the team being home and other one for the away team. In that setup, you can still predict which team would win given if it is playing as home or away, and your problem has more data to produce a result.
Another idea would be to take data from other European leagues. Since now teams are a feature and not a class, it shouldn't add too much noise to your model and you could benefit from the additional data (assuming that those features are valid in another leagues)
I have some similar system - a good base for source data is football-data.co.uk.
I have used last N seasons for each league and built a model (believe me, more than 3 years is a must!). Depends on your criterial function - if criterion is best-fit or maximum profit you may build your own predicting model.
One very good thing to know is that each league is different, also bookmaker gives different home win odds on favorite in Belgium than in 5th English League, where you can find really value odds for instance.
Out of that you can compile interesting model, such as betting tips to beat bookmakers on specific matches, using your pattern and to have value bets. Or you can try to chase as much winning tips as you can, but possibly earns less (draws earn a lot of money even though less amount of draws is winning).
Hopefully I gave you some ideas, for more feel free to ask.
Don't know if this is still helpful, but features like Full-time goals, half time goals, passes, shots, yellows, reds, etc. are features that you don't have for the new match that you want to classify.
I would treat this as a classification problem (you want to classify the match in one of 3 categories: 1, X, or 2) and add more features that you can also apply to the new match. i.e: the number of missing players (due to injury/red cards), the number of wins/draws/losses each team has had in a row immediately BEFORE the match, which is the home team (already mentioned), goals scored in the last few matches home and away etc...
Having 6-8 matches is the real problem. This dataset is very small and there would be a lot of over-fitting, but if you use features like the ones I mentioned, I think you could also use older data.

New(?) attept to structure RESTful base URLs

We all love REST, especially when it comes to the development of APIs. Doing so for the last years I always stumble upon the same problem: nested resources. It seems we're living at the two edges of a scale. Let me introduce an example.
/galaxies/8/solarsystems/5/planets/1/continents/4/countries.json
Neato. Cases like that seem to happen everywhere, no matter in what shape they materialize. Now I'd like to being able to fetch all the countries in a solar system while being able to fetch countries deeply scoped as shown above.
It seems I have two choices here. The first one, I flatten my nested structure and introduce a lot of GET parameters (that need to be well documented and understood by my API user) like so:
/countries.json?galaxy=8&solarsystem=5&planet=1&continent=4
I could flatten all my resources like so and won a unique endpoint base URL for each one. Good point … unique endpoints per resource!
But what's the price? Something that does not feel natural, is not discoverable and does not behave like the tree structure below my resources. Conclusion: Bad idea, but well practiced.
On the other hand I could try to get rid of as many additional GET parameters as possible, creating endpoints like that:
/galaxies/8/solarsystems/5/countries.json
But I also needed:
/galaxies/8/solarsystems/5/planets/1/continents/4/countries.json
This seems to be the other side of the scale. Least number of additional GET parameters, more natural behave but still not what I expected as an API user.
The most APIs I worked with in the last year follow the one or the other paradigm. It seems there is at least one bullet to bite. So why not doing the following:
If there are resources that nest naturally, lets nest them exactly in the way we'd expect them to be nested. What we achive is at first a unique endpoint for every resource when we stay like that:
/galaxies.json
/galaxies/8/solarsystems.json
/galaxies/8/solarsystems/5/planets.json
/galaxies/8/solarsystems/5/planets/1/continents.json
/galaxies/8/solarsystems/5/planets/1/continents/4/countries.json
Ok, but how to solve the initial problem, I wanted to fetch all the countries in a solar system while still being able to fetch countries fully scoped under galaxies, solar systems, planets and continents? Here's what feels natural for me:
/galaxies/8/solarsystems/5/planets/0/continents/0/countries.json # give me all countries in the solarsystem 5
/galaxies/8/solarsystems/0/planets/0/continents/0/countries.json # give me all countries in the galaxy 8
… and so on, and so on. Now you may argue "ok, but the zero there ….." and you are right. Does not look really nice. So why not change the two upper calls to something like that:
/galaxies/8/solarsystems/5/planets/all/continents/all/countries.json # give me all countries in the solarsystem 5
/galaxies/8/solarsystems/all/planets/all/continents/all/countries.json # give me all countries in the galaxy 8
Neat eh? So what do we achive? No additional GET parameters and still stable base URLs for each resources endpoint. What's the price? Yep, at least longer URLs especially during testing by hand using tools like curl.
I wonder wether this could be a way to improve not only the maintainability but also the ease of use of APIs. If so, why does not anyone take an approach like that. I can not imagine to be the first one having that idea. So there must be valid counter arguments against an approach like that. I don't see any. Do you?
I would really like to hear your opinion and arguments for or against an approach like that. Maybe there are ideas for improvement … would be great to hear from you. In my opinion this could lead to much better structured APIs, so hopefully someone will read that and reply.
Regards.
Jan
It would all depend on upon how the data is presented. Would the user really need to the know the galaxy # to find a specific country? If so them what you propose makes sense. However, it seems to me that what you are proposing, while structured and presented well, doesn't allow for clients to search for child element unless the parent is a known quantity.
In your example, if I had a specific id for a continent I would need to know the planet, solar system and galaxy as well. In order to find the specific continent I would need to get all for each possible parent until I found the continent.
Presenting structured data in this manner if fine. Using this structure when you only have a piece of the data may be a bit cumbersome. It all depends upon what you are trying to accomplish.
Nested resource URLs are usually bad. The approach I generally take is to use unique IDs.
Design your DB so that it is only going to have one continent with ID 4. Then, instead of the horrible /galaxies/8/solarsystems/5/planets/1/continents/4/countries.json, all you need is the simple /continents/4/countries.json. Clear, sufficient, and memorable.
The :shallow routing option in Rails does this automatically.
For "all countries in a solar system", I'd use /solar_systems/5/countries.json -- that is, don't try to shoehorn it into the generic URL scheme. (And note the underscore.)

Resources