I'm trying to get the 2D vectors from a set of countries. I've built my graph by the following process (see the picture):
each node represents a country
each edge represents the land border between 2 countries (or nodes)
I'm using Node2vec library to manage it but results are not relevant.
countries = [
"France", "Andorra",
"Spain", "Italy", "Switzerland",
"Germany", "Portugal"
]
crossing_borders = [
("France", "Andorra"),
("France", "Spain"),
("Andorra", "Spain"),
("France", "Italy"),
("France", "Switzerland"),
("Italy", "Switzerland"),
("Switzerland", "Italy"),
("Switzerland", "Germany"),
("France", "Germany"),
("Spain", "Portugal")
]
graph.add_nodes_from(countries)
graph.add_edges_from(crossing_borders)
# Generate walks
node2vec = Node2Vec(graph, dimensions=2, walk_length=2, num_walks=50)
# Learn embeddings
model = node2vec.fit(window=1)
I would like to get countries which are sharing the land border closer each other. As below, Spain is too far from France. I only considered direct border that's why walk-length = 2.
Do you have any idea that would fit my problem ?
If I understand correctly, Node2Vec is basd on word2Vec, and thus like word2vec, requires a large amount of varied training data, and shows useful results when learning dense high-dimensional vectors per entity.
A mere 7 'words' (country-nodes) with a mere 10 'sentences' of 2 words each (edge-pairs) thus isn't expecially likely to do anything useful. (It wouldn't in word2vec.)
These countries literally are regions on a sphere. A sphere's surface can be mapped to a 2-D plane - hence, 'maps'. If you just want a 2-D vector for each country, which reflects their relative border/distance relationships, why not just lay your 2-D coordinates over an actual map large enough to show all the countries, and treat each country as its 'geographical center' point?
Or more formally: translate the x-longitude/y-latitude of each country's geographical center into whatever origin-point/scale you need.
If this simple, physically-grounded approach is inadequate, then being explicit about why it's inadequate might suggest next steps. Something that's an incremental transformation of those starting points to meet whatever extra constraints you want may be the best solution.
For example, if your not-yet-stated formal goal is that "every country-pair with an actual border should be closer than any country-pair without a border", then you could write code to check that, list any deviations, and try to 'nudge' the deviations to be more compliant with that constraint. (It might not be satisfiable; I'm not sure. And if you added other constraints, like "any country pair with just 1 country between them should be closer than any country pair with 2 countries between them", satisfying them all at once could become harder.)
Ultimately, next steps may depend on exactly why you want these per-country vectors.
Another thing worth checking out might be the algorithms behind 'force-directed graphs'. There, after specifying a graph's desired edges/edge-lengths, and some other parameters, a physics-inspired simulation will arrive at some 2-d layout that tries to satisfy the inputs. See for example from the JS world:
https://github.com/d3/d3-force
Related
In games like StarCraft you can have up to 200 units (for player) in a map.
There are small but also big maps.
When you for example grab 50 units and tell them to go to the other side of the map some algorithm kicks in and they find path through the obsticles (river, hills, rocks and other).
My question is do you know how the game doesnt slow down because you have 50 paths to calculate. In the meantime other things happens like drones collecting minerals buildinds are made and so on. And if the map is big it should be harder and slower.
So even if the algorithm is good it will take some time for 100 units.
Do you know how this works maybe the algorithm is similar to other games.
As i said when you tell units to move you did not see any delay for calculating the path - they start to run to the destination immediately.
The question is how they make the units go through the shortest path but fast.
There is no delay in most of the games (StarCraft, WarCraft and so on)
Thank you.
I guess it just needs to subdivide the problem and memoize the results. Example: 2 units. Unit1 goes from A to C but the shortest path goes through B. Unit2 goes from B to C.
B to C only needs to be calculated once and can be reused by both.
See https://en.m.wikipedia.org/wiki/Dynamic_programming
In this wikipedia page it specifically mentions dijkstra's algorithm for path finding that works by subdividing the problem and store results to be reused.
There is also a pretty good looking alternative here http://www.gamasutra.com/blogs/TylerGlaiel/20121007/178966/Some_experiments_in_pathfinding__AI.php where it takes into account dynamic stuff like obstacles and still performs very well (video demo: https://www.youtube.com/watch?v=z4W1zSOLr_g).
Another interesting technique, does a completely different approach:
Calculate the shortest path from the goal position to every point on the map: see the full explanation here: https://www.youtube.com/watch?v=Bspb9g9nTto - although this one is inefficient for large maps
First of all 100 units is not such a large number, pathfinding is fast enough on modern computers that it is not a big resource sink. Even on older games, optimizations are made to make it even faster, and you can see that unit will sometimes get lost or stuck, which shouldn't really happen with a general algorithm like A*.
If the map does not change map, you can preprocess it to build a set of nodes representing regions of the map. For example, if the map is two islands connected by a narrow bridge, there would be three "regions" - island 1, island 2, bridge. In reality you would probably do this with some graph algorithm, not manually. For instance:
Score every tile with distance to nearest impassable tile.
Put all adjacent tiles with score above the threshold in the same region.
When done, gradually expand outwards from all regions to encompass low-score tiles as well.
Make a new graph where each region-region intersection is a node, and calculate shortest paths between them.
Then your pathfinding algorithm becomes two stage:
Find which region the unit is in.
Find which region the target is in.
If different regions, calculate shortest path to target region first using the region graph from above.
Once in the same region, calculate path normally on the tile grid.
When moving between distant locations, this should be much faster because you are now searching through a handful of nodes (on the region graph) plus a relatively small number of tiles, instead of the hundreds of tiles that comprise those regions. For example, if we have 3 islands A, B, C with bridges 1 and 2 connecting A-B and B-C respectively, then units moving from A to C don't really need to search all of B every time, they only care about shortest way from bridge 1 to bridge 2. If you have a lot of islands this can really speed things up.
Of course the problem is that regions may change due to, for instance, buildings blocking a path or units temporarily obstructing a passageway. The solution to this is up to your imagination. You could try to carefully update the region graph every time the map is altered, if the map is rarely altered in your game. Or you could just let units naively trust the region graph until they bump into an obstacle. With some games you can see particularly bad cases of the latter because a unit will continue running towards a valley even after it's been walled off, and only after hitting the wall it will turn back and go around. I think the original Starcraft had this issue when units block a narrow path. They would try to take a really long detour instead of waiting for the crowd to free up a bridge.
There's also algorithms that accomplish analogous optimizations without explicitly building the region graph, for instance JPS works roughly this way.
Suppose I have a whole set of recipes in text format, with nothing else about them being known in advance. I must divide this data into 'recipes for baked goods' and 'other recipes'.
For a baked good, an excerpt from the recipe might read thusly:
"Add the flour to the mixing bowl followed by the two beaten eggs, a pinch of salt and baking powder..."
These have all been written by different authors, so the language and vocabulary is not consistent. I am in need of an algorithm or, better still, an existing machine learning library (implementation language is not an issue) that I can 'teach' to distinguish between these two types of recipe.
For example I might provide it with a set of recipes that I know are for baked goods, and it would be able to analyse these in order to gain the ability to make an estimate as to whether a new recipe it is presented with falls into this category.
Getting the correct answer is not critical, but should be reasonably reliable. Having researched this problem it is clear to me that my AI/ML vocabulary is not extensive enough to allow me to refine my search.
Can anyone suggest a few libraries, tools or even concepts/algorithms that would allow me to solve this problem?
What you are looking for is anomaly / outlier detection.
In your example, "baked goods" is the data you are interested in, and anything that doesn't look like what you have seen before (not a baked good) is an anomaly / outlier.
scikit learn has a limited number of methods for this. Another common method is to compute the average distance between data points, and then anything new that is more than the average + c*standard deviation is considered an outlier.
More sophisticated methods exist as well.
You can try case based reasoning.
Extract specific words or phrases that would put a recipe into the baked goods category. If it is not there it must be in other recipes.
You can get clever and add word sets {} so you don't need to look for a phrase.
Add weighting to each word and if it gets over a value put it into baked.
So {"oven" => "10", "flour" = > "5", "eggs" => "3"}
My reasoning is that if it is going in the "oven" it is likely to be getting baked. If you are going to distinguish between baking a cake and roasting a join then, this needs adjusted. Likewise "flour" is associated with something that is going to be baked as are eggs.
add pairs {("beaten", "eggs") => "5"} notice this is different from a phrase {"beaten eggs" => "10"} in that the worst in the pairs can appear anywhere in the recipe.
negatives {"chill in the fridge" => -10}
negators {"dust with flour" => "-flour"}
absolutes {"bake in the oven" => 10000} is just a way of saying {"bake in the oven" => "it is a baked good"} by having the number so high it will be over the threshold on its' own.
I have a very small question which has been baffling me for a while. I have a dataset with interesting features, but some of them are dimensionless quantities (I've tried using z-scores) on them but they've made things worse. These are:
Timestamps (Like YYYYMMDDHHMMSSMis) I am getting the last 9 chars from this.
User IDs (Like in a Hash form) How do I extract meaning from them?
IP Addresses (You know what those are). I only extract the first 3 chars.
City (Has an ID like 1,15,72) How do I extract meaning from this?
Region (Same as city) Should I extract meaning from this or just leave it?
The rest of the things are prices, widths and heights which understand. Any help or insight would be much appreciated. Thank you.
Timestamps can be transformed into Unix Timestamps, which are reasonable natural numbers
User IF/Cities/Regions are nominal values, which has to be encoded somehow. The most common approach is to create as much "dummy" dimensions as the number of possible values. So if you have 100 ciries, than you create 100 dimensions and give "1" only on the one representing a particular city (and 0 on the others)
IPs should rather be removed, or transformed into some small group of them (based on the DNS-network identification and nominal to dummy transformation as above)
In mahout, I'm setting up a GenericUserBasedRecommender, pretty straight forward for now, typical settings.
In generating a "preference" value for an item, we have the following 5 data points:
Positive interest
User converted on item (highest possible sign of interest)
Normal like (user expressed interest, e.g. like buttons)
Indirect expression of interest (clicks, cursor movements, measuring "eyeballs")
Negative interest
Indifference (items the user ignored when active on other items, a vague expression of disinterest)
Active dislike (thumbs down, remove item from my view, etc)
Over what range I should express these different attributes, let's use a 1-100 scale for discussion?
Should I be keeping the 'Active dislike' and 'Indifference' clustered close together, for example, at 1 and 5 respectively, with all the likes clustered in the 90-100 range?
Should 'Indifference' and 'Indirect expressions of interest' by closer to the center? As in 'Indifference' in the 20-35 range and 'Indirect like' in the 60-70 range?
Should 'User conversion' blow the scale away and be heads and tails higher than the others? As in: 'User Conversion' # 100, 'Lesser likes' # ~65, 'Dislikes' clustered in the 1-10 range?
On the scale of 1-100 is 50 effectively "null", or equivalent to no data point at all?
I know the final answer lies in trial and error and in the meaning of our data, but as far as the algorithm goes, I'm trying to understand at what point I need to tip the scales between interest and disinterest for the algorithm to function properly.
The actual range does not matter, not for this implementation. 1-100 is OK, 0-1 is OK, etc. The relative values are all that really matters here.
These values are estimated by a simple (linearly) weighted average. Therefore the response ought to be "linear". It ought to match an intuition that if action X gets a score 2x higher than action Y, then X should be an indicator of twice as much interest in real life.
A decent place to start is to simply size them relative to their frequency. If click-to-conversion rate is 2%, you might make a click worth 2% of a conversion.
I would ignore the "Indifference" signal you propose. It is likely going to be too noisy to be of use.
Say I want to build a check-in aggregator that counts visits across platforms, so that I can know for a given place how many people have checked in there on Foursquare, Gowalla, BrightKite, etc. Is there a good library or set of tools I can use out of the box to associate the venue entries in each service with a unique place identifier of my own?
I basically want a function that can map from a pair of (placename, address, lat/long) tuples to [0,1) confidence that they refer to the same real-world location.
Someone must have done this already, but my google-fu is weak.
Yes, you can submit the two addresses using geocoder.net (assuming you're a .Net developer, you didn't say). It provides a common interface for address verification and geocoding, so you can be reasonably sure that one address equals another.
If you can't get them to standardize and match, you can compare their distances and assume they are the same place if they are below a certain threshold away from each other.
I'm pessimist that there is such a tool already accessible.
A good solution to match pairs based on the entity resolution literature would be to
get the placenames, define and use a good distance function on them (eg. edit distance),
get the address, standardize (eg. with the mentioned geocoder.net tools), and also define distance between them,
get the coordinates and get a distance (this is easy: there are lots of libraries and tools for geographic distance calculations, and that seems to be a good metric),
turn the distances to probabilities ("what is the probability of such a distance, if we suppose these are the same places")(not straightforward),
and combine the probabilities (not straightforward also).
Then maybe a closure-like algorithm (close the set according to merging pairs above a given probability treshold) also can help to find all the matchings (for example when different names accumulate for a given venue).
It wouldn't be a bad tool or service however.