I have a dataset which contains information about houses worldwide with the following features: house size, number of bedrooms, city name, country name, garden or not, ... (and many other typical house information). And the target variable is the price of the house.
I know that strings are not acceptable as input in a Machine Learning or Neural Network model so instead of doing one hot encoding for the city name and the country name (because I would end up with a few hundred columns) I decided to replace the city name with its geographical coordinates (one column with longitude and one column with latitude).
The city where a house is located will obviously help determine the price of the house.
So does changing the city name with its longitude and latitude preserve this important information? Is it alright to change the city name with its longitude and latitude ?
Cartesian coordinates can be useful for the model to some extent. However, for certain models such as decision trees, properly modeling the dependency of the target variable on geographical coordinates can require overly complex models. For a clear and visual understanding of this you may check this.
A common approach in these cases is to transform the coordinates into polar coordinates, and add them as new features. When you think about it, you're adding a new way of expressing a same thing, just in a different scale or system. That way a tree will require less splits to be able to model this spatial dependency of the samples.
That being said, I would not completely replace the existing geolocation data with coordinates. It would probably be interesting too to add some aggregates/statistics based on the country of city data, rather than one hot encoding them or just replacing them by coordinates.
Related
Currently, I'm working on dimensional modeling and have a question in regards to an outrigger dimension.
The company is trading and acts as a broker between customer and supplier.
For a fact table, "Fact Trades", we include dimCustomer and dimSupplier.
Each of these dimensions have an address.
My question is if it is correct to do outrigger dimensions that refer to geography. This way we can measure how much we have delivered from an origin and delivered to a city.
dimensional model
I am curious to what is best practice. I hope you can help to explain how this should be modelled correctly and why.
Hope my question was clear and that I have posted it the correct place.
Thanks in advance.
I can think of at least 3 possible options; your particular circumstances will determine which is best for you:
If you often filter your fact by geography but without needing company/person information (i.e. how many trades where between London and New York?) then I would create a standalone geography dimension and link it directly to your fact (twice - for customer and supplier). This doesn't also stop you having geographic attributes in your customer/supplier Dims, as a dimensional model is not normalised
If geographic attributes change at a significantly more frequent rate than the customer/supplier attributes, and the customer/supplier Dims have a lot of attributes, then it may be worth creating an outrigger dim for the geographical attributes - as this reduces the maintenance required for the customer/supplier Dims. However, given that most companies/people rarely change their address, this is probably unlikely
Keep the geographical attributes in the customer/supplier Dims. I would probably do this anyway even if also picking option 1 above
Just out of interest - do customer and supplier have significantly different sets of attributes (I assume they are both companies or people)? Is it necessary to create separate Dims for them?
Im a software engineering student, and new to Data Mining, I want to implement a solution to find similar users based on their interests and skills (Strings sets).
I think I cannot use K nearest Neighbors using an edit distance(Levenshtein or ..)
If someone could help with that please
The first thing you should do is convert your data into some reasonable representation, so that you will have a well-defined notion of distance between suitably represented users.
I would recommend converting all strings into some canonical form, then sorting all n distinct skills and interest strings into a dictionary D. Now for each user u, construct a vector v(u) with n components, which has i-th component set to 1 if the property in dictionary entry i is present, and 0 otherwise. Essentially we represented each user with a characteristic vector of her interests/skills.
Now you can compare users with Jaccard index (it's just an example, you'll have to figure out what works best for you). With the notion of a distance in hand, you can start trying out various approaches. Here are some that spring to mind:
apply hierarchical clustering if the number of users is sufficiently small;
apply association rule learning (I'll leave you to think out the details);
etc.
Let's say I have 1000 objects that have a latitude/longitude value.
What I'd like to do is send up a user's latitude/longitude and return the 50 closest objects. I know how I could do that with one value - if I had just a latitude, but location seems different with the two values so I'm not sure how I'd do that.
What would be the best way to accomplish that? Feel free to point me to a tutorial or anything - I'm very new to using location in rails :).
There isn't any detail in the original question regarding data representation, but let's suppose you have a list of lat/long values in the form, [lat, long] in a list, locations. Let's suppose further you have a metric method for the distance between them, dist(lat, long).
Then this will collect all of the location pair combinations and their respective distances as a collection of triples, [dist, p1, p2]:
location.combination(2).collect { |p| [dist(p[0], p[1]), p[0], p[1]] }
You can then sort this which will order by distance, and pick off the top 50:
location.combination(2).collect { |p| [dist(p[0], p[1]), p[0], p[1]] }.sort.first(50)
You'd have to see if it works with 1,000 objects, as this will create 999,000 combinations initially and I don't know what the Ruby array capacity is.
This is a classic computer science problem called the nearest neighbor problem. See this wikipedia article for two popular solutions.
The article describes the point with x, y coordinates. You can just substitute longitude for x and latitude for y.
I am trying to realize a datamodel in Neo4j. The model has points of interest in a city and streets. The streets connect the points.
Initially I thought that points and streets should both represented in the graph database as nodes.
Between these two different type of nodes there is a relationship ("point is connected with").
Now I am thinking the possibility that instead of representing the street as a node, perhaps is more correct to represent the street as relationship ("connects two points")
And this is my question actually. What is the more correct way to represent the network (line part) in a model: with nodes or with relationships?
The only major difference between relationships and nodes is that relationships must exist between two nodes. This means that you wouldn't be able to store a specific street if you didn't store two points of interest that it connects. So, if you see this being an issue, you may want to store streets as nodes. If you are certain that you will only want to store streets if there are points of interest in your database that exist on the street, then it'd make more sense to represent the streets as relationships.
In general, you should try to avoid storing properties in nodes that you only intend to use to find relationships between them. In this case, you mention possible storying a "point is connected with" property in each point of interest node. This would work, but is essentially just saying that a relationship exists between two points without actually using a relationship. Again, in the case where you want to be able to store streets that don't have points of interests existing on them, this may be necessary, and you could store streets that don't have points of interests on them by leaving the "point is connected with" property as NULL, but I would advise against this.
Another thing to think about is what you would store in the relationship. If you go with the model where streets are nodes, it becomes very difficult to represent quantities like distances between points of interest without adding relationships into your graph specifically for those properties, which may as well be properties of a street relationship.
UPDATE: Thought I'd add an example query to show how making the streets relationships can simplify your logic and make using your database much simpler and more intuitive.
Imagine you wish to find the path with the fewest points of interest between points A and B.
This is what the query would look like with the relationships model:
MATCH (a:Point {name: "foo"}), (b:Point {name: "bar"}),
p = shortestPath(a-[*:Street]-b)
RETURN p
By using relationships where appropriate, you enable the capabilities of Neo4j, allowing you to get a lot of work done with relatively simple queries. It's hard to think of a way to write this query in the model where you represent streets as nodes, but it would in all likelihood be much more complex and less efficient.
I'm writing an iOS application that pulls events from a public Google calendar, pulls out the free-form "Location" field, and drops a pin on a map corresponding to the given location. I want to make the app as flexible as possible using some kind of string search or fuzzy matching algorithms, but I'm not sure where to begin.
There are several things a calendar moderator may enter into the Location field:
A building name and room number (e.g. Foo Hall Room 123)
A building abbreviation and room number (e.g. FOO 123)
A shorthand room or location name (e.g. Foo)
Currently, I have a sqlite database composed of one table, each row storing a latitude, longitude, full building name (Foo Hall), and standardized building abbreviation (FOO).
I want to take the moderator's free-form string and obtain the correct coordinates from the database (if present).
I've tried using LIKE '%FOO%' and similar patterns, as well as Levenshtein Distance, but I run into issues, for instance if the actual building name is "Example Foo and Bar Building" and the location entered by moderator is "Example Bar Building".
The three options I've considered are...
Force the moderator to enter in a standardized abbreviation or building name. This could potentially be a tedious process for the calendar moderators, so I'm trying to avoid this if possible.
Do a crude substring search that checks if the entered string is contained anywhere in the database string. This is what my university does on their website, but it obviously isn't very flexible.
Implement a more complex fuzzy string matching algorithm that provides maximum flexibility but will take an order of magnitude more time to implement. If the right one already exists, that would be the ideal solution!!
Which of these options (if any) seems the best? Is there a better alternative that I haven't thought of? Is there a library that does what I need and I just haven't found it yet?
Thanks in advance for any help!