Free Shape Files for districts of the world - geolocation

I am looking for shape files(polygons data) of all the districts in the world. I can find few files from different sources for districts of one country but I need one (updated) file that have all the districts in the world OR one reliable source from where I can get files may be one per country. Please help me find.
PS: shape files for countries and regions are easily available but not required in my case.

Geofabrik (http://download.geofabrik.de/) contains shapefiles for the whole world generated from OpenstreetMap data. These include administrative areas.
Natural Earth (see http://www.naturalearthdata.com/features/) has first-order administrative boundaries for the whole world.

Related

Best Grasshopper plugin to analyse floor plans

I'm trying to figure out the best way to analyse a grasshopper/rhino floor plan. I am trying to create a room map to determine how many doors it takes to reach an exit in a residential building. The inputs are the room curves, names and doors.
I have tried to use space syntax or SYNTACTIC, but some of the components are missing. Alot of the plugins I have been looking at are good at creating floor plans but not analysing them.
Your help would be greaty appreciated :)
You could create some sort of spine that goes through the rooms that passes only through doors, and do some path finding across the topology counting how many "hops" you need to reach the exit.
So one way to get the topology is to create a data structure (a tuple, keyValuePair) that holds the curve (room) and a point (the door), now loop each room to each other and see if the point/door of each of the rooms is closer than some threshold, if it is, store the relationship as a graph (in the abstract sense you don't really need to make lines out of it, but if you plan to use other plugins for path-finding, this can be useful), then run some path-finding (Dijkstra's, A*, etc...) to find the shortest distance.
As for SYNTACTIC: If copying the GHA after unblocking from the installation path to the special components folder (or pointing the folder from _GrasshopperDeveloperSettings) doesn't work, tick the Memory load *.GHA assemblies using COFF byte arrays option of the _GrasshopperDeveloperSettings.
*Note that SYNTACTIC won't give you any automatic topology.
If you need some pseudo-code just write a comment and I'd be happy to help.

Georeference automatically thousands of scanned maps

I need to georeference automatically thousands of scanned maps (263 inner points inside the map and some in the territory limits), all maps represent the same territory and only change some conventions. I Tried to identify contours with OpenCV and look like good parameters for machine learning and georeferencing, but i dont know how to identify an specific contourn in different images, If I succeed, I can put the georeference for that contur. ¿some ideas?, ¿other possibilities?.
Here i put two maps for clarifying the idea: ;

Is there a reputable source that provides mappings of UN/LOCODEs to Olsen Timezones?

I've been researching CLDR and IANA in order to find a centralized mapping of UN/LOCODEs to Olsen Timezones.
Ideally I would like to have for example:
+--------------+--------------------+
|un_locode |timezone |
+--------------+--------------------+
|USLAX | America/Los_Angeles|
+--------------+--------------------+
for every UN/LOCODE.
Are my nube skills failing me in understanding how to use these sources to reach my goal? (If so please help point me towards the scripting that would allow me to automate providing these mappings).
Or, do these sources fail to have the data correlation that I'm looking for? (If so please let me know if you have a reliable source).
We faced the exact same problem and hence had to provide a solution.
This solution involves linking the UN/LOCODES database with a geolocation/timezone database.
There are a few caveats to this approach that were captured by Matt Johnson's answer and the accompanying comments.
Namely:
the UN/LOCODE database of coordinates is not complete[1] and sometime has inaccurate data[2]
in some cases, a 1 to 1 mapping between the UN/LOCODE and a timezone is impossible due to the political nature of the timezones.
the two points above are worsened by the inaccuracy of free coordinates-to-timezone databases. It is helpful to get a dataset that also includes territorial waters so that ports timezones can be properly linked to the country they belong.
The following repository https://github.com/Portchain/un_locodes_sql contains the code to extract and link the data. It outputs a SQL file that can be imported into a PostgreSQL DB.
The geolocation/timezone data is based on the geo-tz[3] module which seems to source its data from timezone-boundary-builder[4].
Again, the list provided by our repository is of course incomplete and inaccurate. If you see any error in the data, please open a github issue and let's make an accurate, open source list of UN/LOCODE, coordinates and timezone information.
[1] For example, both Los Angeles and San Francisco, USA (USLAX & USSFO) are missing coordinates in the UN/LOCODE database.
[2] The petroleum port of Abu al Bukhoosh (AEABU) is situated in Abu Dhabi (UAE). Its coordinates in the UN/LOCODE database position the port right in the middle of the Persian Gulf (https://www.port-directory.com/ports/abu_al_bukhoosh/). When resolved, this causes the timezone to be unknown.
[3] https://github.com/evansiroky/node-geo-tz
[4] https://github.com/evansiroky/timezone-boundary-builder
The GeoNames free database of cities (which is available to download) provides: city names, latitude/longitude and, most importantly, timezone information. You can fairly quickly make your own database connecting this information with the UN/LOCODE code lists based on the name/country/coordinates.
I've not seen such a source. You could try to create one by mapping the lat/lon coordinates for those entries that have them, and correlating to IANA time zone by one of the methods listed here.
However, be sure to read Wikipedia's article about UN/LOCODE, especially describing errors with coordinates. Also note that many of the coordinates simply not in the data - why? I don't know.
The list of UN/LOCODE for the US is here, and show Los Angeles to be US LAX (not UNLAX). Its coordinates field is blank.
If you can find some other reliable source of UN/LOCODE to lat/lon, then you are in business. A quick search found that GeoNames claims to have this in their premium data subscription, but I haven't investigated further.
CLDR's map is here: https://unicode.org/reports/tr35/#Time_Zone_Identifiers
I saw CLDR tagged but not mentioned.

Finding features for classifying document into printable or non-printable

I would like to perform a binary classification of documents (.txt, .pdf, .jpeg, .img, etc.) into two categories: printable and non-printable. Essentially our school runs a free printing service for clubs, but the reality is that many clubs abuse the free printing and end up printing their homework, papers, etc., which amounts to thousands of dollars in ink and paper. Thus we would like to take some unsupervised methods to help limit this by determining whether a document is with high probability not club related (e.g. Biophysics paper, there is no biophysics club!).
So this is a very simple binary classification problem. I am not looking for low-level implementation details or which ML algorithms I should use, but rather how I should discover the relevant features that will then be fed to the training, etc.
My first idea was to gather all the documents that students print in the library. The idea is that if you have actual club printing, you'll do it for free at the club printing center rather than pay for it at the library. That would be a massive dataset, assuming every document printed at the library is assigned the non-printable/club material category. Unfortunately, the school is very liberal and opposed to allowing this due to privacy concerns, so it is not really an option without legal risks.
A similar-minded option would be to collect documents that are tied to courses / school work, e.g. course syllabi, available course documents online (homeworks, papers, etc.) and do feature extraction / selection on these. The assumption is that students would be abusing the printing to generally print material relevant to their studies.
While for .pdf and .txt based document this approach should have reasonable performance, I am at a loss at how to classify image based documents, besides perhaps using the title of the document and other meta data. A clever violator could simply convert all their text documents to image format to circumvent this system. However that is outside the scope of this question and should be saved for a future question / research. For now the scope is just text based documents.
Note that there are previous questions on topics similar to this, but mine is very specific and I believe it may pose challenges that something like movie review classification might not have to face.
I just wanted to leave a comment but it ended way longer than what I imagined.
While this is an interesting problem I'm not sure ML will get you what you need easily.
Firstly your classification problem is of the type A vs the World and A isn't strictly defined. Unless you know exactly what kind of stuff the clubs print you can't really say that new material belong or no to that class.
This will prove particularly difficult when you will need to assemble a large enough training set to be able to cover whatever can or cannot be printed. Such task will be extremely tedious, and as you said you won't have access to what the clubs usually print out so at best you will have a large class imbalance in your training set.
As the goal is to make the system automated (I mean if there is human interaction anyway, it's faster to check what will be printed than to make a ML algorithm that will provide a score that a human will have to investigate anyway) the number of false positives and false negatives will also be problematic. There will be cases where the clubs won't be able to print things they have the right to.
As you said you could simplify greatly the problem by classifying Course Material and Not Course Material. For that I will look towards BoW because some words are more present than others in papers or course material (everything remotely technical). The number of words as well as the overall size of the file seem like sensible things to extract. The structure is often also particular : it might be a good idea to extract such things : "number of lines with less than x words", "number of lines per page", "number of pictures" (if that's something you can extract from the file), ...
For pictures the major thing to check would be if this a scan of something (often they will scan and print course related things I guess), for that the format of the image is already a good indication but I don't see other things that would be particularly "course related".
So for me, if you can't really define precisely one of your two classes don't go with classification or reduce the problem to something you can really define (course related things).
If you are able to compile a "black list" of documents students are not allowed to print, you can then implement a several layers rejection mechanism.
I would suggest these 3 levels:
compare the md5 of the file they want to print with a database of all the md5 of the black-listed documents.
if the 1) is passed, compare repeat 1) but at a page level, rather than at document level (perhaps they want to print just few pages rather than the entire document).
if 2) is passed you can compare the page they want to print with the pages of the black-listed documents document using an image similarity method, like SSIM. if you get a high score between the page they want print and one of the black-listed items do not print, and update your md5 database accordingly.
if 3) is passed: print!
A few words about SSIM: this method is quite robust to noise, so even a smart student who added some sort of niose to the image will be caught
However:
you have to find a proper way to extract a region of interest (ROI) from the page and the db of documents (if the two ROIs are in two different area of the page, SSIM will be negative)
SSIM might be slow! definitely a C implementation is needed here.
I think SSIM is not rotational invariant, hence the check will fail if they print the page upside down (unless you have a smart way to rotate the page).

Entity resolution for venues and other geo locations

Say I want to build a check-in aggregator that counts visits across platforms, so that I can know for a given place how many people have checked in there on Foursquare, Gowalla, BrightKite, etc. Is there a good library or set of tools I can use out of the box to associate the venue entries in each service with a unique place identifier of my own?
I basically want a function that can map from a pair of (placename, address, lat/long) tuples to [0,1) confidence that they refer to the same real-world location.
Someone must have done this already, but my google-fu is weak.
Yes, you can submit the two addresses using geocoder.net (assuming you're a .Net developer, you didn't say). It provides a common interface for address verification and geocoding, so you can be reasonably sure that one address equals another.
If you can't get them to standardize and match, you can compare their distances and assume they are the same place if they are below a certain threshold away from each other.
I'm pessimist that there is such a tool already accessible.
A good solution to match pairs based on the entity resolution literature would be to
get the placenames, define and use a good distance function on them (eg. edit distance),
get the address, standardize (eg. with the mentioned geocoder.net tools), and also define distance between them,
get the coordinates and get a distance (this is easy: there are lots of libraries and tools for geographic distance calculations, and that seems to be a good metric),
turn the distances to probabilities ("what is the probability of such a distance, if we suppose these are the same places")(not straightforward),
and combine the probabilities (not straightforward also).
Then maybe a closure-like algorithm (close the set according to merging pairs above a given probability treshold) also can help to find all the matchings (for example when different names accumulate for a given venue).
It wouldn't be a bad tool or service however.

Resources