How to use external data in an OSRM profile - lua

It this Mapbox blog post, Lauren Budorick shares how they got working a routing engine with OSRM that uses elevation data in order to give cyclists better routes... AMAZING!
I also want to explore the potential of OSRM's routing when plugging in external (user-generated) data, but I'm still having a hard time grasping how OSRM's profiles work. I think I get the main idea, that every way (or node?) is piped into a few functions that, all toghether, scores how good that path is.
But that's it, there are plenty of missing parts in my head, like what do each of the functions Lauren uses in her profile do. If anyone could point me to some more detailed information on how all of this works, you'd make my next week much, much easier :)
Also, in Lauren's post, inside source_function she loads a ./srtm_bayarea.asc file. What does that .asc file looks like? How would one generate a file like that one from, let's say, data stored in a pgsql database? Can we use some other format, like GeoJSON?
Then, when in segment_function she uses things like source.lon and target.lat, are those refered to the raw data stored in the asc file? Or is that file processed into some standard that maps everything to comply it?
As you can see, I'm a complete newbie on routing and maybe GIS in general, but I'd love to learn more about this standards and tools that circle around the OSRM ecosystem. Can you share some tips with me?

I think I get the main idea, that every way (or node?) is piped into a few functions that, all toghether, scores how good that path is.
Right, every way and every node are scored as they are read from an OSM dump to determine passability of a node and speed of a way (used as the scoring heuristic).
A basic description of the data format can be found here. As it reads, data immediately available in ArcInfo ASCII grids includes SRTM data. Currently plaintext ASCII grids are the only supported format. There are several great Python tools for GIS developers that may help in converting other data types to ASCII grids - check out rasterio, for example. Here's an example of a really simple python script to convert NED IMGs to ASCII grids:
import sys
import rasterio as rio
import numpy as np
args = sys.argv[1:]
with rio.drivers():
with rio.open(args[0]) as src:
elev = src.read()[0]
profile = src.profile
def shortify(x):
if x == profile['nodata']:
return -9999
elif x == np.finfo(x).tiny:
return 0
else:
return int(round(x))
out_elev = [map(shortify, row) for row in elev]
with open(args[0] + '.asc', 'a') as dst:
np.savetxt(dst, np.array(out_elev),fmt="%s",delimiter=" ")
source.lon and target.lat e.g: source and target are nodes provided as arguments by the extraction process. Their coordinates are used to look up data at each location during extraction.
Make sure to read thoroughly through the relevant wiki page (already linked).
Feel free alternately to open a Github issue in
https://github.com/Project-OSRM/osrm-backend/issues with OSRM
questions.

Related

Beam/Dataflow design pattern to enrich documents based on database queries

Evaluating Dataflow, and am trying to figure out if/how to do the following.
My apologies if anything in the above is trivial--trying to wrap our heads around Dataflow before we make a decision on using Beam, or something else like Spark, etc.
General use case is for machine learning:
Ingesting documents which are individually processed.
In addition to easy-to-write transforms, we'd like to enrich each document based on queries against databases (that are largely key-value stores).
A simple example would be a gazetteer: decompose the text into ngrams, and then check if those ngrams reside in some database, and record (within a transformed version of the original doc) the entity identifier given phrases map to.
How to do this efficiently?
NAIVE (although possibly tricky with the serialization requirement?):
Each document could simply query the database individually (similar to Querying a relational database through Google DataFlow Transformer), but, given that most of these are simple key-value stores, it seems like there should be a more efficient way to do this (given the real problems with database query latency).
SCENARIO #1: Improved?:
Current strawman is to store the tables in Bigquery, pull them down (https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py), and then use them as side inputs, that are used as key-value lookups within the per-doc function(s).
Key-value tables range from generally very small to not-huge (100s of MBs, maybe low GBs). Multiple CoGroupByKey with same key apache beam ("Side inputs can be arbitrarily large - there is no limit; we have seen pipelines successfully run using side inputs of 1+TB in size") suggests this is reasonable, at least from a size POV.
1) Does this make sense? Is this the "correct" design pattern for this scenario?
2) If this is a good design pattern...how do I actually implement this?
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/gcp/bigquery.py#L53 shows feeding the result to the document function as an AsList.
i) Presumably, AsDict is more appropriate here, for the above use case? So I'd probably need to run some transformations first on the Bigquery output to separate it into key, value tuple; and make sure that the keys are unique; and then use it as a side input.
ii) Then I need to use the side input in the function.
What I'm not clear on:
for both of these, how to manipulate the output coming off of the Bigquery pull is murky to me. How would I accomplish (i) (assuming it is necessary)? Meaning, what does the data format look like (raw bytes? strings? is there a good example I can look into?)
Similarly, if AsDict is the correct way to pass it into the func, can I just reference things like a dict normally is used in python? e.g., side_input.get('blah') ?
SCENARIO #2: Even more improved? (for specific cases):
The above scenario--if achievable--definitely does seem like it is superior continuous remote calls (given the simple key-value lookup), and would be very helpful for some of our scenarios. But if I take a scenario like a gazetteer lookup (like above)...is there an even more optimized solution?
Something like, for every doc, writing our all the ngrams as keys, with values as the underlying indices (docid+indices within the doc), and then doing some sort of join between these ngrams and the phrases in our gazeteer...and then doing another set of transforms to recover the original docs (now w/ their new annotations).
I.e., let Beam handle all of the joins/lookups directly?
Theoretical advantage is that Beam may be a lot quicker in doing this than, for each doc, looping over all of the ngrams and doing a check if the ngram is in the side_input.
Other key issues:
3) If this is a good way to do things, is there any trick to making this work well in the streaming scenario? Text elsewhere suggests that the side input caching works more poorly outside the batch scenario. Right now, we're focused on batch, but streaming will become relevant in serving live predictions.
4) Any Beam-related reason to prefer Java>Python for any of the above? We've got a good amount of existing Python code to move to Dataflow, so would heavily prefer Python...but not sure if there are any hidden issues with Python in the above (e.g., I've noticed Python doesn't support certain features or I/O).
EDIT: Strawman? for the example ngram lookup scenario (should generalize strongly to general K:V lookup)
Phrases = get from bigquery
Docs (indexed by docid) (direct input from text or protobufs, e.g.)
Transform: phrases -> (phrase, entity) tuples
Transform: docs -> ngrams (phrase, docid, coordinates [in document])
CoGroupByKey key=phrase: (phrase, entity, docid, coords)
CoGroupByKey key=docid, group((phrase, entity, docid, coords), Docs)
Then we can iteratively finalize each doc, using the set of (phrase, entity, docid, coords) and each Doc
Regarding the scenarios for your pipeline:
Naive scenario
You are right that per-element querying of a database is undesirable.
If your key-value store is able to support low-latency lookups by reusing an open connection, you can define a global connection that is initialized once per worker instead of once per bundle. This should be acceptable your k-v store supports efficient lookups over existing connections.
Improved scenario
If that's not feasible, then BQ is a great way to keep and pull in your data.
You can definitely use AsDict side inputs, and simply go side_input[my_key] or side_input.get(my_key).
Your pipeline could look something like so:
kv_query = "SELECT key, value FROM my:table.name"
p = beam.Pipeline()
documents_pcoll = p | ReadDocuments()
additional_data_pcoll = (p
| beam.io.BigQuerySource(query=kv_query)
# Make row a key-value tuple.
| 'format bq' >> beam.Map(lambda row: (row['key'], row['value'])))
enriched_docs = (documents_pcoll
| 'join' >> beam.Map(lambda doc, query: enrich_doc(doc, query[doc['key']]),
query=AsDict(additional_data_pcoll)))
Unfortunately, this has one shortcoming, and that's the fact that Python does not currently support arbitrarily large side inputs (it currently loads all of the K-V into a single Python dictionary). If your side-input data is large, then you'll want to avoid this option.
Note This will change in the future, but we can't be sure ATM.
Further Improved
Another way of joining two datasets is to use CoGroupByKey. The loading of documents, and of K-V additional data should not change, but when joining, you'd do something like so:
# Turn the documents into key-value tuples as well[
documents_kv_pcoll = (documents_pcoll
| 'format docs' >> beam.Map(lambda doc: (doc['key'], doc)))
enriched_docs = ({'docs': documents_kv_pcoll, 'additional_data': additional_data_pcoll}
| beam.CoGroupByKey()
| 'enrich' >> beam.Map(lambda x: enrich_doc(x['docs'][0], x['additional_data'][0]))
CoGroupByKey will allow you to use arbitrarily large collections on either side.
Answering your questions
You can see an example of using BigQuery as a side input in the cookbook. As you can see there, the data comes parsed (I believe that it comes in their original data types, but it may come in string/unicode). Check the docs (or feel free to ask) if you need to know more.
Currently, Python streaming is in alpha, and it does not support side inputs; but it does support shuffle features such as CoGroupByKey. Your pipeline using CoGroupByKey should work well in streaming.
A reason to prefer Java over Python is that all these features work in Java (unlimited-size side inputs, streaming side inputs). But it seems that for your use case, Python may have all you need.
Note: The code snippets are approximate, but you should be able to debug them using the DirectRunner.
Feel free to ask for clarification, or to ask about other aspects if you feel like it'd help.

OSM - Boundary Export from XML file

I've been attempting to export boundary information from an OSM file. My process is nearly there however I have an issue with the polygon I'm generating drawing random lines.
I would appreciate some insight on where I may be going wrong.
Step 1: Export the OSM data into XML
osmfilter -v greater-london-latest.osm --keep="boundary= admin_level= place=" > b.txt
Step 2: Run a script to process the XML.
cycle each relation node
load the member ways
load the nodes from each specified way
record the lat/lon and build a poly set
This produces a series of lat/lon which when I build them as a polygon give the correct overall shape I'm looking for. However, there are issues with the connecting lines I assume..
My polygon output
I'm actually looking for this, which is similar but Im obviously missing something.
Actual Poly Im looking to generate
Again, thanks for any help.
Ways in relations are not necessarily sorted. See answers to this question on how to sort ways, especially the answer by user geocodezip.
Alternatively you can make use of various tools/libraries to do the sorting for you. Unfortunately I can't point you directly to one but there are various tools capable of sorting relation members, including the OSM website itself, JOSM, overpass turbo (I guess), some JS stuff, [...].
Maybe some other user can help out with pointing to some good examples?

Advice (Best practices) for handling large number of large 2D arrays in HDF5 files

I am using a python program to write a 4000x4000 array into an hdf5 file.
Then, I read the data by a c-program where I need it as an input to do some simulations. I need approximately 1000 of these 4000x4000 arrays (meaning, I am doing 1000 simulation runs).
My question now is the following: Which way is "better", 1000 separate hdf5 files or one big hdf5-file with 1000 different dataset (named 'dataset_%04d')?
Any advice or best practices behaviour for this kind of problem is greatly appreciated (as I am not too familiar with hdf5).
In case, this is of interest, here is the python code I am using to write the hdf5 file:
import h5py
h5f = h5py.File( 'data_0001.h5', 'w' )
h5f.create_dataset( 'dataset_1', data=myData )
h5f.close
This is really interesting as I'm currently dealing with similar problem.
Performace
To investigate the problem a little closer, I have created following file
import h5py
import numpy as np
def one_file(shape=(4000, 4000), n=1000):
h5f = h5py.File('data.h5', 'w')
for i in xrange(n):
dataset = np.random.random(shape)
dataset_name = 'dataset_{:08d}'.format(i)
h5f.create_dataset(dataset_name, data=dataset)
print i
h5f.close()
def more_files(shape=(4000, 4000), n=1000):
for i in xrange(n):
file_name = 'data_{:08d}'.format(i)
h5f = h5py.File(file_name, 'w')
dataset = np.random.random(shape)
h5f.create_dataset('dataset', data=dataset)
h5f.close()
print i
Then, in IPython,
>>> from testing import one_file, more_files
>>> %timeit one_file(n=25) # with n=25, the resulting file is 3.0GB
1 loops, best of 3: 42.5 s per loop
>>> %timeit more_files(n=25)
1 loops, best of 3: 41.7 s per loop
>>> %timeit one_file(n=250)
1 loops, best of 3: 7min 29s per loop
>>> %timeit more_files(n=250)
1 loops, best of 3: 8min 10s per loop
The difference is quite surprising to me, for n=25 having more files is faster, however this is no longer truth for more datasets.
Experience
As others noted in comments, there is probably no correct answer as this is very problem specific. I deal with hdf5 files for my research in plasma physics. I don't know if it helps you but I can share my hdf5 experience.
I'm running lots of simulations and output for a given simulation used to go to one hdf5 file. When a simulation finished, it dumped it's state to this hdf5 file, so later I was able to take this state and extend simulation from that point (I could change some parameters as well and I don't need to start from scratch). The output from this simulation went again to the same file. This was great - I had only one file for one simulation. However, there are certain drawbacks with this approach:
When a simulation crashes, you end up with a file that is not 'complete' - you can't start new simulation from that file.
There is no simple way, how you can safely take a look into hdf5 file when another process is writing to that file. If it happen that you try to read from and another process is writing to, you end up corrupted file and all you data is lost!
I don't know of any simple way how you can delete groups from a file (I anyone know a way, let me know). So, if I need to restructure a file, I need to create new one from it (h5copy, h5repack, ...).
So I end up with this approach which works much better:
I'm periodically flushing state from a simulation and after that I'm writing to a new file. If simulation crashes, I need only delete last file and I don't lost that much of cpu time.
I'm currently only plotting data from all files but the last one. Note that there is another way: see here, but my approach is definitely simpler and I'm ok with that.
It is much better to process more small files than one huge file - you see the progress and so on.
Hope this helps.
A little late to the party, I know, but I thought I'd share my experiences. My data sizes are smaller, but from a simplicity-of-analysis standpoint I actually prefer one large (1000, 4000, 4000) dataset. In your case, it looks like you'd need to use the maxshape property to make it extendable as you create new results. Saving multiple separate datasets makes it hard to look at trends across datasets since you have to slice them all separately. With one dataset you could do eg. data[:, 5, 20] to look across the 3rd axis. Also, to address the corruption problem, I highly recommend using h5py.File as a context manager:
with h5py.File('myfilename') as f:
f.create_dataset('mydata', data=data, maxshape=(1000, 4000, 4000))
This automatically closes the file even if there is an exception. I used to curse incessantly due to corrupted data and then I started doing this and haven't had a problem since.

F# - Organisation of algorithms in a file

I do not find a good way to organize various algorithms. Today the file is like this :
1/ Extraction of values from Excel
2/ First algorithm based on these values (extracted from Excel) starting with
"let matriceAlgo1 ="
3/ Second algorithm starting from the same values
"let matriceAlgo2 ="
4/ Synthesis algorithm, doing a weighted average (depending on several values) of the 2/ and 3/ and selecting the result to be shown.
"let matriceSynthesis ="
My question is the following : what should i put before the different parts of this file in order to just call them by there name ? I have seen answers explaining that Module could be an answer but I don't know how to apply it in my case (or anything else if it's not the good answer).At the end, I would like to be able to write something like this :
"launch Extraction
launch First Algorithm
launch Second Algorithm
Launch Synthesis"
The way I usually organize files is to have some clear visual separator between different sections of a file (see for example Crawler.fsx on GitHub) and then have one "main" section at the end that calls functions declared previously.
I don't really use modules unless I have a large number of functions with clashing names. It would be good idea to use modules if your algorithm consists of more functions (e.g. Alg1.initialize, Alg1.run, etc.). Then you could easily switch between using different algorithms using module alias:
module Alg = Alg1 // or Alg2
let a = Alg.initialize
Alg.run a
If the file is getting longer, then you could also move sections to separate files and use #load "File.fs" to load algorithms or functions from a file. In that case, you probably need to use modules, but you can always open the module after loading the file.

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Resources