F# - Organisation of algorithms in a file - f#

I do not find a good way to organize various algorithms. Today the file is like this :
1/ Extraction of values from Excel
2/ First algorithm based on these values (extracted from Excel) starting with
"let matriceAlgo1 ="
3/ Second algorithm starting from the same values
"let matriceAlgo2 ="
4/ Synthesis algorithm, doing a weighted average (depending on several values) of the 2/ and 3/ and selecting the result to be shown.
"let matriceSynthesis ="
My question is the following : what should i put before the different parts of this file in order to just call them by there name ? I have seen answers explaining that Module could be an answer but I don't know how to apply it in my case (or anything else if it's not the good answer).At the end, I would like to be able to write something like this :
"launch Extraction
launch First Algorithm
launch Second Algorithm
Launch Synthesis"

The way I usually organize files is to have some clear visual separator between different sections of a file (see for example Crawler.fsx on GitHub) and then have one "main" section at the end that calls functions declared previously.
I don't really use modules unless I have a large number of functions with clashing names. It would be good idea to use modules if your algorithm consists of more functions (e.g. Alg1.initialize, Alg1.run, etc.). Then you could easily switch between using different algorithms using module alias:
module Alg = Alg1 // or Alg2
let a = Alg.initialize
Alg.run a
If the file is getting longer, then you could also move sections to separate files and use #load "File.fs" to load algorithms or functions from a file. In that case, you probably need to use modules, but you can always open the module after loading the file.

Related

Build vocab in doc2vec

I have a list of abstracts and articles approx 500 in csv each paragraph contains approx 800 to 1000 words whenever I build vocab and print with words giving none and how I can improve results?
lst_doc = doc.translate(str.maketrans('', '', string.punctuation))
target_data = word_tokenize(lst_doc)
train_data = list(read_data())
model = gensim.models.doc2vec.Doc2Vec(vector_size=50, min_count=2, epochs=40)
train_vocab = model.build_vocab(train_data)
print(train_vocab)
{train = model.train(train_data, total_examples=model.corpus_count,
epochs=model.epochs) }
Output:
None
A call to build_vocab() only builds the vocabulary inside the model, for further usage. That function call doesn't return anything, so your train_vocab variable will be Python None.
So, the behavior you're seeing is as expected, and you should say more about what your ultimate aims are, and what you'd want to see as steps towards those aims, if you're stuck.
If you want to see reporting of the progress of your calls to build_vocab() or train(), you can set the logging level to INFO. This is always a usually a good idea working to learn a new library: even if initially the copious info shown is hard to understand, by reviewing it you'll start to see the various internal steps, and internal counts/timings/etc, that hint whehter things are doing well or poorly.
You can also examine the state of the model and its various internal properties after the code has run.
For example, the model.wv property contains, after build_vocab(), a Gensim KeyedVectors structure holding all the untrained ready-for-training vectors. You can ask for its length (len(model.wv) or examine the discovered active list of words (model.wv.index_to_key).
Other comments:
It's not clear your 1st two lines – assigning into lst_doc and target_data – affect anything further, since it's unclear what read_data() might be doing to fill the train_corpus.
Often low min_count values worsen results, by including more words that have so few usage examples that they're little more than noise during training.
only 500 documents is rather small compared to most published work showing impressive results with this algorithm, which uses tens-of-thousands of documents (if not millions). So, keep in mind that results on such a small dataset may be unrepresentative of what's possible with a larger corpus - in terms of quality, optimal parameters, etc.

When are placeholders necessary?

Every TensorFlow example I've seen uses placeholders to feed data into the graph. But my applications work fine without placeholders. According to the documentation, using placeholders is the "best practice", but they seem to make the code unnecessarily complex.
Are there any occasions when placeholders are absolutely necessary?
According to the documentation, using placeholders is the "best practice"
Hold on, this quote is out-of-context and could be misinterpreted. Placeholders are the best practice when feeding data through feed_dict.
Using a placeholder makes the intent clear: this is an input node that needs feeding. Tensorflow even provides a placeholder_with_default that does not need feeding — but again, the intent of such a node is clear. For all purposes, a placeholder_with_default does the same thing as a constant — you can indeed feed the constant to change its value, but is the intent clear, would that not be confusing? I doubt so.
There are other ways to input data than feeding and AFAICS all have their uses.
A placeholder is a promise to provide a value later.
Simple example is to define two placeholders a,b and then an operation on them like below .
a = tf.placeholder(tf.float32)
b = tf.placeholder(tf.float32)
adder_node = a + b # + provides a shortcut for tf.add(a, b)
a,b are not initialized and contains no data Because they were defined as placeholders.
Other approach to do same is to define variables tf.Variable and in this case you have to provide an initial value when you declare it.
like :
tf.global_variables_initializer()
or
tf.initialize_all_variables()
And this solution has two drawbacks
Performance wise that you need to do one extra step with calling
initializer however these variables are updatable .
in some cases you do not know the initial values for these variables
so you have to define it as a placeholder
Conclusion :
use tf.Variable for trainable variables such as weights (W) and biases (B) for your model or when Initial values are required in
general.
tf.placeholder allows you to create operations and build computation graph, without needing the data. In TensorFlow
terminology, we then feed data into the graph through these
placeholders.
I really like Ahmed's answer and I upvoted it, but I would like to provide an alternative explanation that might or might not make things a bit clearer.
One of the significant features of Tensorflow is that its operation graphs are compiled and then executed outside of the original environment used to build them. This allows Tensorflow do all sorts of tricks and optimizations, like distributed, platform independent calculations, graph interoperability, GPU computations etc. But all of this comes at the price of complexity. Since your graph is being executed inside its own VM of some sort, you have to have a special way of feeding data into it from the outside, for example from your python program.
This is where placeholders come in. One way of feeding data into your model is to supply it via a feed dictionary when you execute a graph op. And to indicate where inside the graph this data is supposed to go you use placeholders. This way, as Ahmed said, placeholder is a sort of a promise for data supplied in the future. It is literally a placeholder for things you will supply later. To use an example similar to Ahmed's
# define graph to do matrix muliplication
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
# this is the actual operation we want to do,
# but since we want to supply x and y at runtime
# we will use placeholders
model = tf.matmul(x, y)
# now lets supply the data and run the graph
init = tf.global_variables_initializer()
with tf.Session() as session:
session.run(init)
# generate some data for our graph
data_x = np.random.randint(0, 10, size=[5, 5])
data_y = np.random.randint(0, 10, size=[5, 5])
# do the work
result = session.run(model, feed_dict={x: data_x, y: data_y}
There are other ways of supplying data into the graph, but arguably, placeholders and feed_dict is the most comprehensible way and it provides most flexibility.
If you want to avoid placeholders, other ways of supplying data are either loading the whole dataset into constants on graph build or moving the whole process of loading and pre-processing the data into the graph by using input pipelines. You can read up on all of this in the TF documentation.
https://www.tensorflow.org/programmers_guide/reading_data

PENTAHO data integration data source/destination mapping

I'm reaching you hoping to find answers about Pentaho data integrator limitation.
I'm currentlty working on a 1 to 1 data source integration and would like to make it n to 1-n. This requires dynamic jobs creation and would like to know if any of came across such issue. My 1 to 1 is working perfectly, it integration form differents data source types (CSV, databases "Mysql, Oracle ...) to same date destination and need to make it n to 1-n.
There is a Metadata Injection Step just for that.
A use case similar to yours is described by Diethard here.
Because it seams that you have a lot of different source format, it may be a good investment to read the use case of Jens, the author of the step, here, which (apart for the automation) is precisely your case.
AFAIK in Pentaho DI, it is not possible to create dynamic transformations for any random data sources. PDI looks for the input columns to be available in the input stream before it loads the data to the target database. For example, if you are using 1 data source (in MySQL) and loading the same to the csv output, the csv output step is expecting the presence of input columns in the data source step (Table input). If you are trying to load any n random data sources you need to define input columns/fields for each of them individually.
Alternatively there are few things which you can explore:
1. Fast Dump in Text File Output step:
There is an option to fast data dump the data set in Text file output step. Here you don't need to define any output column. The input fields will be automatically dumped without formatting as it is. You can use this to map all of the input sources to a csv format and then load it to their targets.
2. Extending Java and Kettle together to build a solution:
PDI allows you to create custom JAVA codes on top of kettle. You can check this blog for more. You can use this idea to create custom code to pass n data sources fields to the kettle as a parameter and execute them. {note: i haven't tried this step, just thinking out loud here}
Hope this helps :)

How to use external data in an OSRM profile

It this Mapbox blog post, Lauren Budorick shares how they got working a routing engine with OSRM that uses elevation data in order to give cyclists better routes... AMAZING!
I also want to explore the potential of OSRM's routing when plugging in external (user-generated) data, but I'm still having a hard time grasping how OSRM's profiles work. I think I get the main idea, that every way (or node?) is piped into a few functions that, all toghether, scores how good that path is.
But that's it, there are plenty of missing parts in my head, like what do each of the functions Lauren uses in her profile do. If anyone could point me to some more detailed information on how all of this works, you'd make my next week much, much easier :)
Also, in Lauren's post, inside source_function she loads a ./srtm_bayarea.asc file. What does that .asc file looks like? How would one generate a file like that one from, let's say, data stored in a pgsql database? Can we use some other format, like GeoJSON?
Then, when in segment_function she uses things like source.lon and target.lat, are those refered to the raw data stored in the asc file? Or is that file processed into some standard that maps everything to comply it?
As you can see, I'm a complete newbie on routing and maybe GIS in general, but I'd love to learn more about this standards and tools that circle around the OSRM ecosystem. Can you share some tips with me?
I think I get the main idea, that every way (or node?) is piped into a few functions that, all toghether, scores how good that path is.
Right, every way and every node are scored as they are read from an OSM dump to determine passability of a node and speed of a way (used as the scoring heuristic).
A basic description of the data format can be found here. As it reads, data immediately available in ArcInfo ASCII grids includes SRTM data. Currently plaintext ASCII grids are the only supported format. There are several great Python tools for GIS developers that may help in converting other data types to ASCII grids - check out rasterio, for example. Here's an example of a really simple python script to convert NED IMGs to ASCII grids:
import sys
import rasterio as rio
import numpy as np
args = sys.argv[1:]
with rio.drivers():
with rio.open(args[0]) as src:
elev = src.read()[0]
profile = src.profile
def shortify(x):
if x == profile['nodata']:
return -9999
elif x == np.finfo(x).tiny:
return 0
else:
return int(round(x))
out_elev = [map(shortify, row) for row in elev]
with open(args[0] + '.asc', 'a') as dst:
np.savetxt(dst, np.array(out_elev),fmt="%s",delimiter=" ")
source.lon and target.lat e.g: source and target are nodes provided as arguments by the extraction process. Their coordinates are used to look up data at each location during extraction.
Make sure to read thoroughly through the relevant wiki page (already linked).
Feel free alternately to open a Github issue in
https://github.com/Project-OSRM/osrm-backend/issues with OSRM
questions.

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.
Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...
On this page find Real-World Cluster Configurations
it will cover file size configuration
You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.
Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.
Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Resources