Specifically, how does it distribute the tasks on the tensor on the machine and implement the batch_dims gathering on a lower level?
I tried to trace with python and searching on the github repository but I'm confused with the tensor, array, list operators and where the implemented source code is.
Related
I'm planning a TFF scheme in which the clients send to the sever data besides the weights, like their hardware information (e.g CPU frequency). To achieve that, I need to call functions of third-party python libraries, like psutils. Is it possible to serialize (using tff.tf_computation) such kind of functions?
If not, what could be a solution to achieve this objective in a scenario where I'm using a remote executor setting through gRPC?
Unfortunately no, this does not work without modification. TFF uses TensorFlow graphs to serialize the computation logic to run on remote machines. TFF does not interpret Python code on the remote machines.
There maybe a solution by creating a TensorFlow custom op. This would mean writing C++ code to retrieve CPU frequency, and then a Python API to add the operation to the TensorFlow graph during computation construction. TensorFlow's guide for Create an op can provide detailed instructions.
I want to know if there is any basic difference between how spacy_sklearn and tensorflow_embedding pipelines operate under the hood.I mean tensorflow_embedding must also be using the same concepts of word embeddings,reducing the dimensionality of data using PCA etc. Is the only difference then that spacy_sklearn has some pre trained data to draw upon in the form of pre trained vectors and tensorflow pipeline does not?Is my understanding correct?Also how is tensorflow_embedding pipeline related to the tensorflow framework offered by google?
I tried looking up tensorflow framework on google, but could not get any specific answer.I also searched about it on RASA community page, but again found no help
The spacy_sklearn pipeline uses pre-trained word vectors.This is useful if we don’t have very much training data.
The tensorflow embedding pipeline doesn’t use any pre-trained word vectors,it fits specifically for our dataset. The advantage of the tensorflow_embedding pipeline is that the word vectors will be customised for our domain.
For more information ,please refer the below link
https://rasa.com/docs/nlu/choosing_pipeline/
So as a fun project, I've been messing with the TensorFlow API (in Java unfortunately.. but I should be able to get some results out anyways). My first goal is to develop a model for 2D point cloud filtering. So I have written code that generates random clouds in 224x172 resolution, computes the result of a neighbor density filter, and stores both (see images below).
So basically I have generated data for both an input and expected output, which can be done as much as needed for a massive dataset.
I have both the input and output arrays stored as 224x172 binary arrays (0 for no point at index, 1 for a point at that index). So my input and output are both 224x172. At this point, I'm not sure how to translate my input to my expected result. I'm not sure how to weight each "pixel" of my cloud, or how to "teach" the program the expected result. Any suggestions/guidance on whether this is even possible for my given scenario would be appreciated!
Please don't be too hard on me... I'm a complete noob when it comes to machine learning.
Imagine, that Tensorflow is a set of building blocks (like LEGO) that allows constructing machine learning models. After the model is constructed, it could be trained and evaluated.
So basically your question could be divided into three steps:
1. I'm new to machine learning. Please guide me how to choose the model that fits the task.
2. I'm new to tensorflow. I have the idea of model (see 1) and I want to construct it via Tensorflow.
3. I'm new to Java's API of tensorflow. I know how to build model using tensorflow (2), but I'm stuck in Java.
This sounds scaring, but that's not too bad really. And I'd suggest you the following plan to do:
1. You need to look through the machine learning models to find the model that suits your case. So you need to ask yourself: what are the models that could be used for cloud filtering? And basically, do you really need some machine learning model? Why don't you use simple math formulas?
That are the questions you may ask to yourself.
Ok, assume you've found some model. For example, you've found a paper describing neural network able to solve your tasks. Then go to the next step
2. Tensorflow has a lot of examples and code snippets. So you could even
find the code with your model implemented yet.
The bad thing, that most code examples and API are Python based. But as you want to go into machine learning, I'd suggest you studying Python. It's easy to enter. It's very common to use python in the science world as it allows not to waste time on wrappers, configuration, etc. (as Java needs). We just start solving our task from the first line of the script.
As I've told initially about tensorflow and LEGO similarity, I'd like to add that there are more high-level additions to the tensorflow. So you work with the not building blocks but some kind of layers of blocks.
Something like tflearn. It's very good especially if you don't have deep math or machine learning background. It allows building machine learning models in a very simple and understandable way. So do you need to add some neural network layer? Here you are. And that's all without complex low-level tensor operations.
The disadvantage that you won't be able to load tflearn model from Java.
Anyway, we assume, at the end of this step you are able to build your model, to train it and to evaluate the model and prediction quality.
3. So you have your machine learning model, you understand Tensorflow mechanics, and if you still need to work with Java that should be much easier yet.
I note, that you won't be able to load tflearn model from Java. You can try to use jython to call python's functions directly from Java, though I haven't tried it.
And on this way (1-3) you will definitely have some more questions. So welcome to SO.
In a paper titled, "Machine Learning at the Limit," Canny, et. al. report substantial word2vec processing speed improvements.
I'm working with the BIDMach library used in this paper, and cannot find any resource that explains how Word2Vec is implemented or how it should be used within this framework.
There are several scripts in the repo:
getw2vdata.sh
getwv2data.ssc
I've tried running them (after building the referenced tparse2.exe file) with no success.
I've tried modifying them to get them to run but have nothing but errors come back.
I emailed the author, and posted an issue on the github repo, but have gotten nothing back. I only got somebody else having the same troubles, who says he got it to run but at much slower speeds than reported on newer GPU hardware.
I've searched all over trying to find anyone that has used this library to achieve these speeds with no luck. There are multiple references floating around that point to this library as the fastest implementation out there, and cite the numbers in the paper:
Intel research references the reported numbers without running the code on GPU (they cite numbers reported in the original paper)
old reddit post pointing to BIDMach as the best (but the OP says "I haven't tested BIDMach myself yet")
SO post citing BIDMach as the best (OP doesn't actually run the library to make this claim...)
many more not worth listing citing BIDMach as the best/fastest without example or claims of "I haven't tested myself..."
When I search for a similiar library (gensim), and the import code required to run it, I find thousands of results and tutorials but a similar search for the BIDMach code yields only the BIDMach repo.
This BIDMach implementation certainly carries the reputation for being the best, but can anyone out there tell me how to use it?
All I want to do is run a simple training process to compare it to a handful of other implementations on my own hardware.
Every other implementation of this concept I can find either has works with the original shell script test file, provides actual instructions, or provides shell scripts of their own to test.
UPDATE:
The author of the library has added additional shell scripts to get the previously mentioned scripts running, but exactly what they mean or how they work is still a total mystery and I can't understand how to get the word2vec training procedure to run on my own data.
EDIT (for bounty)
I'll give out the bounty to anywone that can explain how I'd use my own corpus (text8 would be great), and then train a model, and then save the ouput vectors and the vocabulary to files that can be read by Omar Levy's Hyperwords.
This is exactly what the original C implementation would do with arguments -binary 1 -output vectors.bin -save-vocab vocab.txt
This is also what Intel's implementation does, and other CUDA implementations, etc, so this is a great way to generate something that can be easily compared with other versions...
UPDATE (bounty expired without answer)
John Canny has updated a few scripts in the repo and added a fmt.txt file, thus making it possible to run test scripts that are package in the repo.
However, my attempt to run this with the text8 corpus yields near 0% accuracy on they hyperwords test.
Running the training process on the billion word benchmark (which is what the repo scripts now do) also yields well-below-average accuracy on the hyperwords test.
So, either the library never yielded accuracy on these tests, or I'm still missing something in my setup.
The issue remains open on github.
BIDMach's Word2vec is a tool for learning vector representations of words, also known as word embeddings. To use Word2vec in BIDMach, you will need to first download and install BIDMach, which is an open-source machine learning library written in Scala. Once you have BIDMach installed, you can use the word2vec function to train a Word2vec model on a corpus of text data. This function takes a number of parameters, such as the size of the word vectors, the number of epochs to train for, and the type of model to use. You can find more detailed instructions and examples of how to use the word2vec function in the BIDMach documentation.
Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.
According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?
What's the missing point here?
Spark 2.0.0
Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. You cannot use partitioner*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset don't provide native checkpoint capabilities (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.
There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning