Is there any Lua implementation of data frames - structures for data analysis which? Something like Python pandas. I want to do some statistical operations using LuaJIT.
Yes, there is now. Check out torch-dataframe that I and Alex are developing. Our main priority is reliability that we check with a rich test suite. Performance comes second although we try to avoid performance hogs within the limits of Lua.
The package is currently far from the pandas sophistication, but feel free to contribute with any methods you feel lacking.
You may want to look at Torch7 that provides N-dimensional arrays with support for various statistical and mathematical operations and is based on LuaJIT.
Related
With respect specifically to CatBoost:
Under what scenarios might one want to use fewer than the max number of threads of one's CPU? I cannot find an answer to this.
Is there a fixed cost/overhead associated with each core utilized? I.e., is more always better for all data set types/sizes?
Do the answers to the questions above generalize to all machine learning algorithms?
I think that most of the reasons for changing the thread_count are not catboost specific. Other libraries like sklearn offer the same feature. Reasons for not running with all CPUs are:
Debugging: If there is a problem it might be handy to only have one thread thus making the process more simple.
You want other processes on your machine to have CPU power. Especially if you have a server for in-memory data analysis shared by a team of data scientists. Your colleagues won't be happy if you take all resources.
Your job is so small that it simply does not need all the resources.
Your parallelize in another way: For example you try different hyper parameters using cross validation. Then it would make sense to dedicate one CPU to training one model rather than training a model with with all CPUs and then move on to train the next model with all CPUs
I hope this answers question 1. This generalizes to other in-memory ml libraries like sklearn.
Regarding question 2 I'm not sure. CatBoost does the parallelisation somewhere in its C++ Code and uses it via Cython in the Python package. I assume it introduces some overhead (since distributed computing always introduces overhead) but it's probably not too much. You could find out by timing some experiments.
I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.
My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?
Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.
https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/
I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.
For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.
For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.
For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.
In the near future, JLSO.jl will be a good option for replacing Serialization.
Following the Spark MLlib Guide we can read that Spark has two machine learning libraries:
spark.mllib, built on top of RDDs.
spark.ml, built on top of Dataframes.
According to this and this question on StackOverflow, Dataframes are better (and newer) than RDDs and should be used whenever possible.
The problem is that I want to use common machine learning algorithms (e.g: Frequent Pattern Mining,Naive Bayes, etc.) and spark.ml (for dataframes) don't provide such methods, only spark.mllib(for RDDs) provides this algorithms.
If Dataframes are better than RDDs and the referred guide recommends the use of spark.ml, why aren't common machine learning methods implemented in that lib?
What's the missing point here?
Spark 2.0.0
Currently Spark moves strongly towards DataFrame API with ongoing deprecation of RDD API. While number of native "ML" algorithms is growing the main points highlighted below are still valid and internally many stages are implemented directly using RDDs.
See also: Switch RDD-based MLlib APIs to maintenance mode in Spark 2.0
Spark < 2.0.0
I guess that the main missing point is that spark.ml algorithms in general don't operate on DataFrames. So in practice it is more a matter of having a ml wrapper than anything else. Even native ML implementation (like ml.recommendation.ALS use RDDs internally).
Why not implement everything from scratch on top of DataFrames? Most likely because only a very small subset of machine learning algorithms can actually benefit from the optimizations which are currently implemented in Catalyst not to mention be efficiently and naturally implemented using DataFrame API / SQL.
Majority of the ML algorithms require efficient linear algebra library not a tabular processing. Using cost based optimizer for linear algebra could be an interesting addition (I think that flink already has one) but it looks like for now there is nothing to gain here.
DataFrames API gives you very little control over the data. You cannot use partitioner*, you cannot access multiple records at the time (I mean a whole partition for example), you're limited to a relatively small set of types and operations, you cannot use mutable data structures and so on.
Catalyst applies local optimizations. If you pass a SQL query / DSL expression it can analyze it, reorder, apply early projections. All of that is that great but typical scalable algorithms require iterative processing. So what you really want to optimize is a whole workflow and DataFrames alone are not faster than plain RDDs and depending on an operation can be actually slower.
Iterative processing in Spark, especially with joins, requires a fine graded control over the number of partitions, otherwise weird things happen. DataFrames give you no control over partitioning. Also, DataFrame / Dataset don't provide native checkpoint capabilities (fixed in Spark 2.1) which makes iterative processing almost impossible without ugly hacks
Ignoring low level implementation details some groups of algorithms, like FPM, don't fit very well into a model defined by ML pipelines.
Many optimizations are limited to native types, not UDT extensions like VectorUDT.
There is one more problem with DataFrames, which is not really related to machine learning. When you decide to use a DataFrame in your code you give away almost all benefits of static typing and type inference. It is highly subjective if you consider it to be a problem or not but one thing for sure, it doesn't feel natural in Scala world.
Regarding better, newer and faster I would take a look at Deep Dive into Spark SQL’s Catalyst Optimizer, in particular the part related to quasiquotes:
The following figure shows that quasiquotes let us generate code with performance similar to hand-tuned programs.
* This has been changed in Spark 1.6 but it is still limited to default HashPartitioning
I asked myself the question whether most people normally code the machine learning algorithms themselves or whether they are likely to use existing solutions like Weka or R packages.
Of course it depends on the problem - but let's say that I want to use a common solution like a neural network. Is there still a reason to code it myself? To understand the mechanism better and adapt it? Or is the thought of standardized solutions more important?
This is not a good question for Stackoverflow. It's an opinion question, not a programming problem.
Nevertheless, here is my take:
It depends on what you want to do.
If you want to find which algorithm works best for your data problem at hand, try ELKI, Weka, R, Matlab, SciPy, whatever. Try out all the algorithms you can find, and spend even more time on preprocessing your data.
If you know which algorithm you need and need to get it into production, many of these tools will not perform good enough or be easy enough to integrate. Instead, check if you can find low level libraries such as libSVM that provide the functionality you need. If these don't exist, roll your own optimized code.
If you want to do research in this domain, you are best off with extending the existing tools. ELKI and Weka have APIs that you can plug into to provide extensions. R doesn't really have an API (CRAN it's a mess...) but people just dump their code somewhere and (hopefully) add a manual how to use it. Extending these frameworks can save you a lot of effort: you have comparison methods ready to use, and you can re-use a lot of their code. ELKI for example has a lot of index structures to accelerate algorithms. Most of the time, the index acceleration is much harder to write than the actual algorithm. So if you can reuse the existing indexes, this will make your algorithms much faster, too (and you will also benefit from future enhancements to these frameworks).
If you want to learn about existing algorithms you better implement them yourself. You'll be surprised how much more there is to optimizing some algorithms than what is taught in class. E.g. APRIORI. The basic idea is quite simple. But getting all the pruning details right, I say 1 out of 20 students gets these details. If you implement APRIORI, then benchmark it against a known good implementation and try to understand why yours is much slower, then you'll actually discover the subtle details to the algorithms. And don't be surprised to see a factor of 100 performance difference between ELKI, R, Weka etc. - it's can still be the same algorithm, just implemented more or less efficiently when it comes to actual data structures used, memory layout etc.
I'm not looking for a Neural Networks library, since I'm creating new kinds of networks. For that I need a good "dataflow" language.
Of course you can do this in C, C++, Java and co. but dealing from scratch with the multithreading etc. would be a nightmare.
At the other extremity, languages like Oz or Erlang seem more adapted, but they don't have many libraries, and they are harder to master (it's easy to play with them, but is it OK to create complete software ?).
What would you suggest ?
I watched an interesting conference presentation about using Erlang for Neural Networks. You might want to check it out:
From Telecom Networks to Neural Networks; Erlang, as the unintentional Neural Network Programming Language
I also know that the presented system is going to be open-sourced any day now according the authors tweet.
Erlang is very well suited for NN.
Neurons can be modeled by processes (no problem with having millions of them)
Connections/synapses can be represented by PIDs of target neuron. It is very easy to initialize such a network as part of standard init procedure in OTP. Communication would be realized by message passing.
Maybe it would be good to have global address space in ETS/mnesia (build in datastores) in order to do dynamic reconfiguration of network structure.
Pattern matching in receive block can determine what kind of signal neuron receives and modify it on the fly.
It would be very easy to monitor such a network.
Also consider that Erlang NN would be 'live' all the time. You would be able to query neurons, layers, routers etc any time.
In C/C++ you just read current state of arrays/data structure.
Regarding performance, we all know that C/C++ is orders of magnitude faster than Erlang,
however NN topic is tricky.
If the network would hold very few neurons, in very wide address space, in regular array,
iterating over it again and again could be costly (in C). Equivalent situation in Erlang would be solved by single query to root/roots (input layer) neurons, which would propagate query directly to well addressed neighborhs.
DXNN1, and DXNN2 which was built and introduced in the textbook: Handbook of Neuroevolution Through Erlang: http://www.amazon.com/Handbook-Neuroevolution-Through-Erlang-Gene/dp/1461444624/ref=zg_bs_760204_22
Are open source and available at: https://github.com/CorticalComputer
If you are interested in data flow programming and multi-threading then I would suggest National Instruments LabVIEW. In this case you don't need to bother about multi-threading since its already there and you can also use OOP since now OOP is also native with LabVIEW. LabVIEW OOP is also purely based on data flow programming paradigm.
If you have any Java experience, then use Scala which is a JVM language that is based on the same concept of "actors" as Erlang. But it is less strict than Erlang and can easily use any existing Java libraries.
Then, when you find a computationally expensive task that would work better in Erlang, you can use Erlang's jinterface library to communicate between your Scala code and your distributed Erlang nodes.
Using Java does not mean dealing from scratch with multithreading - just use one of numerous Java Actor Libraries.
It's not a language in and of itself, but Emergent is very powerful and can be highly customized (it has a full scripting language).
It's open source, too, which could be helpful as a guide if you need to make your own version for your novel architectures.
Why reinvent the wheel? Try PyBrain. It's free and very comprehensive:
Quickstart
Another big plus for Erlang is full integration with Drakon
http://drakon-editor.sourceforge.net/drakon-erlang/intro.html
It all depends on your application. C++, Python are some good programming languages for machine learning