So far that I know about Storm, that it's used to analyze Twitter tweets to get trending topics, but can it be used to analyze data from government's census? And because the data is structured, is storm suitable for that?
Storm is generally used for processing unending streams of data, e.g. logs, the twitter stream, or in my case the output of a web crawler.
I believe census type data would be in the form of a fixed report, which could be treated as a stream, but would probably lend itself better to processing via something like Map Reduce, using Hadoop (possibly with cacading or scalding as layers of abstraction over the details).
The structured nature of the data wouldn't prevent use of any of these technologies, that's more related to the problem you are trying to solve.
Storm is designed for streaming data processing, where the data is coming continuously. Your application has all the data it needs to process available, so a Batch processing is more suited. If the data is structured, you can use R or other tools for analysis, or write scripts to convert the data so that it can go to R as input. If its a humongous dataset, & u want to process it faster, only then think of getting into Hadoop & writing your program as per the analysis you have to do. Suggesting an architecture is only possible if you provide more details regarding data size, & what sort of analysis you are looking forward to do on it. If its a smaller dataset, both hadoop & storm can be an overkill for the problem that has to be solved.
--gtaank
Related
Disclaimer: I am new to perf and still trying to learn the ins/outs.
Currently, I encountered this scenario: I need to analyze the performance of a program. This program consists of reading data, preprocessing data, performing data processing, and outputting results. I am currently only interested in the performance of the data processing part and would like to use perf to perform various performance analyses. However, all the usage methods I know so far will analyze the other parts (reading data, preprocessing data, outputting results) together. The time spent on these parts is relatively large, which effects my ability to observe perf's test results very much.
Therefore, I would like to specify that perf only analyzes the performance of a certain section of code (processing data).
As far as I know, vtune can embed __itt_resume, __itt_pause in my code to achieve the above function. I am not sure if it is possible to do similar analysis using perf.
Any content on this would be greatly appreciated.
I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.
My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?
Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.
https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/
I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.
For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.
For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.
For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.
In the near future, JLSO.jl will be a good option for replacing Serialization.
I wanna start to develop a recommendation system for big data, say 2GB log data per day. For this purpose, between Rhadoop and Apache Mahout, which one is preferred?
Please answer this question from different aspects, such as availability of codes, speed, et.
If you know R and your data is not that big try SparkR but most of the massive R package collection does not integrate well with Spark distributed data.
If you have big data a are ok with an R-like Scala API then Mahout is better. You can get your math working on sample data and the same code will automatically scale to production size.
I have a few questions related with the use of Apache Spark for real-time analytics using Java. When the Spark application is submitted, the data that are stored in Cassandra database are loaded and processed via a machine learning algorithm (Support Vector Machine). Throughout Spark's streaming extension when new data arrive, they are persisted in the database, the existing dataset is re-trained and the SVM algorithm is executed. The output of this process is also stored back in the database.
Apache Spark's MLLib provides implementation of linear support vector machine. In case that I would like a non-linear SVM implementation, should I implement my own algorithm or may I use existing libraries such as libsvm or jkernelmachines? These implementations are not based on Spark's RDDs, is there a way to do this without implementing the algorithm from scratch using RDD collections? If not, that would be a huge effort if I would like to test several algorithms.
Is MLLib providing out of the box utilities for data scaling before executing the SVM algorithm? http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf as defined in section 2.2
While new dataset is streamed, do I need to re-train the hole dataset? Is there any way that I could just add the new data to the already trained data?
To answer your questions piecewise,
Spark provides the MLUtils class that allows you to load data from the LIBSVM format into RDDs - so just the data load portion won't stop you from utilizing that library. You could also implement your own algorithms if you know what you're doing, although my recommendation would be to take an existing one and tweak the objective function and see how it runs. Spark basically provides you the functionality of a distributed Stochastic Gradient Descent process - you can do anything with it.
Not that I know of. Hopefully someone else knows the answer.
What do you mean by re-training when the whole data is streamed?
From the docs,
.. except fitting occurs on each batch of data, so that the model continually updates to reflect the data from the stream.
I have a text I want analyze and send it to several Sentiment Analysis APIs and store the result of them. This text can be a tweet for example.
In the training phase, a human defines the real sentiment of the text and the APIs with the same answer get a better ranking. The machine also analyzes the Main Topic of the Text.
In the Use phase:
The machine receives a text, analyzes the Main Topic of the text, says which APIs worked best for that topic and merges these best APIs results based on their rating.
I thought about something like a recommendation engine like prediction.io
Is this the best way to solve the problem?
What technologies can I use?