Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Do I need to master Hadoop before learning Mahout? How far I can go (in order to use data mining feature) without learning Hadoop ?
Master? No. If you are using the parts of the project based on Hadoop then basic knowledge is required but sufficient. If you are using the parts not based on Hadoop then you don't need Hadoop at all.
Mahout provides you with the instruments which will enable you to play with data mining. Yes, Mahout also supports Hadoop implementation incase the dataset is huge but it will fairly well without Hadoop on a single machine. Same code will work with or without Hadoop (Haddop will be picked up if Hadoop configuration parameter is set). Knowing Hadoop will add extra weapon to the kitty of Big Data.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 days ago.
This post was edited and submitted for review 1 hour ago.
Improve this question
I want to train the XGBoost model with a very large dataset that would cause the OutOfMemory errors if trained on a single machine. I'm also interested in seeing how quickly the model training can be. For the sake of argument, lets assume training dataset is fixed so no posts about feature reduction pls.
To get this running as a Custom Job on Google Cloud, I figured I have to parse the TF_CONFIG environment variable which is set behind the scenes on all the nodes specified by an input argument when creating the job.
An example in the docker image used in this page. The image looks like it parses the TF_CONFIG variable and manually sets up the distributed training using the xgboost.rabit library from xgboost.
The other way I'm thinking to do this is to use libraries like DASK or Ray to run the distributed training but I don't have any experience with those.
Can someone provide examples of on or the other?
Ray provides a distributed xgboost trainer that can run on common cloud providers (including Google Cloud). See example here: https://docs.ray.io/en/latest/ray-air/examples/xgboost_example.html
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
Is Octave good to learn for Machine Learning?
Or Python and other libraries would do?
Depends what you want to do.
Octave is excellent for fast prototyping and learning. The language is simple and you can focus on grasping the concepts of ML. On the other side, Python is very powerful and has unparalleled stack of libraries and frameworks that give you the ability to dive into machine learning on the level that you are comfortable with. And it's also a simple language in which you can get comfortable pretty fast.
If you just want to play with machine learning a little, I would recommend Octave as it's simple and straightforward. In all other cases, I would recommend Python as it's a powerful language for building complete systems and has a large community which can help you with any problem you could possibly encounter.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I have just completed Machine learning course from Andrew ng and would like to proceed further.
I also want the python implementation of Machine Learning from beginning so that i can practice on Kaggle.
Also, is there any better book or tutorial or some resource like that so that i can proceed further without wasting any time searching such resources.
The best book unequivocally that has implementation of Machine Learning algorithms in Python is the "Introduction to Machine Learning with Python: A Guide for Data Scientists" by Andreas C. Müller. Machine Learning algorithms in Python can be used from a package called scikit-learn. This package has everything you need for Machine Learning. All the algorithms, scaling, cross validation. And that book is written by the chief developer of scikit-learn itself.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I am new to Apache Flume and I am trying to perform PoC with Apache Flume & Hadoop, but I don't know which version will be suitable for this exercise.
Please help.
I've tested Flume with several versions of Hadoop and always worked. The official Apache Flume documentation does not specify any required Hadoop version in its HDFS Sink so I guess it is using some Hadoop API that has not chenged over time (which is really good). Let's do the exercise of going into the details:
The HDFSWriterFactory class used by HDFSEventSink.process() to get a HDFS writer may provide a:
HDFSSequenceFile: it uses a org.apache.hadoop.io.SequenceFile in order to write the data.
HDFSDataStream: it uses a org.apache.flume.serialization.EventSerializer.
HDFSCompressedDataStream: again, it uses a org.apache.flume.serialization.EventSerializer.
On the one hand, org.apache.hadoop.io.EventSerializer is quite large and seems to maintain a lot of deprecated methods for writing the data, so that could explain the compatibility with all Hadoop versions. On the other hand, org.apache.flume.serialization.EventSerializer uses standard java.io.OutputStream, which I think is quite stable.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Hi i want to predict health level(High,medium,low) in leaf using image processing and data mining.So far i thought using extract colors from leaf using Bayes algorithm to predict healthy of leaf. and data mining part have completed now.but i need extra features for prediction.we only used orchid leaf.So i can't use vain structure.Can anyone help me to what are the other features can be extracted from leaf for identify health level of leaf.Any idea or comments help me to improve my project. Thanks
There are many possible approaches to a problem like this. One common method is the bag-of-features model. Take a look at this example using the Computer Vision System Toolbox in MATLAB.