Finding data statistics using mahout - mahout

I am new to mahout and i am trying to find how i can make use of my dataset to present some relations. I have a dataset of the sort
IPs,timestamp,bytes_tranferred
what are the different relationships i can derive from this set so that i can present some meaningful values using mahout. Currently am planning to use this set to represent which client (in IPs column) had more traffic for a given time. So i will have to group IPs together i guess. Are there any better ideas and how can i do it using JAVA code. Kindly suggest.
Thanks in Advance

Basically it depends your requirement.you can calculate data transfer in a time duration,ip having data transfer during a time duration etc.but to calculate dont think you need mahout framework,simple MR job can do all these.

Related

In a regression task, how do I find which independent variables are to be ignored or are not important?

In the regression problem I'm working with, there are five independent columns and one dependent column. I cannot share the data set details directly due to privacy, but one of the independent variables is an ID field which is unique for each example.
I feel like I should not be using ID field in estimating the dependent variable. But this is just a gut feeling. I have no strong reason to do this.
What shall I do? Is there any way I decide which variables to use and which to ignore?
Well, I agree with #desertnaut. Id attribute does not seem relevant when creating a model and provides no help in prediction.
The term you are looking for is feature selection. Since it's a comprehensive section so I would just tell you the methods that are mostly used by data scientists.
As for regression problems you can try correlation heatmap to find the features that are highly correlated with the target.
sns.heatmap(df.corr())
There are several other ways too like PCA,using tree inbuilt feature selection methods to find the right features for your model.
You can also try James Phillips method. This approach is limited since model time complexity will increase linearly with the features. But in your case where you've only four features to compare you can try it out. You can compare the regression model trained with all the four features with the model trained with only three features by dropping one of the four features recursively. This would mean training four regression models and comparing them.
According to you, the ID variable is unique for each example. So the model won't be able to learn anything from this variable as with every example, you get a new ID and hence no general patterns to learn as every ID only occurs once.
Regarding feature elimination, it depends. If you have domain knowledge, based on that alone you can engineer/ remove features as needed. If you don't know much about the domain, you can try out some basic techniques like Backward Selection, Forward Selection, etc via Cross Validation to get the model with the best value of the metric that you're working with.

RapidMiner - Time Series Segmentation

As I am fairly new to RapidMiner, I have a Historical Financial Data Set (with attributes Date, Open, Close, High, Low, Volume Traded) from Yahoo Finance and I am trying to find a way to segment it such as in the image below:
I am also planning on performing this segmentation on more than one of such Data Sets and then comparing between each segmentation (i.e. Segment 1 for Data Set A against Segment 1 for Data Set B), so I would preferably require an equal number of segments each.
I am aware that certain extensions are available within the RapidMiner Marketplace, however I do not believe that any of them have what I am looking for. Your assistance is much appreciated.
Edit: I am currently trying to replicate the Voting-Based Outlier Mining for Multiple Time Series (V-BOMM) with multiple data sets. So far, I am able to perform the operation by recording and comparing common dates against each other.
However, I would like to enhance the process to compare Segments rather than simply dates. I have gone through the existing functionalities of RapidMiner, and thus far I don't believe any fit my requirements.
I have also considered Dynamic Time Warping, but I can't seem to find an available functionality in RapidMiner.
Ultimate question: Can someone guide me to functionalities that can help replicate the segmentation in the attached image such that the segments can be compared between Historic Data Sets in RapidMiner? Also, can someone guide me on how to implement Dynamic Time Warping using RapidMiner?
I would use the new version of the Time Series extension, using the windowing features to segment the time series into whatever parts you want. There is a nice explanation of the new tools in the blog section of the community.

Finding items which are similar

I have a big database of many items of retail company. If I would like to find the items which are similar to any particular item, can I use pearson correlation in Spark ML to do that? Is there any other better algorithm to do it? How do I make sure the machine also learns as it evolves?
Edit - I implemented Mapreduce program to find distance between various features. But how can I make it as Machine learning solution? Suppose if I let the program identify the correct neighbor, how can the program make use of this learning for next time?
Using Azure ML recommendation model it is easy to perform tasks such as "reltaed purchase items" it would be a quick and easy start.
https://gallery.cortanaintelligence.com/MachineLearningAPI/Recommendations-2

Analyzing Sensor Data stored in cassandra and draw graphs

I'm collecting data from different sensors and write them to a Cassandra database.
The Sensor-ID accts as a partition key, the timestamp of the sensors data as clustering column. Additionally a value of the sensor is stored.
Each sensor collects something about 30000 to 60000 values a day.
The simplest thing I wane do is draw a graph showing this data. This is not a problem for a few hours but when showing a week or even a longer range, all the data has to be loaded into the backend (a rails application) for further processing. This isn't really fast with my test dataset and won't be faster in production I think.
So my question is, how to speed this up. I thought about pre-processing the data directly in the database but it seems, that Cassandra isn't able to do such things.
For a graph with a width of 1000px it isn't interesting to draw ten thousands of points - so it would be interesting to gather only relevant, pre-aggregated data from the database.
For example, when showing the data for a whole day in a graph with a width of 1000px, it would be enough to take 1000 average values (this would be an average clustered by 86seconds - 60*60*24 / 1000).
Is this a good approach? Or are there other techniques fasten this up? How would I handle this with database? Create a second Table and store some average values? But the resolution of the graph may change...
Other approaches would be drawing mean values by day, week, month and so on. Maybe vor this a second table could do a good job!
Cassandra is all about letting you write and read your data quickly. Think of it as just a data store. It can't (really) do any processing on that data.
If you want to do operations on it, then you are going to need to put the data into something else. Storm is quite popular for building computation clusters for processing data from Cassandra, but without knowing exactly the scale you need to operate at, then that may be overkill.
Another option which might suit you is to aggregate data on the way in, or perhaps in nightly jobs. This is how OLAP is often done with other technologies. This can work if you know in advance what you need to aggregate. You could build your sets into hourly, daily, whatever, then pull a smaller amount of data into Rails for graphing (and possibly aggregate it even further to exactly meet the desired graph requirements).
For the purposes of storing, aggregating, and graphing your sensor data, you might consider RRDtool which does basically everything you describe. Its main limitation is it does not store raw data, but instead stores aggregated, interpolated values. (If you need the raw data, you can still use Cassandra for that.)
AndySavage is onto something here when it comes to precomputing aggregate values. This does require you to understand in advance the sorts of metrics you'd like to see from the sensor values generally.
You correctly identify the limitation of a graph in informing the viewer. Questions you need to ask really fall into areas such as:
When you aggregate are you interested in the mean, median, spread of the values?
What's the biggest aggregation that you're interested in?
What's the goal of the data visualisation - is it really necessary to be looking at a whole year of data?
Are outliers the important part of the dataset?
Each of these questions will lead you down a different path with visualisation and the application itself too.
Once you know what you're wanting to do, an ETL process harnessing some form of analytical processing will be needed. This is where the Hadoop world would be useful investigating.
Regarding your decision to use Cassandra as your timeseries historian, how is that working for you? I'm looking at technical solutions for a similar requirement at the moment and it's one of the options on the table.

should I use mahout for this?

I want to recommend items that are tagged and are categorized into three price categories (cheap, regular and expensive). I know that with Mahout recommendation could be achieved but here's why I don't know how to use it.
Mahout is based on the other users opinion but all of the new items that I want to recommend are just the new ones that don't have any preferences set yet.
Is Mahout the right tool for this? Is this content-based? (which mahout don't support yet????) or should I use classification?
Thanks!
Since I've never built any recommender system - do not take this answer very seriously (no-one has answered it, so I try)
recommendation system has to be built on some already known (or partially known data). If you have only new (unseen) data there is only possibility to use some clustering algorithm in order to build some clusters.
And if those clusters would be ok, they can be used for training some recommendation system.
Mahout is just a tool which implement various ML methods. You can use other tools like Weka, R, ...
If you have no data at all about a new user, there's really nothing you can do to make recommendations, no matter what you do. There is zero input that would differentiate the person from anyone else.
Good systems should however be able to do something reasonable after the first input is available.
This is not a classifier problem by nature, no. It is also not a clustering tool, other answers notwithstanding.
The price categories are not core to any rec process you would use. You have other data presumably, what is it? That's important.
Finally whether or not to use Mahout depends on taste. You would use it if you want to use Java and Hadoop. And in turn you would only consider Hadoop if you had very large input, and few people have that much data (like >10M data points at least).
(Well, not quite -- my recommender pieces in Mahout pre-date Hadoop and are for on-line, smaller-scale applications. You might indeed be interested in this, if you are working in Java.)

Resources