I would like to use slope one as item reccomender. The problem is that I have groups of items that are not correlated each other.It seems to me that there is no way to tell Mahout to use the diff storage just for a group of products.I want to achieve this because the groups have an average of 100 items and I prefer to not create a mongoDbDiffStorage from scratch. Does the rescorer tell what differences should be computed in order to avoid to store useless data?
Thanks
You mean that you know that you want to consider some items completely unrelated? No, you'd have to write your own DiffStorage for that.
Or, I might suggest that if these item sets are completely unrelated, then what you really have is a recommender problem for each group of items, and can use a Recommender for each subset of data. This would probably be easier and more efficient in many ways.
Related
I have a complex data model consisting of around hundred tables containing business data. Some tables are very wide, up to four hundred columns. Columns can have various data types - integers, decimals, text, dates etc. I'm looking for a way to identify relevant / important information stored in these tables.
I fully understand that business knowledge is essential to correctly process a data model. What I'm looking for are some strategies to pre-process tables and identify columns that should be taken to later stage where analysts will actually look into it. For example, I could use data profiling and statistics to find and exclude columns that don't have any data at all. Or maybe all records have the same value. This way I could potentially eliminate 30% of fields. However, I'm interested in exploring how AI and Machine Learning techniques could be used to identify important columns, hoping I could identify around 80% of relevant data. I'm aware, that relevant information will depend on the questions I want to ask. But even then, I hope I could narrow the columns to simplify the manual assesment taking place in the next stage.
Could anyone provide some guidance on how to use AI and Machine Learning to identify relevant columns in such wide tables? What strategies and techniques can be used to pre-process tables and identify columns that should be taken to the next stage?
Any help or guidance would be greatly appreciated. Thank you.
F.
The most common approach I've seen to evaluate the analytical utility of columns is the correlation method. This would tell you if there is a relationship (positive or negative) among specific column pairs. In my experience you'll be able to more easily build analysis outputs when columns are correlated - although, these analyses may not always be the most accurate.
However, before you even do that, like you indicate, you would probably need to narrow down your list of columns using much simpler methods. For example, you could surely eliminate a whole bunch of columns based on datatype and basic count statistics.
Less common analytic data types (ids, blobs, binary, etc) can probably be excluded first, followed by running simple COUNT(Distinct(ColName)), and Count(*) where ColName is null . This will help to eliminating UniqueIDs, Keys, and other similar data types. If all the rows are distinct, this would not be a good field for analysis. Same process for NULLs, if the percentage of nulls is greater than some threshold then you can eliminate those columns as well.
In order to automate it depending on your database, you could create a fairly simple stored procedure or function that loops through all the tables and columns and does a data type, count_distinct and a null percentage analysis on each field.
Once you've narrowed down list of columns, you can consider a .corr() function to run the analysis against all the remaining columns in something like a Python script.
If you wanted to keep everything in the database, Postgres also supports a corr() aggregate function, but you'll only be able to run this on 2 columns at a time, like this:
SELECT corr(column1,column2) FROM table;
so you'll need to build a procedure that evaluates multiple columns at once.
Thought about this tech challenges for some time. In general it’s AI solvable problem since there are easy features to extract such as unique values, clustering, distribution, etc.
And we want to bake this ability in https://columns.ai, obviously we haven’t gotten there yet, the first step we have done though is to collect all columns stats upon a data connection, identify columns that have similar range of unique values and generate a bunch of query templates for users to explore its dataset.
If interested, please take a look, as we keep advancing this part, it will become closer to an AI model to find relevant columns. Cheers!
I'm struggling to find a real world example on how to use google cloud dataflow combiners to run a common ETL tasl which aggregates records on multiple keys (e.g. Date, Location) and sums values over different measures (e.g. GrossValue, NetValue, Quantity). I can only find examples with a typical Key/Value (e.g. Day/Value) aggregation. Any hints on how this is done with the Python SDK would be appreciated.
I'm not 100% sure I understand your question. Do you have separate elements you are trying to join the data together for, in which case you may wish to use CoGroupByKey? Or does a single element have multiple fields?
Hope some of this info helps,
I would suggest looking at windowing, which will allow you to subdivide a PCollection according to the timestamps of its individual elements. If you want to see all the events for particular day this may be useful. Python examples of windowing. You may want to window across a days worth of data. This link is useful as well to understand how you can use GroupByKey in different ways,
Another option is to determine what date your elements belongs to, and use a group by key to key it with "[location][date][other]". You may need to do something like this if you want to join the data based on multiple fields.
See this GroupByKey example, but change the key to use your multiple fields concatenated.
Here is an example for reducing with a custom combiner. You can add logic here to do a custom aggregation for multiple different measurements.
I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps
We are using Mahout to get UserBased and ItemBased recommendations. We are using a file data model that contains a mapping of userId and itemId (not sorted in any form), Tanimoto Coefficient Similarity and GenericBooleanPrefItemBasedRecommender,
DataModel dataModel = new FileDataModel("/FilePath");
_itemSimilarity = new TanimotoCoefficientSimilarity(dataModel);
_recommender = new CachingRecommender(new GenericBooleanPrefItemBasedRecommender(dataModel,_itemSimilarity));
we also have a rescorer to filter out some of the results, we are calling the inbuilt recommend method of the recommender,
_recommender.recommend(userID, howMany, _rescorer);
We have around 200K users, 55k products and around 4 million entries as user-product preferences.
The problem that we are facing is that the first call to recommend method for a user is taking around 300-400ms to return the list of recommended item, which is not a feasible option as per our needs. I am looking for some optimisation techniques that someone has used over mahout, or may be if someone has implemented there own recommend method over the given method, or if we should pass the data after adding some sort to the data files. We are trying to get the recommendation time to be around 100ms.
Any suggestions would be really helpful.
Your best bet is to look into CandidateItemStrategy to further limit how many possibilities are considered. See:
https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/cf/taste/recommender/CandidateItemsStrategy.html
Candidate Strategy for GenericUserBasedRecommender in Mahout
I have one product, let's say a book. Now I want to retrieve k products, that are similar to this product. How can I do this with Mahout?
The products are stored in a MySQL database so I'd use the JDBCDataModel.
For computing the similarities I'd prefer the LogLikelihoodTest.
But which recommender should I choose? It seems that all recommenders are designed
I'm going to guess at the question here. You have user-item data, where users are real people and items are books. You are using LogLikelihoodSimilarity as the basis for some recommender, either user-based or item-based.
You don't need a recommender if you just want most similar items. Just use LogLikelihoodSimilarity, which is an ItemSimilarity, to compute similarity with all other items and take the most similar ones. In fact look at the TopItems class which even does that logic for you.