Time Aware Recommender System would work for my data set? - machine-learning

I have implicit feed back data.
Customer Data: <CustomerID> <Product Bought> <Date of Purchase>
C1 P1 01-11-2008
C1 P2 01-01-2009
C1 P3 01-01-2020
C2 P1 01-01-2021
I am building a recommender system. I have used co-occurrence matrix to build the recommender system. I have used Graph lab to do it. I also used Jaccard similarity measure for it.
Now my objective is to recommend products which customer likely to buy in next 6 months. For C2, I should recommend product P2 and not P3. How to handle this problem? I did learn about CARS (context-aware recommender systems) esp. Fast FM. It looks like it did not serve me well. Please help me how to handle problem. Item-item similarity is not doing well for me.

Related

Could you explain this question? i am new to ML, and i faced this problem, but its solution is not clear to me

The problem is in the picture
Question's image:
Question 2
Many substances that can burn (such as gasoline and alcohol) have a chemical structure based on carbon atoms; for this reason they are called hydrocarbons. A chemist wants to understand how the number of carbon atoms in a molecule affects how much energy is released when that molecule combusts (meaning that it is burned). The chemists obtains the dataset below. In the column on the right, kj/mole is the unit measuring the amount of energy released. examples.
You would like to use linear regression (h a(x)=a0+a1 x) to estimate the amount of energy released (y) as a function of the number of carbon atoms (x). Which of the following do you think will be the values you obtain for a0 and a1? You should be able to select the right answer without actually implementing linear regression.
A) a0=−1780.0, a1=−530.9 B) a0=−569.6, a1=−530.9
C) a0=−1780.0, a1=530.9 D) a0=−569.6, a1=530.9
Since all a0s are negative but two a1s are positive lets figure out the latter first.
As you can see by increasing the number of carbon atoms the energy is become more and more negative, so the relation cannot be positively correlated which rules out options c and d.
Then for the intercept the value that produces the least error is the correct one. For the 1 and 10 (easier to calculate) the outputs are about -2300 and -7000 for a, -1100 and -5900 for b, so one would prefer b over a.
PS: You might be thinking there should be obvious values for a0 and a1 from the data, it's not. The intention of the question is to give you a general understanding of the best fit. Also this way of solving is kinda machine learning as well

Minimum Spanning Tree of Subgraph

Suppose you have a graph G = (V, E). You can do whatever you want in terms of preprocessing on this graph G (within reasonable time and space constraints for a graph with a few thousands vertices, so you couldn't just store every possible answer for example).
Now suppose I select a subset V' of V. I want the MST over just these vertices V'. How do you do this quickly and efficiently?
There are two ways to solve the problem. Their performance is dependent to different states of the problem.
Applying MST algorithms on sub-graph(solve from scratch).
Using dynamic algorithms to update tree after changes in the problem.
There are two types of dynamic algorithms:
I) Edge insertion and deletion**
G. Ramalingam and T. Reps, “On the computational complexity of
dynamic graph problems,” Theoret. Comput. Sci., vol. 158, no. 1, pp.
233–277, 1996.
II) Edge weight decreasing and increasing**
D. Frigioni, A. Marchetti-Spaccamela, and U. Nanni, “Fully dynamic
output bounded single source shortest path problem,” in ACM-SIAM
Symp. Discrete Algorithms, 1996, pp. 212–221.
“Fully dynamic algorithms for maintaining shortest paths trees,”
J. Algorithms, vol. 34, pp. 251–281, 2000.
You can use them directly or change them with respect to the problem and consider node insertion and deletion.

Decision tree completeness and unclassified data

I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?
This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.

Best Solution for Recommendation

I am going to find a appropriate function in order to obtain accurate similarity between two persons according to their favourites.
for instance persons are connected to tags and their desire to each tags will be kept on the edge of tag nodes as a numeric values. I want to recommend similar persons to each persons.
I have found two solutions:
Cosine Similarity
There is Cosine function in Neo4j that just accept one input while in above function I need to pass vectores to this formula. Such as:
for "a": a=[10, 20, 45] each number indicates person`s desire to each tag.
for "b": b=[20, 50, 70]
Pearson Correlation
When I was surfing on the net and your documentation I found:
http://neo4j.com/docs/stable/cypher-cookbook-similarity-calc.html#cookbook-calculate-similarities-by-complex-calculations
My question is what is your logic behind this formula?
What is difference between r and H?
Because at the first glance I think H1 or H2 are always equals one. Unless I should consider the rest of the graph.
Thank you in advanced for any helps.
I think the purpose of H1 and H2 are to normalize the results of the times property (the number of times the user ate the food) across food types. You can experiment with this example in this Neo4j console
Since you mention other similarity measures you might be interested in this GraphGist, Similarity Measures For Collaborative Filtering With Cypher. It has some simple examples of calculating Pearson correlation and Jaccard similarity using Cypher.
This example makes it a little bit hard to understand what is going on. In this example, H1 and H2 are both 1. a better example would show each person eating different types of food, so you'd be able to see the value of H changing. If "me" also ate "vegetables", "pizza", and "hotdogs", their H would be 4.
Can't help you with Neo4J, just want to point out that Cosine Similarity and Pearsons' correlation coefficient are essentially the same thing. If you decode the different notations, you'll find that the only difference is that Pearsons zero-centers the vectors first. So you can define Pearsons as follows:
Pearsons(a, b) = Cosine(a - mean(a), b - mean(b))

How can I use machine learning to extract larger chunks of text from a document?

I am currently learning about machine learning, as I think it might be helpful to solve a problem I have. However, I am unsure about what techniques I should apply to solve my problem. I apologise in advance for probably not knowing enough about this field to even ask a proper question.
What I want to to is extract the significant parts of a knitting pattern (the actual pattern, not all the intro and stuff like that). For instance, I would like to feed this web page into my program and get out something like this:
{
title: "Boot Style Red and White Baby Booties for Cold Weather"
directions: "
Right Bootie.
Cast on (31, 43) with white color.
Rows (1, 3, 5, 7, 9, 10, 11): K.
Row 2: K1, M1, (K14, K20), M1, K1, M1, (K14, K20), M1, K1. (35, 47 sts)
Row 4: K2, M1, (K14, K20), M1, K3, M1, (K14, K20), M1, K2. (39, 51 sts)
Row 6: K3, M1, (K14, K20), M1, K5, M1, (K14, K20), M1, K3. (43, 55 sts)
..."
}
I've been reading about extracting smaller parts, like sentences and words, and also about stuff like Named Entity Recognition, but they all seem to be focused on very small parts of the text.
My current thoughts are to use supervised learning, but I'm also very unsure about how to extract features out of the text. Naive methods like using letters, words or even sentences as features seems like they wouldn't be relevant enough to yield any kind of satisfactory results (and also, there would be tons of features, unless I use some kind of sampling), but what are really the significant features for finding out which parts are what in a knitting pattern?
Can someone point me in the right direction of algorithms and methods to do extraction of larger portions of the text?
One way to see this is as a straightforward classification problem: for each sentence in the page, you want to determine if it's relevant to you or not. Optionally, you have different classes of relevant sentences, such as "title" and "directions".
As a result, for each sentence you need to extract the features that contain information about its status. This will likely involve tokenizing the sentence, and possibly applying some type of normalization. Initially I would focus on features such as individual words (M1, K1, etc.) or n-grams (a number of adjacent words). Yes, there are many of them, but a good classifier will learn which features are informative, and which are not. If you're really worried about data sparseness, you can also reduce the number of features by mapping similar "words" such as M1 and K1 to the same feature.
Additionally, you will need to label a set of example sentences, to serve as the training and test sets for your classifier. This will allow you to train the system, evaluate its performance and compare different approaches.
To start, you can experiment with some simple, but popular classification methods such as Naive Bayes.

Resources