Which ML algorithm can find pairs in two datasets? - machine-learning

I have two datasets for which I need to find matching pairs.
In dataset 1 I have reported data from payment providers, which will result in bank payouts. Each record contains a monetary amount, potentially a reference, a date, etc.
The actual incoming bank transactions are in dataset 2. Each transaction also contains a monetary amount, complementary bank info, a date, an account number, etc.
In the case where the "reference" from dataset 1 is part of the "complementary bank info" from dataset 2, finding pairs is trivial. Also, finding the correct pairs based on monetary amount and dates works fine.
However, there are cases where the monetary amounts are off by a few cents or the "reference" has whitespace in the complementary bank info. Those cases are currently handled manually, e.g. a person will hand pick and match the pairs.
I'd like to try and train a machine-learning algorithm, to find these pairs but I am struggling a bit with the correct approach here. I assume this is a case for supervised learning - but what algorithm would find me those pairs from the two datasets?

Try some "nearest neighbors" algorithm, with the number of neighbors set to 1.
Specifically, you want to perform a "search", which is a bit broader concept than "classification" or "regression". See the sklearn.neighbors.NearestNeighbors class for more ideas.
there are cases where the monetary amounts are off by a few cents or the "reference" has whitespace in the complementary bank info
The "nearest neighbors" algorithm has a "threshold" parameter, which represents similarity cutoff. Find a "threshold" parameter value that captures "a few cents difference" but doesn't capture "a dollar or more difference".

Related

Using a Word2Vec Model to Extract Data

I've used gensim Word2Vec to learn the embedding of monetary amounts and other numeric data in bank transaction memos. The goal is to use this to be able to extract these amounts and currencies from future input strings.
Design
Our input strings are something like
"AMAZON.COM TXNw98e7r3347 USD 49.00 # 1.283"
During preprocessing, I tokenize and also replace all tokens that have the possibility of being a monetary amount (string consisting only of digits, commas, and <= 1 decimal point/period) with a special VALUE_TOKEN. And I also manually replace exchange rates with RATE_TOKEN. The result would be
["AMAZON", ".COM", "TXNw", "98", "e", "7", "r", "3347", "USD", "VALUE_TOKEN", "#", "RATE_TOKEN"]
With all my preprocessed lists of strings in list data, I generate model
model = Word2Vec(data, window=3, min_count=3)
The embeddings of model that I'm most interested in are that of VALUE_TOKEN, RATE_TOKEN, as well as any currencies (USD, EUR, CAD, etc.). Now that I generated the model, I'm not sure what to do with it.
Problem
Say I have a new string that the model has never seen before,
new_string = "EUR 299.99 RATE 1.3289 WITH FEE 5.00"
I would like to use model to identify which tokens of new_string is most contextually similar to VALUE_TOKEN (which should return ["299.99", "5.00"]), which is closest to RATE_TOKEN ("1.3289"). It should be able to classify these based on the learned embedding. I can preprocess new_string the way I do with the training data, but because I don't know the exchange rate before hand, all three tokens of ["299.99", "5.00", "1.3289"] will be tagged the same (either with VALUE_TOKEN or a new UNIDENTIFIED_TOKEN).
I've looked into methods like most_similar and similarity but don't think they work for tokens that are not necessarily in the vocabulary. What methods should I use to do this? Is this the right approach?
Word2vec's fuzzy, dense embedded token representations don't strike me as the right tool for what you're doing, though they might perhaps be an indirect contributor to a hybrid approach.
In particular:
The word2vec algorithm originated from, & has the most consistent public results, when applied to natural-language texts, with their particular patterns of relative token frequences, and varied co-occurrences. Certainly, many ahave applied it, with success, to other kinds of text/record data, but such uses may require a lot more preprocessing/parameter-tuning, and to the extent the underlying data has some fixed, highly-repetitive scheme, might be more amenable to other approaches.
If you replace all known values with 'VALUE_TOKEN', & all known rates with 'RATE_TOKEN', then the model is only going to learn token-vectors for 'VALUE_TOKEN' & 'RATE_TOKEN'. Such a model won't be able to supply any vector for non-replaced tokens it's never seen like '$1.2345' or '299.99'. Even collapsing all those to 'UNIDENTIFIED_TOKEN' just limits the model to whatever it learned earlier was the vector for 'UNIDENTIFIED_TOKEN' (if any, in the training data).
I've not noticed existing word2vec implementations offering an interface for inferring the word-vector for new unknown-vectors, from just one or several new examples of its appearance in-context. They could, in the same style of new-document-vector inference used by 'Paragraph Vectors'/Doc2Vec, but just don't.) The closest I've seen is Gensim's predict_output_word(), which does a CBOW-like forward-propagation on negative-sampling models, to every 'output node' (one per known word), to give a ranked list of the known-words most-likely to appear given some context words.
That predict_output_word() might, if fed surrounding known-tokens, contribute to your needs by whether it says your 'VALUE_TOKEN' or 'RATE_TOKEN' is a more-likely model-prediction. You could adapt its code to only evaluate those two candidates, if you're always sure the right answer is one or the other, for a speed-up. A simple comparison of the average-of-context-word-vectors, and the candidate-answer vectors, might be as effective as the full forward-propagation.
Alternatively, you might want use the word2vec model solely as a source of features (via context-words) for some other classifier, which is trained to answer VALUE or TOKEN. This other classifier's input might include things like:
some average of the vectors of all nearby tokens
the full vectors of closest neighbors
a one-hot encoding ('bag-of-words') of all nearby (or 'preceding') or 'following) known-tokens, assuming the vocabulary of non-numerical tokens is fairly short & highly indicative
?
If the data streams might include arbitrary new or corrupted tokens whose meaning might be inferrable from substrings, you could consider a FastText model as well.

Is this problem a classification or regression?

In a lecture from Andrew Ng, he asked whether the problem below is a classification or a regression problem. Answer: It is a regression problem.
You have a large inventory of identical items. You want to predict how
many of these items will sell over the next 3 months.
Looks like I am missing something. Per my understanding it should be classification problem. Reason is we have to classify each item in two categories i.e it can be sold or not, which are discrete value not the continuous ones.
Not sure where is the gap in my understanding.
Your thinking is that you have a database of items with their respective features and want to predict if each item will be sold. At the end, you would simply count the number of items that can be sold. If you frame the problem this way, then it would be a classification problem indeed.
However, note the following sentence in your question:
You have a large inventory of identical items.
Identical items means that all items will have exactly the same features. If you come up with a binary classifier that tells whether a product can be sold or not, since all feature values are exactly the same, your classifier would put all items in the same category.
I would guess that, to solve this problem, you would probably have access to the time-series of sold items per month for the past 5 years, for instance. Then, you would have to crunch this data and interpolate to the future. You won't be classifying each item individually but actually calculating a numerical value that indicates the number of sold items for 1, 2, and 3 months in the future.
According to Pattern Recognition and Machine Learning (Christopher M. Bishop, 2006):
Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression.
On top of that, it is important to understand the difference between categorical, ordinal, and numerical variables, as defined in statistics:
A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories.
(...)
An ordinal variable is similar to a categorical variable. The difference between the two is that there is a clear ordering of the variables. For example, suppose you have a variable, economic status, with three categories (low, medium and high). In addition to being able to classify people into these three categories, you can order the categories as low, medium and high.
(...)
An numerical variable is similar to an ordinal variable, except that the intervals between the values of the numerical variable are equally spaced. For example, suppose you have a variable such as annual income that is measured in dollars, and we have three people who make $10,000, $15,000 and $20,000.
Although your end result will be an integer (a discrete set of numbers), note it is still a numerical value, not a category. You can manipulate mathematically numerical values (e.g. calculate the average number of sold items in the next year, find the peak number of sold items in the next 3 months...) but you cannot do that with discrete categories (e.g. what would be the average of a cellphone and a telephone?).
Classification problems are the ones where the output is either categorical or ordinal (discrete categories, as per Bishop). Regression problems output numerical values (continuous variables, as per Bishop).
Your system might be restricted to outputting integers, instead of real numbers, but won't change the nature of the variable from being numerical. Therefore, your problem is a regression problem.

Model that predict both categorical and numerical output

I am building a RNN for a time series model, which have a categorical output.
For example, if precious 3 pattern is "A","B","A","B" model predict next is "A".
there's also a numerical level associated with each category.
For example A is 100, B is 50,
so A(100), B(50), A(100), B(50),
I have the model framework to predict next is "A", it would be nice to predict the (100) at the same time.
For real life examples, you have national weather data.
You are predicting the next few days weather type(Sunny, windy, raining ect...) at the same time, it would be nice model will also predict the temperature.
Or for Amazon, analysis customer's trxns pattern.
Customer A shopped category
electronic($100), household($10), ... ...
predict what next trxn category that this customer is likely to shop and predict at the same time what would be the amount of that trxns.
Researched a bit, have not found any relevant research on similar topics.
What is stopping you from adding an extra output to your model? You could have one categorical output and one numerical output next to each other. Every neural network library out there supports multiple outputs.
Your will need to normalise your output data though. Categories should be normalised with one-hot encoding and numerical values should be normalised by dividing by some maximal value.
Researched a bit, have not found any relevant research on similar topics.
Because this is not really a 'topic'. This is something completely normal, and it does not require some special kind of network.

Binary recommendation algorithms

I'm currently doing some research for a school assignment. I have two data streams, one is user ratings and the other is search, click and order history (binary data) of a webshop.
I found that collaborative filtering is the best family of algorithms if you are using rating data. I found and researched these algorithms:
Memory-based
user-based
pearson correlation
constrainted pearson
vector similaritys (cosinus)
Mean squared difference
weighted pearson
correlation threshold
max number of neighbours
weighted by correlation
Z-score normalization
item-based
adjusted cosine
maximum number of neighbours
similarity fusion
model based
regression based
slope one
lsi/svd
regularized svd (rsvd/rsvd2/nsvd2/svd++)
integrated neighbor based
cluster based smoothing
Now I'm looking for a way to use the binary data, but I'm having a hard time figuring out if it is possible to use binary data instead of rating data with these algorithms or is there a different family of algorithms I should be looking at ?
I apologize in advance for spelling errors since I have dyslexia and am not a native writer.Thanks marc_s for helping.
Take a look at data mining algorithms such as association rule mining (aka market basket analysis). You've come upon a tough problem in recommendation systems: unary and binary data are common but the best algorithms for personalization don't work well with them. Rating data can represent preference for a single user-item pair; e.g., I rate this movie 4 stars out of 5. But with binary data, we have the least granular type of rating data: I either like or don't like something, or have or have not consumed it. Be careful not to confuse binary and unary data: unary data means that you have information that a user consumed something (which is coded as 1, much like binary data), but you have no information about whether a user didn't like or consume something (which is coded as NULL instead of binary data's 0). For instance, you may know that a person viewed 10 web pages, but you don't have any idea what she would have thought of other pages had she known they were available. That's unary data. You can't assume any preference information from NULL.

Document classification using naive bayse

I have question regarding the particular Naive Bayse algorithm that is used in document classification. Following is what I understand:
construct some probability of each word in the training set for each known classification
given a document we strip all the words that it contains
multiply together the probabilities of the words being present in a classification
perform (3) for each classification
compare the result of (4) and choose the classification with the highest posterior
What I am confused about is the part when we calculate the probability of each word given training set. For example for a word "banana", it appears in 100 documents in classification A, and there are totally 200 documents in A, and in total 1000 words appears in A. To get the probability of "banana" appearing under classification A do I use 100/200=0.5 or 100/1000=0.1?
I believe your model will more accurately classify if you count the number of documents the word appears in, not the number of times the word appears in total. In other words
Classify "Mentions Fruit":
"I like Bananas."
should be weighed no more or less than
"Bananas! Bananas! Bananas! I like them."
So the answer to your question would be 100/200 = 0.5.
The description of Document Classification on Wikipedia also supports my conclusion
Then the probability that a given document D contains all of the words W, given a class C, is
http://en.wikipedia.org/wiki/Naive_Bayes_classifier
In other words, the document classification algorithm Wikipedia describes tests how many of the list of classifying words a given document contains.
By the way, more advanced classification algorithms will examine sequences of N-words, not just each word individually, where N can be set based on the amount of CPU resources you are willing to dedicate to the calculation.
UPDATE
My direct experience is based on short documents. I would like to highlight research that #BenAllison points out in the comments that suggests my answer is invalid for longer documents. Specifically
One weakness is that by considering only the presence or absence of terms, the BIM ignores information inherent in the frequency of terms. For instance, all things being equal, we would expect that if 1 occurrence of a word is a good clue that a document belongs in a class, then 5 occurrences should be even more predictive.
A related problem concerns document length. As a document gets longer, the number of distinct words used, and thus the number of values of x(j) that equal 1 in the BIM, will in general increase.
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529

Resources