k nearest neighbor in SAS: how to get the neighbor list for each row? - machine-learning

currently I'm using the proc discrim in SAS to run a kNN analysis for a data set, but the problem may require me to get the top k neighbor list for each rows in my table, so how can I get this list from SAS??
thanks for the answer, but I'm looking for the neighbor list for each data point, for example if i got data set:
name age zipcode alcohol
John 26 08439 yes
Cathy 49 47789 no
smith 37 90897 no
Tom 34 88642 yes
then i need list:
name neighbor1 neighbor2
John Tom cathy
Cathy Tom Smith
Smith Cathy Tom
Tom John Cathy
I could not find this output from SAS, is there any whay that I can program to get this list? Thank you!

I am not a SAS user, but a quick web lookup seems to give a good answers for your problem:
As far as i know you do not have to implement it by yourself. DISCRIM is enough.
Code for iris data from http://www.sas-programming.com/2010/05/k-nearest-neighbor-in-sas.html
ods select none;
proc surveyselect data=iris out=iris2
samprate=0.5 method=srs outall;
run;
ods select all;
%let k=5;
proc discrim data=iris2(where=(selected=1))
test=iris2(where=(selected=0))
testout=iris2testout
method=NPAR k=&k
listerr crosslisterr;
class Species;
var SepalLength SepalWidth PetalLength PetalWidth;
title2 'Using KNN on Iris Data';
run;
The long and detailed description is also avaliable here:
http://analytics.ncsu.edu/sesug/2012/SD-09.pdf
And from the sas community:
Simply ask PROC DISCRIM to use nonparametric method by using option "METHOD=NPAR K=". Note that do not use "R=" option at the same time, which corresponds to radius-based of nearest-neighbor method. Also pay attention to how PROC DISCRIM treat categorical data automatically. Sometimes, you may want to change categorical data into metric coordinates in advance. Since PROC DISCRIM doesn't output the Tree it built internally, use "data= test= testout=" option to score new data set.

Related

How to search a string for multiple multi-word phrases in Swift or Objective-C

I want to parse a large number of strings for canned phrases or names and then store the names, if found, in an array where the order counts.
So for example, starting with a string such as:
str = "The movie stars Robert Duvall and James Earl Jones and pits them against a villain played expertly by Brando in an action packed adventure."
I would like to search against an array of actors:
names = [Robert Duvall, Henry Fonda, Brando, Marlon Brando, Jane Fonda, James Earl Jones, Peter Fonda, Montgomery Clift] etc where the actors can have one, two or three names.
Initially, I could simply check for a match on the triples using strpos or convert the string to triples and do a match on triples as in James Earl Jones. Then I could remove his name and search the remainders for other doubles or individual words. However, this approach starts to get very complicated quickly and I'm wondering if there isn't a more elegant approach.
//This road looks very messy indeed...
NSArray *triples = [self getTriples:str];//get all combinations of three sequential words
NSArray *pieces = [NSMutableArray new];
NSMutableArray * matches = [NSMutableArray new];
for (long i = 0;i<[triples count];i++) {
NSString *phrase = triples[i];
for (long j = 0;j<[names count];j++) {
NSString *name = names[j];
if ([phrase caseInsensitiveCompare:name]==NSOrderedSame) {
[matches addObject:phrase];
//Rumps has two elements, before and after
rumps = [str componentsSeparatedByString:phrase];
NSString *start = rumps[0];
NSString *end = rumps[1];
//Search before for a name
//search after for a name
}
}
}//end triples
Thanks for any suggestions.
Here is an idea based on your names string.
Split names on comma and and store in array say a1
Loop through a1 and see if you have any match on full name
If not, loop a1 again and split on spaces into a2
Here I am not that clear on your logic, but maybe like so? Now in this inner loop, where you loop a2
If a2 has three elements / names then you assume no match? Or you can check all possible combinations, not too bad for just 3 (123 already checked, then 132, 213, 231, 312, 321 and you're done with 3 names).
If it has two elements only check in reverse (21, you already checked 12).
If still no match you can check on the individual elements of a2 if that is what you want, so check on 1, 2 (and possibly 3).
Any match you use the corresponding a1 element - which is what you want, the full name, right?
You can use an index set and set the index into a1 - the actor you found to prevent dups.
Here is one possible algorithm sketch, there will be no real code – indeed as I write this it has not been written in Objective-C or Swift, it is an algorithm which can be implemented in both (and other) languages.
In coding it you may find the algorithm missed something (i.e. there could be errors, this was written directly into the answer, it is a sketch!), in which case go back and refine the algorithm and repeat.
Our sample name list:
James Earl Jones, James, Marlon Brando, Earl Jones, Brando, James Earl
and sample text:
James, James Earl and James Earl Jones all regular meet for coffee
The algorithm is based around the observations:
[Note: In the description we assume left-to-right text and that search for a match moves left-to right. The algorithm will work for right-to-left with simple adjustments, for mixed direction text it will get messier!]
Matches cannot overlap. E.g. "James Earl" is not both "James" and "James Earl". We say a match consumes the test.
Only names which are prefixes of another one need care, ones which are *postfixes" do not. E.g. If looking for "James" and "James Earl" you must look for the latter first to avoid getting a match on "James" and then missing "James Earl" as a match on "James" has consumed those characters. However "Earl Jones" and "James Earl Jones" can be searched for at the same time, the latter will match first.
In a collection of names which do not contain any prefixes they can all be matched at the same time using a regular expression with alternates. E.g. "James Earl Jones" and "Earl Jones" can be matched by the RE "James Earl Jones|Earl Jones"
When you have prefixes, so you search for the longer first, a match for a shorter name can only occur to the left of the match for the longer one.
The algorithm uses regular expression matching, as provided by NSRegularExpression in Objective-C & Swift; and ranges, as provided by NSRange, which allow searching in part of a string.
The outline:
Sort you names. E.g.:
Brando, Earl Jones, James, James Earl, James Earl Jones, Marlon Brando
Divide your names into two lists by removing any name which is a prefix of its immediately following name and placing into a second list. E.g.
Brando, Earl Jones, James Earl Jones, Marlon Brando
James, James Earl
If the second list is not empty repeat step (2) producing a third list, keep repeating until no prefixes have been removed. E.g. our sample names produce the 3 lists:
Brando, Earl Jones, James Earl Jones, Marlon Brando
James Earl
James
Convert each list to a regular expression using alternation to produce a list of regular expressions to use in searching. E.g.:
"Brando|Earl Jones|James Earl Jones|Marlon Brando", "James Earl", "James"
(At this point we realise the sample names could have been better as only the first RE required alternation. Oh well...)
Now we are ready to use our prepared regular expressions to find the matches.
Set the search range to the whole text, the match range to be empty/no value.
Set the current RE to the first
Using the current RE do a search for the first match within the search range to produce a match range. If there is no new match goto (9). E.g. using our sample, where the match range is indicated by []'s:
James, James Earl and [James Earl Jones] all regular meet for coffee
Set the new search range to be from the start of the current search range to the end of the match range, advance the current RE, goto (6). E.g. the sequence of matches for the first name goes:
James, James Earl and [James Earl Jones] all regular meet for coffee
James, [James Earl] and James Earl Jones
[James], James Earl
We now have our first matching range, record it, set the new search range to be from the end of the matched range to the end of the text, and if this new search range is non-empty goto 6.
Done, we have the list of matches.
If you don't want the list of actual matches but just a collection of unique matches then accumulate a set (e.g. NSMutableSet/Set) of matches as you go.
Have fun coding (and refining, coding...) the algorithm. If you get stuck ask a new question, reference this Q&A, describe your algorithm as it is then, show your implementation, detailed your problem, etc. and someone will undoubtedly help you along. HTH.

How to handle release year difference in movie recommendation

I have been part of the movie recommendation project. We have developed a doc2vec model using gensim.
You can have a look at gensim documentation if needed.
https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
Trained the model and when i took top 10 similar movies for a film based on cast it gives way back old movies with release_yr as (1960, 1950, ...). So i have tried including the release_yr as a parameter to gensim model but still it shows me old movies. How can i solve this release_yr difference? When I see top10 recommendations for a film I need those movies whose release_yr difference is less (like past 10 years movies not more than that). How can i do that?
code for doc2vec model
def d2v_doc(titles_df):
tagged_data = [TaggedDocument(words=_d, tags=[str(titles_df['id_titles'][i])]) for i, _d in enumerate(titles_df['doc'])]
model_d2v = Doc2Vec(vector_size=300,min_count=10, dm=1)
model_d2v.build_vocab(tagged_data)
model_d2v.train(tagged_data,epochs=100,total_examples=model_d2v.corpus_count)
return model_d2v
titles_df dataframe contains columns(id_titles, title, release_year, actors, director, writer, doc)
col_names = ['actors', 'director','writer','release_year']
titles_df['doc'] = titles_df[col_names].apply(lambda x: ' '.join(x.astype(str)), axis=1).str.split()
Code for Top 10 similar movies
def titles_lookup(similar_doc,titles_df):
df = pd.DataFrame(similar_doc, columns =['id_titles', 'simialrity'])
df = pd.merge(df, titles_df[['id_titles','title','release_year']],on='id_titles',how='left')
print(df)
def demo_d2v_title(model,titles_df, id_titles):
similar_doc = model.docvecs.most_similar(id_titles)
titles_lookup(similar_doc,titles_df)
def demo(model,titles_df):
print('hunt for red october')
demo_d2v_title(model,titles_df, 'tt0099810')
The output for Top 10 similar movies for film - "hunt for red october"
id_titles similarity title release_year
0 tt0105112 0.541722 Patriot Games 1992.0
1 tt0267626 0.524941 K19: The Widowmaker 2002.0
2 tt0112740 0.496758 Crimson Tide 1995.0
3 tt0052151 0.471951 Run Silent Run Deep 1958.0
4 tt1922685 0.464007 Phantom 2013.0
5 tt0164184 0.462187 The Sum of All Fears 2002.0
6 tt0058962 0.459588 The Bedford Incident 1965.0
7 tt0109444 0.456760 Clear and Present Danger 1994.0
8 tt0063121 0.455807 Ice Station Zebra 1968.0
9 tt0146309 0.452572 Thirteen Days 2001.0
you can see from the output that i'm still getting old movies. Please help me how to solve that.
Thanks in advance.
Doc2Vec only knows text-similarity; it doesn't have the idea of other fields.
So if you want to discard matches according to some criteria other than text-similarity, that's only represented external to the Doc2Vec model, you'll have to do that in a separate step.
So, you could use .most_similar() with a topn=len(model.docvecs) parameter - to get back all moviews, ranked. Then, filter that result-set by discarding any whose year is too-far from your desired year. Then, trim that result-set to the top N that you really want.

How to predict an item's category given its name?

Currently I have a database consisted of about 600,000 records represents merchandise with their category information look like below:
{'title': 'Canon camera', 'category': 'Camera'},
{'title': 'Panasonic regrigerator', 'category': 'Refrigerator'},
{'title': 'Logo', 'category': 'Toys'},
....
But there are merchandises without category information.
{'title': 'Iphone6', 'category': ''},
So I'm thinking whether it is possible to train a text classifier based on my items' name by using scikit-learn to help me predict which the category should the merchandise be. I'm forming this problem as a multi-class text classification but there are also one~many pictures for each item so maybe deep learning/Keras can also be used?
I don't know what is the best way to solve this problem so any suggestion or advice is welcome, thank you for reading this.
P.S. the actual text is in Japanese
You could build a 2-char / 3-char model and calculate values e.g. how often does the 3-gram "pho" appear in the category "Camera".
trigrams = {}
for record in records: # only the ones with categories
title = record['title']
cat = record['category']
for trigram in zip(title, title[1:], title[2:])
if trigram not in trigrams:
trigrams[trigram] = {}
for category in categories:
trigrams[trigram] = 0
trigrams[trigram][cat] += 1
Now you can use the titles trigrams to calculate a score:
scores = []
for trigram in zip(title, title[1:], title[2:]):
score = []
for cat in categories:
score.append(trigrams[trigram][cat])
# Normalize
sum_ = float(sum(score))
score = [s / sum_ for s in score]
scores.append(score)
Now score contains a probability distribution for every trigram: P(class | trigram). It does not take into account that some classes are just more common (prior, see Bayes theorem). I'm currently also not quite sure if you should do something against the problem that some titles might just be really long and thus have a lot of trigrams. I guess taking the prior does that already.
If it turns out that you have many trigrams missing, you could switch to bigrams. Or simply do Laplace smoothing.
edit: I've just seen that the text is in Japanese. I think the n-gram approach might be useless there. You could translate the name. However, it is probably easier to just take other sources for this information (e.g. wikipedia / amazon / ebay?)

How mahout user based recommendation works?

I am using generic user based recommender of mahout taste api to generate recommendations..
I know it recommends based on ratings given to past users..I am not getting mathematics behind its selection of recommended item..for example..
for user id 58
itemid ratings
231 5
235 5.5
245 5.88
3 neighbors are,with itemid and ratings as,{231 4,254 5,262 2,226 5}
{235 3,245 4,262 3}
{226 4,262 3}
It recommends me 226 how?
With advance thanks,
It depends on the UserSimilarity and the UserNeighborhood you have chosen for your recommender. But in general the algorithm works as follows for user u:
for every other user w
compute a similarity s between u and w
retain the top users, ranked by similarity, as a neighborhood n
for every item i that some user in n has a preference for, but that u has no preference for yet
for every other user v in n that has a preference for i
compute a similarity s between u and v
incorporate v's preference for i, weighted by s, into a running average
Source: Mahout in Action http://manning.com/owen/

How to boost score of Solr search results based on criteria?

Background:
1 - I'm using WebSolr for this search.
2 - I have two fields stored in websolr - name and id.
I want to search for these entries based on name AND boost the search score based on this criteria:
if id in [x1,x2..xN] then +2
if id in [y1,y2..yN] then +1
else +0
From my research, the answer lies in the following
- Function query, or
- DisMaxQParser
I have looked at the documentation but IMO its not very comprehensive.
Any help is appreciated.
You can use boosts. Try a query like
name:searchString AND ( id:[x1 TO xN] ^2 OR id:[y1 TO yN]^1)
In addition to hkn's approach, you could also use DisMax query parser boost queries:
q=queryString
&defType=dismax
&qf=…
&bq=id:[x1+TO+xN]^3
&bq=id:[y1+TO+yN]^2
(Untested, but should convey the idea.)

Resources