Mahout's recommenditembased returns already existing item - mahout

I have used mahout's (v 0.9) recommenditembased with arguments
--input /usr_pref.csv --numRecommendations 10 --output /out/ --tempDir /temp1/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
On checking the result I found out that it recommended for a user an item that he had already rated. Why did something like this happen?
Thank you for your time.
As requested here is a snippet of the recommendations:
34175 [89005462:1.7624004,89017464:0.11477072,89011967:0.11375865,89007606:0.113421306,14103126:0.11096669,89002502:0.10888276,14103124:0.106607914,89011035:0.10636083,40111014:0.104254685,89016109:0.104254685]
and the corresponding line from user preferences:
34175,89005462,0.07596562
I have upload the two files in dropbox.
recommendations: https://www.dropbox.com/s/uapzq0926y7427p/outusrpref_final
user preferences: https://www.dropbox.com/s/6nru9799udgrzl8/usr_pref_final.csv
UPDATE
Acting on the idea that my problem had to do with the range of my ratings I multiplied them by 100 and then truncated them to two demical digits. After running the recommendator I found no duplicates. Still I don't understand why this happens.

Apache Mahout is Recommending an Item that the user has Already rated?
There may be a chance that you are not updating that users ratings for the item in file which you are giving it as input to mahout.
For Example:
If you are giving input.csv as input to the mahout kindly check whether you have updated the input.csv. (i.e) check whether the input.csv file contains the user id with that ratings.
Mahout will not recommend an item that is already been rated and updated in you input file.
Solution:
Try to update your input file with the user who has already and then check. This may fix your problem.

Example:
now,check your input file which your going to feed as input to mahout
Example: input.csv
979 300 2.0
979 400 1.0
800 200 3.0
800 300 4.0
Recommendations.csv (In this case userid 979,itemid 200,ratings 1.0)
979 [200:1.0]
800 [400:2.0]
Note:
Mahout will recommend only item 200 for userid 979 and will not recommend item 300 and 400 since it is already been rated and stored in input.csv and which is going to be feed as input to mahout.
like wise open your two files and cross check once manually. I guess Mahout recommendation will not recommend for the item which is already been rated.
Suggestion:
For testing create a small set of input data and test it so that it would be easy to track and identify.

Related

How to check an input string contains street address or not?

We want to identify the address fields from a document. For Identifying the address fields we converted the document to OCR files using Tesseract. From the tesseract output we want to check a string contains the address field or not . Which is the right strategy to resolve this problem ?
Its not possible to solve this problem using the regex because address fields are different for various documents and countries
Tried NLTK for classifying the words but not works perfectly for address field.
Required output
I am staying at 234 23 Philadelphia - Contains address files <234 23 Philadelphia>
I am looking for a place to stay - Not contains address
Provide your suggestions to solve this problem .
As in many ML problems, there are mutiple posible solutions, and the important part(and the one commonly has greater impact) is not which algorithm or model you use, but feature engineering ,data preprocessing and standarization ,and things like that. The first solution comes to my mind(and its just an idea, i would test it and see how it performs) its:
Get your training set examples and list the "N" most commonly used words in all examples(thats your vocabulary), this list will contain every one of the "N" most used words , every word would be represented by a number(the list index)
Transform your training examples: read every training example and change its representation replacing every word by the number of the word in the vocabolary.
Finally, for every training example create a feature vector of the same size as the vocabulary, and for every word in the vocabulary your feature vector will be 0(the corresponding word doesnt exists in your example) or 1(it exists) , or the count of how many times the word appears(again ,this is feature engineering)
Train multiple classifiers ,varing algorithms,parameters, training set sizes, etc, and do cross validation to choose your best model.
And from there keep the standard ML workflow...
If you are interested in just checking YES or NO and not extraction of complete address, One simple solution can be NER.
You can try to check if Text contains Location or not.
For Example :
import nltk
def check_location(text):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))):
if hasattr(chunk, "label"):
if chunk.label() == "GPE" or chunk.label() == "GSP":
return "True"
return "False"
text="I am staying at 234 23 Philadelphia."
print(text+" - "+check_location(text))
text="I am looking for a place to stay."
print(text+" - "+check_location(text))
Output:
# I am staying at 234 23 Philadelphia. - True
# I am looking for a place to stay. - False
If you want to extract complete address as well, you will need to train your own model.
You can check: NER with NLTK , CRF++.
You're right. Using regex to find an address in a string is messy.
There are APIs that will attempt to extract addresses for you. These APIs are not always guaranteed to extract addresses from strings, but they will do their best. One example of an street address extract API is from SmartyStreets. Documentation here and demo here.
Something to consider is that even your example (I am staying at 234 23 Philadelphia) doesn't contain a full address. It's missing a state or ZIP code field. This makes is very difficult to programmatically determine if there is an address. Once there is a state or ZIP code added to that sample string (I am staying at 234 23 Philadelphia PA) it becomes much easier to programmatically determine if there is an address contained in the string.
Disclaimer: I work for SmartyStreets
A better method to do this task could be as followed below:
Train your own custom NER model (extending pre-trained SpaCy's model or building your own CRF++ / CRF-biLSTM model, if you have annotated data) or using a pre-trained models like SpaCy's large model or geopandas, etc.
Define a weighted score mechanism based on your problem statement.
For example - Let's assume every address have 3 important components - an address, a telephone number and an email id.
Text that would have all three of them would get a score of 33.33% + 33.33% + 33.33% = 100 %
For identifying if it's an address field or not you may take into account - the per% of SpaCy's location tags (GPE, FAC, LOC, etc) out of total tokens in text which gives a good estimate of how many location tags are present in text. Then run a regex for postal codes, and match the found city names with the 3-4 words just before the found postal code, if there's an overlap, you have correctly identified a postal code and hence an address field - (got your 33.33% score!).
For telephone numbers - certain checks and regex could do it but an important criteria would be that it performs these phone checks only if an address field is located in above text.
For emails/web address again you could perform nomial regex checks and finally add all these 3 scores to a cumulative value.
An ideal address would get 100 score while missing fields wile yield 66% etc. The rest of the text would get a score of 0.
Hope it helped! :)
Why do you say regular expressions won't work?
Basically, define all the different forms of address you might encounter in the form of regular expressions. Then, just match the expressions.

How do you include categories with 0 responses in SPSS frequency output?

Is there a way to display response options that have 0 responses in SPSS frequency output? The default is for SPSS to omit in the frequency table output any response option that is not selected by at least a single respondent. I looked for a syntax-driven option to no avail. Thank you in advance for any assistance!
It doesn't show because there is no one single case in the data is with that attribute. So, by forcing a row of zero you'll need to realize we're asking SPSS to do something incorrect.
Having said that, you can introduce a fake case with the missing category. E.g. if you have Orange, Apple, and Pear, but no one answered they like Pear, the add one fake case that says Pear.
Now, make a new weight variable that consists of only 1. But for the Pear case, make it very very small like 0.00001. Then, go to Data > Weight Cases > Weight cases by and put that new weight variable over. Click OK to apply. Now what happens is that SPSS will treat the "1" with a weight of 1 and the fake case with a weight that is 1/10000 of a normal case. If you rerun the frequency you should see the one with zero count shows up.
If you have purchased the Custom Table module you can also do that directly as well, as far as I can tell from their technical document. That module costs 637 to 3630 depending on license type, so probably only worth a try if your institute has it.
So, I'm a noob with SPSS, I (shame on me) have a cracked version of SPSS 22 and if I understood your question correctly, this is my solution:
double click the Frequency table in Output
right click table, select Table Properties
go to General and then uncheck the Hide empty rows and columns option
Hope this helps someone!
If your SPSS version has no Custom Tables installed and you haven't collected money for that module yet then use the following (run this syntax):
*Note: please use variable names up to 8 characters long.
set mxloops 1000. /*in case your list of values is longer than 40
matrix.
get vars /vari= V1 V2 /names= names /miss= omit. /*V1 V2 here is your categorical variable(s)
comp vals= {1,2,3,4,5,99}. /*let this be the list of possible values shared by the variables
comp freq= make(ncol(vals),ncol(vars),0).
loop i= 1 to ncol(vals).
comp freq(i,:)= csum(vars=vals(i)).
end loop.
comp names= {'vals',names}.
print {t(vals),freq} /cnames= names /title 'Frequency'. /*here you are - the frequencies
print {t(vals),freq/nrow(vars)*100} /cnames= names /format f8.2 /title 'Percent'. /*and percents
end matrix.
*If variables have missing values, they are deleted listwise. To include missings, use
get vars /vari= V1 V2 /names= names /miss= -999. /*or other value
*To exclude missings individually from each variable, analyze by separate variables.

Similarity between LDA results over two different number of topics?

if we choose 20 topics in LDA and then if we choose 30 topics. So my question is will both these results intersect those 20 topics and produce similar results
Short answer - no. The way LDA works is it uses Gibbs sampler to get Dirichlet distribution over document vectors. Allocations are then made on this sample and hence will always be different both because of sampling randomness and allocation uncertainties unless you define explicit random seed and run same number of topics k. Take a look at original paper Blei et al. 2003 to see how k is defined.
UPDATE (with regard to comment): Hierarchical LDA (hLDA) is trying to solve the problem of retaining topics and subtopics by constructing levels of topics following the Chinese restaurant model. But it's still not perfect.
The way flat LDA works, however, is it looks at documents rather than topics to produce further results. Say, you get topic 0 (first table in restaurant) and all documents try to sit there, but it's not really enough space and you create another topic 1 where some docs feel more comfortable, etc., etc. now you are right from the point of view of how these tables are created. But there is one big thing that's critical - topic 0 CHANGES when you create a new table/Topic 1 because some documents have left the first table and took the words (or probabilities of cooccurence thereof) with them to the new table and all words in topic 0 got reshuffled given new situation. Same happens when you create more tables/topics that all the previous are also re-estimated. Hence, you will never get same 20 topics when rerunning with 30.

Speaker adaptation with HTK

I am trying to adapt a monophone-based recogniser to a specific speaker. I am using the recipe given in HTKBook 3.4.1 section 3.6.2. I am getting stuck on the HHEd part which I am invoking like sp:
HHEd -A -D -T 1 -H hmm15/hmmdefs -H hmm15/macros -M classes regtree.hed monophones1eng
The error I end up with is as follows:
ERROR [+999] Components missing from Base Class list (2413 3375)
ERROR [+999] BaseClass check failed
The folder classes contains the file global which has the following contents:
~b ‘‘global’’
<MMFIDMASK> *
<PARAMETERS> MIXBASE
<NUMCLASSES> 1
<CLASS> 1 {*.state[2-4].mix[1-25]}
The hmmdefs file within hmm15 had some mixture components (I am using 25 mixture components per state of each phone) missing. I tried to "fill in the blanks" by giving in mixture components with random mean and variance values but zero weigths. This too has had no effect.
The hmms are left-right hmms with 5 states (3 emitting), each state modelled by a 25 component mixture. Each component in turn is modelled by an MFCC with EDA components. There are 46 phones in all.
My questions are:
1. Is the way I am invoking HHEd correct? Can it be invoked in the above manner for monophones?
2. I know that the base class list (rtree.base must contain every single mixture component, but where do I find these missing mixture components?
NOTE: Please let me know in case more information is needed.
Edit 1: The file regtree.hed contains the following:
RN "models"
LS "stats_engOnly_3_4"
RC 32 "rtree"
Thanks,
Sriram
They way you invoke HHEd looks fine. The components are missing as they have become defunct. To deal with defunct components read HTKBook-3.4.1 Section 8.4 page 137.
Questions:
- What does regtree.hed contain?
- How much data (in hours) are you using? 25 mixtures might be excessive.
You might want to use a more gradual increase in mixtures - MU +1 or MU +2 and limit the number of mixtures (a guess: 3-8 depending on training data amount).

How to use libsvm for text classification?

I'd like to write a spam filter program with SVM and I choose libsvm as the tool.
I got 1000 good mails and 1000 spam mails, then I classify them into :
700 good_train mails 700 spam_train mails
300 good_test mails 300 spam_test mails
Then I wrote a program to count the time of each words occur in each file, got result like:
good_train_1.txt:
today 3
hello 7
help 5
...
I learned that libsvm needs format like:
1 1:3 2:1 3:0
2 1:3 2:3 3:1
1 1:7 3:9
as its input. I know that 1, 2, 1 is the label, but what does 1:3 mean?
How could I transfer what I've got to this format?
Likely, the format is
classLabel attribute1:count1 ... attributeN:countN
N is the total number of different words in your text corpus. You will have to check the documentation for the tool you are using(or its sources), to see if you can use a sparser format by not including the attributes having count 0.
How could I transfer what I've got to this format?
Here's how I would do this. I would use the script you've got to compute the count of words for each mail in the training set. Then, use another script and transfer that data into the LIBSVM format that you've shown earlier. (This can be done in a variety of ways, but it should be reasonable to write with an easy input/output language like Python) I would batch all "good-mail" data into one file, and label that class as "1". Then, I would do the same process with the "spam-mail" data and label that class "-1". As nologin said, LIBSVM requires the class label to precede the features, but the features themselves can be any number as long as they are in ascending order, e.g. 2:5 3:6 5:9 is allowed, but not 3:23 1:3 7:343.
If you're concerned that your data is not in the correct format, use their script
checkdata.py
before training and it should report any possible errors.
Once you have two separate files with data in the correct format, you can call
cat file_good file_spam > file_training
and generate a training file that contains data on both good and spam mail. Then, do the same process with the testing set. One psychological advantage with forming the data this way is that you know the top 700 (or 300) mail in the training (or testing) set is good mail, and the remaining are spam mail. This makes it easier to create other scripts you may want to act on the data, such as a precision/recall code.
If you have other questions, the FAQ at http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html should be able to answer a few, as well as the various README files that come with installation. (I personally found the READMEs in the "Tools" and "Python" directories to be a great boon.) Sadly, the FAQ does not touch much on what nologin said, about data being in a sparse format.
On a final note, I doubt that you need to keep counts of every possible word that could appear in mail. I would recommend counting only the most common words you would suspect to appear in spam mail. Other potential features include total word count, average word length, average sentence length, and other possible data that you feel may be helpful.

Resources