I am using Ubuntu12.04, Hadoop-1.0.4, Mahout-0.7 running job on Hadoop cluster for Recommendation algorithm when I am giving input file in this format, map reduce runs fine but not giving any result(blank)
tataRecommend100.txt (userID - productID - preference)
14218954 54518 4
14218954 617691 2
14218954 616488 2
14218954 614975 2
14218954 605662 1
14218954 619979 1
14218954 14183 3
14218954 611309 5
14218954 615242 3
14218954 13138 1
14232708 54518 1
14232708 617691 3
14232708 616488 1
14232708 614975 5
14232708 605662 4
command :-bin/hadoop jar /home/hadoop/apacheC/mahout-distribution-0.7/mahout-core-0.7-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input /tataDocomo/recommend/tataRecommend100.txt --output /tataDocomo/recommend/tataRecommendOutput
Your data is simply too sparse / small to make recommendations. Try with a non-toy data set.
Could it be that you haven't given it a user id for whom you want recommendations? That happened to me when I was giving it a try the first time. No output. You drop this in the file you pass in for --userFile.
Related
Table 1:
Position
Team
1
MCI
2
LIV
3
MAN
4
CHE
5
LEI
6
AST
7
BOU
8
BRI
9
NEW
10
TOT
Table 2
Position
Team
1
LIV
2
MAN
3
MCI
4
CHE
5
AST
6
LEI
7
BOU
8
TOT
9
BRI
10
NEW
Output I'm looking for is
Position difference = 10 as that is the total of the positional difference. How can I do this in excel/google sheets? So the positional difference is always a positive even if it goes up or down. Think of it as a league table.
Table 2 New (using formula to find positional difference):
Position
Team
Positional Difference
1
LIV
1
2
MAN
1
3
MCI
2
4
CHE
0
5
AST
1
6
LEI
1
7
BOU
0
8
TOT
2
9
BRI
1
10
NEW
1
Try this:
=IFNA(ABS(INDEX(A:B,MATCH(E2,B:B,0),1)-D2),"-")
Assuming that table 1 is at columns A:B:
I have a dataset where each ID has visited a website and recorded their risk level which is coded 0-3. They have then returned to the website at a future date and recorded their risk level again. I want to calculate the difference between each ID's risk level from their first recorded risk level.
For example my dataset looks like this:
ID Timestamp RiskLevel
1 20-Jan-21 2
1 04-Apr-21 2
2 05-Feb-21 1
2 12-Mar-21 2
2 07-May-21 3
3 09-Feb-21 2
3 14-Mar-21 1
3 18-Jun-21 0
And I would like it to look like this:
ID Timestamp RiskLevel DifFromFirstRiskLevel
1 20-Jan-21 2 .
1 04-Apr-21 2 0
2 05-Feb-21 1 .
2 12-Mar-21 2 1
2 07-May-21 3 2
3 09-Feb-21 2 .
3 14-Mar-21 1 -1
3 18-Jun-21 0 -2
What should I do?
One way to approach this is with the strategy in my answer here, but I will use a different approach here:
sort cases by ID timestamp.
compute firstRisk=risklevel.
if $casenum>1 and ID=lag(ID) firstRisk=lag(firstRisk).
execute.
compute DifFromFirstRiskLevel=risklevel-firstRisk.
I have such columns in GS:
Equipments Amount . Equipment 1 Equipment 2
---------- ------- ----------- -----------
Equipment 1 2 Process 1 Process 3
Equipment 2 3 Process 2 Process 4
Process 5
I need to produce equipment 1 x2, and equipment 2 x3.
When equipments are produced, then Process 1 is executed 2 times, Process 2 - 2 times, Process 3 - 3 times, Process 4 - 3 times, Process 5 - 3 times.
So I need to generate such list:
Process 1
Process 1
Process 2
Process 2
Process 3
Process 3
Process 3
Process 4
Process 4
Process 4
Process 5
Process 5
Process 5
Of course, I want a formula which will be dynamic (e.g. can add another equipment or change processes in particular equipment)
1 list using rept:
=TRANSPOSE(SPLIT(JOIN(",",FILTER(REPT(C2:C&",",B2),C2:C<>"")),","))
Multy-list rept:
=TRANSPOSE(SPLIT(JOIN(",",FILTER(REPT(C2:C&",",VLOOKUP(D2:D,A:B,2,)),C2:C<>"")),","))
There is no easy way to solve your problem with formulas.
I would strongly suggest you write a script. It's easier than you think. You can even record an action, and then see the code you need to reproduce the action.
FOR example here are two data files:
file1:
target
1 6791340 10.9213
2 6934561 9.6791
3 6766224 9.5835
4 6753444 9.1097
5 6809077 8.7386
6 6818752 8.7172
fil2:
1 6766224 11.7845
2 6753444 9.6863
3 6809077 9.5252
4 6818752 9.3867
5 6791340 9.1914
6 6934561 9.1914
file3(output):
target
1 6791340 10.9213 5 9.1914
2 6934561 9.6791 6 9.1914
3 6766224 9.5835 1 11.7845
4 6753444 9.1097 2 9.6863
5 6809077 8.7386 3 9.5252
6 6818752 8.7172 4 9.3867
As you can see, the order of target column stayed exactly the same as file1. But, file2 follows the order of file1 based on target column and columns from file2 changed accordingly. the real files are big and "target" is written just for clarification. any guide, please?
This is what I tried:
awk 'NR==FNR{ a[$2]=$1; next }{ print a[$1],$1,$2 }' file1 file2 >
output
But this changes the order of target column.
Switch the order of file processing
$ awk 'NR==FNR{a[$2]=$1 OFS $3; next} ($2 in a){print $0, a[$2]}' f2 f1
1 6791340 10.9213 5 9.1914
2 6934561 9.6791 6 9.1914
3 6766224 9.5835 1 11.7845
4 6753444 9.1097 2 9.6863
5 6809077 8.7386 3 9.5252
6 6818752 8.7172 4 9.3867
($2 in a) can be removed if both second columns will surely match
Is the easiest way not to sort the two files on column two and then sort again on column 1? Be aware that you do buffer here and call various programs. A clean awk solution is given by Sundeep.
% join -j2 <(sort -g -k2 file1) <(sort -g -k2 file2) \
-o 1.1,1.2,1.3,2.1,2.3 | sort -g -k1
1 6791340 10.9213 5 9.1914
2 6934561 9.6791 6 9.1914
3 6766224 9.5835 1 11.7845
4 6753444 9.1097 2 9.6863
5 6809077 8.7386 3 9.5252
6 6818752 8.7172 4 9.3867
The flag -o 1.1,1.2,1.3,2.1,2.3 is the output option of join, it dictates to print column 1 of file 1 (1.1), followed by column 2 of file 1 (1.2), etc.
man join :
-o FORMAT
obey FORMAT while constructing output line
FORMAT is one or more comma or blank separated specifications, each
being FILENUM.FIELD or 0. Default FORMAT outputs the join field, the remaining fields from FILE1, the remaining fields from
FILE2, all separated by CHAR. If FORMAT is the keyword auto, then
the first line of each file
determines the number of fields output for each line.
Without this option, you would still have to swap column 1 and 2.
join -j2 <(sort -g -k2 file1) <(sort -g -k2 file2) | awk '{t=$2;$2=$1;$1=t}1' | sort -g -k1
I've got a dataset with repeated measures that looks roughly like this:
ID v1 v2 v3 v4
1 3 4 2 NA
1 2 NA 6 7
2 4 3 6 4
2 NA 2 7 9
. . . . .
n . . . .
What I want to know is how many NAs are there for each participants over the variables v1 - v4 (e.g. participant 1 is missing 2 of 8 responses)?
Missing Values are always displayed per Variable not per participant so how do I do this? Maybe there is a way using the AGGREGATE command with ID as BREAK?
Use COUNT to count the missing values as a new variable and then aggregate by the Id or split files by I'd and freq.