Mahout recommender returns no results - mahout

My Mahout recommender is returning no results although from the looks of the evaluator output it seems like it should:
2014-10-15 18:33:36,704 INFO GenericDataModel - Processed 90 users
2014-10-15 18:33:36,735 INFO AbstractDifferenceRecommenderEvaluator - Beginning evaluation using 0.99 of GenericDataModel[users:1116,1117,1118...]
2014-10-15 18:33:36,767 INFO GenericDataModel - Processed 89 users
2014-10-15 18:33:36,767 INFO AbstractDifferenceRecommenderEvaluator - Beginning evaluation of 75 users
2014-10-15 18:33:36,767 INFO AbstractDifferenceRecommenderEvaluator - Starting timing of 75 tasks in 8 threads
2014-10-15 18:33:36,782 INFO StatsCallable - Average time per recommendation: 15ms
2014-10-15 18:33:36,782 INFO StatsCallable - Approximate memory used: 876MB / 1129MB
2014-10-15 18:33:36,782 INFO StatsCallable - Unable to recommend in 0 cases
2014-10-15 18:33:36,845 INFO AbstractDifferenceRecommenderEvaluator - Evaluation result: 1.0599938694784354
I'm assuming that "Unable to recommend in 0 cases" means that it was able to recommend in all cases.
I then iterate of the User id set and only see
2014-10-15 18:33:36,923 DEBUG GenericUserBasedRecommender - Recommending items for user ID '1164'
2014-10-15 18:33:36,923 DEBUG Recommendations are: []
for each Id.
I'm I reading the debug log correctly?
Thanks.

Not exactly sure what that log message means, it looks like that stat is printed every 1000 itereations so it must refer to the past 1000 requests, not for all time but that's just a guess.
In any case you will very seldom be able to recommend to all users. There will often be users that do not get recommendations due to not enough usage history them, or usage history that does not overlap other users. There will also be new users for which you have no preference history and they will get no collaborative filtering recs either. Remember that you are only using preferences that have been expressed, this does not mean you have all users represented.
You should always have some fallback method to make recommendations for this case, even recently popular or promoted items would be better than nothing.

Related

Getting actual memory usage per user session in SSAS tabular model

I'm trying to build a report which would show actual memory usage per user session when working with a particular SSAS tabular in-mem model. The model itself is relatively big (~100GB in mem) and the test queries are relatively heavy: no filters, lowest granularity level, couple of SUM measures + exporting 30k rows to CSV.
First, I tried querying following DMV:
select SESSION_SPID
,SESSION_CONNECTION_ID
,SESSION_USER_NAME
,SESSION_CURRENT_DATABASE
,SESSION_USED_MEMORY
,SESSION_WRITES
,SESSION_WRITE_KB
,SESSION_READS
,SESSION_READ_KB
from $system.discover_sessions
where SESSION_USER_NAME='username'
and SESSION_SPID=29445
and got following results:
$system.discover_sessions result
I was expecting SESSION_USED_MEMORY to show at least several hundreds of MBs, but the biggest value I got is 11 KB (MS official documentation for this DMV indicates that SESSION_USED_MEMORY is in kilobytes).
I've also tried querying 2 more DMVs:
SELECT SESSION_SPID
,SESSION_COMMAND_COUNT
,COMMAND_READS
,COMMAND_READ_KB
,COMMAND_WRITES
,COMMAND_WRITE_KB
,COMMAND_TEXT FROM $system.discover_commands
where SESSION_SPID=29445
and
select CONNECTION_ID
,CONNECTION_USER_NAME
,CONNECTION_BYTES_SENT
,CONNECTION_DATA_BYTES_SENT
,CONNECTION_BYTES_RECEIVED
,CONNECTION_DATA_BYTES_RECEIVED from $system.discover_connections
where CONNECTION_USER_NAME='username'
and CONNECTION_ID=2047
But also got quite underwhelming results: 0 used memory from $system.discover_commands and 4,8 MB from $system.discover_connections for CONNECTION_DATA_BYTES_SENT, which still seems to be smaller than the actual session would take.
These results don't seem to correspond to a very blunt test, where users would send similar queries via PowerBI and we would observe ~40GB spike in RAM allocation on the SSAS server per 4 users (so roughly 10GB per user session).
Have anyone used these (or any other DMVs or methods) to get actual user session memory consumption? Using SQL tracer dump would be the last resort since it would require parsing and loading the result into a DB and my goal is to have a real-time report showing active user sessions.

questions related to wrk2 benchmark tool about their latencies and requests

I have some questions in my mind related to wrk2 benchmark tool. I did a lot of search on them and did not find answers related to them. If you have little understanding related to them then please help me.
What "count" column represents in Detailed Percentile spectrum? example Did they show the total number of requests whose latency is within "value" (column name) range? Correct me if i am wrong.
What "latency(i)" and "requests" represent in done function provided by wr2 and wrk? and How can I get that values? done_function
How can I get the total number of requests generated per minute and their latencies? Does "latency(i)" and "requests" give me some information about them?
What "-B (batch latency)" option in wrk does? My output remains the same whether i use this option or not. batch
In wrk2 readme.md, i didn't understand these lines. can you please explain that.

Measure service latency with Prometheus

I am new to Prometheus and Grafana. My primary goal is to get the response time per request.
For me it seemed to be a simple thing - but whatever I do I do not get the results I require.
I need to be able to analyse the service latency in the last minutes/hours/days. The current implementation I found was a simple SUMMARY (without definition of quantiles) which is scraped every 15s.
Is it possible to get the average request latency of the last minute from my Prometheus SUMMARY?
If YES: How? If NO: What should I do?
Currently I am using the following query:
rate(http_response_time_sum{application="myapp",handler="myHandler", status="200"}[1m])
/
rate(http_response_time_count{application="myapp",handler="myHandler", status="200"}[1m])
I am getting two "datasets". The value of the first is "NaN". I suppose this is the result from a division by zero.
(I am using spring-client).
Your query is correct. The result will be NaN if there have been no queries in the past minute.

Solr : How to get all results corresponding to a query

I am using rsolr gem to integrate solr search with my RoR app. Now for each search, I need to specify the rows parameter, which is the number of results I want to retrieve. In order to retrieve all results corresponding to a query, I set the rows parameter to a high value as mentioned in this post.
But doing that makes the processing really really slow and I am getting the following error in the rails logs:
[2014-01-11 15:51:08] ERROR WEBrick::HTTPStatus::RequestURITooLarge
[2014-01-11 15:51:08] ERROR TypeError: can't convert nil into an exact number
/home/nish/.rvm/gems/ruby-1.9.2-p320#voylla/gems/activesupport-3.1.10/lib/active_support/core_ext/time/calculations.rb:266:in `-'
/home/nish/.rvm/gems/ruby-1.9.2-p320#voylla/gems/activesupport-3.1.10/lib/active_support/core_ext/time/calculations.rb:266:in `minus_with_duration'
/home/nish/.rvm/gems/ruby-1.9.2-p320#voylla/gems/activesupport-3.1.10/lib/active_support/core_ext/time/calculations.rb:277:in `minus_with_coercion'
/home/nish/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/webrick/accesslog.rb:42:in `setup_params'
/home/nish/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/webrick/httpserver.rb:164:in `access_log'
/home/nish/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/webrick/httpserver.rb:87:in `run'
/home/nish/.rvm/rubies/ruby-1.9.2-p320/lib/ruby/1.9.1/webrick/server.rb:183:in `block in start_thread'
How can I fix this issue? Thanks
Your error is related to RoR, not Solr. It's telling you the problem -- the requested URI is too large. WEBrick is not a production-caliber web server, and v1.9.3 appears to limit http request length to 2083 (per this other SO question.)
The short-term fix? Use a web server that doesn't limit your requested URI length to something so short.
However, that's just one part of the fix -- the process you're engaging in will grow in linear or worse fashion in terms of execution time relative to the number of results. Not only does the number of results affect performance, but also the size of the documents being retrieved.
Can you share your requirements that led to an implementation where all results are returned with each query?
From the Solr FAQ:
This is impractical in most cases. People typically only want to do
this when they know they are dealing with an index whose size
guarantees the result sets will be always be small enough that they
can feasibly be transmitted in a manageable amount -- but if that's
the case just specify what you consider a "manageable amount" as your
rows param and get the best of both worlds (all the results when your
assumption is right, and a sanity cap on the result size if it turns
out your assumptions are wrong)

Running the mahout example for collaborative filtering: where are the results?

I am experimenting a bit with mahout and started building everything and have a look at the examples. I am mostly interested in collaborative filtering, so I started with the example on finding recomendations from the BookCrossing dataset. I managed to get everything working, the sample runs without errors. However, the outbput is something like this:
INFO: Creating FileDataModel for file /tmp/taste.bookcrossing.
INFO: Reading file info...
INFO: Read lines: 433647
INFO: Processed 10000 users
INFO: Processed 20000 users
INFO: Processed 30000 users
INFO: Processed 40000 users
INFO: Processed 50000 users
INFO: Processed 60000 users
INFO: Processed 70000 users
INFO: Processed 77799 users
INFO: Beginning evaluation using 0.9 of BookCrossingDataModel
INFO: Processed 10000 users
INFO: Processed 20000 users
INFO: Processed 22090 users
INFO: Beginning evaluation of 4245 users
INFO: Starting timing of 4245 tasks in 2 threads
INFO: Average time per recommendation: 296ms
INFO: Approximate memory used: 115MB / 167MB
INFO: Unable to recommend in 1 cases
INFO: Average time per recommendation: 67ms
INFO: Approximate memory used: 107MB / 167MB
INFO: Unable to recommend in 2363 cases
INFO: Average time per recommendation: 72ms
INFO: Approximate memory used: 146MB / 167MB
INFO: Unable to recommend in 5095 cases
INFO: Average time per recommendation: 71ms
INFO: Approximate memory used: 113MB / 167MB
INFO: Unable to recommend in 7596 cases
INFO: Average time per recommendation: 71ms
INFO: Approximate memory used: 130MB / 167MB
INFO: Unable to recommend in 10896 cases
INFO: Evaluation result: 1.0895580110095793
When I check the code, I can see that is does this:
RecommenderIRStatsEvaluator evaluator = new GenericRecommenderIRStatsEvaluator();
File ratingsFile = TasteOptionParser.getRatings(args);
DataModel model =
ratingsFile == null ? new BookCrossingDataModel(true) : new BookCrossingDataModel(ratingsFile, true);
IRStatistics evaluation = evaluator.evaluate(
new BookCrossingBooleanRecommenderBuilder(),
new BookCrossingDataModelBuilder(),
model,
null,
3,
Double.NEGATIVE_INFINITY,
1.0);
log.info(String.valueOf(evaluation));
So that seems to be correct, but I would like to see more details from the generated suggestions and/or similarities. The object returned is of type IRStatistics, which exposes only some numbers on the statistics of the results. Should I look somewhere else? Is this recommender not intended for getting any actual recommendations?
You are not actually generating recommendations, here you are just performing an evaluation.
This example from the Mahout in Action book (link) should give you an idea on how to actually get recommendations.
The example only requests recommendations for one user, in your case you would iterate through all the users and get every users recommendations, then you decide what to do with that, like output them to a file.
Also the example doesn't use the data model builder or the recommender builder, but it shouldn't be hard for you to figure it out by looking at the method signatures.

Resources