These are the fields that I have
Dimension: Usertype, when you filter I have "New Visitor" "Returning Visitor"
Measure: users - sum of both new & returning visitors
How can I create the calculation that will count only New Visitors inside the users measure pill without using the filter option?
For those wondering answer is:
usertype - dimension (New Visitor, Returning Visitor)
users - measure (Total of both New Visitor and Returning Visitor)
MIN({FIXED [usertype]='New Visitor': SUM([users])})
An alternate, possibly more efficient, approach is to define a calculated field that incorporates your filter. Say define a calculated field called, New Users, as
IF [Usertype] = “New Visitor” THEN [Users] END
(This field will evaluate to null if the [Usertype] <> “New Visitor”)
If you then place SUM([New Users]) on some shelf, you’ll have your result - without the complexity of an LOD calculation. LOD calcs are incredibly useful, but they add complexity and performance cost, so I’d recommend using them only when needed. In many cases, such as this, simpler methods work fine.
Related
What if any is a good best practice / approach for a use case where a given business activity uses estimates that are then replaced by actual as they become available? In the same way that effective dates can be used to "automatically" (without user's having to know about it) retrieve historically accurate dimension rows, is there a similar way to have actual "automatically" replace the estimates without overwriting the data? I'd rather not have separate fact tables or columns and require that the users have to "know" about this and manually change it to get the latest actuals.
Why not have 2 measures in your fact table, one for estimate and one for actual?
You could then have a View over the fact table with a single measure calculated as "if actual = 0 then estimate else actual".
Users who just need the current position can use the View; users who need the full picture can access the underlying fact table
Given a query and a document, I would like to compute a similarity score using Gensim doc2vec.
Each document consists of multiple fields (e.g., main title, author, publisher, etc)
For training, is it better to concatenate the document fields and treat each row as a unique document or should I split the fields and use them as different training examples?
For inference, should I treat a query like a document? Meaning, should I call the model (trained over the documents) on the query?
The right answer will depend on your data & user behavior, so you'll want to try several variants.
Just to get some initial results, I'd suggest combining all fields into a single 'document', for each potential query-result, and using the (fast-to-train) PV-DBOW mode (dm=0). That will let you start seeing results, doing either some informal assessment or beginning to compile some automatic assessment data (like lists of probe queries & docs that they "should" rank highly).
You could then try testing the idea of making the fields separate docs – either instead-of, or in addition-to, the single-doc approach.
Another option might be to create specialized word-tokens per field. That is, when 'John' appears in the title, you'd actually preprocess it to be 'title:John', and when in author, 'author:John', etc. (This might be in lieu of, or in addition to, the naked original token.) That could enhance the model to also understand the shifting senses of each token, depending on the field.
Then, providing you have enough training data, & choose other model parameters well, your search interface might also preprocess queries similarly, when the user indicates a certain field, and get improved results. (Or maybe not: it's just an idea to be tried.)
In all cases, if you need precise results – exact matches of well-specified user queries – more traditional searches like exact DB matches/greps, or full-text reverse-indexes, will outperform Doc2Vec. But when queries are more approximate, and results need filling-out with near-in-meaning-even-if-not-in-literal-tokens results, a fuzzier vector document representation may be helpful.
Suppose I have a bunch of User node, which has a property named gender, which can be male or female. Now in order to cluster user based on gender, I have two choice of structure:
1) Add an index to the gender property, and use a WHERE to select users under a gender.
2) Create a Male node and a Female node, and edges linking them to relevant users. Then every time when querying upon gender, I use pattern ,say, (:Male)-[]->(:User).
My question is, which one is better?
Indices should never be a replacement for putting things in the graph.
Indexing is great for looking up unique values and, in some cases, groups of values; however, with the caching that Neo4j can do (and the extensibility of modeling your domain).
Only indexing a property with two (give or take) properties is not the best use of an index and likely won't net too much of a performance boost given the number of results per property value.
That said, going with option #2 can create supernodes, a bottle-necking issue which can become a major headache depending on your model.
Maybe consider using labels (:Male and :Female, for example) as they are essentially "schema indices". Also keep in mind you can use multiple labels per node, so you could have (user:User:Male), etc. It also helps to avoid supernodes while not creating a classic or "legacy" index.
HTH
I have two models - Score & Weight.
Each of these models have about 5 attributes.
I need to be able to create a weighted_score for my User, which is basically the product of Score.attribute_A * Weight.attribute_A, Score.attribute_B * Weight.attribute_B, etc.
Am I better off creating a 3rd model - say Weighted_Score, where I store the product value for each attribute in a row with the user_id and then query that table whenever I need a particular weighted_score (e.g. my_user.weighted_score.attribute_A) or am I better off just doing the calculations on the fly every time?
I am asking from an efficiency stand-point.
Thanks.
I think the answer is very situation-dependent. Creating a 3rd table may be a good idea if the calculation is very expensive, you don't want to bog down the rest of the system and it's ok for you to respond to the user right away with a message saying that calculation will occur in the future. In that case, you can offload the processing into a background worker and create an instance of the 3rd model asynchronously. Additionally, you should de-normalize the table so that you can access it directly without having to lookup the Weight/Score records.
Some other ideas:
Focus optimizations on the model that has many records. If Weight, for instance, will only have 100 records, but Score could have infinite, then load Weight into memory and focus all your effort on optimizing the Score queries.
Use memoization on the calc methods
Use caching on the most expensive actions/methods. if you don't care too much about how frequently the values update, you can explicitly sweep the cache nightly or something.
Unless there is a need to store the calculated score (lets say that it changes and you want to preserve the changes to it) i dont see any benefit of adding complexity to store it in a separate table.
I am a newbie learning mahout.
I learned that there are five recommenders in mahout. User-based, Item-based,...
The datasets I used is movielens 100K
I am thinking implement a little different movie recommender from user based one. i.e., instead of taking user id as an input to recommend movies to only one user, I want to take user demographic information, e.g., age range, gender, occupation, and zip code.
But the problem is how do I create my own user similarity method (The original one is taking two long type user id as parameters) and how do I combine u.user file and u.data file together?
I understand your question now. I think the simplest thing is to temporary create a dummy user with the demographic properties you are querying for, and then recommend for that dummy user.
Yes, you would have to write a UserSimilarity that implements whatever similarity rule you want on top of the demographic data.
Maybe there is another solution.
I implement my own Rescorer to deal with u.user file and input (gender, age range, ...). If each piece of information is equal, then I put the according user id into a FastIDSet.
Then, in the rescore method, I will check if the current user id is in FastIDSet, if yes, the augment the score.
In my own Recommender, I will use PlusAnoymousUserDataModel to get a temp id, and call the method recommen(id, howMany, rescorer)
However, after I tried different dataset file, I get 0 recommended item.
I am thinking whether it is the right way to use PlusAnoymousUserDataModel.