What describes the value of RecommendedItem for a Boolean-Recommender - mahout

If one calculates the recommendations for a boolean DataModel, the RecommendedItems have some numbers in their value field.
What does it represent? (Understandably, it can't be the calculated preference).
The class GenericRecommendedItem-API only says: "A value expressing the strength of the preference for the recommended item. The range of the values depends on the implementation. Implementations must use larger values to express stronger preference."

It's intentionally opaque so you don't rely on any particular value. It happens to be a sum of similarities if I recall correctly -- all similarities between the user's items and that item.

Related

Treat missing value as it is in Decision Tree

I have a dataset in which some variable (categorical variable and numerical variable) has missing values. Example, i have a variable "area" with numerical value which divided into two categories, "area (today)" and "area (-1 day)". If a data row categorized as "new comer" then it will have no value on "area (-1 day)". So, normal missing value handling like removal or mean not working here. Do i have to label no value on "area (-1 day)" as a category where the variable is originally numeric? Or, is there any other suggestions?
Treating the newcomer as a separate class makes sense, because that's how you are treating it in your dataset - you have a separate area column for it.
Otherwise you can check various other Imputation techniques to suit your use case. Regression imputation might suit your case.
HTH

Missing value replacement based on class

I've been reading an article on Random Forests, and in missing value replacement section (https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#missing1) they say:
If the mth variable is not categorical, the method computes the median of all values of this variable in class j, then it uses this value to replace all missing values of the mth variable in class j.
Wouldn't that undermine the entire process? If most values in some column are missing, then after this procedure the new values could be used to easily identify the class, and the resulting classifier would be useless. Am I missing something here?
The resulting classifier is not necessarily useless, it depends on the characteristics of the 'missingness' (the event that a feature value is missing). If its distribution is identical between train and test set (which is a prevailing implicit assumption in ML), it is doing the right thing. However it is indeed problematic if there is a discrepancy, e.g., if missing values are an artifact of the way the training data was generated and mostly associated with one class, while at test time feature values are always fully known. In this case, imputation might lead to incorrect conclusions, especially if the number of missing values is large.

Best practices for handling non decimal variables. [ACM KDD 2009 CUP]

For practice I decided to use neural network to solve problem of classification (2 classes) stated by ACM Special Interest Group on Knowledge Discovery and Data Mining at 2009 cup. The problem I have found is that the data set contains a lot of "empty" variables and I am not sure how to handle them. Furthermore second question appears. How to handle with other non decimals like strings. What are Your best practices?
Most approaches require numerical features, so the categorical ones have to be converted into counts. E.g. if a certain string is present among the attributes of an instance, it's count is 1, otherwise 0. If it occurs more than once, it's count increases correspondingly. From this point of view any feature that is not present (or "empty" as you put it) has a count of 0. Note that the attribute names have to be unique.

java auto correct pattern-matcher - which item is the most similar in a given set?

I was wondering how to implement the following problem: Say I have a 'set' of Strings and I wish to know which one is the most related to a given value.
Example:
String value= "ABBCCE";
Set contains: {"JJKKLL", "ABBCC", "AAPPFFEE", "AABBCCDD", "ABBCEE", "AABBCCEE"}
By 'most related' I assume there could be many options (valid one can be the last 2), but at least we can ignore some items (JJKKLLL).
What should be the approach to solve this kind of a problem (that at minmum, a result like AABBCCEE would be acceptable)
Any java code would be appreciated :-)
You could try using the Levenshtein Distance between your "target" string (e.g. "ABBCCE") and each element in your set. Pick a maximum threshold above which you will consider items to be unrelated (in your example here, a threshold of one or two perhaps), and reject everything in the set that has a Levenshtein Distance greater than that from the target string.
An example implementation of the Levenshtein Distance computation in Java can be found here.
You may be interested in the Levenstein distance metric, which measures similarities between two strings, including insertions and removals.

How should I present a cost field to the user, and store it in the database?

Right now I have two fields for cost. One for dollars and one for cents. This works, but it is a bit ugly. It also doesn't allow the user to enter the term "free" or "no cost" if they want. But if I only have one field, I might have to make my parser a bit smarter. What do you think?
On the server side, I combine dollars and cents to store them as decimals in my database. Mainly so that I can gather statistics (cost averages, etc.) quickly.
Do you think it is better to store the cost as a string? Then whenever I actually use the cost for stats or other purposes, I would convert it to a decimal at that point. Or am I on the right track?
There is a rule in database design that states that "atomic data" should not be split. By this rule a price, or cost is such an example of atomic data and therefore it should never be split among multiple columns just like you shouldn't split a phone number among multiple columns (unless you really have a very good reason for it - very rare)
Use a DECIMAL data type. Something like DECIMAL(8,3) should work and it's supported by all ANSI SQL compliant database products!
You can consult Joe Celko's "Thinking In Sets" book for a discussion of this topic. See section 1.6.2, pages 21-22.
EDIT -
It seems from your question that you are also concerned with how to accept user's input in a form that resembles the price (xxxx.xx) - hence the two input boxes, for the whole dollars, and the pennies.
I recommend using a single input box and then doing input validation using Regular Expressions to match your format (i.e. something like [0-9]+(.[0-9]{1,3})? would probably work but could be improved). You could then parse the validated string to a Decimal type in your language, or just pass it as a string into your database - SQL will know how to cast it to a DECIMAL type.
Keep the whole cost as decimal. If it's free, then keep the cost as 0. In presentation if cost is zero - write "free" instead of 0.
I generally store the cost as the lowest unit (pennies) and then convert it to whole dollars later.
So a cost of $4.50 gets stored as 450. Free items would be -1 pennies. You could store free things as 0 pennies as well, this gives you the flexibility to use 0 and -1 to mean two slightly different things (free vs no sale?).
It also makes it easier to support countries that don't use cents if you choose to go that route.
As for presenting the data entry field, I personally don't like it when I have to keep switching fields for tiny things (like when they break up phone numbers into 3 fields, or IP addresses into 4). I'd present one field, and let the users type the decimal point in themselves. That way, your users don't have to tab (or click, if they are unfamiliar with tab) to the next field.
Use cents, use 450 for $4.50 this will save you problems that are arising very often
from the fact that floating point operations are not safe. Just try the following expression in irb:
0.4 - 0.3 == 0.1 will return false. All because of floating point representation
innacuracies.
In my models I'm always using:
attr_accessor :price_with_cents
def price_with_cents
self.price/100.00
end
def price\_with\_cents==(num)
self.price = (num.to_f * 100.00).to_i
end
And the name of column is just price and integer type.
I don't have much experience with decimal columns and their representation in ruby (which can be float that is problematic as i've shown at the begining).
Don't allow garbage to make it to your database. If you're expecting a dollar amount on a field, than make sure it's valid before it gets in there. This will allow you to report better on the data and allow simpler formatting on output.
I suggest making this a single field with validation on update or insert.
if field != SpecialFreeTag then
try to convert to decimal
if fail then report to user
otherwise accept value
Use try parse or regular expressions to help with the validation.
I would store the cost as decimal with the scale being no less than 2 and maybe even 3-5. If something is bought in bulk the unit cost could easily include fractions of a cent. Free items have a cost of 0. If the cost is unknown then allow null values also.

Resources