My goal is to include or exclude dimensional data, from a calculation that creates a category on that dimension, in this example, Customer Name. I have achieve the inclusion/exclusion using Parameters, but they only accept single values. That means I need to create several parameters to achieve a selection of 10 items or more.
To explain the case in full, I'm using SuperStore sample dataset on Tableau Desktop 2021.1, I have created the following calculation
Top 10 Customers
IF
{fixed [Customer Name]:sum([Sales])}>10000
then
[Customer Name]
ELSE
"Other"
END
That renders the following visual
How can I move Bart Watters and Denny Joy to Other, without filtering the data? The idea is providing the user the ability to classify - instead of hard coding the selection into the calculation.
Related
I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.
Now i have some questions regarding above approach.
Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)
Can you please clarify your dataset a little bit more?
First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.
Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.
If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.
Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":
You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.
Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.
Content based filtering (CBF): It works on basis of product/ item attributes. Say user_1 has placed order(or liked) for some of the items in the past.
Now we need to identify relevant features of those ordered items and compare them with other items to recommend any new one.
One of the famous model to find the similar items based on feature set is Random forest or decision tree
Collaborative filtering (CLF): It uses user behavior . Say user_1 has placed order(or liked) for some of the items in the past. Now we find similar user. Users
who ordered/likes the same items in the past can be considered similar user. Now we can recommend some of the items ordered by similar user based on scores.
One of the famous model to find similar user is KNN
Question : Say I have to find similar users not on based of their behavior (like I mentioned) in CBF but based on some user profile features like
nationality/height/weight/language/salary etc will it be considered CBF or CLF ?
Second related doubt I have is both CBF or CLF will not work for the new user in system as he has not done any activity in the system. Is that correct ? same
is the case when system is new or launched as we won't have much data here ?
You can think content based approach as regression problem wherein you have your x_i's as your data points and their corresponding y_i's as rating given by the user.
You have correctly stated the CLF, it uses an user-item matrix from which it creates item-item or user-user matrices and then recommends products/items based on these matrices.
But in content-based you need to build a vector corresponding to each user. e.g. lets say we want to create a vector for a netflix user. This vector can include features like how many movies this user has watched, what genere of movies he/she likes, is he a critical user, etc. some of the features you have mentioned like his average salary and others and this vector will have an y_i which will the rating. These kinds of recommendation systems are known as content based and this answers your first question.
Coming to your second question, wherein when a new user/item comes into the picture, then how does one recommend items to that user. This problem is known as cold start problem. In that case you can use the geographical location of that user to pick the top items that are watched by the people in his country and recommend based on that. Once he starts rating those top items, then both your CLF and Content based can work as they normally work.
I am using a time series database (InfluxDB) and I am trying to understand how to design a measurement (table).
My background is using relational database where it is common to join tables.
In my current project we are writing different sensor values like (temperature and pressure) for many
vehicles to a measurement along with associated identifiers so that we know the specific details of
the each value we measure.
Measurement: Sensor_Trans
Tags: time, vehicleId, sensorId
Fields: value (temperature or pressure)
Later when I want to use these values, I need addtional details about the specific values.
Note: that I currently have 20+ unique tags for each sensor measurement like:
unit of measure, size of vehicle, senor description, etc.
For example: I want to know the engine pressure in Kpa for all cars with four doors.
For example: I want to know the exhaust temperature in degrees C for truck 89.
I'd like to know what is concidered best practise when designing time series measurements (tables)?
1- Do I add more tags that provide the addition inforation directly to the measurement?
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code?
3- Other?
1-Do I add more tags that provide the additional information directly to the measurement? Yes you can do that but also keep in mind adding more tags also consume more memory. Please refer the system requirements in the following link
https://docs.influxdata.com/influxdb/v1.7/guides/hardware_sizing/
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code? No need if you implement the above, you can design a relation DB table for your entire need instead ok keeping two different databases.
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
I am trying to solve a problem on mahout. The question is we have users and courses, a user can view a course or can take a course. If a user is viewing a course frequently then i have to recommend to take the course. I have data like userid and itemid and there is no preferences associated with.
EX:
1 2
1 7
2 4
2 8
3 5
4 6
where in first column 1 is userid and in 2nd column 2 is course id.The twist is in 2nd column can hold both viewed or/and complete of a particular course.suppose courseA which is viewed has id 2 and same courseA which is taken has id 7 for user 1. if a user other than user 1 coming and viewing the courseA than i have to predict courceA to be taken.now the problem here is if all the user viewing a course but not taking it, then user based recommendation in mahout will be failed.because for business perspective we have to give them the course that they are viewing should be taken. Do i need to factorize my dataset here or which algo is best suitable for this kind of problem.
One problem is that viewing may not predict (and certainly won't predict as well) that the user wants to take the course. You should look at the new cross-cooccurrence recommender stuff in Mahout v1. It's part of a complete revamp of Mahout on Spark using a new Scala DSL and built in optimizer for linear algebra. The command line job you are looking for is spark-itemsimilarity and it can ingest your user and item ids directly without translating them into cardinal non-negative numbers.
The algo takes the actions you know you want to recommend (user takes a course) these are the strongest "indicators" that can be used in your recommender. Then finds correlated views, views that led to the user taking that course. This is done with the spark-itemsimilarity job, which can take two actions at a time finding correlations, filtering out noise, and producing two "indicators". From the job you get two sparse matrices, each row is an item from the "user takes a course" action dataset and the values are an ordered list of item ids that are most similar. The first output will be items similar by other peoples taking the course, the second will be items similar by other people viewing and taking the course.
Input uses application specific IDs. You can leave you data mixed if you include a filter term that ids the action. It looks something like:
user-id-1,item-id1,user-took-class
user-id-1,item-id2,user-viewed-class-page
user-id-1,item-id5,user-viewed-class-page
...
The output is text delimited (think CSV but you can control the format) and is all item-id tokens that by default looks like this:
item-id-1,item-id-100 item-id-200 item-id-250 ...
This is an item id, comma, and an ordered list of similar items separated by spaces. Index this with a search engine and use the current user's history of action 1 to query against the primary indicator and the user's history of action 2 against the secondary cross-cooccurrence indicator. These can be indexed together as two fields of the same doc so there is only one query against two fields. This also gives you a server that is as scalable as Solr or Elasticsearch. You just create the data models with Mahout then index and query them with a search engine.
Mahout docs:http://mahout.apache.org/users/recommender/intro-cooccurrence-spark.html
Preso on the theory and other things you can do with these techniques: http://www.slideshare.net/pferrel/unified-recommender-39986309
Using this technique you can take virtually the entire user clickstream recorded as separate actions and use them to make better recs. The actions don't even have to be on the same items. You can use the user's search term history, for instance, and get a cross-cooccurrence indicator. In this case the output would have search terms that lead users to take the course and so your query would be the current user's search term history.