I have a dataset that contains a number of different of Subject IDs. They are each part of a specific Chain and were assigned a Generation number in an experiment. I have the mean StructureScore (how much a language is structured) for each participant, but I also want to see what the mean StructureScore is for each generation.
For example, Generation 7 of Chain E exists 4 times, so I want to have the mean score for those 4 participants. I'm not sure how to make a new dataset of just those mean StructureScores? Any suggestion is appreciated.
Check this out!
This type of operation is exactly what aggregate was designed for:
Mean per group in a data.frame
Related
The problem I would like to solve is how to choose the best seats on a train based on some ordered user preferences. eg. whether they'd like a seat facing forwards, backwards (or don't care), whether they'd like a seat at a table or not, whether they need to be near a toilet, luggage rack, buffet car, near the door. Window / Aisle seat. Whether they want the aisle to the left or the right (can be very important some someone with a stuff knee!).
Most customers will specify one or two preferences, other may specify more. For some, being near the toilet might be the most important factor, for others having that table to work at might be the most important.
There may be more than one passenger (although they will share preferences). These should be sat as close to each other as possible. 2 passengers would ideally be sat next to each other, or opposite each other at a table seat. A group of 8 passengers might best be split into 2 groups of 4 or 4 groups of 2...
Position is defined by carriage number (seats in the same carriage are better then seats in different carriages) and by x/y coordinate within that carriage - so easy enough to calculate distance between any pair of seats - but a BIG job to calculate distances between EVERY pair of seats...)
Each [available] seat (pre-filtered by ticket class) will have the above attributes either defined or set to NULL (for unknown - seat facing is often unknown).
So for training I can provide a vast array of example trains and customer preferences with the best balance of preferences version position.
For execution I want to provide a run-time specific array of seats with attributes, a set of user preferences and a set if weighting for those preference (eg. passenger 1 thinks being near toilet is most important, passenger 2 think having a table is most important, passenger 3 think being in the quiet carriage is..) and finally the number of passengers.
Output will be an array of seats (one per passenger) that strike the best compromise between matching as many customer preferences as possible (usually not possible to match all preferences) and keeping the seats fairly close to each other.
eg. We might be able to match 2 preferences with seats 2 rows apart, but match 3 preference with seats 10 rows apart...
Obviously distance will need a weighting the same as the individual preference and necessary to choose between those two. I suppose a distance not greater than X becomes just one more customer preference...
I've not done any ML work before, so it's all going to be a learning exercise for me. I wish I had the time to just play and see what comes out, but I don't, Happy to do that, but I need to have a reasonable expectation of a positive result otherwise I'll have to focus on a more traditional approach. Limited time and all that...
So, my questions are:
Is this a suitable problem for machine learning?
If so, is brain.js a good choice, or is something else more suitable? AWS ML service perhaps?
Any advice on how to organise all my data into something suitable for an ML engine to process?
Machine Learning is good at finding hidden patterns in complex data. In your case, you would need a lot of data where user preferences are already matched with optimal seating arrangements.
You could then try to see if the ML model can actually make optimal seating arrangements by itself. It’s an interesting problem but it may also lead to unexpected seating :)
If you don’t have training data you could collect it live, by registering where people sit down, knowing their preferences.
I have a bike rental dataset. In this dataset our target variable is Count i.e. total count of bike rental which is the sum of two variables in our dataset i.e casual user count variable and registered user count variable.
So my question is how should i perform modelling on this dataset ?
Please suggest a step as I'm thinking of dropping casual and registered user variable and keeping only count variable as our tagert variable along with other predictor variables
The question is rather vague but I will attempt to answer it.
I am not too sure what it is that you want to predict. Assuming it is the amount of bikes that would be rented out at some future time.
If the distinction between casual and registered is important and has significant meaning to the purpose of your project, then you should probably treat them as separate features and not combine them into one.
On the contrary, if the distinction is not important and you only care for the amount of bikes, then you should be fine combining them and using the total sum.
I think you should try to understand what you are trying to accomplish and what questions you wish to answer with your analysis.
Converted my two target variables into one by summing them up and then created a new model with only one target variable.
Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.
What's the best way to model 37 different attributes/"checkpoints" (that can be graded as Pass/Fail/Not Applicable) in a dimension for a star schema where each row in the fact table is a communication that is graded against the checkpoints in question?
TL;DR:
I've developed a star schema model where each row in the fact table is a single communication. These communications go through a series of graded "checks" (e.g. "Posted on Time", "Correct Email Subject", "XYZ content copied correctly", etc.) and each check can be graded as "Passed", "Missed", or "Not Applicable".
Different types of communications are graded on different sets of checks (e.g. one type of communication may only be graded on three checks, the rest being "Not Applicable", whereas another type of communication is graded on 19 checks). There are 37 total unique checks.
I've built a "CommunicationGrading" Type 2 slowly changing dimension to facilitate reporting of which "checks" communications are scoring most poorly. The dimension has 37 columns, one for each attribute, and each row is a permutation of the attributes and the score they can receive (pass/fail/NA). A new row is added when a new permutation becomes available - filling all possible permutations unfortunately returns millions of rows, whereas this way is < 100 rows, much less overhead. I've created 37 separate measures that aggregate the # of communications that have missed each of the 37 separate "checks".
I can quickly build a treemap in PBI, drag the 37 measures on there, see the total # of communications that have missed each "check", and determine that X # of communications missed Y checkpoint this month. The problem comes when I want to use the visual as a slicer, (e.g. selecting a check/tile on the treemap to see what individual communications missed that check in a table beneath the treemap) or determining the top N "checks" given a slice of data.
From what I can tell, the issue is because I'm using 37 different attributes and measures rather than one attribute and one measure (where I could drag the single measure into Values and the single attribute/column containing all checks into Group field in the treemap visual). The problem is, I'm stumped on how to best model this/the Grading dimension. Would it involve trimming the dimension down to just two columns, one for the checks and one for the checks' possible scores, then creating a bridge table to handle the M:M relationship? Other ideas?
Your dimension (implemented as a junk dimension- something to google) is one way of doing it, although if going down that road I'd break it down into multiple dimensions of related checkpoints to massively reduce the permutations in each. It also isn't clear why this would need to be a Type 2- is there a history of this dimension you would need to track?
However I'd suggest one approach to explore is having a new fact for each communication's score at each checkpoint- you could have one dimension of grade result (passed, failed, not applicable) and one dimension of each checkpoint (which is just the checkpoint description). It would also allow you to count on that fact rather than having to have 37 different measures. You may wish to keep a fact at the communication level if there is some aggregate information to retain, but that would depend on your requirements.
What is generally considered the correct approach when you are performing a regression and your training data contains 'incidents' of some sort, but there may be a varying number of these items per training line?
To give you an example - suppose I wanted to predict the likelihood of accidents on a number of different roads. For each road, I may have a history multiple accidents and each accident will have its own different attributes (date (how recent), number of casualties, etc). How does one encapsulate all this information on one line?
You could for example assume a maximum of (say) ten and include the details of each as a separate input (date1, NoC1, date2, NoC2, etc....) but the problem is we want each item to be treated similarly and the model will treat items in column 4 as fundamentally separate from those in column 2 above, which it should not.
Alternatively we could include one row for each incident, but then any other columns in each row which are not related to these 'incidents' (such as age of road, width, etc) will be included multiple times and hence produce bias in the results.
What is the standard method that is use to accomplish this?
Many thanks