Role playing dimensions other than Date/Time - data-warehouse

I am working on a Data warehouse project and came across the concept of Role playing Dimension, I am wondering has anyone used a role playing dimension table other than Date/Time Dimension? because wherever i search about this, Date Dimension is the only example that has been given and i can't think of any other dimensions that plays multiple roles in similar manner!

http://www.ibm.com/support/knowledgecenter/SSEP7J_11.0.0/com.ibm.swg.ba.cognos.ug_fm.doc/c_bp-multiplerelationships.html
http://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/role-playing-dimension/
I think these two examples explain it fairly well. Order Data, Ship Data, Closed date in the IBM reference to me is a good example of a roll playing dimension. Three different "Roles" relating to the time dimension.

Related

Legality on data usage

So I'm working on a project for my University where our current plan is to use the YouTube API and do some data analysis. We have some ideas since we're looking at the Terms of Service and the Developer Policies, but we're not entirely sure about a few things.
Our project does not focus on things such as monetary gain or predicting estimated income from a video or anything of that nature, or anything regarding trying to determine user data such as passwords/usernames, etc. It's much more about the content and statistics of the videos rather than anything else.
Our current ideas that we want to be sure would be ok to do and use:
Determine the category of a video given its title
Determine the category of a video given its tags
Determine the category of a video given its description
Determine the category of a video given its thumbnail
Some combination of above to create an ensemble model
Clustering on videos category/view counts
Sentiment analysis on comments
Trending topics over time
These are just a vague list for now but I would love to be able to reach out more to figure out what all we're allowed to use the data for.

Google Coral Model Selection?

I'm trying to find a good object detection model to use in my application, running on the coral TPU, but have a few questions about where to find a good option.
My application is watching a security camera for "interesting" objects, and notifying me in real-time when detected. As such, I have the following requirements:
fast. I would like to analyze images at a rate of around 5-10 frames per second, for quicker notification (you'd be surprised how far a car can move in one second)
accurate. I don't want to be notified that there is a train in my driveway every time the shadows change :)
Of course, both of these are "soft" requirements, but ideally are the goal. So far for model selection, all I have found is the ones on the coral.ai main page: https://coral.ai/models/object-detection/ Which leads to my questions:
Those models are listed as "not production-quality models". If that is the case, where might I find production quality models?
If retraining is the answer to making them "production quality", how would I go about that? The instructions on that page give instructions for training the models to recognize additional object types, but I don't need that - I just need fast and accurate recognition of a handful of object types (people and the various forms of transportation they might use to arrive in my driveway, plus common mammals such as dogs and moose). Also, I would need to know where to get training materials (I could pull frames off my camera, but that would be a royal pain).
Sticking to the models on that page, it looks like I have a choice of "fast", such as the SSD models, or "good", such as the EfficientDet-Lite models. Is that going to be generally true? I have also noticed the EfficientDet-Lite models use a LOT more CPU, even though they should be running on the Coral TPU.
Are there other differences between the SSD models and the EfficientDet-Lite models that would recommend one over the other?

Repeated measures for 3+ groups comparing percentages

I'm new to SPSS. I have data of skin cancer diagnosis for the years 2004 - 2018. I want to compare the changes in distribution of new cases with regards to which body part and compare between the different years. I've managed to create a crosstab and grouped bar graph that shows the percentages but I would like to run a statistical analysis to see if the changes in distribution are significant over time. The groups I have are face, trunk, arm, leg or not specified, the number of cases for each year vary greatly which is why I'm looking to compare the ratios (percentages) between the different body sites. The only explanations I've found all refer to repeated observations of the same subject which is not the case here (a person is only included with their first diagnosis so can only appear in one of the years).
The analysis would be similar to comparing the percentages of an election between 3+ parties and how that distribution changes over the years but I haven't found any such tutorials. Please help!
The CTABLES or Custom Tables procedure, if you have access to it, will let you create a crosstabulation like you mention, and then will let you test both for any changes overall in the distribution of types, as well as comparing each pair of columns for each row.
More generally, problems like this would usually be handled as loglinear or logit models.

Evaluate Collaborative-Filtering

I'm at the end of my project and my company asked me to evalute the model without metrics. In brief, after obtaing the best 10 reccomendation, I should see if these reccomandation are between the movies that the new user wanted to see. I don't understand how I can do it if I'm doing and algorithm for prediction these movies.
Slowly, I've found a possibile ansewer to my question. A user said that a possibile approach could be hide a few data points at random for every user, make recommendations using your algorithm, and then uncover the hidden data and see how many of those matched the recommendations.
But I still don't have clear ideas. Could anyone help me?
Here is how you can do evaluation:
Filter the users who have more than 20 ratings with value 5 (the exact numbers will depend on your dataset);
Randomly select two movies per user;
That’s our test set — it won’t be used during the training, but these movies should appear in the top recommendations for selected users accordingly.
You can find more details and practical implementation in the article about building recommendation system based Bayesian Personalized Ranking.

Pattern for managing large user uploaded datasets?

I'm a relatively noob programmer. I am creating web based GIS tool where users can upload custom datasets ranging from 10 rows to 1million. The datasets can have variable columns and datatypes. How do you manage these user submitted datasets?
Is the creation of a table per dataset a bad idea? (BTW - i'll be using postgresql as the database).
My apologies if this is already answered somewhere, but my search did not turn up any good results. I may be using bad keywords in my search.
Thanks!
creating a table per dataset is not a 'bad' idea at all. swivel.com was a very similar app to what you are describing and we used table per dataset and it worked very well for graph generation on user uploaded datasets and comparing data across datasets using joins. we had over 10k datasets and close to a million graphs and some datasets were very large.
you also get lots of free usage out of your orm layer, for instance we could use active record for working with a dataset (each dataset is a generated model class with its table set to the actual table)
pitfall wise is you gotta do a LOT of joins if you have any kind of cross dataset calculations.
My coworkers and I recently tackled a similar problem where we had a poor data model in MySQL and were looking for better ways to implement it. We weighed a few different options, including MongoDB, and ended up using the entity attribute value model. The EAV model is essentially a 3-column model. It allowed us to a single model to represent a variable number of columns and data types.
You can read a little about our problem here, but it sounds like it might be a good fit for you too.

Resources