Best way to use actors when predicting a movie rating - machine-learning

So I'm trying to predict movie ratings based on several variables. I would like to include actors because that has a pretty large impact on the success of a movie. I've come up with several options.
Get the top 5 actors for each movie. Just have a unique integer that represents those actors and use that. I'm worried there are too many unique actors for a model to use this effectively though.
Take an average of all ratings of the movies that the actor performs in and use it like a key performance indicator. Have 5 separate columns for the top 5 actors in the movie with the KPI of each actor in the column.
Same as two except instead of five separate columns, combine them into a single value for the movie.
I'm thinking option two will be best. Is there a better way to go about this? If anyone has had any similar experiences I would love to hear how you solved it.

Related

Better way to model RATED relationship in neo4j movie graph database

I want to know which is better approach to model [:RATED] relationship in movie database in Nneo4J? I can think of following two approaches:
Approach 1 feels more straighforward and somehow design academically more correct.
However, approach 1 requires n (:Movie) nodes. One might say that approach 2 looks more natural as graph can contain only one (:Movie) node for a particular movie ("The Matrix" in this case) which can exists regardless whether anyone rates it or not. However I feel it less comfortable to store rating values on [:RATED] relationship. Is it correct looking in purely design perspective?
Also what if we are dealing with a node which does not represent an entity. For example bunch of cars replacing users in above image and accident replacing "The Matrix". In this case (:Accident) node may not exist by default, but only created when accident occurs. Also accident faced by two different cars are different instances of (:Accident) and have many attributes associated with them like time, place etc. In this case it makes more design sense to create separate (:Accident) node for each car whenever it encounters accident and have its properties associated with it instead of having single (:Accident) and have properties associated with relationships pointing from(:Car) to (:Accident). But then it will create a lot of (:Accident) nodes. What will be best approach for this scenario in design perspective and performance perspective?
Summarizing:
Is approach 2 perfectly fine in design perspective? (Especially storing properties on relationships which might have been stored on nodes instead)
What are possible design, performance drawbacks of approach 2?
In general, whatever approach you choose to use should fit your use cases and queries.
Given your example, approach 2, using one Matrix :Movie node, is perfectly fine design given the use cases of tracking movie ratings. This is the same approach used in the Movie graph you can load up in Neo4j. Try that out, and note that the graph would be chaotic and difficult to query if there were multiple separate :Movie nodes for every single relationship to a :Movie.
You'll note that in approach 1, there is absolutely nothing different between each of the Matrix :Movie nodes. That's a strong indicator that you should be modeling the thing as a single node instead of multiple. It's also more difficult to query if you're using multiple nodes for the same thing, as the database can no longer use a single node as a starting point for the movie to get data based on relationships from it. Your queries about the movie itself also become slightly more complicated, in that you will need to add LIMIT 1 when matching to the movie by name, otherwise the query will match to all the multiple Matrix movies, which could be in the thousands or more depending on how many ratings there are.
Even though some of the other queries you might use for this model are going to use similar Cypher, or even the same Cypher queries, you will be needlessly impacting db operations through this data model. Consider an average rating query. With a single Matrix :Movie node, it's a matter of matching on the single :Movie node (by indexed or unique name), then taking the average of all its relationships. With multiple Matrix :Movie nodes, your match will match on thousands (or more) redundant nodes, and for all of those nodes it will need to pull those relationships and average them together. That's a ton of db hits you didn't need to do.
Also, keep in mind the difficulty of using this approach when combining this for other use cases. For example, consider if we had to change your data model to include actors and directors, similar to the movie db you can import in neo4j. If we had multiple nodes for every single rating for every single movie, which node would we use when creating relationships between actors and directors and the movie they worked in? With that kind of data model, there are no good choices for modeling this kind of data efficiently or clearly.
Considering your second case, it makes sense to make a new :Accident node with each accident, with details of the accident in each node. If two or more cars in your db is involved in the same accident, then it makes sense to use the same accident node to represent the accident, and make relationships from the multiple cars to the same accident they were involved in. That saves you from duplicating data about the same accident instance, and clearly models the participants in the accident, along with any other related data that is associated with the accident. You could always store accident data specific to the car in question on the relationship between the car and the accident, such as the damage sustained, and whether the driver of the car was found at fault.
It should be clear in this data model that there should be separate :Accident nodes (unless, as mentioned, it's the same accident for multiple cars), as the data between accidents will differ, and requires you to capture them in separate nodes. This is far different than your movie data model, where it does not make sense to use multiple :Movie nodes for the same movie, since the data is all the same.
As for storing data in relationships, again that depends upon your data model, and what makes the most sense. For ratings, storing the rating on the relationship to the movie looks fine to me.
There are cases where you may consider creating intermediary nodes to store data on a node instead of a relationship. Consider an employment graph, with :Person and :Company nodes. You could model this simply with :WORKS_AT relationships between nodes, but you would need to store data about the employment on the relationship, such as hireDate, salary, jobTitle, etc. That might be fine...but you could always extract that into its own node, an :Employment node between a :Person and a :Company to hold that data. That could let us index those properties, making it easier to query :Persons for a :Company in order of hireDate, for example, which wouldn't be as efficient if the data was on the relationships, as you can't index on relationship properties.
EDIT
Concerning cardinality of nodes, when to use a single node instance vs multiple node instances, again, that's usually best answered as you answer the questions of "does this make logical sense for this data model" and "is this easy and efficient to query this data?"
The two cases you presented, for Matrix :Movie nodes and :Accident nodes, each demonstrate opposite cases for this.
A single Matrix :Movie node makes sense, I think it may be a stretch to find use cases which would require multiple copies of Matrix nodes.
However, if you had to model movie showings of The Matrix, then that might call for a :Showing node, of which there would be several (per time and per theater), but all of them referencing the same Matrix :Movie node. It's the same movie, but it has multiple showings.
For :Accidents, it makes sense to use multiple :Accident nodes, each one representing a particular instance of an accident. In many cases there will be only one :Car associated with a single :Accident node, a driver crashing into something without involving other drivers. In other cases, when it's a multi-car collision, then several cars are involved in the same :Accident, so you would have the :Accident node with the time and location and details, and relationships with the :Cars involved in that particular accident.
While it's possible to use a single :Accident node for ALL accidents, and have the details on the relationships, you'll quickly encounter problems with some of the likely queries you'll need to make. For example, how do you know which accidents were multi-car accidents, and which cars were involved? We would have to examine all relationships to the single :Accident node, and even then we'd have to do extra logic to figure out the associations. What if we wanted to order :Accidents by date? We can't use indexes on relationship properties, so again we have to touch on all relationships and inspect their properties and sort them all. What if we wanted to indicate location based on closest city to the accident, for fast lookup of accidents in certain cities? Again, we can't use indexes on relationship properties for fast lookup. If we already have :City nodes, we can't create relationships between the relevant :City node and the crash relationship, you need a node for that.
I could list more cases, but it's fairly clear that multiple :Accident nodes are needed per accident (again, sharing the node for :Cars involved in the same :Accident).
This is one of those cases where even if you missed it when thinking about if the data model makes sense, consideration about the kind of queries you want to make, and their efficiency, should push you toward a better means of modeling your data...in this case, using multiple :Accident nodes.

Rails model with multipe values for a field

I have a model Movie. That can have multiple Showtimes. Each Showtime is a pair of start and end times. Movies get saved in the database.
So although a Movie might have_many Showtimes, does that really need to be a model, or just a class, or some kind of custom tuple-like type?
I have seen where you can have a field with an array of values, but this would not be basic values as each value is a pair of times.
What is the best way to achieve this?
Showtimes should be a model, yes. Here are a few reasons:
Most relational databases don't natively support a tuple or array type.
What if you want to query movies occurring at a particular time? This would be difficult to do with a custom field, but would be relatively trivial with a separate table.
Most importantly, it enables better flexibility and extensibility through decreased coupling. For instance, does a showtime always exist exclusively to a movie? What if you want to extend your schema to add theatres where each theatre has many showtimes?

Fact table linked to Slowly Changing Dimension

I'm struggling to understand the best way to model a particular scenario for a data warehouse.
I have a Person dimension, and a Tenancy dimension. A person could be on 0, 1 or (rarely) multiple tenancies at any one time, and will often have a succession of tenancies over time. A tenancy could have one or more people associated with it. The people associated with a tenancy can change over time, and tenancies generally last for many years.
One option is to add tenancy reference, start and end dates to the Person Dimension as type 2 SCD columns. This would work well as long as I ignore the possibility of multiple concurrent tenancies for a person. However, I have other areas of the data warehouse where I am facing a similar design issue and ignoring multiple relationships is not a possibility.
Another option is to model the relationship as an accumulating snapshot fact table. I'm not sure how well this would work in practice though as I could only link it to one version of a Person and Tenancy (both of which will have type 2 SCD columns) and that would seem to make it impossible to produce current or historical reports that link people and tenancies together.
Are there any recommended ways of modelling this type of relationship?
Edit based on the patient answer and comments given by SQL.Injection
I've produced a basic model showing the model as described by SQL.Injection.
I've moved tenancy start/end dates to the 'junk' dimension (Dim.Tenancy) and added Person tenancy start/end dates to the fact table as I felt that was a more accurate way to describe the relationship.
However, now that I see it visually I don't think that this is fundamentally any different from the model that I started with, other than the fact table is a periodic snapshot rather than an accumulating snapshot. It certainly seems to suffer from the same flaw that whenever I update a type 2 slowly changing attribute in any of the dimensions it is not reflected in the fact.
In order to make this work to reflect current changes and also allow historical reporting it seems that I will have to add a row to the fact table every time a SCD2 change occurs on any of the dimensions. Then, in order to prevent over-counting by joining to multiple versions of the same entity I will also need to add new versions of the other related dimensions so that I have new keys to join on.
I need to think about this some more. I'm beginning to think that the database model is right and that it's my understanding of how the model will be used that is wrong.
In the meantime any comments or suggestions are welcome!
Your problem is similar to to the sale transactions with multiple item. The difference, is that a transaction usually has multiple items and your tenancy fact usually has a single person (the tenant).
Your hydra is born because you are trying to model the tenancy as a dimension, when you should be modeling it as a fact.
The reason why I think you have a tenancy dimension, is because somewhere you have a fact rent. To model the fact rent consider use the same approach i stated above, if two persons are tenants of the same property two fact records should be inserted each month:
1) And now comes some magic (that is no magic at all), split the value of the of the rent by the number of tenants and store it the fact
2) store also the full value of the rent (you don't know how the data scientist is going to use the data)
3) check 1) with the business user (i mean people that build the risk models); there might be some advanced rule on how to do the spliting (a similar thing happens when the cost of shipping is to be divided across multiple item lines of the same order -- it might not be uniformly distributed)

How to organize 2 resources in a rails app that are associated

I'm planning a new rails apps for teaching lessons. Each lesson will belong to a track (e.g. Level I). So each track has_many lessons.
But each lesson should have a Lesson Number within each track. What is a good way to present the lessons in sequential order (e.g. Lesson 4, Track 2).
I'm not sure exactly how to assign lesson numbers within the tracks and keep them in sequential order. If they didn't have to be organized into tracks then I could just use created_at to put them into sequential order.
Is the lesson number something that the user creating the lesson needs to assign manually when they create or update a lesson?
Anyone have opinions on a good way to do this?
I would have an extra column, e.g. position (or similar), in the lesson/track tables if you want your users to have control over what the lesson/track numbers are, as opposed to just having them in sequential order by creation date.
You can do that either by giving the lesson creator a field to enter the lesson number, or some sort of UI sortable interface they can use to organize the order. When displaying the lessons/tracks you'd probably want to order by the position field.

Handling lots of COUNT queries for a report

I am putting together a report that shows statistical information about products for a company that owns those products. This report, in the form I need, contains as many as 150 'counts', because we are filling the table with the counts for 12 product types against 15 different statistical categories.
Here's the set up of the models. I'm afraid it's a little complicated!
Company is the entity accessing the report.
Company has many Products through Matchings; and
Product has many Companies through Matchings.
Matching belongs_to Order.
Example report:
___________|_Available/Active/Light Available/Active/Heavy (+12 columns)__
Perishable |
Intangible |
(+10 rows) |
The product types are in the Product table (they run down the left side of the report).
The categories across the top of the report are combinations of three criteria: two from Product and one from Order.
Example - for one cell in the Perishable row, show me how many matchings exist for whom the order type is 'active', the product's weight is 'light' and the product status is 'available'.
On its own the above query is not too bad, but if I keep going like this I'm going to have ~170 queries for this report - both an inelegant and highly impractical solution. Is there a magic ActiveRecord way to deal with this scenario?
You could always create a background job to run regularly and pre-cache the results, or pre-generate the entire report. This would free your users from having to sit and wait for 170 queries to run, and I assume it would be acceptable to have slightly stale results.
As for the elegance and practicality of it, the only magic you could use is SQL. Your object model wasn't built for reporting, don't feel bad about using a tool that was.
There is a statistics gem that does this sort of thing. It does allow you to cache the statistics.
I've used it for lightweight statistics like counts and averages but have never taken benchmarks, which is definitely something you'll want to do if performance is a concern.

Resources