Is there any way to translate an EER - Model into a Dimensional Model automatically? - entity-relationship

I've got an EER - Model in MYSQL and would like to have a Dimensional Model from that.
Can I
Translate that one-to-one, and
Is there any way to create a dimensional model digitally?
EDIT: question 2 means if there is program that is made for that purpose to make a figure. I assume I simply use office now.

The short answer is no, it is not possible. Even if it was technically possible (of course it is always possible to write a program to perform almost any task) the resulting model would be useless.
A dimensional model is designed based on your reporting requirements, not on the structure of any source system you might be importing data from.
I'm not sure what you mean in your 2nd point "Is there any way to create a dimensional model digitally?". Everything you do on a computer is "digital" so please could you clarify exactly what you are asking here?

Related

Question regarding role-playing dimension

I hope you can be helpful in answering one question in regards to role-playing dimensions.
When using views for a role playing dimension, Does it then matter which view is referred to later in the analysis. Especially, when sorting on the role playing dimension, can this be done no matter which view is used?
Hope the question is clear enough. If not, let me know and I will elaborate.
Thanks in advance.
Do you mean you have created a view similar to "SELECT * FROM DIM" for each role the Dim plays? If that's all you've done then you could use any of these views in a subsequent SQL statement that joins the DIM to a FACT table - but obviously if you use the "wrong" view it's going to be very confusing for anyone trying to read your SQL (or you trying to understand what've you've written in 3 months time!)
For example, if you have a fact table with keys OrderDate and ShipDate that both reference your DateDim then you could create vwOrderDate and vwShipDate. You could then join FACT.OrderDate to vwShipDate and FACT.ShipDate to vwOrderDate and it will make no difference to the actual resultset your query produces (apart from, possibly, column names).
However, unless the applicable attributes are very different for different roles, I really wouldn't bother creating views for role-playing Dims as it's an unnecessary overhead and just going to cause confusion to anyone you've given access to at this level of the DB (who presumably have pretty strong SQL skills to be given this level of access?).
If you are trying to make life easier for end-users then either create these types of "views" in the models of the BI tool(s) they are using - and not directly in the DB - or, if they are being given access to the DB, then create View(s) across the Fact(s) and all their joined Dimensions

How to pre process a class data (with a large number of unique values) before feeding it to machine learning model?

Let's say I have a large data from an online gaming platform (like steam) which has 'date, user_id, number_of_hours_played, no_of_games' and I have to write a model to predict how many hours a user will play in future for a given date. Now, user_id has a large number of unique values (in millions). I know for class data we can use one hot encoding, but not sure what to do when I have millions of unique classes. Also, suggest if we can use any other method to preprocess the data.
Using directly the user id in the model is not a good idea, since that would result like you said into a large number of features, but also in overfitting since you would get one id per line (If I understood correctly your data). It would also make your model useless in case of a new user id and you would have to retrain your model each time you have a new user.
What I would recommand in the first place is to drop this variable and try to build a model with only the other variables.
Another Idea that you could try is to perform a clustering on the users you have based on other features, and then pass the cluster as a feature instead of the user id, but I don't know if this is a good idea since I don't know the kind of data you have.
Also, you are talking about making a prediction on a given date. The data you described doesn't suggest that but if you have the number of hours per multiple dates, this is closer to a time series prediction problem, which is different from a 'classic' regression problem.

Named Entity Recognition upper case issue

I recently switched the model I use for NER in spacy from en_core_web_md to xx_ent_wiki_sm.
I noticed that the new model always recognises full upper case words such as NEW JERSEY or NEW YORK as organisations. I would be able to provide training data to retrain the model, although it would be very time consuming. However I am uncertain if the model would loose the assumption that upper case words are organisations or if it would instead keep the assumption and create some exceptions for it. Does it maybe even learn that every all upper case with word with less than 5 letter is likely to be an organisation and everything with more letters not? I just dont know how exactly the training will affect the model
en_core_web_md seems to deal fine with acronyms, while ignoring words like NEW JERSEY. However the overall performance of xx_ent_wiki_sm is better for my use case
I ask because the assumption as such is still pretty useful, as it allows us to identify acronyms such as IBM as an organisation.
The xx_ent_wiki_sm model was trained on Wikipedia, so it's very biased towards what Wikipedia considers and entity, and what's common in the data. (It also tends to frequently recognise "I" as an entity, since sentences in the first person are so rare on Wikipedia.) So post-training with more examples is definitely a good strategy, and what you're trying to do sounds feasible.
The best way to prevent the model from "forgetting" about the uppercase entities is to always include examples of entities that the model previously recognised correctly in the training data (see: the "catastrophic forgetting problem"). The nice thing is that you can create those programmatically by running spaCy over a bunch of text and extracting uppercase entities:
uppercase_ents = [ent for ent in doc.ents if all(t.is_upper for t in ent)]
See this section for more examples of how to create training data using spaCy. You can also use spaCy to generate the lowercase and titlecase variations of the selected entities to bootstrap your training data, which should hopefully save you a lot of time and work.

Designing a points based system similar to Stack Overflow in Ruby on Rails

I'm not trying to recreate Stack Overflow and I did look at similar questions but they don't have many answers.
I'm interested in how to design a rails app, particularly the models and their associations, in order to capture various different kinds of actions and their points amount. Additionally these points decay over time and there are possible modifiers in the form of other actions or other data I'm tracking.
For example if I were designing Stack Overflow (which again I'm not) it would go something like the following.
Creating a question = 5 points
Answering a question = 10 points
The selected correct answer is a x2 modifier on the points for Answer a question.
From a design perspective it seems to me like I need 3 models for the key parts.
The action model is polymorphic so it can belong to questions, answers, or whatever. The kind of association is stored in the type field. It also contains a points field that is calculated at creation time by a lookup in the points model I will discuss next. It should also update a total points on the user model, which I won't discuss here.
The points model is a lookup table where actions go to figure out their points. It uses the actions type as a key. It also stores the number amount for the points and a field for their decay.
The modifier model is the one where I'm not sure what to do with. I think it should probably be a lookup table too like points using the action's type field. Additionally it needs some sort of conditional on when it should be applied. I'm not sure how to store a conditional statement. It also needs to store how the points are modified. For example x2, +5, -10, /100, etc. The other problem is how does the modifier get applied after the action has already happened. In my example it would be when a question is selected as answered. By this time the points were already set. The only way I can think of doing it is to have an after_save on every model that could be a modifier which checks the modifier table and applies them. That seems wrong to me somehow though.
There are other problems too like how to handle the decay. I guess I need a cron job that just recalculates everyone's points but that seems like it doesn't scale well.
I'm not sure if I'm over thinking this or what but I'd like some feedback.
I tend to prefer an log-aggregate-snapshot where you log discrete events and then periodically aggregate changes and store those in a separate table. This would allow you to handle something like decay as an insert job rather than an update job. Depending on how many votes there are, you could even aggregate them over time and just roll forward from a specific point (though probably there aren't enough per question or answer for this to be a concern) but given that you may have other things like user's total points to track that may be a good thing to snapshot.
I think you need to figure out how you are going to handle decay before you address it in a aggregate snapshot table, however.
Now Rails has gem to achieve this feature
https://github.com/tute/merit

SPROC to update record: how to handle unchanged values

I'm calling a update SPROC from my DAL, passing in all(!) fields of the table as parameters. For the biggest table this is a total of 78.
I pass all these parameters, even if maybe just one value changed.
This seems rather inefficent to me and I wondered, how to do it better.
I could define all parameters as optional, and only pass the ones changed, but my DAL does not know which values changed, cause I'm just passing it the model - object.
I could make a select on the table before updateing and compare the values to find out which ones changed but this is probably way to much overhead, also(?)
I'm kinda stuck here ... I'm very interested what you think of this.
edit: forgot to mention: I'm using C# (Express Edition) with SQL 2008 (also Express). The DAL I wrote "myself" (using this article).
Its maybe not the latest state of the art way (since its from 2006, "pre-Linq" so to say but Linq works only for local SQL instances in Express anyways) of doing it, but my main goal was learning C#, so I guess this isn't too bad.
If you can change the DAL (without changes being discarded once the layer is "regenerated" from the new schema when changes are made), i would recomend passing a structure containing the column being changed with values, and a structure kontaing key columns and values for the update.
This can be done using hashtables, and if the schema is known, should be fairly easy to manipulate this in the "new" update function.
If this is an automated DAL, these are some of the drawbacks using DALs
You could implement journalized change tracking in your model objects. This way you could keep track of any changes in your objects by saving the previous value of a property every time a new value is set.This information could be stored in one of two ways:
As part of each object's own private state
Centrally in a "manager" class.
In the first solution, you could easily implement this functionality in a base class and have it run in all model objects through inheritance.
In the second solution, you need to create some kind of container class that will keep a reference and a unique identifier to any model object that is created and record all changes in its state in a central store.This is similar to the way many ORM (Object-Relational Mapping) frameworks achieve this kind of functionality.
There are off the shelf ORMs that support these kinds of scenarios relatively well. Writing your own ORM will leave you without many features like this.
I find the "object.Save()" pattern leads to this kind of behavior, but there is no reason you need to follow that pattern (while I'm not personally a fan of object.Save(), I feel like I'm in the minority).
There are multiple ways your data layer can know what changed and most of them are supported by off the shelf ORMs. You could also potentially make the UI and/or business layer's smart enough to pass that knowledge into the data layer.
Two options that I prefer:
Generating/hand coding update
methods that only take the set of
parameters that tend to change.
Generating the update statements
completely on the fly.

Resources