Data warehouse - multiple choice survey - data-warehouse

I want to use a data warehouse to store questions and answers from a survey of multiple choice questions, so my proposal is to design a star schema. For this I have done the following:
I build a fact table with the next fields: userID, surveyID, questionID, answerID and date. On the other hand I build the dimension tables for users, surverys, questions and answers.
The purpose of this is to build some reports that allow us to analyze the percentages of correct answers of the users in the questions, which are the questions with the highest error rate and which are the most usual answers in the questions with the highest error rate.
However, in order to have this I would need a table that would relate the dimensions of questions and answers and this would break the star scheme.
I am new in the design of this type of models, could someone guide me about it?

Related

Concept Based Text Summarization (Abstraction) [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I am looking for an engine that does AI text summarization based on the concept or meaning of the sentence, I looked at open-source projects like (ginger, paraphrase, ace) but they don't do the job.
The way they work is that they try to find synonyms for each word and replace with the current words, this way they generate alot of alternatives to a sentence but the meaning is wrong most of the times.
I have worked with Stanford's engine to do something like highlights to an article and based on that extract the most important sentences, but still this is not abstraction, its extraction.
It would also make sense that the engine I'm looking for learns over time and results are improved after each summary.
Please help out here, your help is greatly appreciated!
I don’t know any open source project which fits your requirements about abstraction and a meaning as I assume.
But I have an ideas how to build such engine and how to train it.
In a few words I think we all keep in mind some Bayesian-network like structure in our minds, with helps us not only to classify some data, but also to form an abstract meaning about text or message.
Since it is impossible to extract all that abstract categories structure from our mind I think it’s better to build mechanism which allow as to reconstruct it step-by-step.
Abstract
The key idea of the proposed solution is in the extraction of meaning of a conversation using approaches which easier in operation with it from an automated computer system. This will allow creating the good level of illusion of real conversation with another person.
Proposed model supports two levels of abstraction:
First of them, less complex level consists in the recognition of groups of words or a single word as a group which related to the category, instance or to the instance attribute.
Instance means instantiation from the general category of the real or abstract subject, object, action, attribute or other kind of instances. As an example – concrete relation between two or more subjects: concrete relations between employer and employee, concrete city and country where it’s situated and so on.
This basic meaning recognition approach allows us to create bot with ability sustain a conversation. This ability based on recognition of basic elements of meaning: categories, instances and instances attributes.
Second, the most complicated method based on scenario recognition and storing them into the conversation context with instances/categories as well as using them for completion some of recognized scenarios.
Related scenarios will be used to complete the next message of the conversation as well as some of scenarios can be used to generate the next message or for recognizing meaning element by using of conditions and by using meaning elements from the context.
Something like that:
Basic classification should be entered manually and with future correction/addition of the teachers.
Words from sentence in conversation and scenarios from sentence can be filled from context
Conversation scenarios/categories can be fulfilled by previously recognized instances or with instances described in future conversation (self-learning)
Pic 1 – word detection/categorization basically flow vision
Pic 2 – general system vision big picture view
Pic 3 - meaning element classification
Pic 4 – basically categories structure could be like that

How to recommend K similar items to a user who has bought N items?

Suppose a user buys n items from my website; I need an algorithm or a method (using Mahout maybe? How?) so that I can recommend k similar items to the user. I don't have user ratings.
The k recommendations need to be based upon the user's buying history (his n items).
The items have fields "name","author","keywords" for example, I need to recommend the most similar items. What happens if I add user ratings along with this? How would I take that into account?
I've read the Mahout docs, but it seems to always need some sort of ratings. How will I provide ratings, if say, I have had only a couple of customers so far?
There is no perfect way to build a recommender.
Recommendations without user ratings
Calculate the item-item similarity according to the keywords, name and author. Then you can propose the most similar items not seen yet. As items don't change often, you can store the similarity-table somewhere.
Recommendations with user ratings
If you don't want to have user ratings, you could also store the view-history of a user. This results in a "boolean" rating (only having "seen" and "not seen"). With this pseudo-rating, you can generate recommendations with user-similarity. Users having seen similar things are similar.
For some lecture, I strongly recommend you the book Mahout in Action. It contains a lot of information about how to use Mahout.

Apache Mahout: can we combine User-Item and Item-Item?

I am new to Mahout, and am still playing around with it.
My question is, is it appropriate to combine Item-Item and User-Item?
My use case is, a social networking application will try to recommend something for the current user based on user historical data (with higher priority), and combine the recommendation results from the current user's friends historical data (with lower priority), and display the result with ordered rating list.
The reason is, for example a new user might not have much historical data in the system, we can recommend something from his friends historical data. Once the user accumulate enough historical data, the recommendation should be based on more on that.
Is it appropriate to design system in this way?
Thank you for your time,
George
This is fairly simple to write. You can create recommendations for the user, and then combine with recommendations for the other users. A dumb version of this logic would be to add: merge lists of recommendations by adding the scores for items that appear in both lists. Maybe you add N friends' recs together, and then add N times the user's own recs. You take recommendations from this list then.
This doesn't exist in the project per se but it's quite easy to write a method to do this on the List<RecommendedItem> that comes back from recommend().

Data Warehousing: Redundant combinations of dimensions

I have built my own, very basic data warehouse. In it I have very simple cubes, for example:
Fact: ReviewRatingByday
Dimensions: Review, Organization, Date
In the OLTP side of my application, an Organization has a 1 to many relationship with Reviews.
Currently my data warehouse provides my Fact's extract function with all possible combinations of the dimensions. This results in redundant combinations where a given Review is combined with an Organization, yet the Review is in fact associated with a different Organization.
How do other data warehouse systems avoid this?
Should I mirror my OLTP relationships in my Dimensions?
I don't really understand your question. If some combinations of Review and Organization do not exist in the source data, then you will have no rows for them in the fact table anyway. So where is the "redundant combination"?
I think you might be asking, "how do I show users only valid combinations of Review and Organization when they select their report criteria". If that's correct then you have two main options:
Use a reporting tool that is able to present only valid combinations to the user
Combine Review and Organization into a single dimension that contains all valid combinations of Review and Organization (Kimball's term for this is a mini-dimension)
If I misunderstood your question, please give some more information about exactly what your issue is, especially what you mean by "redundant combination".

Design of a data warehouse with more than one fact tables

I'm new to data warehousing. First, I want to precise than my copy of The Data Warehouse Toolkit is on it's way to my mailbox (snail mail :P). But I'm already studying all this stuff with what I find on the net.
What I don't find on the net, however, is what to do when you seems to have more than one fact in a DW. In my case (insurance), I have refunds that occur on a non regular basis. One client can have none for 3 months and then ten in the same months. On the other hands, I have "subscription fee" (not sure what is the correct english term, but you get the point), that occur every month or every three months. That seems clearly like two distinct facts to me.
Those two are kind of loosely coupled by some dimensions, like the client or the "insurance product". Now are these two different warehouse, on which I have to produce two different report and then connect the reports outside of the DW ? Or is there a way to design this to fit a single descent DW. Or should I combine these two facts in one? I would probably lose granularity on refunds then.
Some blog I read said a DW always has one fact table. Others mention the step of designing what are the fact tables with a S, but there is no clear instruction of if there is a link between them or they are just distinct components of a same DW project.
Does anyone know some references on that precise part of DW design?
I realize that I am answering an old post, but I am not satisfied with either of the answers provided. I feel that neither answered the question.
A schema can have one or more facts, but these facts are not linked by any key relationship. It is best practice not to join fact tables in a single query as you would whey querying a normalized/transactional database. Due to the nature of many to many joins, etc - the results would be incorrect if attempted.
The answer you are looking for is that you need to "drill across" which basically means that you are querying each fact table (schema) separately and merging the results. This can occur using SQl or preferably via a reporting/analytics tool that you may have which referenced the data warehouse. Instead of duplicating the answers on how to do this, I will direct everyone to two very good articles:
Three ways to drill across by Chris Adamson
and
The Soul of the Warehouse - Drilling Across by Ralph Kimball
You can have as many fact tables as you like. In your example you may have something like:
dimProduct lists several products -- subscription being one of those.
dimTransactionType would list possible transactions (purchase, refund, recurring subscription fee ...)
Now suppose you are interested in simplified subscription reporting, you could add a factSubscription like this:
Taking your questions backwards.
A data warehouse can have more than one fact table. However, you do want to minimize joins between fact tables. It's ok to duplicate fact information in different fact tables.
Of the objects you mentioned:
Refund is a fact. Timestamp is the dimension of the refund fact.
Subscription fee is a fact. Timestamp is the dimension of the subscription fee fact.
A refund can happen more than once. I'm guessing that each customer has one subscription fee. So it appears we have two fact tables so far, customer, and customer refund.
If you knew that there could only be at the most 3 refunds (as an example), then you would eliminate the customer refund fact table, and put 3 refund columns in the customer table.
You also mention insurance. A customer can have more than one policy. So we have a third fact table.
A data warehouse is usually designed using a star schema. The star schema is basically one fact table connected to one or more dimension tables. You'll probably have more than one star in a data warehouse, since we already defined 3 fact tables.

Resources