Survey Monkey editing likert scale - surveymonkey

I am currently working on my dissertation, and as part of it i have constructed a questionnaire on Survey Monkey.
In one of the questions, a matrix type, with ten items, with four choices, i made a miscalculation. I graded the scale from 1-4 instead of 0-3 and i already have about 14 answers to it. Now, if i edit the questionnaire and recode the scale from 0-3 how is that going to affect the answers i already have?
Are they going to change automatically to conform to the new scale, or is it going to disrupt my whole questionnaire?

You can change it and it won't affect your responses. The weight you set on a choice is used during the analytics stage and not actually stored with the response.
The help docs don't say so explicitly but seems to suggest it's safe. In the Analyzing Results section:
If needed, you can change the weight of each answer choice in the
Design section of the survey, even after the survey has collected
responses.

Related

Designing a Data Warehouse/ Star Schema - Choosing facts

Consider a crowdfunding system whereby anyone in the world can invest in a project.
I have the normalized database design in place and now I am trying to create a data warehouse it (OLAP).
I have come up with the following:
This has been denormalized and I have chosen Investment as the fact table because I think the following examples could be useful business needs:
Look at investments by project type
Investments by time periods i.e. total amount of investments made per week etc.
Having done some reading (The Data Warehouse Toolkit: Ralph Kimball) I feel like my schema isn't quite right. The book says to declare the grain (in my case each Investment) and then add facts within the context of the declared grain.
Some facts I have included do not seem to match the grain: TotalNumberOfInvestors, TotalAmountInvestedInProject, PercentOfProjectTarget.
But I feel these could be useful as you could see what these amounts are at the time of that investment.
Do these facts seem appropriate? Finally, is the TotalNumberOfInvestors fact implicitly made with the reference to the Investor dimension?
I think "one row for each investment" is a good candidate grain.
The problem with your fact table design is that you include columns which should actually be calculations in your data-application ( olap cube ).
TotalNumberOfInvestors can be calculated by taking the distinct count of investors.
TotalAmountInvestedInProject should be removed from the fact table because it is actually a calculation with assumptions. Try grouping by project and then taking the sum of InvestmentAmount, which is a more natural approach.
PercentOfProjectTarget is calculated by taking the sum of FactInvestment.InvestmentAmount divided by the sum of DimProject.TargetAmount. A constraint for making this calculationwork is having at least a member of DimProject in your report.
Hope this helps,
Mark.
Either calculate these additional measures in a reporting tool or create a set of aggregated fact tables on top of the base one. They will be less granular and will reference only a subset of dimensions.
Projects seem to be a good candidate. It will be an accumulating snapshot fact table that you can also use to track projects' life cycle.

how to build an efficient ItemBasedRecommender in Mahout?

I am building an Item Based Recommender System for 10 millions users who
rate categories over 20 possible categories (news categories like politic,
sport etc...)
I would like for each one of them to be recommended at least another
category which they don't know (no rating).
I runned a GenericUserBasedRecommender and asked for recommendations for
each user but It looks extremely long: maybe 1000 user proceeded per minute.
My questions are:
1- Can I run this same GenericUserBasedRecommender on hadoop and would it
really be faster? I saw and run an ItemBasedRecommender with command line on
a cluster, but I would rather run a User Based one.
1,5 - I saw many users not having a single recommendations. What is the alogrithm criterium to determine if a user get a recommendation? I thought It could be that the user who don't get recommendations are the one who only give a single rating, but I don't understand why.
2- Is there another smarter way to deal with my problem? Maybe some clustering
solution instead of recommendation? I don't exactly see how.
3- Finally, am I right when I say that the algorithms who have no command line
are not to be used with hadoop?
Thank you for your answers.
Sometimes you won't get recommendations for certain items or users because there are few items over which they overlap. It could also be a case where the user data may be 'enough', but his behaviour/use patterns are very unique and/or disagreement with popular trends in the data.
You could perhaps try LogLikelihood or Tanimoto based ItemSimilarity.
Another thing you could look into is a Matrix Factorization based model. You could use the ALSWR Factorizer to generate recommendations. this method decomposes the original User-Item matrix, to a User-Feature, Item-Feature and Diagonal matrix,--> then reduces the dimensionality-->and then recronstructs the matrix which is closest to the original matrix with same rank. You might lose some data this method, but the missing values in the user-item matrix are imputed and you get estimate preference/recommendation values.
If you have the features and not just implicit ratings, you could probably experiment with clustering techniques, perhaps start with Hierarchical Clustering.
I did not quite get your last question.

Detecting HTML table orientation based only on table data

Given an HTML table with none of it's cells identified as "< th >" or "header" cells, I want to automatically detect whether the table is a "Vertical" table or "Horizontal" table.
For example:
This is a Horizontal table:
and this is a vertical table:
of course keep in mind that the "Bold" property along with the shading and any styling properties will not be available at the classification time.
I was thinking of approaching this by a statistical means, I can hand write couple of features like "if the first row has numbers, but the first column doesn't. That's probably a Vertical table" and give score for each feature and combine to decide the Class of the table orientation.
Is that how you approach such a problem? I haven't used any statistical-based algorithm before and I am not sure what would be optimal for such a problem
This is a bit confusing question. You are asking about ML method, but it seems you have not created training/crossvalidation/test sets yet. Without data preprocessing step any discussion about ML method is useless.
If I'm right and you didn't created datasets yet - give us more info on data (if you take a look on one example how do you know the table is vertical or horizontal?, how many data do you have, are you always sure whether s table is vertical/horizontal,...)
If you already created training/crossval/test sets - give us more details how the training set looks like (what are the features, number of examples, do you need white-box solution (you can see why a ML model give you this result),...)
How general is the domain for the tables? I know some Web table schema identification algorithms use types, properties, and instance data from a general knowledge schema such as Freebase to attempt to identify the property associated with a column. You might try leveraging that knowledge in an classifier.
If you want to do this without any external information, you'll need a bunch of hand labelled horizontal and vertical examples.
You say "of course" the font information isn't available, but I wouldn't be so quick to dismiss this since it's potentially a source of very useful information. Are you sure you can't get your data from a little bit further back in the pipeline so that you can get access to this info?

Designing a points based system similar to Stack Overflow in Ruby on Rails

I'm not trying to recreate Stack Overflow and I did look at similar questions but they don't have many answers.
I'm interested in how to design a rails app, particularly the models and their associations, in order to capture various different kinds of actions and their points amount. Additionally these points decay over time and there are possible modifiers in the form of other actions or other data I'm tracking.
For example if I were designing Stack Overflow (which again I'm not) it would go something like the following.
Creating a question = 5 points
Answering a question = 10 points
The selected correct answer is a x2 modifier on the points for Answer a question.
From a design perspective it seems to me like I need 3 models for the key parts.
The action model is polymorphic so it can belong to questions, answers, or whatever. The kind of association is stored in the type field. It also contains a points field that is calculated at creation time by a lookup in the points model I will discuss next. It should also update a total points on the user model, which I won't discuss here.
The points model is a lookup table where actions go to figure out their points. It uses the actions type as a key. It also stores the number amount for the points and a field for their decay.
The modifier model is the one where I'm not sure what to do with. I think it should probably be a lookup table too like points using the action's type field. Additionally it needs some sort of conditional on when it should be applied. I'm not sure how to store a conditional statement. It also needs to store how the points are modified. For example x2, +5, -10, /100, etc. The other problem is how does the modifier get applied after the action has already happened. In my example it would be when a question is selected as answered. By this time the points were already set. The only way I can think of doing it is to have an after_save on every model that could be a modifier which checks the modifier table and applies them. That seems wrong to me somehow though.
There are other problems too like how to handle the decay. I guess I need a cron job that just recalculates everyone's points but that seems like it doesn't scale well.
I'm not sure if I'm over thinking this or what but I'd like some feedback.
I tend to prefer an log-aggregate-snapshot where you log discrete events and then periodically aggregate changes and store those in a separate table. This would allow you to handle something like decay as an insert job rather than an update job. Depending on how many votes there are, you could even aggregate them over time and just roll forward from a specific point (though probably there aren't enough per question or answer for this to be a concern) but given that you may have other things like user's total points to track that may be a good thing to snapshot.
I think you need to figure out how you are going to handle decay before you address it in a aggregate snapshot table, however.
Now Rails has gem to achieve this feature
https://github.com/tute/merit

Fact Constellation Schema

I made a fact constellation schema with 2 fact tables and 16 dimension tables with 4 common dimension tables. One of the dimension table needs to be normalized because data from data source can have variable number of rows. Can I still call it fact constellation schema having a branch in dimension table??
I hope you understand what I am trying to say.
Cheers.
I know it's been a while just writing to help if any other people needs information about this topic. Normally fact constellation model is made up of star models where any artifical or natural hierarchy should not be present. But according to your needs you can add normalized (hierarchical) dimension tables. In this case your fact constellation made up of snowflakes instead stars.
You may still call it a Constellation Schema with Sliced Dimension Table.
This term is very much in Oracle Datwarehosing Book which I read around 7 years ago.
Regards,
Jit

Resources