How to learn the order of content by machine learning?

How to learn the order of content by machine learning? - machine-learning

Data at hand: 1000 questionnaires with a finite database of questions, say 100 questions about name, gender, income etc. Each questionnaire contains 10 to 30 questions from this question database. The wording of a certain question remains identical across different questionnaires. The 100 questions have their unique label (Q1 to Q100) in the database.
Task: creating a new questionnaire. Assuming I know which questions (say 20 questions including Q1, Q5, Q10, Q22 etc) I need to ask on the new questionnaire, I need to know what order should I place these questions.
Machine learning question: how do I learn the patterns from the existing data to help myself order the 20 questions on my new questionnaire?

A simple but inaccurate solution would count the order of each question label on the existing data. Say Q1 appeared 300 times on the existing data and 70% of time it is the first question on the questionnaire, so I predict Q1 would have order = 1 on any new questionnaire.
Alternatively, I can compute the average order of each question on the existing data. Say Q1 has a mean order of 2.53 and Q10 has a mean order of 1.33. Then when I create a new questionnaire that contains both Q1 and Q10, Q1 will be placed after Q10.
The above methods fail to capture the relationship between the questions. For instance, maybe Q5 always appears after Q6. I hope the algorithm can capture hidden patterns like this.

Related

Combining additive and semi-additive facts in a single report

I'm working on a quarterly report. The report should look something like this:
col
Calculation
Source table
Start_Balance
Sum at start of time period
Account_balance
Sell Transactions
Sum of all sell values between the two time periods
Transactions
Buy Transactions
Sum of all buy values between the two time periods
Transactions
End Balance
Sum at the end of time period
Account_balance
so e.g.
Calculation
sum
Start_Balance
1000
Sell Transactions
500
Buy Transactions
750
End Balance
1250
The problem here is that I'm working with a relational star schema, one of the facts is semi-additive and the other is additive, so they behave differently on the time dimension.
In my case I'm using Cognos analytics, but I think this problem goes for any BI tool. What would be best practice to deal with this issue? I'm certain I can come up with some sql query that combines these two tables into one table which the report reads from, but this doesn't seem like best practice, or is it? Another approach would be to create some measures in the BI tool, I'm not a big fan of this approach because it seems to be least sustainable approach, and I'm unfamiliar with it.

For Cognos you can stitch the tables
The technique has to do with how Cognos aggregates
Framework manager joins are typically 1 to n for describing the relationship
A star schema having the fact table in the middle and representing the N
with all of the outer tables describing/grouping the data, representing the 1
Fact tables, quantitative data, the stuff you want to sum should be on the many side of the relationship
Descriptive tables, qualitative data, the stuff you want to describe or group by should be on the 1 (instead of the many)
To stitch we have multiple tables we want to be facts
Take the common tables that you would use for grouping, like the period (there are probably some others like company, or customer, etc)
Connect each of the fact tables with the common table (aka dimension) like this:
Account_balance N to 1 Company
Account_balance N to 1 Period
Account_balance N to 1 Customer
Transactions N to 1 Company
Transactions N to 1 Period
Transactions N to 1 Customer
This will cause Cognos to perform a full outer join with a coalesce
Allowing you to handle the fact tables even though they have different levels of granularity
Remember with an outer join you may have to handle nulls and you may need to use the summary filter depending on your reporting needs
You want to include the common tables on your report which might conflict with how you want the report to look
An easy work around is to add them to the layout and then set the property to box type none so the sql behaves you want and the report looks the way you want

You'll probably need to setup determinants in the Framework Manager model. The following does a good job in explaining this:
https://www.ibm.com/docs/en/cognos-analytics/11.0.0?topic=concepts-multiple-fact-multiple-grain-queries

Questions on how to model many semi-boolean attributes in a star schema

What's the best way to model 37 different attributes/"checkpoints" (that can be graded as Pass/Fail/Not Applicable) in a dimension for a star schema where each row in the fact table is a communication that is graded against the checkpoints in question?
TL;DR:
I've developed a star schema model where each row in the fact table is a single communication. These communications go through a series of graded "checks" (e.g. "Posted on Time", "Correct Email Subject", "XYZ content copied correctly", etc.) and each check can be graded as "Passed", "Missed", or "Not Applicable".
Different types of communications are graded on different sets of checks (e.g. one type of communication may only be graded on three checks, the rest being "Not Applicable", whereas another type of communication is graded on 19 checks). There are 37 total unique checks.
I've built a "CommunicationGrading" Type 2 slowly changing dimension to facilitate reporting of which "checks" communications are scoring most poorly. The dimension has 37 columns, one for each attribute, and each row is a permutation of the attributes and the score they can receive (pass/fail/NA). A new row is added when a new permutation becomes available - filling all possible permutations unfortunately returns millions of rows, whereas this way is < 100 rows, much less overhead. I've created 37 separate measures that aggregate the # of communications that have missed each of the 37 separate "checks".
I can quickly build a treemap in PBI, drag the 37 measures on there, see the total # of communications that have missed each "check", and determine that X # of communications missed Y checkpoint this month. The problem comes when I want to use the visual as a slicer, (e.g. selecting a check/tile on the treemap to see what individual communications missed that check in a table beneath the treemap) or determining the top N "checks" given a slice of data.
From what I can tell, the issue is because I'm using 37 different attributes and measures rather than one attribute and one measure (where I could drag the single measure into Values and the single attribute/column containing all checks into Group field in the treemap visual). The problem is, I'm stumped on how to best model this/the Grading dimension. Would it involve trimming the dimension down to just two columns, one for the checks and one for the checks' possible scores, then creating a bridge table to handle the M:M relationship? Other ideas?

Your dimension (implemented as a junk dimension- something to google) is one way of doing it, although if going down that road I'd break it down into multiple dimensions of related checkpoints to massively reduce the permutations in each. It also isn't clear why this would need to be a Type 2- is there a history of this dimension you would need to track?
However I'd suggest one approach to explore is having a new fact for each communication's score at each checkpoint- you could have one dimension of grade result (passed, failed, not applicable) and one dimension of each checkpoint (which is just the checkpoint description). It would also allow you to count on that fact rather than having to have 37 different measures. You may wish to keep a fact at the communication level if there is some aggregate information to retain, but that would depend on your requirements.

Designing a Data Warehouse/ Star Schema - Choosing facts

Consider a crowdfunding system whereby anyone in the world can invest in a project.
I have the normalized database design in place and now I am trying to create a data warehouse it (OLAP).
I have come up with the following:
This has been denormalized and I have chosen Investment as the fact table because I think the following examples could be useful business needs:
Look at investments by project type
Investments by time periods i.e. total amount of investments made per week etc.
Having done some reading (The Data Warehouse Toolkit: Ralph Kimball) I feel like my schema isn't quite right. The book says to declare the grain (in my case each Investment) and then add facts within the context of the declared grain.
Some facts I have included do not seem to match the grain: TotalNumberOfInvestors, TotalAmountInvestedInProject, PercentOfProjectTarget.
But I feel these could be useful as you could see what these amounts are at the time of that investment.
Do these facts seem appropriate? Finally, is the TotalNumberOfInvestors fact implicitly made with the reference to the Investor dimension?

I think "one row for each investment" is a good candidate grain.
The problem with your fact table design is that you include columns which should actually be calculations in your data-application ( olap cube ).
TotalNumberOfInvestors can be calculated by taking the distinct count of investors.
TotalAmountInvestedInProject should be removed from the fact table because it is actually a calculation with assumptions. Try grouping by project and then taking the sum of InvestmentAmount, which is a more natural approach.
PercentOfProjectTarget is calculated by taking the sum of FactInvestment.InvestmentAmount divided by the sum of DimProject.TargetAmount. A constraint for making this calculationwork is having at least a member of DimProject in your report.
Hope this helps,
Mark.

Either calculate these additional measures in a reporting tool or create a set of aggregated fact tables on top of the base one. They will be less granular and will reference only a subset of dimensions.
Projects seem to be a good candidate. It will be an accumulating snapshot fact table that you can also use to track projects' life cycle.

How to get descriptive statistics on questionnaire items by group using SPSS?

I have carried out an evaluation of a product using likert scale questionnaire and imported the date into SPSS. I have my columns arranged as follows:
ID, Group, Q1, Q2, Q3, Q4
I have two different groups completing the questionnaire, with each person a different numerical ID. Under the Q columns, I have the score given for that person (from 1-5) from the Likert Scale.
In all there are over 300 responses.
I am running analysis using 'descriptive statistics/frequencies' from the menubar and not getting the tables I am looking for. Basically, it is including all respondents together, whereas I would like it to compare the two groups in the tables.
How can I get descriptive statistics on questionnaire items by group using SPSS?
In addition, if you have any further tips as to what analysis I could perform on this type of data in SPSS I'd be most grateful. I'd like to show that there isn't a significant difference in opinions between the groups, and from looking at the data, it appears that this is the case.

One option
split file by group
run descriptive statistics as usual
See this SPSS FAQ item from UCLA on how to analyze data by categories.

The short answer to you question is, crosstabs Q1 to Q4 by group. will produce the table you want. Or if you have the ctables package available a more compact table will be produced by
variable level group_id Q1 to Q4 (nominal).
ctables
/table Q1 + Q2 + Q3 + Q4 by group_id.
Either can be elaborated on to produce other statistics if wanted. It seems to me a chi-square test would be sufficient for your question.
As far as further analysis it is a bit of an open-ended question that needs more focus to be able to effectively answer. I frequently suggest visual exploration for such exploratory analysis, and hence I would suggest perusing this question on the site, Visualizing Likert responses using R or SPSS for potential ideas about how to visualize the responses. Another motivating post may be How to visualize 3D contingency matrix?.
There are a ton of other questions related to analyzing likert responses on this site though, and it is difficult to give any more specific advice without a more specific motivation for the analysis.

While the above answers all have their good points, I usually prefer this procedure (type the following into a syntax window and Run):
means q1 to q4 by group/stat anova.
This will give you group means, sample sizes, and standard deviations as well as tests of the difference in means between the groups, for each of the variables Q1 to Q4. Of course, the tests will only give you valid results to the extent that your data meet the standard assumptions of anova. Some may say that variables measured on an ordinal 1-5 scale are not suitable for anova, and in academic contexts this is often true, but in business contexts most people are willing to sacrifice some rigor for the sake of convenience. It's much more convenient to compare 4x2=8 means than it is to compare the distributions of 4x5x2=40 categories of responses.

This can easily be done by using the "Crosstabs" function in SPSS for Windows:
Analyze --> Descriptive Statistics --> Crosstabs. Move the dependent variable(s) into the "Row(s)" box, then move the grouping variable into the "Column(s)" box, then click OK.

How can I approach the Travelling Saleman Problem with multiple salesman?

Merged with Travelling Salesman with multiple salesmen?.
I have a problem that has been effectively reduced to a Travelling Salesman Problem with multiple salesman. I have a list of cities to visit from an initial location, and have to visit all cities with limited salesman.
I am trying to come up with a heuristic and was wondering if anyone could give a hand. For example, if I have 20 cities with 2 salesman, the approach that I thought of taking is a 2 step approach. First, divide the 20 cities up randomly into 10 cities for 2 salesman each, and I'd find the tour for each as if it were independent for a few iteratinos. Afterwards, I'd like to either swap or assign a city to another salesman and find the tour. Effectively, it'd be a TSP and then minimum makespan problem. The problem with this is that it'd be too slow and good neighborhood generation of swapping or assigning a city is hard.
Can anyone give an advise on how I could improve the above? Thanks.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart