Multiple Object With Same Value in K-Means - machine-learning

Well i have problem with my data
This is my healthcare database
(Name, Value1, Value2, Value3, Value4)
Jhon 10, 20, 30, 40
Jhon 9, 12, 21, 33
Noah 8, 22, 18, 10
Anna 9, 19, 29, 32
Clark 11, 4, 17, 20
In healthcare database one person can ill two times, three times or more as you can see the example of my database there is two jhon's who have two records because he ill twice
The purpose i use k-means is to get two cluster (cluster 1 : group 1, cluster 2 : group 2) with their member
And i want get output like this :
Group 1 : jhon, clark
Group 2 : noah, anna, jhon
You see there is two jhon's, one member can be to group 1 and group 2, so how i can fix this problem ??

K-means works by iterations between pairs of steps. You basically alternate between
assuming you know the mapping of instances to clusters, and calculate the cluster centers
assuming you know the cluster centers, assign instances to clusters
Thus if you have constraints, e.g., that all jhon (sic) should belong to the same cluster, you can incorporate this into step 2: you'll need to find the cluster for which the simultaneous assignment of all of them is the most likely.
See Constrained k-means clustering with background for details.

Related

criterium range broadcasting along sum range in `SUMIF`

Can Google spreadsheets broadcast rows or columns in their functions, in particular in SUMIF, if different range dimensions are used for two arguments?
For example, I was hoping SUMIF(B1:F1, "Offices", B2:F4) would return 56+23+23, because the 1x5 range B1:F1 would be repeated in the row dimension to match the 3x5 range B2:F4. PS: this repeating is called dimension broadcasting. Unfortunately, this doesn't work, SUMIF just ignores the 2 rows it has no criteria for and returns 56.
A B D D E F
1 Month Maintenance Offices Cars Employees Cars
2 Jan 23 56 43 23 56
3 Feb 12 23 67 43 21
4 March 44 23 45 56 45
Question: Can I specify a criterium in SUMIF where the column or row stays fixed, as is possible with conditional formatting? Differently put, how can one specify a criterium in SUMIF such that the criterium range is broadcasted?
Why DSUM does not work: Notice that SUMIF(B1:F4,"Offices",B1:F4) could do the trick, only that it would fail here since the B1:F4 isn't a proper database since there are two columns named "Car". Moreover, DSUM requires the column headers to be adjacent to their data, while I would imagine I'd want to place the total sums between the header and the data. That being said, I also want to learn an imho powerful concept, not the DSUM function.
Conditional formatting Google spreadsheet does offer broadcasting in formatting, for example, if I format the range A1:F4 on the condition =A$1="Cars", then the two "Cars" columns would be formatted, even though they are columns D and F.
Comparison with numpy, which does broadcasting numpy, Which can be used as a spreadsheet library when programming Python, does something called (dimension) broadcasting. Consider an array (read spreadsheet) a of 1 row and 3 columns and another array b of 3 rows and 3 columns. We could then ask numpy to multiply both element- (read cell-) wise, and it will repeat the single row of a three times to match the dimension of b:
import numpy
a = numpy.array([
[0, 1, 2]
])
b = numpy.array([
[0, 1, 2],
[3, 4, 5],
[7, 8, 9],
])
a * b
Outcome, notice that the entire first column is multiplied with 0, the entire second with 1 and the entire third with 2:
numpy.array([
[0, 1, 4],
[0, 4, 10],
[0, 8, 18],
])
=SUMPRODUCT(QUERY(TRANSPOSE(A1:F10), "where Col1 = 'Offices'"))

How can I easily label my data in Power BI?

Question
Is there a fast, scalable way to replace number values by mapped text labels in my visualisations?
Background
I often find myself with questionnaire data of the following format:
ID Sex Age class Answer to question
001 1 2 5
002 2 3 2
003 1 3 1
004 2 5 1
The Sex, Age class and Answer column values actually map to text labels. For the example of Sex:
ID Description
0 Unknown
1 Man
2 Woman
Similar mappings are possible for the other columns.
If I create visualisations of e.g. the distribution of sex in my respondent group I'll get a visual showing that 50% of my data has sex 1 and 50% of my data has sex 2.
The data itself often originates from an Excel or csv file.
What I have tried
To make that visualisation meaningful to other people I:
create a second table containing the mapping between the value and label
create a relationship between the source data and the mapping
use the Description column of my mapping table as a category in my visualisations.
I have to do this for several columns in my dataset, which makes this a tedious process.
Ideal solution
A method that allows me to define, per column, a mapping between values and corresponding text labels. SPSS' VALUE LABELS command comes to mind.
You can simply create a calculated column on your table that defines how you want to map each ID values using a SWITCH function and use that column in your visual. For example,
Sex Label =
SWITCH([Sex],
1, "Man",
2, "Woman",
"Unknown"
)
(Here, the last argument is an else condition that gets returned if none of the previous get matched.)
If you want to do a whole bunch at a time, you can create a new table from your existing table using ADDCOLUMNS like this:
Test =
ADDCOLUMNS(
Table1,
"Sex Label", SWITCH([Sex], 1, "Man", 2, "Woman", "Unknown"),
"Question 1 Label", SWITCH([Question 1], 1, "Yes", 2, "No", "Don't Know"),
"Question 2 Label", SWITCH([Question 2], 1, "Yes", 2, "No", "Don't Know"),
"Question 3 Label", SWITCH([Question 3], 1, "Yes", 2, "No", "Don't Know")
)

Google sheets nested if statement workaround

Link to sheet:
I'm trying to make a scorecard and leaderboard for my golf team, and I need to calculate how many holes a person has finished. The nested if statement in cell J2
=if(G11, 18,
=if(G10, 17,
=if(G9, 16,
=if(G8, 15,
=if(G7, 14,
=if(G6, 13,
=if(G5, 12,
=if(G4, 11,
=if(G3, 10,
=if(C11, 9,
=if(C10, 8,
=if(C9, 7,
=if(C8, 6,
=if(C7, 5,
=if(C6, 4,
=if(C5, 3,
=if(C4, 2,
=if(C3, 1, 0))))))))))))))))))
should accomplish what I need but there are too many functions in the cell to work.
The current function checks the cell where the 18th hole score should be, and if it's there, the player is through 18 holes. If not, it goes to the first nested if and checks the 17th hole score cell, etc...
I know I could do part the function in three different cells and it would work fine, but I'm curious if anyone has any better ideas.
Thanks!
I need to calculate how many holes a person has finished.
I believe what you need is the COUNT function.
=COUNT({G3:G11;C3:C11})
This will give the total number of holes a person has finished.
Below returns an array of all the hole numbers in the first set that has a value against it
=ArrayFormula(E3:E11*(G3:G11<>""))
Below returns the maximum of the hole numbers among all the holes that have a value against them.
=MAX(ArrayFormula(E3:E11*(G3:G11<>"")),ArrayFormula(A3:A11*(C3:C11<>"")))
Broke it up for brevity, but the second one is what I guess you need.

How to create subscale scores for 4 subscales of the REI using SPSS?

I need to create subscale scores for 4 subscales of the REI: REI_Appear; REI_Hlth; REI_Mood; REI_Enjoy. The items comprising each subscale are as follows:
Appearance (9 items): 1, 5, 9, 13, 16, 17, 19, 21, 24
Health (8 items): 3, 6, 8, 15, 18, 20, 22, 23
Mood (4 items): 2, 7, 12, 14
Enjoyment (3 items): 4, 10, 25
For example, I have placed REI_Appear in the target variable but then im unsure of what to place in the numeric expression section for it to work?
There are several important issues.
Do you want means or sums or some other composite?
Do any items need reversing?
How do you want to handle missing data?
Assuming you want means, there are no items needing reversing, and you want a participant to have at least 3 items to get a score, then you could use:
compute REI_appear = mean.3(item1, item5, ..., item24).
EXECUTE.
where you replace item1 etc. with the relevant variable names.
I have an existing post dedicated to the topic of computing scale scores for psychological tests which discusses some of these issues further.

Data structure for Matrix lookup in ruby, rails

i want to use a matrix type data structure for storing and looking up values.
for this 2d array can be used. but i am looking for a better structure.
Requirements:
Matrix columns are fixed, but rows can increase.
for e.g.
see the following structure.
Issue| col1, col2, col3, col4
1 | 0, 1, 0, 0
2 | 0, 1, 0, 1
3 | 1, 1, 0, 0
[values in the structure are used as flag or status field]
now i want this structure to be used for look up
say i want to know the value for issue 2 col1 (which is 0 in above example)
what can be the better structure in ruby for the above scenario?
comments please?
What about a hash?
h = { 1 => [0,1,0,0],
2 => [0,1,0,1],
3 => [1,1,0,0] }
#fetch value for issue 2 col 1
puts h[2][0]
In case your data set is large and you want to have faster lookups and a more flexible design (what happens if you'll add a column later as your design evolves?), you might consider an in-memory database like supermodel. That way, you can avoid reinventing the wheel and you gain a lot of functionality and flexibility with very little effort.

Resources