Gephi - generate graph using matrix data - gephi

Can you help me visualise an undirected graph?
I have about 500 strings that look like this:
;javascript;java;tapestry;d;design;jquery;css;html;com;air;testing;events;crm;soa;documentation;.a;email;iso;dynamic;mobile;this;project;resolution;s;automation;web;like;e-commerce;profile;commerce;out;jobs;inventory;operators;environment;system;include;integration;relationship;field;implementation;key;.profile;planning;knockout.js;sun;packaging;collaboration;report;public;virtual;communication;send;state;member;execution;solution;provider;members;continuous;writing;e;cuba;required;transactional;subject;manual;capacity;portfolio;.so;leader;take
;c;python;java;.a;basic;equivalent;cad;requirements;catia;.x;nx;self;communication;selected;base;summary
;javascript;c;python;java;rest;android;security;linux;sql;git;design;perl;css;html;svn;yaml;architecture;ios;json;api;ubuntu;pyramid;deployment;bash;documentation;configuration;frameworks;module;object;.a;multitasking;centos;hosting;project;fluent;administrator;monitoring;control;specifications;web;version;platform;admin;components;out;minimum;environment;system;include;using;key;falcon;communication;migrate;deadlines;ansible;back;cycle;production;red;analysis;administration;graphic;maintenance;autonomy;french;required;environments;hat;lead;arch;take
and what I would like to do with them is calculate and visualise the edges between the shared elements of the strings. Like if in the first two strings we find javascript and python, then the edge would be thicker between them for every match occurence in all strings in the final graph.
What I've done so far is to parse out the strings and separate each one in a 1/0 matrix, with the string names as column names (in a csv file) but that did not seem to work because I don't know if the labels could be seen in Gelphi as column names.
javascript java tapestry
---------------------------------------
Row 1 1 0 1
Row 2 0 1 0
Row 3 1 1 1
So I transposed the matrix to get the strings all in a column, but those columns enumerated by number don't mean much to me.
name Col1 Col2 Col3 Col4
------- ------------------------
javascript 1 0 0 0 1
java 0 1 1 0 0
tapestry 1 0 1 0 1
I am thinking a matrix multiplied by its transverse might help, although Im not sure how the math works for the result interpretation.

what I would like to do with them is calculate and visualise the edges between the shared elements of the strings.
Gephi just visualizes what is created manually, or selected for import.
Can you help me visualise an undirected graph?
Graph as .csv
Turning associations into a static graph (as Gephi-compatible .csv file):
Nodes
List unique names, save to .csv like:
id,label
0,"node #1"
1,"node #2"
2,"node #3"
Optionally, add additional columns as required.
Edges
Enumerate associations, add to weight for each occurrence, save to .csv, like:
id,source,target,label,weight
0,0,1,"edge #1",1
1,0,2,"edge #2",3
Alternatively, create new edge per association-occurrence and merge them after import.
Do not create edges for 0 -weight values (no connection).
source and target -column reference node id.
label -column is optional.
May contain type -column (allows for Directed or Undirected -value).
Optionally, convert to weight (0.0 - 1.0 as opposed to amount) by recalculating weight as:
weight = weight / highest_weight

Related

Tableau question: How to link a reference table to a dynamic calculated field value (which is an integer)? I'm assigning P values

Since Tableau does not have a function for P-values(correct me if I'm wrong here) I created a spreadsheet with all possible sample sizes under two different alphas/significance levels and need to connect the appropriate p-value to a calculated field from the main database source (aggregate count of people). I assumed I could easily match numbers with a condition to bring back the p-value in a calculated field yet I'm hitting a brick wall. Biggest issue seems to be that the field I want to join the P-value reference table to is an aggregated integer. Also, I do not have any extensions and my end result needs to be an integer, not a graph.
Any secret tricks here?
Seems I cannot blend the reference table in nor join it to an aggregate?
Thanks!
I found a work around in calculating the critical value for a two tailed t-test in tableau. However, I didn't figure out how to join based on an aggregated calculated field. Work around: I used a conditional statement just copying and pasting about 100 critical values based on (sample size - 2) aka degrees of freedom, into a calculated field. To save time, use excel to pull down the conditions to 120. Worked like a charm!
Here is the conditional logic for alpha = .2 (80%) in two tailed t-test (replace the ## line with about 117 rows):
IF [degrees of freedom] = 1 THEN 3.08
ELSEIF [degrees of freedom] = 2 THEN 1.89
ELSEIF [degrees of freedom] = 3 THEN 1.64
##ELSEIF [...calculate down to 120] = ... then ...
ELSEIF [degrees of freedom] > 121 THEN 1.28
END

SPSS set lables equal to value in column

As the title says.
What I'm trying to do is a way to set the labels of a column equal to the value in another column.
A B
1 Car
2 Bike
3 Van
1 Car
3 Van
Column A contains the numeric values. Column B contains the labels.
I want to tell SPSS to take the value 1, and assign it the label "Car" (and so on) as clasically is done manually with:
VALUE LABELS
1 "Car"
2 "Bike"
3 "Van".
Execute.
The syntax below will automatically create a new syntax that adds the value labels as you described.
Before starting, I'm recreating the sample data you posted to demonstrate on:
data list list/A (f1) B (a10).
begin data
1 "Car"
2 "Bike"
3 "Van"
1 " Car"
3 "Van"
end data.
dataset name orig.
Now we get to work:
* first we aggregate the data to get only one line for every value/label pair.
dataset declare agg.
aggregate out=agg /break A B /nn=n.
dataset activate agg.
* now we use the data to create a syntax file with the value label commands.
string cat (a50).
compute cat=concat('"', B, '"').
write out="yourpath\my labels syntax.sps" /"add value labels A ", A, cat, ".".
execute.
* getting back to the original data we can now execute the syntax.
dataset activate orig.
insert file="yourpath\my labels syntax.sps".

Is there a way to use same field as rows and columns in google sheets to count unique occurrence between columns?

Looking to convert
Task id
John
Jan
Juliet
1
1
1
0
2
1
0
1
3
0
1
1
4
0
0
1
5
0
1
1
6
1
1
0
7
0
1
0
8
1
0
0
9
0
1
1
10
1
1
0
To
John
Jan
Juliet
John
3
1
Jan
3
3
Juliet
1
3
I have set up a new sheet ("Erik Help") in your sample spreadsheet.
In B1:
=SORT(FILTER(Sheet1!B1:1,Sheet1!B1:1<>""))
This simply fills the top row with your names list, sorted alphabetically.
In A2:
=TRANSPOSE(SORT(FILTER(Sheet1!B1:1,Sheet1!B1:1<>"")))
This fills A2 down with the same names list as above, just vertically.
In B2 is the main formula for the grid (which is then dragged over and down):
=ArrayFormula(IF( ($A2="") + (B$1="") + ($A2=B$1),, SUM(MMULT(IF((FILTER(Sheet1!$B$2:$L,Sheet1!$A$2:$A<>"")=1) * (Sheet1!$B$1:$L$1=$A2),1,0), SEQUENCE(COLUMNS(Sheet1!$B$1:$L$1),1,1,0)) * MMULT(IF((FILTER(Sheet1!$B$2:$L,Sheet1!$A$2:$A<>"")=1) * (Sheet1!$B$1:$L$1=B$1),1,0), SEQUENCE(COLUMNS(Sheet1!$B$1:$L$1),1,1,0)))))
The first ( ) + ( ) + ( ) tests three OR conditions. If any is true, the cell will be left blank. This is what allows the formula to be dragged all the way right and down without throwing errors and, in essence, "waiting" for new data from the first two formulas above that it can process.
The rest of the formula is too complex to warrant full explanation (e.g., how MMULT works in detail), this being a volunteer-run site. (Writing the formula took more time than I generally spend in a day on this or other forums.) But here's the gist.
Two grids — each formed by an MMULT (matrix multiplication) — are SUMmed. The first MMULT will produce a grid the same size as the Sheet1 grid, filled with 1 only if two conditions are met: that there was already a 1 in that slot and that the name above matches the name to the right in the "Erik Help" grid. Otherwise, the result for that slot is a zero. The second MMULT forms the same size grid based on the same conditions, only this time it gets a 1 only if there is already a 1 and the name above matches the name above the cell in "Erik Help." These two grids are multiplied, and if the product is a 1, we know that BOTH names had a 1 there. Once SUMmed, we get the count of shared projects for those two names.
As this formula is dragged, cell references not locked with a dollar sign will adjust, so that two different names will be compared by the two MMULT grids.
Because this solution requires comparing arrays with arrays with arrays, I don't currently see how a further array solution is possible, hence the need for the formulas to be dragged. That is, each of these formulas is already jam-packed with array processing.
Again, the formula is currently dragged all the way to Column Z and down to Row 200. However, it only references up to Column L (which is as far as your current names list goes). If your real world application has more names and thus carries over past Column L, the easiest way to change all of the formulas at once is this:
Go to the "Erik Help" sheet (which you can, of course, rename as you like).
Hit Ctrl-H to open the Find/Replace dialog box.
Enter $L in the FIND field and $? in the REPLACE field (where ? will be the new column to which you want the results to extend, e.g., $M or $P, etc.)
Choose "This sheet" from the "Search" drop-down.
Check the box next to "Also search within formulas."
Click the "Replace all" button.
If the data set shrinks or grows again, do the same steps, just changing the old furthest column reference for the new furthest column reference.
Here is a super-simple way of doing it which just changes the pair of columns selected in the countifs as the formula moves across and down by relative addressing:
=countifs(index($B$2:$D,0,row(A1)),1,index($B$2:$D,0,column(A1)),1)
pulled down and across.
Attempt at more general solution.
The question is tagged pivot-table. Although a pivot table approach seems useful, the data is in exactly the wrong format to achieve it. The task would be to transform the data from ones and zeroes to column numbers so
1 1 0 => 1 2
1 0 1 => 1 3
1 1 1 => 1 2, 1 3 and 2 3.
This can be achieved by generating pairs of numbers as follows and performing a lookup in the original data:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
The formulas to generate these sequences are
=ArrayFormula(quotient(mod(sequence(90,1,0),9),3)+1)
and
=ArrayFormula(mod(sequence(90,1,0),3)+1)
(9 because there are 3X3 pairs per row of data, 90 because there are 10 rows of data).
The following generates a lookup for each row of data
=ArrayFormula(quotient(sequence(90,1,0),9)+1)
Putting all this together and wrapping it in a pivot query gives
=ArrayFormula(query({vlookup(quotient(sequence(90,1,0),9)+2,{row(B2:D),B2:D},quotient(mod(sequence(90,1,0),9),3)+2,0)*(quotient(mod(sequence(90,1,0),9),3)+1),
vlookup(quotient(sequence(90,1,0),9)+2,{row(B2:D),B2:D},mod(sequence(90,1,0),3)+2,0)*(mod(sequence(90,1,0),3)+1)},
"select count(Col1) where Col1<>0 and Col2<>0 group by Col1 pivot Col2"))
The formula can be generalised to different numbers of rows and columns.

How to get the sum of a column up to a certain value?

I have a google sheet that I am using to try and calculate leveling and experience points. Column A has the level and Column B has the exp needed to reach the next level. i.e. To get to Level 3 you need 600 exp.
A B
1 200
2 400
3 600
...
99 19800
In column I2 I have an integer for an amount of exp (e.g. 2000), in column J2 I want to figure out what level someone would be at if they started from 0.
Put this in column J and ddrag down as required. Rounddown(I2,-2) rounds I2 down to the nearest 100. Index match finds a match in column B and returns the value in column A of the matched row.
=index(A2:A100,match(ROUNDDOWN(I2,-2),B2:B100,0))
Using a helper column (for example Z): put =sum(B$1:B1) in cell Z1 and drag down. This will compute the sums required for each level. In J2, use the formula
=vlookup(I2, {B:B, Z:Z}, 2) + 1
which looks up I2 in column B, and returns the nearest match that is less than or equal to the search key. It adds 1 to find the level that would be reached, because your table has this kind of an offset to you: the entry against level N is about achieving level N+1.
You may want to put 0 0 on top of the table, to correctly handle the amounts under 200. Or treat them with a separate if condition.
Using algebra
In your specific scenario, the point amount required for level N can be computed as
200*(1+2+3+...+N-1) = 200*(N-1)*N/2 = 100*(N-1/2)^2 - 25
So, given x amount of points, we can find N directly with algebra:
N = floor(sqrt((x+25)/100)+1/2)
which means that the formula
=floor(sqrt((I2 + 25) / 100) + 1/2)
will have the desired effect in cell J2, without the need for an extra column and vlookup.
However, the second approach only works for this specific point values.

Compute subranks in spreadsheet column in combination with ArrayFormula (Google Sheets)

I'm trying to find the inverse rank within categories using an ArrayFormula. Let's suppose a sheet containing
A B C
---------- -----
1 0.14 2
1 0.26 3
1 0.12 1
2 0.62 2
2 0.43 1
2 0.99 3
Columns A:B are input data, with an unknown number of useful rows filled-in manually. A is the classifier categories, B is the actual measurements.
Column C is the inverse ranking of B values, grouped by A. This can be computed for a single cell, and copied to the rest, with e.g.:
=1+COUNTIFS($B$2:$B,"<" & $B2, $A$2:$A, "=" & $A2)
However, if I try to use ArrayFormula:
=ARRAYFORMULA(1+COUNTIFS($B$2:$B,"<" & $B2:$B, $A$2:$A, "=" & $A2:$A))
It only computes one row, instead of filling all the data range.
A solution using COUNT(FILTER(...)) instead of COUNTIFS fails likewise.
I want to avoid copy/pasting the formula since the rows may grow in the future and forgetting to copy again could cause obscure miscalculations. Hence I would be glad for help with a solution using ArrayFormula.
Thanks.
I don't see a solution with array formulas available in Sheets. Here is an array solution with a custom function, =inverserank(A:B). The function, given below, should be entered in Script Editor (Tools > Script Editor). See Custom Functions in Google Sheets.
function inverserank(arr) {
arr = arr.filter(function(r) {
return r[0] != "";
});
return arr.map(function(r1) {
return arr.reduce(function(rank, r2) {
return rank += (r2[0] == r1[0] && r2[1] < r1[1]);
}, 1);
});
}
Explanation: the double array of values in A:B is
filtered, to get rid of empty rows (where A entry is blank)
mapped, by the function that takes every row r1 and then
reduces the array, counting each row (r2) only if it has the same category and smaller value than r1. It returns the count plus 1, so the smallest element gets rank 1.
No tie-breaking is implemented: for example, if there are two smallest elements, they both get rank 1, and there is no rank 2; the next smallest element gets rank 3.
Well this does give an answer, but I had to go through a fairly complicated manoeuvre to find it:
=ArrayFormula(iferror(VLOOKUP(row(A2:A),{sort({row(A2:A),A2:B},2,1,3,1),row(A2:A)},4,false)-rank(A2:A,A2:A,true),""))
So
Sort cols A and B with their row numbers.
Use a lookup to find where those sorted row numbers now are: their position gives the rank of that row in the original data plus 1 (3,4,2,6,5,7).
Return the new row number.
Subtract the rank obtained just by ranking on column A (1,1,1,4,4,4) to get the rank within each group.
In the particular case where the classifiers (col A) are whole numbers and the measurements (col B) are fractions, you could just add the two columns and use rank:
=ArrayFormula(iferror(rank(A2:A+B2:B,if(A2:A<>"",A2:A+B2:B),true)-rank(A2:A,A2:A,true)+1,""))
My version of an array formula, it works when column A contains text:
=ARRAYFORMULA(RANK(ARRAY_CONSTRAIN(VLOOKUP(A1:A,{UNIQUE(FILTER(A1:A,A1:A<>"")),ROW(INDIRECT("a1:a"&COUNTUNIQUE(A1:A)))},2,)*1000+B1:B,COUNTA(A1:A),1),ARRAY_CONSTRAIN(VLOOKUP(A1:A,{UNIQUE(FILTER(A1:A,A1:A<>"")),ROW(INDIRECT("a1:a"&COUNTUNIQUE(A1:A)))},2,)*1000+B1:B,COUNTA(A1:A),1),1) - COUNTIF(A1:A,"<"&OFFSET(A1,,,COUNTA(A1:A))))

Resources