Twitter co-hashtagging with OpenRefine - twitter

I am using OpenRefine to format some Twitter metadata into a edge list to be read by Gephi.
It works easily if I want to study user-mention associations or user-hashtags associations.
But now I would like to study co-hashtagging, so how often hashtags co-occur in tweets.
To do this in OpenRefine (that I do not know very well) is a bit trickier and I need some help.
My data are in csv, with two columns: user name of the user, comma separated string of hashtags used in the tweet.
To get user-hashtags edge lists with OpenRefine I use "Split multi valued cells" on the hashtags column and then "Fill down" on the user column (very easy).
I do not know how to get hashtag-hashtag edge lists. I can use "Split multi valued cells" on the hashtags column to get a new row for every hashtag mentionned in a tweet. But how do I "fill" the rows so to get all combinations of hashtag-hashtag co-occurrence?
Example:
Data:
User Hashtags
Dario Data mining, R, OpenRefine
Desired result:
Hashtag 1 Hashtag 2
Data mining R
Data mining OpenRefine
R OpenRefine

Also posted on the OpenRefine Google Group:
I think you could do this with a combination of forEach and forRange. Try the following transformation on the cell containing the comma delimited hashtags:
forEachIndex(value.split(","),i,v,forRange(i+1,value.split(",").length(),1,j,v.trim() + "," + value.split(",")[j].trim()).join("|")).join("|")
This should produce a pipe-delimited list of the unique combinations. Then you can use 'split multi-valued cells'

here is my suggestion.
Let's use your example :
User Hashtags
Dario Data mining, R, OpenRefine
1°/ Use the function called "Split multi valued cells in column" on column Hashtags
You should get something like :
User Hashtags
Dario Data mining
R
OpenRefine
2°/ try this transformation on Hashtags column :
if((row.record.cells["Hashtags"].value[-1])==value,value+","+(row.record.cells["Hashtags"].value[0]),value+","+(row.record.cells["Hashtags"].value[-1]))
3°) Split your column in to columns bases on the "," separator.
It works for me.
Edit :
This solution generates a duplicate entry that can be easily removed like this :
Join multi valued cells, using a | separator (for example).
You get something like
1.
Dario
Data mining,Prout|R,Prout|OpenRefine,Prout|Prout,Data mining
2.
Essai
Data mining,R|R,Data mining
Then split cells in columns based on the separator |
finally, remove the first hashtag column.

Related

Unnest two columns in google sheet

I have a table like this one here (basically it's data from a google form with multiple choice answers in column A and B and non-muliple choice data in column C) I need a separate row for each multiple choice answer.
Column A
Column B
Email
A,B
XX,YY
1#gmail.com
A,C
FF,DD
2#gmail.com
I tried to un-nest the first column and keep the remaining columns like this
enter image description here
I tried several approaches I found with flatten and split with array formulas but I don't know where to start really.
Any help or hint would be much appreciated!
You can use the split function on the column A and after that, use the index function. Considering the table, you can use:
=index(split(A2,","),1,1)
The split function separate the text using the delimiter indicated, returning an array with 1 line and 2 columns; the index function will return the first line and the first column from this array. To return the second element from the column A, just change to
=index(split(A2,","),1,2)
I think there's no easy solution for this. You're asking for as many combinations of elements as multiple-choice elections have been made. Any function in Google Sheets has its potentials and limitations about how many elements it can express. One very useful formula here is REDUCE. With REDUCE and sequences of elements separated by commas counted with COUNTA, you can stablish this formula:
=QUERY(REDUCE({"Col A","Col B","Email"},SEQUENCE(COUNTA(A2:A)),LAMBDA(z,c,{z;LAMBDA(ax,bx,
REDUCE({"","",""},SEQUENCE(ax),LAMBDA(w,a,
{w;
REDUCE({"","",""},SEQUENCE(bx),LAMBDA(y,b,
{y;INDEX(SPLIT(INDEX(A2:A,c),","),,a),INDEX(SPLIT(INDEX(B2:B,c),","),,b),INDEX(C2:C,c)}
))})))
(COUNTA(SPLIT(INDEX(A2:A,c),",")),COUNTA(SPLIT(INDEX(B2:B,c),",")))})),
"Where Col1 is not null",1)
Since I had to use a "initial value" in every REDUCE, I then used QUERY to filter the empty values:

Google sheet Query formula that use text from one column to match another column returning text from adjacent cell

I have a dropdown text column, Claims!B2:B that is supposed to match Ref!A2:A and select Ref!B2:B text.
I tried
=ArrayFormula(IF($B$2:$B="","", LOOKUP($B$2:$B,Ref!$A$2:$A,Ref!$B$2:$B)))
some results not consistent
=QUERY(Ref!A2:B,"Select B where A = "&B2:B&"")
resulting in error
=FILTER(Ref!B2:B,Ref!$A$2:$A=B2:B)
wrong results and not arrayed.
I like to know what should be the simplest array formula for the scenario and if possible correct my other trials for my learning process.
Sample data attached. sample supplier
Please use
for your Category column
=INDEX(IFNA(VLOOKUP(B2:B,Ref!A2:C,3,0)))
and for your GST Stats
=INDEX(IFNA(VLOOKUP(B2:B,Ref!A2:C,2,0)))
(As you notice the only difference is the column number from 2 to 3)
Functions used:
INDEX
IFNA
VLOOKUP

Search element in comma separated list based on substring match in Google Sheets

I have list of data containing comma separated department wise orders in column-B, which is outlined in a format like:
Order_Num-X1|Dept_Name-Y1,Order_Num-X2|Dept_Name-Y2
and so on...
See the below table:
Is it possible to split and distribute the data in corresponding department column as outlined in Column-C, Column-D, Column-E?
I tried as suggested in this post, But I stuck filtering a separated list stored in a single cell.
Try
=IFERROR(ARRAYFORMULA(query(trim(split(flatten($A$2:$A&"|"&split($B$2:$B,",")),"|")),"select Col2 where Col3='"&C$1&"' and Col1='"&$A2&"' ",0)))

Query + Transpose based on value in Column B if Column A contains certain text

I am currently working with Google Forms and want to rearrange the way the responses are being displayed on the "Response Sheet". The only way I can think of doing this is by importing or moving the data to another sheet that would select and transpose certain columns if Column A contains key value.
This is what I'm seeing as part of the input and would like to see as the output if Column A Contains certain text:
Input & Output
Thank you in advance for your help!
O.K.
I rewrite headings a2:e2,
I take whole first five columns without headings e3:e6
I display content of columns A,B,F,G,H for all the rows that have 'A1' in column 1
I take tables built in point 1 and 2 together and sort them by first column
My solution is here:
https://docs.google.com/spreadsheets/d/1n7Ppd8v75mb3qrnJz_Jh_b4HNaj4i56X9wRGnz0l6i8/copy
={A2:E2;
sort({A3:E6;
query(A3:H6,"select A,B,F,G,H where A ='A1'",0)})
}

Techniques to accommodate new entries in google sheets

As you can see I transpose codes into unique column headings so that debits and credits are analysed and summated. Summations are transposed in another sheet to create summary profit/loss account. I need help how to replicate the sum formula in column I to serve any expanded transposed unique codes and whether/how I should use arrayformula for the individual cell output.
EDIT
Actual output looks like this:
My problem is to how to automatically accommodate new entries/codes in the totals row and main body of cells. The data belongs to a residents' committee so I can only show anonymous data as image.
EDIT 2
Actual input is imported from bank records, then coded:
Query is pretty good for the SUM part.
Starting in column I, you can do:
=ArrayFormula(INDEX(QUERY(
0+OFFSET(I4,0,0,ROWS(F6:F),COUNTA(UNIQUE(F4:F))),
"select "&
JOIN(
",",
"sum(Col"&SEQUENCE(COUNTA(UNIQUE(F4:F)))&")"
)
),2))
The 0+ or the VALUE in the second one (they both do the same thing here) transforms the data cells to default to 0 if blank, otherwise the query fails. This also lets us refer to the columns by sequence number, which is what we do in the second argument. We build the query into something that looks like select sum(Col1),sum(Col2),...,sum(ColN). Since this gives us a header by default, we could relabel everything in the query statement, but that gives too much extra code, so the easier thing to do is use INDEX to select the sums.
The EQ part is fairly straightforward to Arrayify. Starting in I4:
=ArrayFormula(
(FILTER(F4:F,F4:F<>"")=FILTER(I2:2,I2:2<>""))*
IF(
Array_constrain(G4:G,COUNTA(FILTER(F4:F,F4:F<>"")),1),
G4:G,
-H4:H
)
)
The FILTERs just filter out the blank cells, and the Array_Constrain sizes the G column to the same size as the filtered F column.

Resources