Arrange downloaded data into more useful way in google sheets - google-sheets

We currently have a fixed report data that we can only manipulate after download and to simplify, it looks like this:
raw report data extracted to google sheets
a b c
1 Start Date Time Adhering to Schedule (Hours) Time Not Adhering to Schedule (Hours)
2 Employee: A Supervisor: X
3 5/4/2022 7.65 1.35
4 5/5/2022 8.12 0.88
5 5/6/2022 6.95 2.05
6 5/9/2022 8.7 0.3
7 5/10/2022 7.45 1.55
8 5/11/2022 8.63 0.37
9 5/12/2022 8.08 0.92
10 5/13/2022 6.13 0.13
11 Totals: 61.71 7.55
12 Employee: B Supervisor: X
13 5/1/2022 3.8 0.27
14 5/2/2022 6.72 2.28
15 5/3/2022 6.1 2.9
16 5/4/2022 8.43 0.57
17 5/5/2022 5.85 0.53
18 5/10/2022 6.13 2.87
19 5/11/2022 0 1.5
20 5/12/2022 2 1.5
21 5/13/2022 1.75 1.75
22 Totals: 40.78 14.17
I would like some help in constructing a new sheet via formulas so that it rearranges the raw data as follows:
desired output
a b c d e
1 EMPLOYEE SUPERVISOR Start Date Time Adhering to Schedule (Hours) Time Not Adhering to Schedule (Hours)
2 A X 04/05/22 7.65 1.35
3 A X 05/05/22 8.12 0.88
4 A X 06/05/22 6.95 2.05
5 A X 09/05/22 8.70 0.30
6 A X 10/05/22 7.45 1.55
7 A X 11/05/22 8.63 0.37
8 A X 12/05/22 8.08 0.92
9 A X 13/05/22 6.13 0.13
10 B X 01/05/22 3.80 0.27
11 B X 02/05/22 6.72 2.28
12 B X 03/05/22 6.10 2.90
13 B X 04/05/22 8.43 0.57
14 B X 05/05/22 5.85 0.53
15 B X 10/05/22 6.13 2.87
16 B X 11/05/22 0.00 1.50
17 B X 12/05/22 2.00 1.50
18 B X 13/05/22 1.75 1.75
It probably needs some combination of QUERY() ARRAYFORMULA(), TRANSPOSE() and/or INDEX() or something.. but i can't quite figure it out. I need some help with to get started in the right track. the dates and data between employees are dynamic so the formula in the desired result needs to adjust to that as well.
thanks!
edit: adding a sample trix for reference :) https://docs.google.com/spreadsheets/d/1m_FCGcnXvnEiMZ8X4K1eEsMljORWV4V1Yq_81vFnx4Y/edit?usp=sharing

Gobal solution
in E1
={ArrayFormula(if(A1:A="Totals:",,{
substitute(lookup(row(A1:A),row(A1:A)/if(ISNUMBER(A1:A),0,1),A1:A),"Employee: ",""),
substitute(lookup(row(A1:A),row(A1:A)/if(ISNUMBER(A1:A),0,1),C1:C),"Supervisor: ","")
})),Arrayformula(if(ISNUMBER(A1:A),{A1:A,B1:B,C1:C},))}
In 3 steps (3 arrayformulas),
try in H1
=arrayformula(if(left(A1:A,6)="Totals",,if(left(A1:A,8)="Employee",{B1:B,D1:D,E1:E,E1:E,E1:E},{E1:E,E1:E,A1:A,B1:B,C1:C})))
then, back in F1 to complete all rows with employee and supervisor
=ArrayFormula({lookup(row(H:H),row(H:H)/if(H:H<>"",1,0),H:H),lookup(row(I:I),row(I:I)/if(I:I<>"",1,0),I:I)})
finally, if you want to reduce the presentation, in M1
=query(F:L,"select F,G,J,K,L where J is not null",0)

Related

Google Sheets: Convert Horizontal Transaction Data into Chronological Statement + Combining Columns of Data

On a sheet named, "Performance," I have data concerning stock trades in a row like so:
A B C D E F G H I J
1 TICKER TRADE OPEN DATE TRADE CLOSED DATE SHARES AVG BUY INVESTMENT AVG SALE PROCEEDS PROFIT/LOSS ROIC:
2 ABC 01/05/22 03/31/22 107 $14.22 -$1,521.54 $15.00 $1,605.00 $83.46 5.49%
3 BCA 01/05/22 03/31/22 344 $14.52 -$4,994.88 $15.00 $5,160.00 $165.12 3.31%
4 CAB 01/05/22 03/31/22 526 $12.55 -$6,601.30 $13.00 $6,838.00 $236.70 3.59%
... and so forth ...
Within the same workbook but on a separate sheet named, "Contributions/Withdrawals," I have a list of contributions and withdrawals like so:
A B
1 DATE AMOUNT
2 01/05/22 $700.00
3 02/05/22 $700.00
4 03/05/22 $400.00
5 03/15/22 -$7,000.00
... and so forth ...
I need to convert the first table of trade transactions into a vertical column format exactly like what is in the Contributions/Withdrawals table. (Note that each trade transaction actually represents two transactions, one for opening with its own date, and one for closing with its date.) Finally, I need to stack both tables of transactions in date order to make a combined chronological list of transactions so that I can run an XIRR formula on it.
The resulting table on a sheet named, "Cash Flows," needs to look like this:
A B
1 DATE AMOUNT
2 01/05/22 -$1,521.54
3 01/05/22 -$4,994.88
4 01/05/22 -$6,601.30
5 01/05/22 $700.00
6 02/05/22 $700.00
7 03/05/22 $700.00
8 03/10/22 $400.00
9 03/15/22 -$7000.00
10 03/31/22 $1,605.00
11 03/31/22 $5,160.00
12 03/31/22 $6,838.00
Using the following in cell A2 and B2...
A2 =SORT({Performance!$B$2:$B;Performance!$C$2:$C;'Contributions/Withdrawals'!$A$2:$A})
B2 =SORT({Performance!$F$2:$F;Performance!$H$2:$H;'Contributions/Withdrawals'!$B$2:$B})
...almost gets me there, but the transactions are not lining up with the correct dates. Google Sheets is ordering the amounts from smallest to largest. What I end up with is this:
A B
1 DATE AMOUNT
2 01/05/22 -$7,000.00
3 01/05/22 -$6,602.72
4 01/05/22 -$6,602.39
5 01/05/22 -$6,601.30
6 01/05/22 -$6,596.40
7 01/05/22 -$6,587.10
8 01/05/22 -$4,994.88
9 01/05/22 -$3,315.26
10 01/05/22 -$3,284.91
11 01/05/22 -$1,521.54
12 02/05/22 $400.00
13 03/05/22 $700.00
14 03/10/22 $700.00
15 03/15/22 $700.00
16 03/31/22 $1,605.00
17 03/31/22 $3.249.00
18 03/31/22 $3,731.00
19 03/31/22 $5,160.00
20 03/31/22 $6,348.00
21 03/31/22 $6,532.00
22 03/31/22 $6,786.00
23 03/31/22 $6,838.00
Any help would be appreciated. Thanks!
You are very close indeed! You should join both ranges in order to sort them by the first column:
=SORT({Performance!$B$2:$B;Performance!$C$2:$C;'Contributions/Withdrawals'!$A$2:$A,Performance!$F$2:$F;Performance!$H$2:$H;'Contributions/Withdrawals'!$B$2:$B})
(You may need to change that only comma to a inverted slash if you have another locale settings)

Averaging a Data Series in a Google Sheet to a single entry per period regardless of the number of samples in the larger period?

I have a small data set of ~200 samples taken over twenty years with two columns of data that sometimes have multiple entries for the period (i.e. age or date). When I go to plot it, even though the data is over 20 years the graph heavily reflects the number of samples in the period and not the period itself. For example during age 23 there may be 2 or 3 samples, 1 for age 24, 20 for age 25, and 10 for age 35.. the number of samples entirely on needs for additional data at the time.. so simply there is no consistency to the sample rate.
How do I get an Max or an Average / Max for a period (age) and ensure there is only one entry per period in the sheet (about one entry per year) without having to create a separate sheet full of separate queries and charting off of that?
What I have tried in Google Sheets (where my data is) is on the x-series chart choosing "aggregate" (which is on the age period) which helps flatten the graph a bit, but doesn't reduce the series.
A read only link to the the spreadsheet is HERE for reference.
Data Looking something like this:
3/27/2013 36.4247 2.5 29.3
4/10/2013 36.4630 1.8 42.8
4/15/2013 36.4767 2.2 33.9
5/2/2013 36.5233 2.2 33.9
5/21/2013 36.5753 1.91 39.9
5/29/2013 36.5973 1.94 39.2
7/29/2013 36.7644 1.98 38.3
10/25/2013 37.0055 1.7 45.6
2/28/2014 37.3507 1.85 50 41.3
6/1/2014 37.6055 1.98 38 38.1
12/1/2014 38.1068 37
6/1/2015 38.6055 2.18 34 33.9
12/11/2015 39.1342 3.03 23 23.1
12/14/2015 39.1425 3.18 22 21.9
12/15/2015 39.1452 3.44 20 20.0
12/17/2015 39.1507 3.61 19 18.9
12/21/2015 39.1616 3.62 19 18.8
12/23/2015 39.1671 3.32 21 20.8
12/25/2015 39.1726 3.08 23 22.7
12/28/2015 39.1808 3.12 22 22.4
12/29/2015 39.1836 2.97 24 23.7
12/30/2015 39.1863 3.57 19 19.1
12/31/2015 39.1890 3.37 20 20.5
1/1/2016 39.1918 3.37 20 20.5
1/3/2016 39.1973 2.65 27 27.0
1/4/2016 39.2000 2.76 26 25.8
try:
=QUERY(SORTN(SORT({YEAR($A$6:$A), B6:B}, 1, 0, 2, 0), 9^9, 2, 1, 1),
"where Col1 <> 1899")
demo spreadsheet
and build a chart from there

Clustering to achieve heterogeneous groups

I want to group 100 users based on a categorical variable (which can be low, medium, or high). The group size should be 3. I want to get the maximal heterogeneity within groups, assuming that users are distributed equally. I wonder if I can use some clustering algorithm to group based on the dissimilarity? Any suggestions?
I don't believe you need a clustering algorithm to group the data based upon a categorical variable.
Based on you question, I think this should work.
# Code
from sklearn.model_selection import train_test_split
group1, group23 = train_test_split(data, test_size=2/3., stratify=data['lab'])
group2, group3 = train_test_split(group23, test_size=1/2., stratify=group23['lab'])
Stratify makes sure that the maximum heterogeneity is maintained for the given categorical value.
# Sample output
print(data)
val1 val2 lab
0 1 1 L
1 2 2 L
2 3 3 L
3 4 4 M
4 5 5 M
5 6 6 M
6 7 7 H
7 8 8 H
8 9 9 H
print(group1)
val1 val2 lab
4 5 5 M
1 2 2 L
6 7 7 H
print(group2)
val1 val2 lab
8 9 9 H
2 3 3 L
3 4 4 M
print(group3)
val1 val2 lab
0 1 1 L
7 8 8 H
5 6 6 M
train_test_split() Documentation

Fill blank variable with subsequent values

I have a dataset structured like below;
id contracthours13 contracthours14 contracthours13u contracthours14u
12 . 13 . 13
13 30 30 . .
14 . . 15 16
15 . 5 6 7
If contracthours13 is missing I want the value in contracthours14 to move across. If this is missing then I want contacthours13u to move across and the same then for contracthours14u if the previous 3 are all missing. I know this is fairly simple syntax but I just can't get my head around how to do it without having the run simpler syntax 3 times. If anyone could help it would be greatly appreciated.
Edit: below is what I would like my dataset to look like afterwards.
id contracthours13
12 13
13 30
14 15
15 5
Look up VECTOR / LOOP examples.
DATA LIST FREE / ID CH13 CH14 CH13U CH14U.
BEGIN DATA.
1 -1 13 -1 -1
2 30 30 -1 -1
3 -1 -1 15 16
4 -1 5 6 7
END DATA.
DATASET NAME DSRaw.
RECODE ALL (-1=SYSMIS).
VECTOR V= CH14 TO CH14U.
LOOP #i = 1 TO 3 IF (NVALID(CH13)=0).
COMPUTE CH13=V(#i).
END LOOP IF NVALID(V(#i))=1.
LIST.
EXE.
**List**
ID CH13 CH14 CH13U CH14U
1.00 13.00 13.00 . .
2.00 30.00 30.00 . .
3.00 15.00 . 15.00 16.00
4.00 5.00 5.00 6.00 7.00
Number of cases read: 4 Number of cases listed: 4

Using COUNTIFS on 3 different columns and then need to SUM a 4th column?

I have written this formula below. I do not know the correct part of this formula that will add the numbers I have in Column AB2:AB552. As it is, this formula is counting the number of cells in that range that has numbers in it, but I need it to total those numbers as my final result. Any help would be great.
=COUNTIFS(Cases!B2:B552,"1",Cases!G2:G552,"c*",Cases!X2:X552,"No",**Cases!AB2:AB552,">0"**)
Assuming you don't actually need the intermediate counts, the sumifs function should give you the final result:
=SUMIFS(Cases!AB2:AB552,Cases!B2:B552,1,Cases!G2:G552,"c",Cases!X2:X552,"No",Cases!AB2:AB552,">0")
Testing this with some limited data:
Row B G X AB
2 2 a No 10
3 1 c No 24
4 2 c No 4
5 1 c No 0
6 1 a Yes 9
7 2 c No 12
8 2 c No 6
9 2 b No 0
10 1 b No 0
11 1 a No 10
12 2 c No 6
13 1 c No 20
14 1 c No 4
15 1 b Yes 22
16 1 b Yes 22
the formula above returned 48, the sum of AB3, AB13, and AB14, which were the only rows matching all 4 criteria

Resources