just discovered that my source data is being updated from a third-party source which I can't change. As such, my orders file actually has all order history including updates of quantity, etc.
I am trying to create a sheet that pulls ONLY the summary values for the most recent version of the order. The example below is an actual extract from the data set - without all the extra data. As you can see, Bill's order was updated three times before it shipped.
I need to group on Order Number and return ONLY the last update from 09/08/2021.
There are obviously many rows (17,000) to be exact with approximately 8000 orders. About 10% of the orders are updated like this. Does anyone have any suggestions for grouping and reporting on the Latest date?
A B C D E
Order Number | Name | Item | QTY | Updated
1001 | Bill | ABC | 10 | 30/07/2021
1001 | Bill | DEF | 5 | 30/07/2021
1001 | Bill | GHI | 5 | 30/07/2021
1001 | Bill | ABC | 10 | 07/08/2021
1001 | Bill | DEF | 5 | 07/08/2021
1001 | Bill | GHI | 7 | 07/08/2021
1001 | Bill | ABC | 2 | 09/08/2021
1001 | Bill | DEF | 4 | 09/08/2021
1001 | Bill | GHI | 2 | 09/08/2021
I want to pull a query back with this group by order number for the last update and sum the QTY.
For this subset of data, the result should look like this.
1001 | Bill | 8 | 09/08/2021
=query(Orders!A1:E,"Select A, B, Sum(D), E group by A, B, E Where E = date ‘”&Text(Max(Orders!E:E),"YYY-MM-DD”)&”‘”,1)
I am getting an error. Any idea? Thanks
try:
=QUERY(Orders!A1:E,
"select A,B,sum(D),max(E)
Where E = date '"&TEXT(MAX(Orders!E:E), "YYYY-MM-DD")&"'
group by A,B", 1)
Related
I use Google Spreadsheet to keep track of my wine cellar, with a simple sheet with number of bottles / name of the wine / where it's from :
+--------------+------------+-------------+
| # of bottles | Wine | Appellation |
+--------------+------------+-------------+
| 2 | Talbot | St Julien |
| 16 | Marbuzet | St Estephe |
| 1 | Terrebrune | Bandol |
| 10 | Madiniere | Cote Rotie |
+--------------+------------+-------------+
I'd like to get a roundup of appellation I have the most, sorted by number of bottles, eg:
+--------------+-------------+
| # of bottles | Appellation |
+--------------+-------------+
| 16 | St Estephe |
| 10 | Cote Rotie |
| ... | ... |
+--------------+-------------+
I know how to get the sorted list of appellations (=sort(UNIQUE($C$2:$C$999) with wine origin in column C) and the matching number of bottles (=SUMIFS(A:A,C:C,<cell with appellation name>), but I'm stuck at sorting by the number of bottles instead.
With QUERY
=QUERY(A:C,"select sum(A),C group by C order by sum(A) desc",1)
To rename the header:
=QUERY(A:C,"select sum(A),C group by C order by sum(A) desc label sum(A) '# of bottles'",1)
With SORT and SUMIF
=ArrayFormula(SORT({SUMIF(C:C,UNIQUE(C2:C),A:A),UNIQUE(C2:C)},1,FALSE))
I have a data source (coming from a Google Sheet) of engagements that has two columns:
Submitted Date
ID
Each row is a unique engagement.
I want to show a single Scorecard widget that has the total average # of engagements per month. For example, if:
2020-01 - 5 rows / engagements
2020-02 - 7 rows / engagements
2020-03 - 4 rows / engagements
Then the scorecard would show average of 5.33 rows/engagements.
Here is some sample data:
| Submitted Date | ID |
|----------------|------|
| 2020-01-02 | ID01 |
| 2020-01-05 | ID02 |
| 2020-01-10 | ID03 |
| 2020-01-12 | ID04 |
| 2020-01-21 | ID05 |
| 2020-02-01 | ID06 |
| 2020-02-02 | ID07 |
| 2020-02-05 | ID08 |
| 2020-02-15 | ID09 |
| 2020-02-16 | ID10 |
| 2020-02-17 | ID11 |
| 2020-02-21 | ID12 |
| 2020-03-10 | ID13 |
| 2020-03-15 | ID14 |
| 2020-03-20 | ID15 |
| 2020-03-25 | ID16 |
I know I can pre-process this data in another sheet in Google to create a table that shows # of rows per month and then in Data Studio I can create an average of that. I am trying to avoid doing that.
In pseudocode, the formula below is COUNT(ID) / COUNT_DISTINCT(Year Month) (in this case, 16 / 3):
COUNT(ID) / COUNT_DISTINCT(TODATE(Submitted Date, "%Y%m"))
Google Data Studio Report to demonstrate:
Since each row is a unique engagement, first, I would extract the year-month from the Submitted Date column. Then I would count their occurrences and get the average value.
table data
I need to design a star schema to process order processing. The progress of an order look like this:
Customer C place an order on item I with quantity 100
Factory F1 take the order partially with quantity 30
Factory F2 take the order partially with quantity 20
Buy from market 50 items
F1 delivery 20 items
F1 delivery 7 items
F1 cancel the contract (we need to buy 3 more item from market)
F2 delivery 20 items
Buy from market 3 items
Complete the order
How can I design a fact table in this case, since the number of step is not fixed, the data types of event is not the same.
I'm sorry for my bad English.
The definition of an Accumulating Snapshot Fact table according to Kimball is:
summarizes the measurement events occurring at predictable steps between the beginning and the end of a process.
For this particular use case I would go with a Transaction Fact Table as the events (steps) are unpredictable, it is more like an event fact table, something similar to logs or audits.
| order_key | date_key | full_datetime | entity_key (customer, factory, etc. varchar) | entity_type | state | quantity |
|-----------|----------|---------------------|----------------------------------------------|-------------|----------|----------|
| 1 | 20190602 | 2019-06-02 04:30:00 | C1 | customer | request | 100 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F1 | factory | receive | 30 |
| 1 | 20190602 | 2019-06-02 05:30:00 | F2 | factory | receive | 20 |
| 1 | 20190602 | 2019-06-02 05:40:00 | Company? | company | buy | 50 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | deliver | 20 |
| 1 | 20190603 | 2019-06-03 02:40:00 | F1 | factory | deliver | 7 |
| 1 | 20190603 | 2019-06-03 04:40:00 | F1 | factory | deliver | 3 |
| 1 | 20190603 | 2019-06-03 06:40:00 | F1 | factory | cancel | |
| 1 | 20190604 | 2019-06-04 07:40:00 | F2 | factory | deliver | 20 |
| 1 | 20190604 | 2019-06-04 07:40:00 | Company? | company | buy | 3 |
| 1 | 20190604 | 2019-06-04 09:40:00 | Company? | company | complete | 100 |
I'm not sure about your reporting needs as they were not specified, but assuming you need to measure lag/durations of unpredictable steps, you could PIVOT and use dynamic SQL to create the required view
SQL Server dynamic PIVOT query?
Let me know if you came up with something different as I'm interested on this particular use case. Good luck
I have InfluxDB measurement currently set up with following "schema":
+----+-------------+-----------+
| ts | cost(field) | type(tag) |
+----+-------------+-----------+
| 1 | 10 | 'a' |
| 1 | 20 | 'b' |
| 2 | 12 | 'a' |
| 2 | 18 | 'b' |
| 2 | 22 | 'c' |
+------------------+-----------+
I am trying to write a query that will group my table by timestamp and get a delta between field values of two different tags. If I want to get delta between tag 'a' and tag 'b', it will give me following result (please not that I ignore tag 'c'):
+----+-----------+------------+
| ts | type(tag) | delta_cost |
+----+-----------+------------+
| 1 | 'a' | 10 |
| 2 | 'b' | 6 |
+----+-----------+------------+
Is it something Influx can do or am I using the wrong tool?
Just managed to answer my own question. While one of the obvious ways would be performing self-join, Influx does not support joins anymore. We can, however, use nested selects in a following format:
SELECT MEAN(cost_a) - MEAN(cost_b) as delta_cost
FROM
(SELECT cost as cost_a, tag, tablename where tag='a'),
(SELECT cost as cost_b, tag, tablename where tag='b')
GROUP BY time(60s)
Since I am getting my data every 60 seconds anyway, and I have a guarantee of just one point per tag per 60 seconds, I can use GROUP BY and take MEAN without any problems
There are 50 exams to be written by around millions of students online, One person may or may not write more than one exam. A person can also write a single exam more than one time ( retries ) ..
So which of the below solution is better for this case, I am okay with a better solution than these two as well
Option 1. Store each exam in a single table :
Subject 1
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Subject 2
+----------------+---------+
| student id | Marks |
+----------------+---------+
| 1 | 85 |
| 2 | 32 |
| 2 | 60 |
+----------------+---------+
Like above with each table will have the student id only if that particular person has taken that exam , and have multiple occurrences of the student id if he has taken it more than once.
Option 2 :
+----------------+---------+---------+
| student id | Subject | Marks |
+----------------+---------+---------+
| 1 | Subj1 | 85 |
| 2 | Subj1 | 32 |
| 2 | Subj1 | 60 |
| 1 | Subj2 | 80 |
| 3 | Subj2 | 90 |
+----------------+---------+---------+
with all the values in a single table.
Which is better in terms of performance and storage perspective.
My various que
I think the best here is following:
Table STUDENT with information about students
Table EXAM with information about exams
Table EXAM_TRY with reference to STUDENT and EXAM tables, and fields DATE_OF_EXAM and RESULT_OF_EXAM
2 indexes on foreign keys in table EXAM_TRY
Depending on situation - index on date field (for example, you would need it for planning work for examiners)