Google SQL Query with 3 criteria - google-sheets

The query below should give me a usage rate for the specified material code (B7 cell) with those matching criteria.
Should match the cell value
Should contain the 'Taking out of inventory'
Finally data should be between the date range of the last 4 weeks.
= QUERY(Inventory, "SELECT SUM(D) WHERE B = '"&B7&"' AND C = 'Taking out of inventory' AND A >= TODAY(-30) LABEL SUM(D) ''",-1)**
The output should Sum up the values in the quantity column for the past 4 weeks. - Usage Rate 4 weeks
Timestamp
Inventory #
Inventory Type
Quantity
Sales Order (S/O) #
Purchase Order (P/O) #
6/20/22 10:42:16
AAR
Cycle Count (Full) (Physical count of quantities on hand)
240
6/20/22 10:45:11
AB1
Cycle Count (Full) (Physical count of quantities on hand)
30
6/20/22 10:46:22
ABC
Cycle Count (Full) (Physical count of quantities on hand)
3
6/20/22 10:47:52
ABD
Cycle Count (Full) (Physical count of quantities on hand)
80
6/20/22 10:48:49
ABN
Cycle Count (Full) (Physical count of quantities on hand)
21
6/20/22 10:50:14
AAV
Cycle Count (Full) (Physical count of quantities on hand)
400
6/20/22 10:50:44
ABA
Cycle Count (Full) (Physical count of quantities on hand)
3
6/20/22 11:02:00
ABG
Cycle Count (Full) (Physical count of quantities on hand)
0
6/20/22 11:03:03
AAX
Cycle Count (Full) (Physical count of quantities on hand)
85
6/20/22 11:05:35
ABM
Cycle Count (Full) (Physical count of quantities on hand)
0
6/20/22 11:06:08
AB5
Cycle Count (Full) (Physical count of quantities on hand)
10
6/20/22 11:07:06
AAS
Cycle Count (Full) (Physical count of quantities on hand)
60
6/20/22 11:07:48
AAT
Cycle Count (Full) (Physical count of quantities on hand)
250
6/20/22 11:08:50
ABQ
Cycle Count (Full) (Physical count of quantities on hand)
20
6/20/22 11:09:37
AB4
Cycle Count (Full) (Physical count of quantities on hand)
0
6/20/22 11:14:34
AC3
Cycle Count (Full) (Physical count of quantities on hand)
80
6/20/22 11:15:05
ABW
Cycle Count (Full) (Physical count of quantities on hand)
80
6/20/22 11:18:29
AAB
Cycle Count (Full) (Physical count of quantities on hand)
448
6/20/22 11:19:37
ABY
Cycle Count (Full) (Physical count of quantities on hand)
0
6/20/22 11:30:35
AC4
Cycle Count (Full) (Physical count of quantities on hand)
10
6/20/22 11:31:54
AC9
Cycle Count (Full) (Physical count of quantities on hand)
22
6/20/22 11:32:30
AC7
Cycle Count (Full) (Physical count of quantities on hand)
80
6/20/22 11:37:17
AC2
Cycle Count (Full) (Physical count of quantities on hand)
1
6/20/22 11:40:40
ABV
Cycle Count (Full) (Physical count of quantities on hand)
10
6/20/22 11:56:06
AB2
Cycle Count (Full) (Physical count of quantities on hand)
240
6/20/22 12:44:46
ABP
Cycle Count (Full) (Physical count of quantities on hand)
50
6/20/22 12:45:28
ABR
Cycle Count (Full) (Physical count of quantities on hand)
2
6/20/22 12:46:51
AA3
Cycle Count (Full) (Physical count of quantities on hand)
240
6/20/22 12:47:33
AA4
Cycle Count (Full) (Physical count of quantities on hand)
360
6/20/22 12:48:27
AAE
Cycle Count (Full) (Physical count of quantities on hand)
50
6/20/22 12:49:37
AAJ
Cycle Count (Full) (Physical count of quantities on hand)
100
6/20/22 12:50:06
AA7
Cycle Count (Full) (Physical count of quantities on hand)
880
6/20/22 12:50:58
AA8
Cycle Count (Full) (Physical count of quantities on hand)
485
6/20/22 12:52:04
AAC
Cycle Count (Full) (Physical count of quantities on hand)
350
6/20/22 12:52:35
AAC
Adding to Inventory
20
6/20/22 12:53:17
AAI
Cycle Count (Full) (Physical count of quantities on hand)
0
6/20/22 12:55:36
AC1
Cycle Count (Full) (Physical count of quantities on hand)
20
6/20/22 13:01:48
ABI
Cycle Count (Full) (Physical count of quantities on hand)
8
6/20/22 13:02:14
ABS
Cycle Count (Full) (Physical count of quantities on hand)
26
6/20/22 13:04:25
ABF
Cycle Count (Full) (Physical count of quantities on hand)
50
6/20/22 13:05:52
AB5
Cycle Count (Full) (Physical count of quantities on hand)
58
6/20/22 13:06:34
ABU
Cycle Count (Full) (Physical count of quantities on hand)
50
6/20/22 13:10:16
ACD
Cycle Count (Full) (Physical count of quantities on hand)
86
6/20/22 13:29:55
AAS
Taking out of Inventory
15
6/20/22 13:30:46
ABN
Adding to Inventory
8
6/20/22 13:33:44
AA7
Taking out of Inventory
60
6/20/22 13:42:13
ACE
Cycle Count (Full) (Physical count of quantities on hand)
140
6/20/22 13:47:03
ACK
Cycle Count (Full) (Physical count of quantities on hand)
170
6/20/22 14:01:51
ADA
Cycle Count (Full) (Physical count of quantities on hand)
22
6/20/22 14:02:14
AD7
Cycle Count (Full) (Physical count of quantities on hand)
0
6/20/22 14:04:23
ACM
Cycle Count (Full) (Physical count of quantities on hand)
85
6/20/22 14:10:52
ACC
Cycle Count (Full) (Physical count of quantities on hand)
172
6/20/22 14:23:25
AD4
Cycle Count (Full) (Physical count of quantities on hand)
85
6/20/22 14:26:12
AD5
Cycle Count (Full) (Physical count of quantities on hand)
130
6/20/22 14:30:25
AD3
Cycle Count (Full) (Physical count of quantities on hand)
186
6/20/22 15:03:29
AD9
Cycle Count (Full) (Physical count of quantities on hand)
63
6/20/22 15:04:04
ADO
Cycle Count (Full) (Physical count of quantities on hand)
9
6/20/22 15:06:03
AD8
Cycle Count (Full) (Physical count of quantities on hand)
113
6/20/22 15:28:35
AAC
Taking out of Inventory
50
6/21/22 7:51:51
AAV
Taking out of Inventory
18
6/21/22 11:13:09
AB4
Cycle Count (Full) (Physical count of quantities on hand)
100
6/21/22 11:13:45
ABC
Cycle Count (Full) (Physical count of quantities on hand)
100
6/21/22 11:14:05
ABG
Cycle Count (Full) (Physical count of quantities on hand)
50
6/21/22 11:14:39
ABM
Cycle Count (Full) (Physical count of quantities on hand)
125
6/21/22 11:15:03
AB5
Cycle Count (Full) (Physical count of quantities on hand)
200
6/21/22 11:15:33
ABN
Cycle Count (Full) (Physical count of quantities on hand)
100
6/21/22 11:15:50
ABQ
Cycle Count (Full) (Physical count of quantities on hand)
350
6/21/22 11:16:15
ABC
Taking out of Inventory
17
6/21/22 11:16:35
ABG
Taking out of Inventory
11
6/21/22 11:41:12
AB4
Taking out of Inventory
1
6/21/22 11:41:39
AB5
Taking out of Inventory
1
6/21/22 11:42:11
ABG
Taking out of Inventory
3
6/21/22 12:32:50
AAC
Taking out of Inventory
2
6/21/22 15:45:04
AAC
Taking out of Inventory
20
6/22/22 8:51:11
AD5
Taking out of Inventory
3
6/22/22 8:50:52
AD4
Taking out of Inventory
3
6/22/22 13:19:04
ABM
Taking out of Inventory
125
6/23/22 10:57:06
AB4
Taking out of Inventory
2
6/23/22 10:57:46
ABC
Taking out of Inventory
2
6/23/22 10:58:22
AB2
Taking out of Inventory
1
6/23/22 10:59:17
ABN
Taking out of Inventory
4
6/23/22 10:59:52
AB1
Taking out of Inventory
4
6/23/22 11:01:09
ABQ
Taking out of Inventory
350
6/24/22 8:38:06
ABM
Cycle Count (Full) (Physical count of quantities on hand)
2
6/27/22 10:54:35
AB5
Taking out of Inventory
175
6/27/22 10:55:12
AAB
Taking out of Inventory
7
6/27/22 12:08:11
ABV
Taking out of Inventory
1
6/27/22 13:35:29
ADA
Taking out of Inventory
2
6/27/22 13:36:09
ABN
Taking out of Inventory
10
6/27/22 13:38:27
AD0
Taking out of Inventory
2
6/27/22 13:38:59
ADA
Taking out of Inventory
12
6/27/22 15:32:06
AAS
Taking out of Inventory
15
6/28/22 13:14:38
AB4
Taking out of Inventory
50
6/29/22 7:54:13
ABC
Taking out of Inventory
2
6/29/22 7:54:42
ABP
Taking out of Inventory
2
6/29/22 7:55:40
AAR
Taking out of Inventory
4
6/29/22 7:57:40
AAX
Taking out of Inventory
2
6/29/22 7:58:21
AA8
Taking out of Inventory
1
7/6/22 8:12:47
AB1
Taking out of Inventory
12
7/6/22 8:13:32
AB2
Taking out of Inventory
4
7/11/22 8:36:41
AAV
Cycle Count (Full) (Physical count of quantities on hand)
320
7/11/22 8:37:42
AAR
Cycle Count (Full) (Physical count of quantities on hand)
240
7/11/22 9:05:43
AB2
Cycle Count (Full) (Physical count of quantities on hand)
205
7/11/22 9:10:40
AAC
Cycle Count (Full) (Physical count of quantities on hand)
270
7/11/22 9:14:01
AA8
Cycle Count (Full) (Physical count of quantities on hand)
445
7/11/22 9:15:01
AA7
Cycle Count (Full) (Physical count of quantities on hand)
880
7/11/22 9:15:56
AA3
Cycle Count (Full) (Physical count of quantities on hand)
240
7/11/22 9:19:40
AA4
Cycle Count (Full) (Physical count of quantities on hand)
350
7/11/22 9:20:37
AAE
Cycle Count (Full) (Physical count of quantities on hand)
25
7/11/22 9:21:16
AAJ
Cycle Count (Full) (Physical count of quantities on hand)
120
7/11/22 9:22:06
ABR
Cycle Count (Full) (Physical count of quantities on hand)
2
7/11/22 9:24:20
ABA
Cycle Count (Full) (Physical count of quantities on hand)
0
7/12/22 12:00:16
ABF
Cycle Count (Full) (Physical count of quantities on hand)
50
7/12/22 12:01:55
ABS
Cycle Count (Full) (Physical count of quantities on hand)
27
7/12/22 12:02:29
ABI
Cycle Count (Full) (Physical count of quantities on hand)
8
7/12/22 12:08:24
ABU
Cycle Count (Full) (Physical count of quantities on hand)
50
7/12/22 12:09:35
ABT
Cycle Count (Full) (Physical count of quantities on hand)
36
7/12/22 12:11:37
ABP
Cycle Count (Full) (Physical count of quantities on hand)
45
7/12/22 15:14:11
AC5
Cycle Count (Full) (Physical count of quantities on hand)
0
7/13/22 9:16:58
ACK
Cycle Count (Full) (Physical count of quantities on hand)
160
7/13/22 9:23:02
ACE
Cycle Count (Full) (Physical count of quantities on hand)
135
7/13/22 9:26:59
ACD
Cycle Count (Full) (Physical count of quantities on hand)
85
7/13/22 11:47:05
AD7
Cycle Count (Full) (Physical count of quantities on hand)
0
7/13/22 11:47:44
ADA
Cycle Count (Full) (Physical count of quantities on hand)
6
7/13/22 11:53:08
ACM
Cycle Count (Full) (Physical count of quantities on hand)
90
7/13/22 11:57:48
ACC
Cycle Count (Full) (Physical count of quantities on hand)
160
7/13/22 12:31:18
AD4
Cycle Count (Full) (Physical count of quantities on hand)
87
7/13/22 12:35:38
AD5
Cycle Count (Full) (Physical count of quantities on hand)
125
7/13/22 12:39:35
AD3
Cycle Count (Full) (Physical count of quantities on hand)
175
7/13/22 13:00:26
AD9
Cycle Count (Full) (Physical count of quantities on hand)
64
7/13/22 13:01:03
AD0
Cycle Count (Full) (Physical count of quantities on hand)
7
7/13/22 13:03:36
AD8
Cycle Count (Full) (Physical count of quantities on hand)
115
7/15/22 9:52:20
AB4
Adding to Inventory
7
7/15/22 9:52:54
ABG
Adding to Inventory
7
7/15/22 9:53:22
ABM
Adding to Inventory
2
7/15/22 9:53:49
ACK
Adding to Inventory
8
7/18/22 6:07:57
AAR
Cycle Count (Full) (Physical count of quantities on hand)
250
7/18/22 6:08:35
AB1
Cycle Count (Full) (Physical count of quantities on hand)
4
7/18/22 6:09:09
ABC
Cycle Count (Full) (Physical count of quantities on hand)
1
7/18/22 6:09:43
ABD
Cycle Count (Full) (Physical count of quantities on hand)
0
7/18/22 6:10:14
ABN
Cycle Count (Full) (Physical count of quantities on hand)
0
7/18/22 6:10:47
ABA
Cycle Count (Full) (Physical count of quantities on hand)
0
7/18/22 6:11:40
ABG
Cycle Count (Full) (Physical count of quantities on hand)
7
7/18/22 6:12:08
AAX
Cycle Count (Full) (Physical count of quantities on hand)
0
7/18/22 6:12:51
ABM
Cycle Count (Full) (Physical count of quantities on hand)
2
7/18/22 6:13:43
AB5
Cycle Count (Full) (Physical count of quantities on hand)
25
7/18/22 6:14:25
AAS
Cycle Count (Full) (Physical count of quantities on hand)
21
7/18/22 6:14:53
AAT
Cycle Count (Full) (Physical count of quantities on hand)
240
7/18/22 6:15:26
AB4
Cycle Count (Full) (Physical count of quantities on hand)
60
7/18/22 6:16:01
ABQ
Cycle Count (Full) (Physical count of quantities on hand)
20
7/18/22 6:20:58
ABW
Cycle Count (Full) (Physical count of quantities on hand)
80
7/18/22 6:21:48
AC5
Cycle Count (Full) (Physical count of quantities on hand)
0
7/18/22 6:22:07
AC4
Cycle Count (Full) (Physical count of quantities on hand)
0
7/18/22 6:22:56
ABV
Cycle Count (Full) (Physical count of quantities on hand)
1
7/18/22 6:24:30
AC7
Taking out of Inventory
78
7/18/22 6:32:03
AC9
Cycle Count (Full) (Physical count of quantities on hand)
20
7/18/22 6:33:38
AB2
Cycle Count (Full) (Physical count of quantities on hand)
180
7/18/22 6:34:15
ABR
Cycle Count (Full) (Physical count of quantities on hand)
2
7/18/22 6:34:51
AA3
Cycle Count (Full) (Physical count of quantities on hand)
240
7/18/22 6:36:08
AA4
Cycle Count (Full) (Physical count of quantities on hand)
300
7/18/22 6:36:32
AAE
Cycle Count (Full) (Physical count of quantities on hand)
20
7/18/22 6:37:38
AAJ
Cycle Count (Full) (Physical count of quantities on hand)
115
7/18/22 6:38:05
AA7
Cycle Count (Full) (Physical count of quantities on hand)
880
7/18/22 6:38:34
AA8
Cycle Count (Full) (Physical count of quantities on hand)
440
7/18/22 6:39:24
AAC
Cycle Count (Full) (Physical count of quantities on hand)
260

try:
=QUERY(Inventory,
"SELECT SUM(D)
WHERE B = '"&B7&"'
AND C = 'Taking out of inventory'
AND A >= date '"&TEXT(TODAY()-30, "e-m-d")&"'
LABEL SUM(D)''", )

Related

Memory Leak inspection with Windbg - Increase in heap size doesn't reflect in Heap detail

I'm trying to detect a memory leak location for a certain process via Windbg, and have come across a strange problem.
Using windbg, I've created 2 memory dump snapshots - one before and one after the leak, that showed an increase of around 20MBs (detected via Performance Monitor - private bytes). it shows that there is indeed a similar size difference in one of the heaps before and after the leak (Used with the command !heap -s):
Before:
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
03940000 08000002 48740 35312 48740 4372 553 9 5 2d LFH
External fragmentation 12 % (553 free blocks)
03fb0000 08001002 7216 3596 7216 1286 75 4 8 0 LFH
External fragmentation 35 % (75 free blocks)
05850000 08001002 60 16 60 5 2 1 0 0
...
After:
Heap Flags Reserv Commit Virt Free List UCR Virt Lock Fast
(k) (k) (k) (k) length blocks cont. heap
-----------------------------------------------------------------------------
03940000 08000002 64928 55120 64928 6232 1051 26 5 51 LFH
External fragmentation 11 % (1051 free blocks)
03fb0000 08001002 7216 3596 7216 1236 73 4 8 0 LFH
External fragmentation 34 % (73 free blocks)
05850000 08001002 60 16 60 5 2 1 0 0
...
See the first Heap (03940000) - there is a difference in committed KBs of around 55120 - 35312 = 19808 KB = 20.2 MB.
However, when I inspected that heap with (!heap -stat -h 03940000), it displays the following for both dump files:
size #blocks total ( %) (percent of total busy bytes)
3b32 1 - 3b32 (30.94)
1d34 1 - 1d34 (15.27)
880 1 - 880 (4.44)
558 1 - 558 (2.79)
220 1 - 220 (1.11)
200 2 - 400 (2.09)
158 1 - 158 (0.70)
140 2 - 280 (1.31)
...(rest of the lines show no difference)
size #blocks total ( %) (percent of total busy bytes)
3b32 1 - 3b32 (30.95)
1d34 1 - 1d34 (15.27)
880 1 - 880 (4.44)
558 1 - 558 (2.79)
220 1 - 220 (1.11)
200 2 - 400 (2.09)
158 1 - 158 (0.70)
140 2 - 280 (1.31)
...(rest of the lines show no difference)
As you can see, there is hardly a difference between the two, despite the abovementioned 20MB size difference.
Is there an explanation for that?
Note: I have also inspected the Unmanaged memory using UMDH - there wasn't a noticeable size difference there.

KDB+/q: how to implement an xbar aggregation after an xgroup aggregation on kdb table

I'm trying to run an xbar aggregation on trade data after an xgroup aggregation however I can't get seem to get it right:
I am trying to take a table of the following format (consisting of trades):
time side amount price exchange
------------------------------------------------
2019.08.22T12:01:04.389 sell 54 9953.5 exchange1
2019.08.22T12:01:05.034 sell 205 9953.5 exchange1
2019.08.22T12:01:05.754 sell 150 9953.5 exchange1
2019.08.22T12:01:06.375 sell 516 9953.5 exchange1
2019.08.22T12:01:07.044 sell 100 9953.5 exchange1
2019.08.22T12:01:07.691 sell 1500 9953.5 exchange1
2019.08.22T12:01:08.393 sell 300 9953.5 exchange1
2019.08.22T12:01:09.005 sell 2254 9953.5 exchange2
2019.08.22T12:01:09.625 sell 500 9957.5 exchange2
2019.08.22T12:01:10.448 sell 5330 9953.5 exchange2
2019.08.22T12:01:11.065 sell 260 9953.5 exchange2
2019.08.22T12:01:11.701 sell 38 9953.5 exchange2
2019.08.22T12:01:12.404 sell 44 9953.5 exchange2
2019.08.22T12:01:12.974 sell 41 9953.5 exchange2
on one hand I would like to use xbar to group them into time buckets of 5 minutes i.e.
select price, amount by 5 xbar time.minute from trades
and on the other I am trying to group them by side and exchange i.e.
exchangeside xgroup trades
I am looking for the best method to combine the above 2 methods such that I have 4 groups bucketed/windowed/aggregated by time i.e.
exchange1 sell time1 price1 amt1
time2 price2 amt2
exchange1 buy time1 ...
time2 ...
exchange2 sell time1 ...
time2 ...
exchange2 buy time1 ...
time2 ...
etc.
How would one succinctly achieve this?
Thanks
If you're trying to aggregate over 15min buckets with groupings then you can do it in the by clause:
trades:([]exch:100?`P`Q;sym:100?`IBM`MSFT;side:100?`B`S;time:asc 0D10:20+0D00:01*100?100;price:100?100.;size:100?1000);
q)select avg price, sum size by exch,side,sym,15 xbar time.minute from trades
exch side sym minute| price size
---------------------| -------------
P B IBM 10:30 | 34.14991 369
P B IBM 10:45 | 46.46884 1204
P B IBM 11:15 | 30.9058 1106
P B IBM 11:30 | 22.88752 1196
P B IBM 11:45 | 12.47049 494
...

Tableau running count reset

I have a list of sporting matches by time with result and margin. I want Tableau to keep a running count of number of matches since the last x (say, since the last draw - where margin = 0).
This will mean that on every record, the running count will increase by one unless that match is a draw, in which case it will drop back to zero.
I have not found a method of achieving this. The only way I can see to restart counts is via dates (e.g. a new year).
As an aside, I can easily achieve this by creating a running count tally OUTSIDE of Tableau.
The interesting thing is that Tableau then doesn't quite deal with this well with more than one result on the same day.
For example, if the structure is:
GameID Date Margin Running count
...
48 01-01-15 54 122
49 08-01-15 12 123
50 08-01-15 0 124
51 08-01-15 17 0
52 08-01-15 23 1
53 15-01-15 9 2
...
Then when trying to plot running count against date, Tableau rearranges the data to show:
GameID Date Margin Running count
...
48 01-01-15 54 122
51 08-01-15 17 0
52 08-01-15 23 1
49 08-01-15 12 123
50 08-01-15 0 124
53 15-01-15 9 2
...
I assume it is doing this because by default it sorts the running count data in ascending order when dates are identical.

How to reduce Ipython parallel memory usage

I'm using Ipython parallel in an optimisation algorithm that loops a large number of times. Parallelism is invoked in the loop using the map method of a LoadBalancedView (twice), a DirectView's dictionary interface and an invocation of a %px magic. I'm running the algorithm in an Ipython notebook.
I find that the memory consumed by both the kernel running the algorithm and one of the controllers increases steadily over time, limiting the number of loops I can execute (since available memory is limited).
Using heapy, I profiled memory use after a run of about 38 thousand loops:
Partition of a set of 98385344 objects. Total size = 18016840352 bytes.
Index Count % Size % Cumulative % Kind (class / dict of class)
0 5059553 5 9269101096 51 9269101096 51 IPython.parallel.client.client.Metadata
1 19795077 20 2915510312 16 12184611408 68 list
2 24030949 24 1641114880 9 13825726288 77 str
3 5062764 5 1424092704 8 15249818992 85 dict (no owner)
4 20238219 21 971434512 5 16221253504 90 datetime.datetime
5 401177 0 426782056 2 16648035560 92 scipy.optimize.optimize.OptimizeResult
6 3 0 402654816 2 17050690376 95 collections.defaultdict
7 4359721 4 323814160 2 17374504536 96 tuple
8 8166865 8 196004760 1 17570509296 98 numpy.float64
9 5488027 6 131712648 1 17702221944 98 int
<1582 more rows. Type e.g. '_.more' to view.>
You can see that about half the memory is used by IPython.parallel.client.client.Metadata instances. A good indicator that results from the map invocations are being cached is the 401177 OptimizeResult instances, the same number as the number of optimize invocations via lbview.map - I am not caching them in my code.
Is there a way I can control this memory usage on both the kernel and the Ipython parallel controller (who'se memory consumption is comparable to the kernel)?
Ipython parallel clients and controllers store past results and other metadata from past transactions.
The IPython.parallel.Client class provides a method for clearing this data:
Client.purge_everything()
documented here. There is also purge_results() and purge_local_results() methods that give you some control over what gets purged.

optimize hive query for multitable join

INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID,
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID,
s.STORE_ID,
s.STORE_NAME,
p.CREATE_DATE,
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;
I have a join query where i am joining huge tables and i am trying to optimize this hive query. Here are some facts about the tables
image table has 60m rows,
product table has 1b rows,
product_cat has 1000 rows,
store has 1m rows,
stock_info has 100 rows,
customizable_image has 200k rows.
a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table.
Here is what i tried so far,
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain
3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers
Here are some key logs from hive console. I ran this hive query in aws. Can anyone help me understand the primary bottleneck here ? This job is only processing a subset of the actual data.
Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%, reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%, reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%, reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec
MapReduce Jobs Launched:
Job 0: Map: 59 Reduce: 18 Cumulative CPU: 3903.4 sec HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48 Cumulative CPU: 2707.34 sec HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29 Reduce: 10 Cumulative CPU: 3901.99 sec HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42 Cumulative CPU: 2388.71 sec HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42 Reduce: 12 Cumulative CPU: 3792.5 sec HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45 Reduce: 13 Cumulative CPU: 4432.22 sec HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42 Cumulative CPU: 6052.96 sec HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK
The query is still taking longer than 5 hours in Hive where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. Interestingly, when i ran the task with 4 large core instances, the time taken improved only by 10 mins compared to the run with 3 large instance core instances. but when i ran the task with 3 med cores, it took 1hr 10 mins more.
This brings me to the question, "is Hive even the right choice for such complex joins" ?
I suspect the bottleneck is just in sorting your product table, since it seems much larger than the others. I think joins with Hive for tables over a certain size become untenable, simply because they require a sort.
There are parameters to optimize sorting, like io.sort.mb, which you can try setting, so that more sorting occurs in memory, rather than spilling to disk, re-reading and re-sorting. Look at the number of spilled records, and see if this much larger than your inputs. There are a variety of ways to optimize sorting. It might also help to break your query up into multiple subqueries so it doesn't have to sort as much at one time.
For the stock_info , and product_cat tables, you could probably keep them in memory since they are so small ( Check out the 'distributed_map' UDF in Brickhouse ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java ) For custom image, you might be able to use a bloom filter, if having a few false positives is not a real big problem.
To completely remove the join, perhaps you could store the image info in a keystone DB like HBase to do lookups instead. Brickhouse also had UDFs for HBase , like hbase_get and base_cached_get .

Resources