How to parse CLI output with cascading elements using textfsm - parsing

I am trying to parse CLI output that has cascading elements using textfsm & python. Here is an example: Ref - https://github.com/google/textfsm/wiki/TextFSM
Using this example. How can I get the value of 'CPU utilization' for each Slot ?
Routing Engine status:
Slot 0:
Current state Master
Election priority Master (default)
Temperature 39 degrees C / 102 degrees F
CPU temperature 55 degrees C / 131 degrees F
DRAM 2048 MB
Memory utilization 76 percent
CPU utilization:
User 95 percent
Background 0 percent
Kernel 4 percent
Interrupt 1 percent
Idle 0 percent
Model RE-4.0
Serial ID xxxxxxxxxxxx
Start time 2008-04-10 20:32:25 PDT
Uptime 180 days, 22 hours, 45 minutes, 20 seconds
Load averages: 1 minute 5 minute 15 minute
0.96 1.03 1.03
Routing Engine status:
Slot 1:
Current state Backup
Election priority Backup
Temperature 30 degrees C / 86 degrees F
CPU temperature 31 degrees C / 87 degrees F
DRAM 2048 MB
Memory utilization 14 percent
CPU utilization:
User 0 percent
Background 0 percent
Kernel 0 percent
Interrupt 1 percent
Idle 99 percent
Model RE-4.0
Serial ID xxxxxxxxxxxx
Start time 2008-01-22 07:32:10 PST
Uptime 260 days, 10 hours, 45 minutes, 39 seconds
Template
Value Required Slot (\d+)
Value State (\w+)
Value Temp (\d+)
Value CPUTemp (\d+)
Value DRAM (\d+)
Value User (\d+)
Value Background (\d+)
Value Kernel (\d+)
Value Interrupt (\d+)
Value Idle (\d+)
Value Model (\S+)
Start
^Routing Engine status: -> Record RESlot
^\s+CPU utilization: -> Record SUBRESlot
RESlot
^\s+Slot\s+${Slot}
^\s+Current state\s+${State}
^\s+Temperature\s+${Temp} degrees
^\s+CPU temperature\s+${CPUTemp} degrees
^\s+DRAM\s+${DRAM} MB
^\s+Model\s+${Model} -> Start
SUBRESlot
^\s+User\s+${User}\s+percent
^\s+backgroud\s+${Background}\s+percent
^\s+Kernel\s+${Kernel}\s+percent
^\s+Interrupt\s+${Interrupt}\s+percent
^\s+Idle\s+${Idle}\s+percent -> Start
Output:
Slot, State, Temp, CPUTemp, DRAM, User, Background, Kernel, Interrupt, Idle, Model
0, Master, 39, 55, 2048, , , , , , RE-4.0
1, Backup, 30, 31, 2048, , , , , , RE-4.0
As you can see the CPU utilization elements are not getting populated.
I would really appreciate any pointer

I consider there are two mistakes in your template:
You should Record when you have all data.
The Model value you are trying to match in the last line of RESlot state can be matched only after you passed the CPU utilization section in the input text. Please note that textfsm parses the file line by line.
You can use below template to get your data:
Value Required Slot (\d+)
Value State (\w+)
Value Temp (\d+)
Value CPUTemp (\d+)
Value DRAM (\d+)
Value User (\d+)
Value Background (\d+)
Value Kernel (\d+)
Value Interrupt (\d+)
Value Idle (\d+)
Value Model (\S+)
Start
^Routing Engine status: -> RESlot
^\s+CPU utilization: -> SUBRESlot
RESlot
^\s+Slot\s+${Slot}
^\s+Current state\s+${State}
^\s+Temperature\s+${Temp} degrees
^\s+CPU temperature\s+${CPUTemp} degrees
^\s+DRAM\s+${DRAM} MB -> Start
SUBRESlot
^\s+User\s+${User}\s+percent
^\s+Background\s+${Background}\s+percent
^\s+Kernel\s+${Kernel}\s+percent
^\s+Interrupt\s+${Interrupt}\s+percent
^\s+Idle\s+${Idle}\s+percent -> SUBModel
SUBModel
^\s+Model\s+${Model} -> Record Start
Result:
Slot, State, Temp, CPUTemp, DRAM, User, Background, Kernel, Interrupt, Idle, Model
0, Master, 39, 55, 2048, 95, 0, 4, 1, 0, RE-4.0
1, Backup, 30, 31, 2048, 0, 0, 0, 1, 99, RE-4.0

Related

Averaging a Data Series in a Google Sheet to a single entry per period regardless of the number of samples in the larger period?

I have a small data set of ~200 samples taken over twenty years with two columns of data that sometimes have multiple entries for the period (i.e. age or date). When I go to plot it, even though the data is over 20 years the graph heavily reflects the number of samples in the period and not the period itself. For example during age 23 there may be 2 or 3 samples, 1 for age 24, 20 for age 25, and 10 for age 35.. the number of samples entirely on needs for additional data at the time.. so simply there is no consistency to the sample rate.
How do I get an Max or an Average / Max for a period (age) and ensure there is only one entry per period in the sheet (about one entry per year) without having to create a separate sheet full of separate queries and charting off of that?
What I have tried in Google Sheets (where my data is) is on the x-series chart choosing "aggregate" (which is on the age period) which helps flatten the graph a bit, but doesn't reduce the series.
A read only link to the the spreadsheet is HERE for reference.
Data Looking something like this:
3/27/2013 36.4247 2.5 29.3
4/10/2013 36.4630 1.8 42.8
4/15/2013 36.4767 2.2 33.9
5/2/2013 36.5233 2.2 33.9
5/21/2013 36.5753 1.91 39.9
5/29/2013 36.5973 1.94 39.2
7/29/2013 36.7644 1.98 38.3
10/25/2013 37.0055 1.7 45.6
2/28/2014 37.3507 1.85 50 41.3
6/1/2014 37.6055 1.98 38 38.1
12/1/2014 38.1068 37
6/1/2015 38.6055 2.18 34 33.9
12/11/2015 39.1342 3.03 23 23.1
12/14/2015 39.1425 3.18 22 21.9
12/15/2015 39.1452 3.44 20 20.0
12/17/2015 39.1507 3.61 19 18.9
12/21/2015 39.1616 3.62 19 18.8
12/23/2015 39.1671 3.32 21 20.8
12/25/2015 39.1726 3.08 23 22.7
12/28/2015 39.1808 3.12 22 22.4
12/29/2015 39.1836 2.97 24 23.7
12/30/2015 39.1863 3.57 19 19.1
12/31/2015 39.1890 3.37 20 20.5
1/1/2016 39.1918 3.37 20 20.5
1/3/2016 39.1973 2.65 27 27.0
1/4/2016 39.2000 2.76 26 25.8
try:
=QUERY(SORTN(SORT({YEAR($A$6:$A), B6:B}, 1, 0, 2, 0), 9^9, 2, 1, 1),
"where Col1 <> 1899")
demo spreadsheet
and build a chart from there

Kilobytes or Kibibytes in GNU time?

We are doing some performance measurements including some memory footprint measurements. We've been doing this with GNU time.
But, I cannot tell if they are measuring in kilobytes (1000 bytes) or kibibytes (1024 bytes).
The man page for my system says of the %M format key (which we are using to measure peak memory usage): "Maximum resident set size of the process during its lifetime, in Kbytes."
I assume K here means the SI "Kilo" prefix, and thus kilobytes.
But having looked at a few other memory measurements of various things through various tools, I trust that assumption like I'd trust a starved lion to watch my dogs during a week-long vacation.
I need to know, because for our tests 1000 vs 1024 Kbytes adds up to a difference of nearly 8 gigabytes, and I'd like to think I can cut down the potential error in our measurements by a few billion.
Using the below testing setup, I have determined that GNU time on my system measures in Kibibytes.
The below program (allocator.c) allocates data and touches each of it 1 KiB at a time to ensure that it all gets paged in. Note: This test only works if you can page in the entirety of the allocated data, otherwise time's measurement will only be the largest resident collection of memory.
allocator.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define min(a,b) ( ( (a)>(b) )? (b) : (a) )
volatile char access;
volatile char* data;
const int step = 128;
int main(int argc, char** argv ){
unsigned long k = strtoul( argv[1], NULL, 10 );
if( k >= 0 ){
printf( "Allocating %lu (%s) bytes\n", k, argv[1] );
data = (char*) malloc( k );
for( int i = 0; i < k; i += step ){
data[min(i,k-1)] = (char) i;
}
free( data );
} else {
printf("Bad size: %s => %lu\n", argv[1], k );
}
return 0;
}
compile with: gcc -O3 allocator.c -o allocator
Runner Bash Script:
kibibyte=1024
kilobyte=1000
mebibyte=$(expr 1024 \* ${kibibyte})
megabyte=$(expr 1000 \* ${kilobyte})
gibibyte=$(expr 1024 \* ${mebibyte})
gigabyte=$(expr 1000 \* ${megabyte})
for mult in $(seq 1 3);
do
bytes=$(expr ${gibibyte} \* ${mult} )
echo ${mult} GiB \(${bytes} bytes\)
echo "... in kibibytes: $(expr ${bytes} / ${kibibyte})"
echo "... in kilobytes: $(expr ${bytes} / ${kilobyte})"
/usr/bin/time -v ./allocator ${bytes}
echo "===================================================="
done
For me this produces the following output:
1 GiB (1073741824 bytes)
... in kibibytes: 1048576
... in kilobytes: 1073741
Allocating 1073741824 (1073741824) bytes
Command being timed: "./a.out 1073741824"
User time (seconds): 0.12
System time (seconds): 0.52
Percent of CPU this job got: 75%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.86
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1049068
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 262309
Voluntary context switches: 7
Involuntary context switches: 2
Swaps: 0
File system inputs: 16
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
2 GiB (2147483648 bytes)
... in kibibytes: 2097152
... in kilobytes: 2147483
Allocating 2147483648 (2147483648) bytes
Command being timed: "./a.out 2147483648"
User time (seconds): 0.21
System time (seconds): 1.09
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2097644
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 524453
Voluntary context switches: 4
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
3 GiB (3221225472 bytes)
... in kibibytes: 3145728
... in kilobytes: 3221225
Allocating 3221225472 (3221225472) bytes
Command being timed: "./a.out 3221225472"
User time (seconds): 0.38
System time (seconds): 1.60
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.98
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3146220
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 786597
Voluntary context switches: 4
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
In the "Maximum resident set size" entry, I see values that are closest to the kibibytes value I expect from that raw byte count. There is some difference because its possible that some memory is being paged out (in cases where it is lower, which none of them are here) and because there is more memory being consumed than what the program allocates (namely, the stack and the actual binary image itself).
Versions on my system:
> gcc --version
gcc (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> /usr/bin/time --version
GNU time 1.7
> lsb_release -a
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.10 (Final)
Release: 6.10
Codename: Final

TextFsm: Cisco NXOS

I have started exploring TextFSM by google, its great for screen
scraping. But I'm stuck. Here is the template for the command: vsh_lc -c "show platform internal bcm-usd funcstats"
Value Filldown Chassis (.*)
Value FUNC (.*)
Value COUNT (.*)
Value TOTAL (.*)
Value MIN (.*)
Value AVG (.*)
Value MAX (.*)
Start
^${Chassis}:
^\s+${FUNC}\s+${COUNT}\s+${TOTAL}\s+${MIN}\s+${AVG}\s+${MAX} -> Record
Raw Output:
FUNC ID Stats:
==============================================================================================================================
Func Name Count Total Min Avg Max
==============================================================================================================================
OPT_TLV_DISPATCH 374599 195554872 32 522 82953
TLV Process 2209072 193806960 21 87 49448
TLV Type Stats:
==============================================================================================================================
Func Name Count Total Min Avg Max
==============================================================================================================================
bcm_l2_addr_add 26337 903799 24 34 1020
bcm_l2_addr_delete 27893 1462054 25 52 16169
bcm_l3_egress_create 13 1127 55 86 120
bcm_l3_egress_destroy 96172 2829352 16 29 4445
bcm_l3_host_add 374240 15358864 33 41 2267
bcm_l3_host_delete 96166 5105960 33 53 1930
bcm_l3_route_add 1197940 87904190 53 73 49366
bcm_field_entry_policer_get 36768 81346 1 2 4436
bcm_field_entry_prio_get 41364 105509 1 2 2707
bcm_field_entry_stat_get 36768 46577 0 1 43
bcm_field_stat_get 147072 2331072 13 15 4378
bcm_policer_get 36768 76539 1 2 4199
my_l3_host_create 96167 14690261 83 152 38770
==============================================================================================================================
Retry Count: 0, Retry Success Count: 0
Parity Errors: 0, Parity Errors Uncorrectable: 0
Port Restarts on PHY error: 0
For some reason not parsing into the table values. Please help!!!
The below mentioned template would give you the required output.
Value Filldown Chassis (.+)
Value FUNC (\w+|.+)
Value COUNT (\d+)
Value TOTAL (\d+)
Value MIN (\d+)
Value AVG (\d+)
Value MAX (\d+)
Start
^${Chassis} Stats:
^${FUNC}\s+${COUNT}\s+${TOTAL}\s+${MIN}\s+${AVG}\s+${MAX} -> Record
The regex (.*) matches any character, always keep the regex specific to what has to be matched.
Use online regex tools to test your regex. try Regexr

GLMM glmer and glmmADMB - comparison error

I am trying to compare if there are differences in the number of obtained seeds in five different populations with different applied treatments, and having maternal plant and paternal plant as random effects. First I tried to fit a glmer model.
dat <-dat [,c(12,7,6,13,8,11)]
dat$parents<-factor(paste(dat$mother,dat$father,sep="_"))
compareTreat <- function(d)
{
d$treatment <-factor(d$treatment)
print (tapply(d$pop,list(d$pop,d$treatment),length))
print(summary(fit<-glmer(seed_no~treatment+(1|pop/mother)+
(1|pop/father),data=d,family="poisson")))
}
Then, I compared two treatments in two populations (pop 64 and pop 121, in that case). The other populations do not have this particular treatments, so I get NA values for those.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
This is the output:
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Generalized linear mixed model fit by maximum likelihood (Laplace
Approximation) [glmerMod]
Family: poisson ( log )
Formula: seed_no ~ treatment + (1 | pop/mother) + (1 | pop/father)
Data: d
AIC BIC logLik deviance df.resid
592.5 609.2 -290.2 580.5 113
Scaled residuals:
Min 1Q Median 3Q Max
-1.8950 -0.8038 -0.2178 0.4440 1.7991
Random effects:
Groups Name Variance Std.Dev.
father.pop (Intercept) 3.566e-01 5.971e-01
mother.pop (Intercept) 9.456e-01 9.724e-01
pop (Intercept) 1.083e-10 1.041e-05
pop.1 (Intercept) 1.017e-10 1.008e-05
Number of obs: 119, groups: father:pop, 81; mother:pop, 24; pop, 2
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.74664 0.24916 2.997 0.00273 **
treatmentIE 7x -0.05789 0.17894 -0.324 0.74629
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
tretmntIE7x -0.364
It seems there are no differences between treatments. But as there are many zeros in the data, a zero-inflated model would be worthy to try. I tried with glmmabmd, and I wrote the script like this:
compareTreat<-function(d)
{
d$treatment<-factor(d$treatment)
print(tapply(d$pop,list(d$pop,d$treatment), length))
print(summary(fit_zip<-glmmadmb(seed_no~treatment + (1|pop/mother)+
(1|pop/father),data=d,family="poisson", zeroInflation=TRUE)))
}
Then I compared again the treatments. Here I have not changed the code.
compareTreat(subset(dat,treatment%in%c("IE 5x","IE 7x")&pop%in%c(64,121)))
But in that case, the output is
IE 5x IE 7x
10 NA NA
45 NA NA
64 31 27
121 33 28
144 NA NA
Error in pop:father : NA/NaN argument
In addition: Warning messages:
1: In pop:father :
numerical expression has 119 elements: only the first used
2: In pop:father :
numerical expression has 119 elements: only the first used
3: In eval(parse(text = x), data) : NAs introduced by coercion
Called from: eval(parse(text = x), data)
I tried to change everything I came up with, but I still don't know where the problem is.
If I remove the (1|pop/father) from the glmmadmb script, the model runs, but it feels not correct. I wonder if the mistake is in the loop prior to the glmmadmb but it worked OK in the glmer model, or if it is in the comparison itself after the model. I tried as well to remove NAs with na.omit in case that was an issue, but it did not make a difference. Why does the script stop and does not continue running?
I am a student beginner with RStudio, my version is 3.4.2, called Short Summer. If someone with experience could point me in the right direction I would be very grateful!
H.

optimize hive query for multitable join

INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID,
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID,
s.STORE_ID,
s.STORE_NAME,
p.CREATE_DATE,
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;
I have a join query where i am joining huge tables and i am trying to optimize this hive query. Here are some facts about the tables
image table has 60m rows,
product table has 1b rows,
product_cat has 1000 rows,
store has 1m rows,
stock_info has 100 rows,
customizable_image has 200k rows.
a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table.
Here is what i tried so far,
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain
3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers
Here are some key logs from hive console. I ran this hive query in aws. Can anyone help me understand the primary bottleneck here ? This job is only processing a subset of the actual data.
Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%, reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%, reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%, reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec
MapReduce Jobs Launched:
Job 0: Map: 59 Reduce: 18 Cumulative CPU: 3903.4 sec HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48 Cumulative CPU: 2707.34 sec HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29 Reduce: 10 Cumulative CPU: 3901.99 sec HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42 Cumulative CPU: 2388.71 sec HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42 Reduce: 12 Cumulative CPU: 3792.5 sec HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45 Reduce: 13 Cumulative CPU: 4432.22 sec HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42 Cumulative CPU: 6052.96 sec HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK
The query is still taking longer than 5 hours in Hive where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. Interestingly, when i ran the task with 4 large core instances, the time taken improved only by 10 mins compared to the run with 3 large instance core instances. but when i ran the task with 3 med cores, it took 1hr 10 mins more.
This brings me to the question, "is Hive even the right choice for such complex joins" ?
I suspect the bottleneck is just in sorting your product table, since it seems much larger than the others. I think joins with Hive for tables over a certain size become untenable, simply because they require a sort.
There are parameters to optimize sorting, like io.sort.mb, which you can try setting, so that more sorting occurs in memory, rather than spilling to disk, re-reading and re-sorting. Look at the number of spilled records, and see if this much larger than your inputs. There are a variety of ways to optimize sorting. It might also help to break your query up into multiple subqueries so it doesn't have to sort as much at one time.
For the stock_info , and product_cat tables, you could probably keep them in memory since they are so small ( Check out the 'distributed_map' UDF in Brickhouse ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java ) For custom image, you might be able to use a bloom filter, if having a few false positives is not a real big problem.
To completely remove the join, perhaps you could store the image info in a keystone DB like HBase to do lookups instead. Brickhouse also had UDFs for HBase , like hbase_get and base_cached_get .

Resources