vitis hls 2020.2 Pre-synthesis failed, but it didn't prompt any errors. How can I find out the cause of the error and fix it? - xilinx

#define READ_COL 4
void read_data(kern_colmeta *colmeta
, int ncols
, HeapTupleHeaderData *htup
, cl_char tup_dclass[READ_COL]
, cl_long tup_values[READ_COL])
{
char *addr ;//__attribute__((unused));
EXTRACT_HEAP_TUPLE_BEGIN(addr, colmeta, ncols, htup);
EXTRACT_HEAP_TUPLE_NEXT(addr);
EXTRACT_HEAP_TUPLE_NEXT(addr);
EXTRACT_HEAP_READ_32BIT(addr, tup_dclass[3],tup_values[3]);
EXTRACT_HEAP_TUPLE_NEXT(addr);
EXTRACT_HEAP_TUPLE_NEXT(addr);
EXTRACT_HEAP_TUPLE_NEXT(addr);
//EXTRACT_HEAP_READ_POINTER(addr,tup_dclass[1],tup_values[1]);
EXTRACT_HEAP_TUPLE_NEXT(addr);
//EXTRACT_HEAP_READ_POINTER(addr ,tup_dclass[2],tup_values[2]);
EXTRACT_HEAP_TUPLE_END();
}
void accel (char *a, char *b)//, int* o)
{
#pragma HLS INTERFACE m_axi depth=125 port=a
#pragma HLS INTERFACE m_axi depth=1984 port=b
kern_colmeta col[16];
memcpy(col, b, sizeof(kern_colmeta)*16);
HeapTupleHeaderData *htup;
htup = (HeapTupleHeaderData *)a;
cl_char tup_dclass[READ_COL];
cl_long tup_values[READ_COL];
read_data(col, 16, htup, tup_dclass, tup_values);
}
The top function is accel(), error occurs in the code when calling EXTRACT_HEAP_READ_32BIT(). C simulation results are normal,but as once try to run synthesis, it faild. error log just like as follows:
INFO: [HLS 200-1510] Running: csynth_design
INFO: [HLS 200-111] Finished File checks and directory preparation: CPU user time: 0 seconds. CPU system time: 0 seconds. Elapsed time: 0 seconds; current allocated memory: 205.967 MB.
INFO: [HLS 200-10] Analyzing design file '/root/gyf/hls/Unable_to_Schedule/kernel_fun.cpp' ...
INFO: [HLS 200-111] Finished Source Code Analysis and Preprocessing: CPU user time: 0.23 seconds. CPU system time: 0.08 seconds. Elapsed time: 0.21 seconds; current allocated memory: 207.566 MB.
INFO: [HLS 200-777] Using interface defaults for 'Vitis' flow target.
INFO: [HLS 200-111] Finished Command csynth_design CPU user time: 3.3 seconds. CPU system time: 0.35 seconds. Elapsed time: 3.56 seconds; current allocated memory: 209.139 MB.
Pre-synthesis failed.
while executing
"source accel.tcl"
("uplevel" body line 1)
invoked from within
"uplevel \#0 [list source $arg] "
INFO: [HLS 200-112] Total CPU user time: 5.81 seconds. Total CPU system time: 0.99 seconds. Total elapsed time: 5.43 seconds; peak allocated memory: 208.758 MB.
INFO: [Common 17-206] Exiting vitis_hls at Fri Feb 25 13:28:33 2022..
The Project code

I have exactly the same error:
INFO: [HLS 200-111] Finished Command csynth_design CPU user time: 45.76 seconds. CPU system time: 0.94 seconds. Elapsed time: 46.21 seconds; current allocated memory: 194.761 MB.
Pre-synthesis failed.
while executing
"source ../scripts/ip_v6.tcl"
("uplevel" body line 1)
invoked from within
"uplevel \#0 [list source $arg] "
I would be interested to know how I can track the error source. I also have the same error message if I launch viti_hls GUI.

Related

Kilobytes or Kibibytes in GNU time?

We are doing some performance measurements including some memory footprint measurements. We've been doing this with GNU time.
But, I cannot tell if they are measuring in kilobytes (1000 bytes) or kibibytes (1024 bytes).
The man page for my system says of the %M format key (which we are using to measure peak memory usage): "Maximum resident set size of the process during its lifetime, in Kbytes."
I assume K here means the SI "Kilo" prefix, and thus kilobytes.
But having looked at a few other memory measurements of various things through various tools, I trust that assumption like I'd trust a starved lion to watch my dogs during a week-long vacation.
I need to know, because for our tests 1000 vs 1024 Kbytes adds up to a difference of nearly 8 gigabytes, and I'd like to think I can cut down the potential error in our measurements by a few billion.
Using the below testing setup, I have determined that GNU time on my system measures in Kibibytes.
The below program (allocator.c) allocates data and touches each of it 1 KiB at a time to ensure that it all gets paged in. Note: This test only works if you can page in the entirety of the allocated data, otherwise time's measurement will only be the largest resident collection of memory.
allocator.c:
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define min(a,b) ( ( (a)>(b) )? (b) : (a) )
volatile char access;
volatile char* data;
const int step = 128;
int main(int argc, char** argv ){
unsigned long k = strtoul( argv[1], NULL, 10 );
if( k >= 0 ){
printf( "Allocating %lu (%s) bytes\n", k, argv[1] );
data = (char*) malloc( k );
for( int i = 0; i < k; i += step ){
data[min(i,k-1)] = (char) i;
}
free( data );
} else {
printf("Bad size: %s => %lu\n", argv[1], k );
}
return 0;
}
compile with: gcc -O3 allocator.c -o allocator
Runner Bash Script:
kibibyte=1024
kilobyte=1000
mebibyte=$(expr 1024 \* ${kibibyte})
megabyte=$(expr 1000 \* ${kilobyte})
gibibyte=$(expr 1024 \* ${mebibyte})
gigabyte=$(expr 1000 \* ${megabyte})
for mult in $(seq 1 3);
do
bytes=$(expr ${gibibyte} \* ${mult} )
echo ${mult} GiB \(${bytes} bytes\)
echo "... in kibibytes: $(expr ${bytes} / ${kibibyte})"
echo "... in kilobytes: $(expr ${bytes} / ${kilobyte})"
/usr/bin/time -v ./allocator ${bytes}
echo "===================================================="
done
For me this produces the following output:
1 GiB (1073741824 bytes)
... in kibibytes: 1048576
... in kilobytes: 1073741
Allocating 1073741824 (1073741824) bytes
Command being timed: "./a.out 1073741824"
User time (seconds): 0.12
System time (seconds): 0.52
Percent of CPU this job got: 75%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:00.86
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 1049068
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 262309
Voluntary context switches: 7
Involuntary context switches: 2
Swaps: 0
File system inputs: 16
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
2 GiB (2147483648 bytes)
... in kibibytes: 2097152
... in kilobytes: 2147483
Allocating 2147483648 (2147483648) bytes
Command being timed: "./a.out 2147483648"
User time (seconds): 0.21
System time (seconds): 1.09
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.31
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 2097644
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 524453
Voluntary context switches: 4
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
3 GiB (3221225472 bytes)
... in kibibytes: 3145728
... in kilobytes: 3221225
Allocating 3221225472 (3221225472) bytes
Command being timed: "./a.out 3221225472"
User time (seconds): 0.38
System time (seconds): 1.60
Percent of CPU this job got: 99%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.98
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 3146220
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 786597
Voluntary context switches: 4
Involuntary context switches: 3
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
====================================================
In the "Maximum resident set size" entry, I see values that are closest to the kibibytes value I expect from that raw byte count. There is some difference because its possible that some memory is being paged out (in cases where it is lower, which none of them are here) and because there is more memory being consumed than what the program allocates (namely, the stack and the actual binary image itself).
Versions on my system:
> gcc --version
gcc (GCC) 6.1.0
Copyright (C) 2016 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
> /usr/bin/time --version
GNU time 1.7
> lsb_release -a
LSB Version: :base-4.0-amd64:base-4.0-noarch:core-4.0-amd64:core-4.0-noarch:graphics-4.0-amd64:graphics-4.0-noarch:printing-4.0-amd64:printing-4.0-noarch
Distributor ID: CentOS
Description: CentOS release 6.10 (Final)
Release: 6.10
Codename: Final

Why the load test values are different in siege - 3.0.8 and siege 4.0.4?

I have run this command siege -c50 -d10 -t3M http://www.example.com in both the versions 3.0.8 and 4.0.4. I got a totally different result. Can anyone give me a solution for this and why do the values differ in these versions..?
In version 4.0.4
Transactions: 1033 hits
Availability: 100.00 %
Elapsed time: 179.47 secs
Data transferred: 26.31 MB
Response time: 8.45 secs
Transaction rate: 5.76 trans/sec
Throughput: 0.15 MB/sec
Concurrency: 48.63
Successful transactions: 1033
Failed transactions: 0
Longest transaction: 72.85
Shortest transaction: 3.65
In version 3.0.8
Transactions: 133 hits
Availability: 100.00 %
Elapsed time: 179.08 secs
Data transferred: 27.59 MB
Response time: 50.95 secs
Transaction rate: 0.74 trans/sec
Throughput: 0.15 MB/sec
Concurrency: 37.84
Successful transactions: 133
Failed transactions: 0
Longest transaction: 141.14
Shortest transaction: 8.34
Thank You.
HTML parsing was added to version 4.0.0 and it is enabled by default. It makes an additional request for resources like style sheets, images, javascript, etc.
We can enable/disable this feature in the siege.conf file, by setting the value of parser to true/false.
Setting parser value in the siege.conf file

iPhone 6s has a longer pre-main time than 6 plus

When I was trying to optimize the app's launch time, I found an interesting fact that the pre-main time needed on iPhone 6s is actually much longer than an iPhone 6 plus(warm launch).
On iPhone 6 plus, it looks like this:
Total pre-main time: 241.93 milliseconds (100.0%)
dylib loading time: 74.81 milliseconds (30.9%)
rebase/binding time: 15.96 milliseconds (6.6%)
ObjC setup time: 23.67 milliseconds (9.7%)
initializer time: 127.31 milliseconds (52.6%)
slowest intializers :
libSystem.B.dylib : 7.21 milliseconds (2.9%)
libBacktraceRecording.dylib : 5.97 milliseconds (2.4%)
AirshipKit : 5.17 milliseconds (2.1%)
My App : 199.60 milliseconds (82.5%)
However, on iPhone 6s, it tripled the time:
Total pre-main time: 891.80 milliseconds (100.0%)
dylib loading time: 680.36 milliseconds (76.2%)
rebase/binding time: 59.18 milliseconds (6.6%)
ObjC setup time: 42.74 milliseconds (4.7%)
initializer time: 109.43 milliseconds (12.2%)
slowest intializers :
libSystem.B.dylib : 7.36 milliseconds (0.8%)
My App : 147.96 milliseconds (16.5%)
They both are running on the same system(iOS 10.3). I think iPhone 6s have a larger RAM & better CPU than the 6 plus. But I don't understand why it's actually takes longer time to finish the pre-main task.
Can anyone give me some hints? Thanks!
P.S: I did try several times.

Spring data neo4j -- tx.close() lasts too long

I'm trying to import data from the relational database into neo4j. The process goes like this (simplified a little bit):
while (!sarBatchService.isFinished()) {
logger.info("New batch started");
Date loadDeklFrom = sarBatchService.getStartDateForNewBatch();
Date loadDeklTo = sarBatchService
.getEndDateForNewBatch(loadDeklFrom); // = loadDeklFrom + 2 hours
logger.info("Dates calculated");
Date startTime = new Date();
List<Dek> deks = dekLoadManager
.loadAllDeks(loadDeklFrom, loadDeklTo); // loading data from the relational database (POINT A)
Date endLoadTime = new Date();
logger.info("Deks loaded");
GraphDatabase gdb = template.getGraphDatabase();
Transaction tx = gdb.beginTx();
logger.info("Transaction started!");
try {
for (Deks dek : deks) {
/* transform dek into nodes, and save
this nodes with Neo4jTemplate.save() */
}
logger.info("Deks saved");
Date endImportTime = new Date();
int aff = sarBatchService.insertBatchData(loadDeklFrom,
loadDeklTo, startTime, endLoadTime, endImportTime,
deks.size()); // (POINT B)
if (aff != 1) {
String msg = "Something went wrong",
throw new RuntimeException(msg);
}
logger.info("Batch data saved into relational database");
tx.success();
logger.info("Transaction marked as success.");
} catch (NoSuchFieldException | SecurityException
| IllegalArgumentException | IllegalAccessException
| NoSuchMethodException | InstantiationException
| InvocationTargetException e1) {
logger.error("Something bad happend :(");
logger.error(e1.getStackTrace().toString());
} finally {
logger.info("Closing transaction...");
tx.close(); // (POINT C)
logger.info("Transaction closed!");
logger.info("Need more work? " + !sarBatchService.isFinished());
}
}
So, data in the relational database has a timestamp which indicates when it's stored and I'm loading it in two hours by two hours time intervals (POINT A in the code). After that, I'm iterating through loaded data, transforming it into nodes (spring-data-neo4j nodes), storing in neo4j and storing informations about the current batch (POINT B) in the relational database. I'm logging almost every step to debug more easily.
The program successfully finishes 158 batches. The problem starts as the 159th batch starts. The program stops at the POINT C in the code (tx.close()) and waits there for 4 hours (which usually lasts a few seconds). After that continues normally.
I've tried running it on tomcat 7 with 10GB heap size and 4GB heap size. The result is the same (blocks on 159th batch). The maximum number of nodes in one transaction is between 10k and 15k (7k on average), and the 159th batch has less then 10k nodes.
The interesting part is that everything goes well if the data is loaded 4 by 4 hours or 12 by 12 hours. Also, if I restart Tomcat or execute only the 159th batch everything passes without problems.
I'm using spring 3.2.8 with spring-data-neo4j 3.0.2.
This is the neo4j's message.log:
...
2014-11-24 15:21:38.080+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 418ms [total block time: 150.973s]
2014-11-24 15:21:45.722+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 377ms [total block time: 151.35s]
...
2014-11-24 15:23:57.381+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 392ms [total block time: 156.593s]
2014-11-24 15:24:06.758+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotating [/home/pravila/data/neo4j/nioneo_logical.log.1] # version=22 to /home/pravila/data/neo4j/nioneo_logical.log.2 from position 26214444
2014-11-24 15:24:06.763+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotate log first start entry # pos=24149878 out of [339=Start[339,xid=GlobalId[NEOKERNL|5889317606667601380|364|-1], BranchId[ 52 49 52 49 52 49 ],master=-1,me=-1,time=2014-11-24 15:23:13.021+0000/1416842593021,lastCommittedTxWhenTransactionStarted=267]]
2014-11-24 15:24:07.401+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotate: old log scanned, newLog # pos=2064582
2014-11-24 15:24:07.402+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Log rotated, newLog # pos=2064582, version 23 and last tx 267
2014-11-24 15:24:07.684+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotating [/home/pravila/data/neo4j/index/lucene.log.1] # version=6 to /home/pravila/data/neo4j/index/lucene.log.2 from position 26214408
2014-11-24 15:24:07.772+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotate log first start entry # pos=25902494 out of [134=Start[134,xid=GlobalId[NEOKERNL|5889317606667601380|364|-1], BranchId[ 49 54 50 51 55 52 ],master=-1,me=-1,time=2014-11-24 15:23:13.023+0000/1416842593023,lastCommittedTxWhenTransactionStarted=133]]
2014-11-24 15:24:07.871+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotate: old log scanned, newLog # pos=311930
2014-11-24 15:24:07.878+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Log rotated, newLog # pos=311930, version 7 and last tx 133
2014-11-24 15:24:10.919+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 214ms [total block time: 156.807s]
2014-11-24 15:24:17.486+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 405ms [total block time: 157.212s]
...
2014-11-24 15:25:28.692+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 195ms [total block time: 159.316s]
2014-11-24 15:25:33.238+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotating [/home/pravila/data/neo4j/nioneo_logical.log.2] # version=23 to /home/pravila/data/neo4j/nioneo_logical.log.1 from position 26214459
2014-11-24 15:25:33.242+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotate log first start entry # pos=24835943 out of [349=Start[349,xid=GlobalId[NEOKERNL|-6436474643536791121|374|-1], BranchId[ 52 49 52 49 52 49 ],master=-1,me=-1,time=2014-11-24 15:25:27.038+0000/1416842727038,lastCommittedTxWhenTransactionStarted=277]]
2014-11-24 15:25:33.761+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Rotate: old log scanned, newLog # pos=1378532
2014-11-24 15:25:33.763+0000 INFO [o.n.k.i.t.x.XaLogicalLog]: Log rotated, newLog # pos=1378532, version 24 and last tx 277
2014-11-24 15:25:37.031+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 148ms [total block time: 159.464s]
2014-11-24 15:25:45.891+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 153ms [total block time: 159.617s]
....
2014-11-24 15:26:48.447+0000 WARN [o.n.k.EmbeddedGraphDatabase]: GC Monitor: Application threads blocked for an additional 221ms [total block time: 161.641s]
I don't know what's going on here...
Please help.
It very much looks like you have a leaking perhaps outer transaction there.
So that the inner transaction that you show actually finishes but the outer one continues to accumulate state. As Neo doesn't suspend outer transactions but purely nests them there will be no real commit until you hit the outer tx.success(); tx.close();
You should see it if you take thread dump when it blocks to see if it is actually stuck in commit.
after hours an hours of searching and testing I tried to rerun the whole batch with the 4 by 4 hours time interval. It also stopped after the 145th batch (transaction). The difference was that it threw an error (Too many opened files). I set the ulimit for opened files to unlimited and now it works. The only mystery is why the program didn't throw an error the first time.

optimize hive query for multitable join

INSERT OVERWRITE TABLE result
SELECT /*+ STREAMTABLE(product) */
i.IMAGE_ID,
p.PRODUCT_NO,
p.STORE_NO,
p.PRODUCT_CAT_NO,
p.CAPTION,
p.PRODUCT_DESC,
p.IMAGE1_ID,
p.IMAGE2_ID,
s.STORE_ID,
s.STORE_NAME,
p.CREATE_DATE,
CASE WHEN custImg.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg1.IMAGE_ID is NULL THEN 0 ELSE 1 END,
CASE WHEN custImg2.IMAGE_ID is NULL THEN 0 ELSE 1 END
FROM image i
JOIN PRODUCT p ON i.IMAGE_ID = p.IMAGE1_ID
JOIN PRODUCT_CAT pcat ON p.PRODUCT_CAT_NO = pcat.PRODUCT_CAT_NO
JOIN STORE s ON p.STORE_NO = s.STORE_NO
JOIN STOCK_INFO si ON si.STOCK_INFO_ID = pcat.STOCK_INFO_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg ON i.IMAGE_ID = custImg.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg1 ON p.IMAGE1_ID = custImg1.IMAGE_ID
LEFT OUTER JOIN CUSTOMIZABLE_IMAGE custImg2 ON p.IMAGE2_ID = custImg2.IMAGE_ID;
I have a join query where i am joining huge tables and i am trying to optimize this hive query. Here are some facts about the tables
image table has 60m rows,
product table has 1b rows,
product_cat has 1000 rows,
store has 1m rows,
stock_info has 100 rows,
customizable_image has 200k rows.
a product can have one or two images (image1 and image2) and product level information are stored only in product table. i tried moving the join with product to the bottom but i couldnt as all other following joins require data from the product table.
Here is what i tried so far,
1. I gave the hint to hive to stream product table as its the biggest one
2. I bucketed the table (during create table) into 256 buckets (on image_id) and then did the join - didnt give me any significant performance gain
3. changed the input format to sequence file from textfile(gzip files) , so that it can be splittable and hence more mappers can be run if hive want to run more mappers
Here are some key logs from hive console. I ran this hive query in aws. Can anyone help me understand the primary bottleneck here ? This job is only processing a subset of the actual data.
Stage-14 is selected by condition resolver.
Launching Job 1 out of 11
Number of reduce tasks not specified. Estimated from input data size: 22
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Kill Command = /home/hadoop/bin/hadoop job -kill job_201403242034_0001
Hadoop job information for Stage-14: number of mappers: 341; number of reducers: 22
2014-03-24 20:55:05,709 Stage-14 map = 0%, reduce = 0%
.
2014-03-24 23:26:32,064 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 34198.12 sec
MapReduce Total cumulative CPU time: 0 days 9 hours 29 minutes 58 seconds 120 msec
.
2014-03-25 00:33:39,702 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 20879.69 sec
MapReduce Total cumulative CPU time: 0 days 5 hours 47 minutes 59 seconds 690 msec
.
2014-03-26 04:15:25,809 Stage-14 map = 100%, reduce = 100%, Cumulative CPU 3903.4 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 3 seconds 400 msec
.
2014-03-26 04:25:05,892 Stage-30 map = 100%, reduce = 100%, Cumulative CPU 2707.34 sec
MapReduce Total cumulative CPU time: 45 minutes 7 seconds 340 msec
.
2014-03-26 04:45:56,465 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 3901.99 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 5 minutes 1 seconds 990 msec
.
2014-03-26 04:54:56,061 Stage-26 map = 100%, reduce = 100%, Cumulative CPU 2388.71 sec
MapReduce Total cumulative CPU time: 39 minutes 48 seconds 710 msec
.
2014-03-26 05:12:35,541 Stage-4 map = 100%, reduce = 100%, Cumulative CPU 3792.5 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 3 minutes 12 seconds 500 msec
.
2014-03-26 05:34:21,967 Stage-5 map = 100%, reduce = 100%, Cumulative CPU 4432.22 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 13 minutes 52 seconds 220 msec
.
2014-03-26 05:54:43,928 Stage-21 map = 100%, reduce = 100%, Cumulative CPU 6052.96 sec
MapReduce Total cumulative CPU time: 0 days 1 hours 40 minutes 52 seconds 960 msec
MapReduce Jobs Launched:
Job 0: Map: 59 Reduce: 18 Cumulative CPU: 3903.4 sec HDFS Read: 37387 HDFS Write: 12658668325 SUCCESS
Job 1: Map: 48 Cumulative CPU: 2707.34 sec HDFS Read: 12658908810 HDFS Write: 9321506973 SUCCESS
Job 2: Map: 29 Reduce: 10 Cumulative CPU: 3901.99 sec HDFS Read: 9321641955 HDFS Write: 11079251576 SUCCESS
Job 3: Map: 42 Cumulative CPU: 2388.71 sec HDFS Read: 11079470178 HDFS Write: 10932264824 SUCCESS
Job 4: Map: 42 Reduce: 12 Cumulative CPU: 3792.5 sec HDFS Read: 10932405443 HDFS Write: 11812454443 SUCCESS
Job 5: Map: 45 Reduce: 13 Cumulative CPU: 4432.22 sec HDFS Read: 11812679475 HDFS Write: 11815458945 SUCCESS
Job 6: Map: 42 Cumulative CPU: 6052.96 sec HDFS Read: 11815691155 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 days 7 hours 32 minutes 59 seconds 120 msec
OK
The query is still taking longer than 5 hours in Hive where as in RDBMS it takes only 5 hrs. I need some help in optimizing this query, so that it executes much faster. Interestingly, when i ran the task with 4 large core instances, the time taken improved only by 10 mins compared to the run with 3 large instance core instances. but when i ran the task with 3 med cores, it took 1hr 10 mins more.
This brings me to the question, "is Hive even the right choice for such complex joins" ?
I suspect the bottleneck is just in sorting your product table, since it seems much larger than the others. I think joins with Hive for tables over a certain size become untenable, simply because they require a sort.
There are parameters to optimize sorting, like io.sort.mb, which you can try setting, so that more sorting occurs in memory, rather than spilling to disk, re-reading and re-sorting. Look at the number of spilled records, and see if this much larger than your inputs. There are a variety of ways to optimize sorting. It might also help to break your query up into multiple subqueries so it doesn't have to sort as much at one time.
For the stock_info , and product_cat tables, you could probably keep them in memory since they are so small ( Check out the 'distributed_map' UDF in Brickhouse ( https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/dcache/DistributedMapUDF.java ) For custom image, you might be able to use a bloom filter, if having a few false positives is not a real big problem.
To completely remove the join, perhaps you could store the image info in a keystone DB like HBase to do lookups instead. Brickhouse also had UDFs for HBase , like hbase_get and base_cached_get .

Resources