How to load Bucketed HIVE table using LOAD DATA LOCAL INPATH - apache-hive

Can we load a Bucketed HIVE table using LOAD DATA LOCAL INPATH ... command. I have executed it for a sample file, but data values are inserted as NULL.
hduser#ubuntu:~$ cat /home/hduser/Desktop/hive_external/hive_external/emp2.csv
101,EName1,110.1
102,EName2,120.1
103,EName3,130.1
hive (default)> load data local inpath '/home/hduser/Desktop/hive_external/hive_external' overwrite into table emp_bucket;
Loading data to table default.emp_bucket
Table default.emp_bucket stats: [numFiles=1, numRows=0, totalSize=51, rawDataSize=0]
OK
Time taken: 1.437 seconds
hive (default)> select * from emp_bucket;
OK
emp_bucket.emp_id emp_bucket.emp_name emp_bucket.emp_salary
NULL NULL NULL
NULL NULL NULL
NULL NULL NULL
Time taken: 0.354 seconds, Fetched: 3 row(s)
hive (default)> show create table emp_bucket;
OK
createtab_stmt
CREATE TABLE `emp_bucket`(
`emp_id` int,
`emp_name` string,
`emp_salary` float)
CLUSTERED BY (
emp_id)
INTO 3 BUCKETS
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'hdfs://localhost:54310/user/hive/warehouse/emp_bucket'
TBLPROPERTIES (
'COLUMN_STATS_ACCURATE'='true',
'numFiles'='1',
'numRows'='0',
'rawDataSize'='0',
'totalSize'='51',
'transient_lastDdlTime'='1457967994')
Time taken: 0.801 seconds, Fetched: 22 row(s)
But when INSERTED using insert command the data got INSERTED successfully.
hive (default)> select * from koushik.emp2;
OK
emp2.id emp2.name emp2.salary
101 EName1 110.1
102 EName2 120.1
103 EName3 130.1
Time taken: 0.266 seconds, Fetched: 3 row(s)
hive (default)> insert overwrite table emp_bucket select * from koushik.emp2;
Query ID = hduser_20160314080808_ae88f1c8-3db6-4a5c-99d2-e9a5312c597d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 3
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1457951378402_0002, Tracking URL = http://localhost:8088/proxy/application_1457951378402_0002/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1457951378402_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2016-03-14 08:09:33,203 Stage-1 map = 0%, reduce = 0%
2016-03-14 08:09:48,243 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 3.24 sec
2016-03-14 08:09:59,130 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 6.39 sec
2016-03-14 08:10:02,382 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 8.8 sec
2016-03-14 08:10:03,442 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.03 sec
MapReduce Total cumulative CPU time: 11 seconds 30 msec
Ended Job = job_1457951378402_0002
Loading data to table default.emp_bucket
Table default.emp_bucket stats: [numFiles=3, numRows=3, totalSize=51, rawDataSize=48]
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 11.03 sec HDFS Read: 12596 HDFS Write: 273 SUCCESS
Total MapReduce CPU Time Spent: 11 seconds 30 msec
OK
emp2.id emp2.name emp2.salary
Time taken: 103.027 seconds
hive (default)> select * from emp_bucket;
OK
emp_bucket.emp_id emp_bucket.emp_name emp_bucket.emp_salary
102 EName2 120.1
103 EName3 130.1
101 EName1 110.1
Time taken: 0.08 seconds, Fetched: 3 row(s)
Question is can't a HIVE bucketed table be loaded from a file?

You may have to enable bucketing before loading a file in a bucketed table.
Use this to set bucketing attribute first and then load your file.
set hive.enforce.bucketing = true;
comment here if it doesn't work.

Apparently Hive does not support bucketing on external tables. Thus, instead of the LOAD DATA INPATH route, you apparently have to INSERT OVERWRITE TABLE ..., cf hadoop tutorial.

Related

InfluxQL time calculations return no records

I'd like to query InfluxDB using InfluxQL and exclude any rows from 0 to 5 minutes after the hour.
Seems pretty easy to do using the time field (the number of nanoseconds since the epoch) and a little modulus math. But the problem is that any WHERE clause with even the simplest calculation on time returns zero records.
How can I get what I need if I can't perform calculations on time? How can I exclude any rows from 0 to 5 minutes after the hour?
# Returns 10 records
SELECT * FROM "telegraf"."autogen"."processes" WHERE time > 0 LIMIT 10
# Returns 0 records
SELECT * FROM "telegraf"."autogen"."processes" WHERE (time/1) > 0 LIMIT 10

KsqlDB HOPPING window retention doesn't work

I am using ksqlDB version 0.14.0-rc732.
Declared a query :
CREATE TABLE LIVE_TRAFFIC AS
select
devicemac,
sum(traffic -> bytesIn) AS bytes_in,
sum(traffic -> bytesOut) AS bytes_out
from FLAT_TRAFFIC
WINDOW HOPPING (SIZE 1 MINUTES, ADVANCE BY 1 MINUTES, RETENTION 15 MINUTES, GRACE PERIOD 1 MINUTES)
GROUP by devicemac;
But the 15 min defined retention doesn't work.
Rows are being added to the table.

Showing hourly average (histogramm) in grafana

Given a timeseries of (electricity) marketdata with datapoints every hour, I want to show a Bar Graph with all time / time frame averages for every hour of the data, so that an analyst can easily compare actual prices to all time averages (which hour of the day is most/least expensive).
We have cratedb as backend, which is used in grafana just like a postgres source.
SELECT
extract(HOUR from start_timestamp) as "time",
avg(marketprice) as value
FROM doc.el_marketprices
GROUP BY 1
ORDER BY 1
So my data basically looks like this
time value
23.00 23.19
22.00 25.38
21.00 29.93
20.00 31.45
19.00 34.19
18.00 41.59
17.00 39.38
16.00 35.07
15.00 30.61
14.00 26.14
13.00 25.20
12.00 24.91
11.00 26.98
10.00 28.02
9.00 28.73
8.00 29.57
7.00 31.46
6.00 30.50
5.00 27.75
4.00 20.88
3.00 19.07
2.00 18.07
1.00 19.43
0 21.91
After hours of fiddling around with Bar Graphs, Histogramm Mode, Heatmap Panel und much more, I am just not able to draw a simple Hours-of-the day histogramm with this in Grafana. I would very much appreciate any advice on how to use any panel to get this accomplished.
your query doesn't return correct time series data for the Grafana - time field is not valid timestamp, so don't extract only
hour, but provide full start_timestamp (I hope it is timestamp
data type and value is in UTC)
add WHERE time condition - use Grafana's macro __timeFilter
use Grafana's macro $__timeGroupAlias for hourly groupping
SELECT
$__timeGroupAlias(start_timestamp,1h,0),
avg(marketprice) as value
FROM doc.el_marketprices
WHERE $__timeFilter(start_timestamp)
GROUP BY 1
ORDER BY 1
This will give you data for historic graph with hourly avg values.
Required histogram may be a tricky, but you can try to create metric, which will have extracted hour, e.g.
SELECT
$__timeGroupAlias(start_timestamp,1h,0),
extract(HOUR from start_timestamp) as "metric",
avg(marketprice) as value
FROM doc.el_marketprices
WHERE $__timeFilter(start_timestamp)
GROUP BY 1
ORDER BY 1
And then visualize it as histogram. Remember that Grafana is designated for time series data, so you need proper timestamp (not only extracted hours, eventually you can fake it) otherwise you will have hard time to visualize non time series data in Grafana. This 2nd query may not work properly, but it gives you at least idea.

query result in set of interval ranges in postgresql(rails)

I have a timestamp column for which i have to calculate the time difference and divide it into certain set of intervals
for time difference in hours i have written this query
result = ActiveRecord::Base.connection.exec_query("SELECT id,(EXTRACT(EPOCH FROM CURRENT_TIMESTAMP - image_retouch_items.created_at)/3600)::INTEGER AS latency FROM image_retouch_items WHERE status= 0;");
The result of my query is
"id" "latency"
104 5928
106 5917
158 5751
162 5736
95 5940
85 5950
How to get result as set of intervals(hours),like for row for which time difference lie between the range of 0-24 hr increment the count .
i.e.
interval count
0-24 2
24-48 3
48-72 0
How to get that in single query

InfluxDB average of distinct count over time

Using Influx DB v0.9, say I have this simple query:
select count(distinct("id")) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(1m)
Which gives results like:
08:00 5
08:01 10
08:02 5
08:03 10
08:04 5
Now I want a query that produces points with an average of those values over 5 minutes. So the points are now 5 minutes apart, instead of 1 minute, but are an average of the 1 minute values. So the above 5 points would be 1 point with a value of the result of (5+10+5+10+5)/5.
This does not produce the results I am after, for clarity, since this is just a count, and I'm after the average.
select count(distinct("id")) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(5m)
This doesn't work (gives errors):
select mean(distinct("id")) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(5m)
Also doesn't work (gives error):
select mean(count(distinct("id"))) FROM "main" WHERE time > now() - 30m and time < now() GROUP BY time(5m)
In my actual usage "id" is a string (content, not a tag, because count distinct not supported for tags in my version of InfluxDB).
To clarify a few points for readers, in InfluxQL, functions like COUNT() and DISTINCT() can only accept fields, not tags. In addition, while COUNT() supports the nesting of the DISTINCT() function, most nested or sub-functions are not yet supported. In addition, nested queries, subqueries, or stored procedures are not supported.
However, there is a way to address your need using continuous queries, which are a way to automate the processing of data and writing those results back to the database.
First take your original query and make it a continuous query (CQ).
CREATE CONTINUOUS QUERY count_foo ON my_database_name BEGIN
SELECT COUNT(DISTINCT("id")) AS "1m_count" INTO main_1m_count FROM "main" GROUP BY time(1m)
END
There are other options for the CQ, but that basic one will wake up every minute, calculate the COUNT(DISTINCT("id")) for the prior minute, and then store that result in a new measurement, main_1m_count.
Now, you can easily calculate your 5 minute mean COUNT from the pre-calculated 1 minute COUNT results in main_1m_count:
SELECT MEAN("1m_count") FROM main_1m_count WHERE time > now() - 30m GROUP BY time(5m)
(Note that by default, InfluxDB uses epoch 0 and now() as the lower and upper time range boundaries, so it is redundant to include and time < now() in the WHERE clause.)

Resources