EPL is like this:
select
cast(a.ReportTime,date,dateformat:'yyyy-MM-dd') as ReportTime,
a.Source,
aa.RequestNum,
a.ServerTotal,
a.ServerSucc,
b.Total,
b.Succ,
NULL as DataChange_LastTime,
c.Response
from IntlTotalCountEvent.win:time_batch(2 min) as aa
inner join A.win:time_batch(2 min) as a on aa.ReportTime=a.ReportTime and aa.Source=a.Source
inner join B.win:time_batch(2 min) as b on a.ReportTime=b.ReportTime and a.Source=b.Source
inner join C.win:time_batch(1 min 30 sec) as c on a.ReportTime=c.ReportTime and a.Source=c.Source
Sometimes it works,but sometimes doesn't work,even Fields ReportTime and Source both use same datas.
Batch windows start when an event arrives. Each batch window is therefore independent and un-aligned with respect to other batch windows. If you want to align the batch windows use a context "create context Every5Sec start #now end after 2 minutes" and "win:keepall" instead of "win:batch".
Related
We are facing this strange issue with Hive on HDInsight 4.0 - Hive 3.1.0
By default, Hive is set to handle all tables as transactional.
We have 3 tables which join together:
a
b
c
In initial phase 2 of those tables were partitioned by year/month (b,c).
Now, we have repartitioned them by year/month/day (b,c).
Which generates a number of around 200 partitions for each table(b,c).
Now if we do a select from a join b join c we get a transaction lock error.
However, if I do Select a join b - works fine, if I do Select a join c works fine.
Also, if I restrict in the join clause for one of the newly partitioned tables to scan only one partition like
Select a join b on 1=1 join c on 1=1 and c.YEAR=2019 AND c.MONTH=1
it also works fine.
It seems that the increased number of partitions which the join has to scan, or something like that is preventing Hive to take a read lock on those tables... which is very strange, since the tables do not share anything, except the same database.
Any ideas?
Full error:
java.sql.SQLException: Error while processing statement: FAILED: Error in acquiring locks: Error communicating with the metastore
at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:401)
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:266)
at com.hortonworks.hivestudio.hive.HiveJdbcConnectionDelegate.execute(HiveJdbcConnectionDelegate.java:56)
at com.hortonworks.hivestudio.hive.actor.StatementExecutor.runStatement(StatementExecutor.java:93)
at com.hortonworks.hivestudio.hive.actor.StatementExecutor.handleMessage(StatementExecutor.java:74)
at com.hortonworks.hivestudio.hive.actor.HiveActor.onReceive(HiveActor.java:45)
at akka.actor.UntypedAbstractActor$$anonfun$receive$1.applyOrElse(AbstractActor.scala:243)
at akka.actor.Actor.aroundReceive(Actor.scala:514)
at akka.actor.Actor.aroundReceive$(Actor.scala:512)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:132)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:527)
at akka.actor.ActorCell.invoke(ActorCell.scala:496)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
This simple query is timeouting, any ideas how to optimise it using some BigQuery tricks?
SELECT
s.typeFlight s_type, r.distance, r.price, (d.booking_token IS NULL) clicked
FROM [search.searches] s
LEFT JOIN [search.search_results] r ON r.searchid=s.searchid
LEFT JOIN [search.clicks] d ON d.booking_token=r.booking_token
WHERE s.saved_at BETWEEN TIMESTAMP('2016-03-01 00:00:00')
AND TIMESTAMP('2016-03-05 00:00:00')
Query settings
Query Priority Batch
Destination Table bucket-984:search.result
Write Preference Overwrite table
Allow Large Results true
The data comes from search engine, so the table clicks is small (under million rows) but the table searches and search_results are huge. The query processes about 5 TB of data.
You could push the where filtering into the first select so there's less data to join:
SELECT
s.typeFlight s.type, r.distance, r.price, (d.booking_token IS NULL) clicked
FROM (
SELECT typeFlight, type, searchid
FROM [search.searches]
WHERE saved_at BETWEEN TIMESTAMP('2016-03-01 00:00:00')
AND TIMESTAMP('2016-03-05 00:00:00')
) s
LEFT JOIN [search.search_results] r ON r.searchid=s.searchid
LEFT JOIN [search.clicks] d ON d.booking_token=r.booking_token
Sometimes it is helpful to look at the Query Plan Explanation https://cloud.google.com/bigquery/query-plan-explanation to see where your query is spending time.
Let's consider a simple object with the same representation in a SQL database with properties(columns¨): Id, UserId,Ip.
I would like to prepare a query that would generate event in case that one user logs in from 2 IP adresses (or more) within 1 hour period.
My SQL looks like:
SELECT id,user_id,ip FROM w_log log
LEFT JOIN
(SELECT user_id, count(distinct ip) AS ip_count FROM w_log GROUP BY user_id) ips
ON log.user_id = ips.user_id
WHERE ips.ip_count > 1
Transformation to EPL:
SELECT * FROM LogEntry.win:time(1 hour) logs LEFT INNER join
(select UserId,count(distinct Ip) as IpCount FROM LogEntry.win:time(1 hour)) ips
ON logs.UserId = ips.UserId where ips.IpCount>1
Exception:
Additional information: Incorrect syntax near '(' at line 1 column 100,
please check the outer join within the from clause near reserved keyword 'select'
UPDATE:
I was successfuly able to create a schema, named window and insert data into it (or update it). I would like to increase the counter when a new LogEvent arrives in the .win:time(10 seconds) and decrease it when the event is leaving the 10 seconds window. Unfortunately the istream() doesn't seem to provide the true/false when the event is in remove stream.
create schema IpCountRec as (ip string, hitCount int)
create window IpCountWindow.win:time(10 seconds) as IpCountRec
on LogEvent.win:time(10 seconds) log
merge IpCountWindow ipc
where ipc.ip = log.ip
when matched and istream()
then update set hitCount = hitCount + 1
when matched and not istream()
then update set hitCount = hitCount - 1
when not matched
then insert select ip, 1 as hitCount
Is there something I missed?
In EPL I don't think it is possible to put a query into the from-part. You can change using "insert into". An EPL alternative is also a named window or table.
I have a nightly job that runs and computes some data in hive. It is partitioned by day.
Fields:
id bigint
rank bigint
Yesterday
output/dt=2013-10-31
Today
output/dt=2013-11-01
I am trying to figure out if there is a easy way to get incremental changes between today and yesterday
I was thinking about doing a left outer join but not sure what that looks like since its the same table
This is what it might looks like when there are different tables
SELECT * FROM a LEFT OUTER JOIN b
ON (a.id=b.id AND a.dt='2013-11-01' and b.dt='2-13-10-31' ) WHERE a.rank!=B.rank
But on the same table it is
SELECT * FROM a LEFT OUTER JOIN a
ON (a.id=a.id AND a.dt='2013-11-01' and a.dt='2-13-10-31' ) WHERE a.rank!=a.rank
Suggestions?
This would work
SELECT a.*
FROM A a LEFT OUTER JOIN A b ON a.id = b.id
WHERE a.dt='2013-11-01' AND b.dt='2013-10-31' AND <your-rank-conditions>;
Efficiently, this would span 1 MapReduce job only.
So I figured it out... Using Subqueries and Joins
select * from (select * from table where dt='2013-11-01') a
FULL OUTER JOIN
(select * from table where dt='2013-10-31') b
on (a.id=b.id)
where a.rank!=b.rank or a.rank is null or b.rank is null
The above will give you the diff..
You can take the diff and figure out what you need to ADD/UPDATE/REMOVE
UPDATE If a.rank!=null and b.rank!=null i.e rank changed
DELETE IF a.rank=null and b.rank!=null i.e the user is no longer ranked
ADD if a.rank!=null and b.rank=null i.e this is a new user
The problem is pretty simple: extract only the not exists records from 2 different streams using Esper engine.
ID exists in streamA but NOT EXITS in streamB.
In SQL it would look like this:
SELECT *
FROM tableA
WHERE NOT EXISTS (SELECT *
FROM tableB
WHERE tableA.Id = tableB.Id)
I've tried it Esper style but it doesn't work:
SELECT *
FROM streamA.win:ext_timed(timestamp, 5 seconds) as stream_A
WHERE NOT EXSITS
(SELECT stream_B.Id
FROM streamB.win:ext_timed(timestamp, 5 seconds) as stream_B
WHERE stream_A.Id = stream_B.Id)
Sadly if stream_A.Id inserted before stream_B.id than it will answer the query conditions and the query won't work.
Any suggestions on how to identify "ID exists in streamA but NOT EXITS in streamB" using Esper?
One simple way is to time-order the stream, so that A and B are timestamp ordered before sending events in.
Or you could delay A such as this query:
select * from pattern [every a=streamA -> timer:interval(1 sec)] as delayed_a
where not exists (... where delayed_a.a.id = b.id)
There is no need for an externally timed window for streamA. For externally timed behavior in general use external timer events.