Impala timestamps don't match Hive - a timezone issue?

Impala timestamps don't match Hive - a timezone issue? - timezone

I have some eventlog data in HDFS that, in its raw format, looks like this:
2015-11-05 19:36:25.764 INFO [...etc...]
An external table points to this HDFS location:
CREATE EXTERNAL TABLE `log_stage`(
`event_time` timestamp,
[...])
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
For performance, we'd like to query this in Impala. The log_stage data is inserted into a Hive/Impala Parquet-backed table by executing a Hive query: INSERT INTO TABLE log SELECT * FROM log_stage. Here's the DDL for the Parquet table:
CREATE TABLE `log`(
`event_time` timestamp,
[...])
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
The problem: when queried in Impala, the timestamps are 7 hours ahead:
Hive time: 2015-11-05 19:36:25.764
Impala time: 2015-11-06 02:36:25.764
> as.POSIXct("2015-11-06 02:36:25") - as.POSIXct("2015-11-05 19:36:25")
Time difference of 7 hours
Note: The timezone of the servers (from /etc/sysconfig/clock) are all set to "America/Denver", which is currently 7 hours behind UTC.
It seems that Impala is taking events that are already in UTC, incorrectly assuming they're in America/Denver time, and adding another 7 hours.
Do you know how to sync the times so that the Impala table matches the Hive table?

Hive writes timestamps to Parquet differently. You can use the impalad flag -convert_legacy_hive_parquet_utc_timestamps to tell Impala to do the conversion on read. See the TIMESTAMP documentation for more details.
This blog post has a brief description of the issue:
When Hive stores a timestamp value into Parquet format, it converts local time into UTC time, and when it reads data out, it converts back to local time. Impala, however on the other hand, does no conversion when reads the timestamp field out, hence, UTC time is returned instead of local time.
The impalad flag tells Impala to do the conversion when reading timestamps in Parquet produced by Hive. It does incur some small cost, so you should consider writing your timestamps with Impala if that is an issue for you (though it likely is minimal).

On a related note, as of Hive v1.2, you can also disable the timezone conversion behaviour with this flag:
hive.parquet.timestamp.skip.conversion
"Current Hive implementation of parquet stores timestamps to UTC, this flag allows skipping of the conversion on reading parquet files from other tools."
This was added in as part of https://issues.apache.org/jira/browse/HIVE-9482
Lastly, not timezone exactly, but for compatibility of Spark (v1.3 and up) and Impala on Parquet files, there's this flag:
spark.sql.parquet.int96AsTimestamp
https://spark.apache.org/docs/1.3.1/sql-programming-guide.html#configuration
Other: https://issues.apache.org/jira/browse/SPARK-12297

be VERY careful with the answers above due to https://issues.apache.org/jira/browse/IMPALA-2716
As for now, the best workaround is not to use TIMESTAMP data type and store timestamps as strings.

As mentioned in
https://docs.cloudera.com/documentation/enterprise/latest/topics/impala_timestamp.html
You can use ----use_local_tz_for_unix_timestamp_conversions=true and --convert_legacy_hive_parquet_utc_timestamps=true to match Hive results.
The first one ensures it converts to local timezone when you use any datetime function. You can set it as Impala Daemon startup options as mentioned in this document.
https://docs.cloudera.com/documentation/enterprise/5-6-x/topics/impala_config_options.html

Related

Converting Snowflake Database Timezone

I have a database that the timestamps are all in UTC format, but I need to convert, for just this one database, it over to CST for any (and all) timestamp fields.
There are 200 tables, so I don't have each table/field mapped that need to be updated. Is there a way to do this, without using
'''
convert_timezone
''' or
'''
dateadd
'''
on every query written?
The database instance is set to CST, but that database is in UTC.

You would need to write a stored proc that would read the ACCOUNT_USAGE.COLUMNS table, identify columns that have a date datatype and then construct SQL statements for each table that updated the values using CONVERT_TIMEZONE

Snowflake DB join on Timestamp field auto conversion issue

I have two tables both of which have a timestamp field [TIMESTAMP_TZ] and when I perform a join based on this timestamp field the plan in snowflake DB shows an auto conversion on these timestamps into LTZ. Ex
(TO_TIMESTAMP_LTZ(CAG.LOAD_DATE_UTC) = TO_TIMESTAMP_LTZ(PIT.CSAT_AGREEMENT_LDTS))
Any reason why this is happening?

TIMESTAMP_TZ means your timestamp is linked to a time zone and TIMESTAMP_LTZ is your local timezone. Probably the timezones of your two timestamps are different and thus Snowflake converts them automatically to your local timezone to match them correctly.

BigQuery Timestamp to Ruby time

I have a table with a column timestamp in type TIMESTAMP in BigQuery. When I display it on my console, I can see timestamps as follows: 2015-10-19 21:25:35 UTC
I then query my table using the BigQuery API, and when I display the result of the query, I notice that this timestamp has been converted in some kind of very big integer like 1.445289935E9 in string format.
Any idea on how do I convert it back to normal time? something I can use in my ruby code?

Time.at("1.468768144014E9".to_f)

Search by date works with SQLite but throws error ORA_00936 with Oracle

I have this error:
OCIError: ORA-00936: missing expression:
SELECT COUNT(*) AS count_all, date("T_ORDER_PRODUCT_ITEMS"."CREATED_AT") AS date_t_order_product_items_cre
FROM "T_ORDER_PRODUCT_ITEMS"
GROUP BY date("T_ORDER_PRODUCT_ITEMS"."CREATED_AT")
It works fine with group by id.
I use SQLite and nothing wrong. But when I upload to server with Oracle, I have this error.
I use this query in SQLite:
Transaction::OrderProductItem.group('date(created_at)').count

SQLite has a DATE function, but not Oracle.
You need to use TO_DATE to convert from varchar to date:
SELECT COUNT(*) AS count_all,
to_date("T_ORDER_PRODUCT_ITEMS"."CREATED_AT", "DD-MM-YYYY") AS date_t_order_product_items_cre
-- ^^^^^^^^^^^^
-- your date format
FROM "T_ORDER_PRODUCT_ITEMS"
GROUP BY to_date("T_ORDER_PRODUCT_ITEMS"."CREATED_AT", "DD-MM-YYYY");
-- ^^^^^^^^^^^^
-- your date format
While not strictly necessary, it is a good practice to explicitly set the date format when using TO_DATE. Otherwise, it defaults to the format corresponding to your locale configuration -- and if this changes for whatever reason it will lead to difficult to track bugs.
SQLite, does not have a proper data type for date/time. But, in Oracle it is always preferable to store date directly in a DATE column. This avoid conversion on the fly while doing queries. Not mentioning that using a function will invalidate the index you might have on that column.
However, if you really need a cross-product solution, your best bet is probably to store your date as strings using a lexicographically comparable date/time format. Like ISO 8601 -- and completely get rid of using a proper date type and/or date functions. For simple search/ordering this will work. But if will increase complexity when you will need to extract date components or perform calculations.
That being said, if it is a requirement for your project to be compatible both with SQLite and Oracle (and possibly other DB), it might worth considering using some abstraction layer between your code and the generated SQL (like SQLAlchemy's SQL Expression Language for Python -- I'm pretty sure such things exist for Ruby as well:).

How can I accurately convert an INFORMIX-SE audit table column which stores UNIX timestamp[INT] into date and time?

INFORMIX-SE:
SE allows you to create an audit file for any table. The audit file has the same schema as the table being audited plus a header consisting of several columns, one of them being a_time INTEGER, which contains a UNIX timestamp of when the row was added, updated or deleted. UNIX timestamp is an INTEGER value corresponding to the number of seconds since 1970-01-01T00:00:00Z. Can anyone come up with an algorithm which can accurately convert these seconds into a date and time?

Here is an algorithm that shows you how to do it: http://forums.mrplc.com/index.php?showtopic=13294&view=findpost&p=65248

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Impala timestamps don't match Hive - a timezone issue? - timezone

be VERY careful with the answers above due to https://issues.apache.org/jira/browse/IMPALA-2716 As for now, the best workaround is not to use TIMESTAMP data type and store timestamps as strings.

Related

Converting Snowflake Database Timezone

Snowflake DB join on Timestamp field auto conversion issue

BigQuery Timestamp to Ruby time

Search by date works with SQLite but throws error ORA_00936 with Oracle

How can I accurately convert an INFORMIX-SE audit table column which stores UNIX timestamp[INT] into date and time?

Categories

Resources