Snowflake JSON to tabular - stored-procedures

Snowflake JSON to tabular - stored-procedures

I was reading through the documentation of Snowflake and haven't found a solution yet so I come to you. I have a table in Snowflake which contains a variant column where I store JSON data. Do you know of a way to dynamically convert the results of a query on a variant column to a tabular format?
For example I have a query like
select json_data from database.schema.table limit 2
Which would return something like
JSON_DATA
{"EventName": "Test", "EventValue": 100}
{"EventName": "Test", "EventValue": 200}
Is there a way to return it as a table without having to reference the keys? I know I can do
select
json_data['EventName'] EventName,
json_data['EventValue'] EventValue
from
database.schema.table
But I am looking for something more dynamic like
select * from table(json_to_table(select json_data from database.schema.table)) limit 2
That could return
EventName
EventValue
Test
100
Test
200
I'm looking for any internal solutions (like stored procedures, udf, snowflake functions I might have missed...anything except external functions)

While there's no way to create dynamic column lists currently, as described in the comment you can run a stored procedure to build (and rebuild) a view. This will avoid having to manually type and maintain a long list of columns.
After creating the SP at the bottom, you can use it like this:
create or replace table MY_TABLE(JSON_DATA variant);
insert into MY_TABLE select parse_json('{"EventName": "Test", "EventValue": 100}');
insert into MY_TABLE select parse_json('{"EventName": "Test", "EventValue": 200}');
call create_view_over_json('MY_TABLE', 'JSON_DATA', 'MY_VIEW');
select * from MY_VIEW;
Here is the stored procedure to create the view. Note that if the table is very large it will take Snowflake's TYPEOF() function quite a while to determine a column type. If it's known to be consistent, you can point it to a sample table or one created with a limit 1000.
create or replace procedure create_view_over_json (TABLE_NAME varchar, COL_NAME varchar, VIEW_NAME varchar)
returns varchar
language javascript
as
$$
/****************************************************************************************************************
* *
* CREATE_VIEW_OVER_JSON - Craig Warman, Alan Eldridge and Greg Pavlik Snowflake Computing, 2019, 2020, 2021 *
* *
* This stored procedure creates a view on a table that contains JSON data in a column. *
* of type VARIANT. It can be used for easily generating views that enable access to *
* this data for BI tools without the need for manual view creation based on the underlying *
* JSON document structure. *
* *
* Parameters: *
* TABLE_NAME - Name of table that contains the semi-structured data. *
* COL_NAME - Name of VARIANT column in the aforementioned table. *
* VIEW_NAME - Name of view to be created by this stored procedure. *
* *
* Usage Example: *
* call create_view_over_json('db.schema.semistruct_data', 'variant_col', 'db.schema.semistruct_data_vw'); *
* *
* Important notes: *
* - This is the "basic" version of a more sophisticated procedure. Its primary purpose *
* is to illustrate the view generation concept. *
* - This version of the procedure does not support: *
* - Column case preservation (all view column names will be case-insensitive). *
* - JSON document attributes that are SQL reserved words (like TYPE or NUMBER). *
* - "Exploding" arrays into separate view columns - instead, arrays are simply *
* materialized as view columns of type ARRAY. *
* - Execution of this procedure may take an extended period of time for very *
* large datasets, or for datasets with a wide variety of document attributes *
* (since the view will have a large number of columns). *
* *
* Attribution: *
* I leveraged code developed by Alan Eldridge as the basis for this stored procedure. *
* *
****************************************************************************************************************/
var currentActivity;
try{
currentActivity = "building the query for column types";
var elementQuery = GetElementQuery(TABLE_NAME, COL_NAME);
currentActivity = "running the query to get column names";
var elementRS = GetResultSet(elementQuery);
currentActivity = "building the column list";
var colList = GetColumnList(elementRS);
currentActivity = "building the view's DDL";
var viewDDL = GetViewDDL(VIEW_NAME, colList, TABLE_NAME);
currentActivity = "creating the view";
return ExecuteSingleValueQuery("status", viewDDL);
}
catch(err){
return "ERROR: Encountered an error while " + currentActivity + ".\n" + err.message;
}
/****************************************************************************************************************
* *
* End of main function. Helper functions below. *
* *
****************************************************************************************************************/
function GetElementQuery(tableName, columnName){
// Build a query that returns a list of elements which will be used to build the column list for the CREATE VIEW statement
sql =
`
SELECT DISTINCT regexp_replace(regexp_replace(f.path,'\\\\[(.+)\\\\]'),'(\\\\w+)','\"\\\\1\"') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
DECODE (substr(typeof(f.value),1,1),'A','ARRAY','B','BOOLEAN','I','FLOAT','D','FLOAT','STRING') AS attribute_type, -- This generates column datatypes of ARRAY, BOOLEAN, FLOAT, and STRING only
REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\\\[(.+)\\\\]'),'[^a-zA-Z0-9]','_') AS alias_name -- This generates column aliases based on the path
FROM
#~TABLE_NAME~#,
LATERAL FLATTEN(#~COL_NAME~#, RECURSIVE=>true) f
WHERE TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '['); -- This prevents traversal down into arrays
`;
sql = sql.replace(/#~TABLE_NAME~#/g, tableName);
sql = sql.replace(/#~COL_NAME~#/g, columnName);
return sql;
}
function GetColumnList(elementRS){
/*
Add elements and datatypes to the column list
They will look something like this when added:
col_name:"name"."first"::STRING as name_first,
col_name:"name"."last"::STRING as name_last
*/
var col_list = "";
while (elementRS.next()) {
if (col_list != "") {
col_list += ", \n";
}
col_list += COL_NAME + ":" + elementRS.getColumnValue("PATH_NAME"); // Start with the element path name
col_list += "::" + elementRS.getColumnValue("ATTRIBUTE_TYPE"); // Add the datatype
col_list += " as " + elementRS.getColumnValue("ALIAS_NAME"); // And finally the element alias
}
return col_list;
}
function GetViewDDL(viewName, columnList, tableName){
sql =
`
create or replace view #~VIEW_NAME~# as
select
#~COLUMN_LIST~#
from #~TABLE_NAME~#;
`;
sql = sql.replace(/#~VIEW_NAME~#/g, viewName);
sql = sql.replace(/#~COLUMN_LIST~#/g, columnList);
sql = sql.replace(/#~TABLE_NAME~#/g, tableName);
return sql;
}
/****************************************************************************************************************
* *
* Library functions *
* *
****************************************************************************************************************/
function ExecuteSingleValueQuery(columnName, queryString) {
var out;
cmd1 = {sqlText: queryString};
stmt = snowflake.createStatement(cmd1);
var rs;
try{
rs = stmt.execute();
rs.next();
return rs.getColumnValue(columnName);
}
catch(err) {
throw err;
}
return out;
}
function GetResultSet(sql){
try{
cmd1 = {sqlText: sql};
stmt = snowflake.createStatement(cmd1);
var rs;
rs = stmt.execute();
return rs;
}
catch(err) {
throw err;
}
}
$$;

Related

Is there a way to average only some rows of data in a sheet

I have several sheets that import various scores based on file reviews for different areas. I want to calculate an office average for those offices who have had more than one review in each period, but there's no way to tell ahead of time which offices are going to have more than one, so in each list there could be
office 1 score
office 2 score
office 2 score
office 3 score
Etc.
Is there a way to automate this, eg find duplicates and average, or do I have to look through after the imports and do it by hand?
Cheers :)
Meg

You can use the query function in Sheets.
Put this in cell E1:
=query(A:C,"select B,avg(C) where B is not null group by B label avg(C) 'Office average' ",1)

function getDataSubset() {
const ss = SpreadsheetApp.getActiveSpreadsheet()
const ssId = ss.getId();
const sheet = ss.getSheetByName('contact')
const sheetName = sheet.getName()
const lastRow = sheet.getLastRow();
const lastCol = sheet.getLastColumn();
// let theQuery = "SELECT * WHERE job ='a job'" // works
let theQuery = "SELECT A, B WHERE E ='a job' AND G > 30" //works
// with header row - result is an array of objects if header row is specified
// const a1Range = sheet.getDataRange().getA1Notation();
// let result = Utils.gvizQuery(
// ssId // YOUR_SPREADSHEET_ID
// ,theQuery
// ,sheetName // can be a number (the sheetId), or the name of the sheet; if not needed, but headers are, pass in undefined
// ,a1Range // specify range, ex: `A2:O`
// ,1 // HEADER_ROW_INDEX_IF_NEEDED> - always a number
// );
// no header row - result is an array of arrays
const a1Range = sheet.getRange(2,1,lastRow,lastCol).getA1Notation();
let result = Utils.gvizQuery(
ssId // YOUR_SPREADSHEET_ID
,theQuery
,sheetName // can be a number (the sheetId), or the name of the sheet; if not needed, but headers are, pass in undefined
,a1Range // specify range, ex: `A2:O`
// HEADER_ROW_INDEX_IF_NEEDED> - always a number
);
console.log( JSON.stringify(result) );
}
/**
* https://stackoverflow.com/questions/51327982/how-to-use-google-sheets-query-or-google-visualization-api-from-apps-script/51328419#51328419
*/
(function(context) {
const Utils = (context.Utils || (context.Utils = {}));
/**
* Queries a spreadsheet using Google Visualization API's Datasoure Url.
*
* #param {String} ssId Spreadsheet ID.
* #param {String} query Query string.
* #param {String|Number} sheetId Sheet Id (gid if number, name if string). [OPTIONAL]
* #param {String} range Range [OPTIONAL]
* #param {Number} headers Header rows. [OPTIONAL]
*/
Utils.gvizQuery = function(ssId, query, sheetId, range, headers) {
var response = JSON.parse( UrlFetchApp
.fetch(
Utilities.formatString(
"https://docs.google.com/spreadsheets/d/%s/gviz/tq?tq=%s%s%s%s",
ssId,
encodeURIComponent(query),
(typeof sheetId === "number") ? "&gid=" + sheetId :
(typeof sheetId === "string") ? "&sheet=" + sheetId :
"",
(typeof range === "string") ? "&range=" + range :
"",
"&headers=" + ((typeof headers === "number" && headers > 0) ? headers : "0")
),
{
"headers":{
"Authorization":"Bearer " + ScriptApp.getOAuthToken()
}
}
)
.getContentText()
.replace("/*O_o*/\n", "") // remove JSONP wrapper
.replace(/(google\.visualization\.Query\.setResponse\()|(\);)/gm, "") // remove JSONP wrapper
),
table = response.table,
rows;
if (typeof headers === "number") {
rows = table.rows.map(function(row) {
return table.cols.reduce(
function(acc, col, colIndex) {
acc[col.label] = row.c[colIndex] && row.c[colIndex].v;
return acc;
},
{}
);
});
} else {
rows = table.rows.map(function(row) {
return row.c.reduce(
function(acc, col) {
acc.push(col && col.v);
return acc;
},
[]
);
});
}
return rows;
};
Object.freeze(Utils);
})(this);

How to log data into the last row of the specified column instead of all columns?

Currently, my script is logging values in E based on the position of the last input in columns A,B. Is there a way to prevent these gaps?
var sss = SpreadsheetApp.openById('sampleID');
var ss = sss.getSheetByName('Forecast data');
var range = ss.getRange('B126');
const now = new Date();
const data = range.getValues().map(row => row.concat(now));
var tss = SpreadsheetApp.openById('sampleID2');
var ts = tss.getSheetByName('Archived Data');
ts.getRange(ts.getLastRow()+1, 5,1,2).setValues(data);
}

Try something like this:
ts.getRange(getLastRow_(ts, 5) + 1, 5, 1, 2).setValues(data);
Here's a copy of the getLastRow_() function:
/**
* Gets the position of the last row that has visible content in a column of the sheet.
* When column is undefined, returns the last row that has visible content in any column.
*
* #param {Sheet} sheet A sheet in a spreadsheet.
* #param {Number} columnNumber Optional. The 1-indexed position of a column in the sheet.
* #return {Number} The 1-indexed row number of the last row that has visible content.
*/
function getLastRow_(sheet, columnNumber) {
// version 1.5, written by --Hyde, 4 April 2021
const values = (
columnNumber
? sheet.getRange(1, columnNumber, sheet.getLastRow() || 1, 1)
: sheet.getDataRange()
).getDisplayValues();
let row = values.length - 1;
while (row && !values[row].join('')) row--;
return row + 1;
}

An alternative way to find it is via filter().
Code:
// Sample data to be iserted
data = [[2.4, '5/5/2021']]
var tss = SpreadsheetApp.openById(sampleID2);
var ts = tss.getSheetByName('Archived Data');
// get values on column E and filter the cells with values and get their length
var column = ts.getRange("E1:E").getValues();
var lastRow = column.filter(String).length;
ts.getRange(lastRow + 1, 5, 1, 2).setValues(data);
Sample data:
Output:
Note:
This approach is good when column has no blank cells in between. When you skip a cell, it will not calculate the lastRow properly and might overwrite data. But as long as you do not have gaps in your column, then this will be good.
Resource:
Determining the last row in a single column

Variable Assignment Issue for Multiple dynamic SQL in DB2 Iseries

We are using DB2 Iseries V7R3 on AS400 system.
In one of the stored procedure, we are preparing dynamic SQL queries. Each SQL query is assigned to different variables. When we execute the stored procedure sometimes it fails but when retry with same parameters it works.
Upon putting logs in the stored procedure, we have observed that during the failed cases value used for variable 2 is from variable 1.
Attached is the stored procedure and screenshot of the logs.
Appreciate any help on this, running out of the thinking options for this.
However, sometimes it uses select * for variable1 as well. After retry it works ok.
create stored procedure (
)DYNAMIC RESULT SETS 1
LANGUAGE SQL
SPECIFIC SYMDTA.PRC_RETRIEVE_CLAIM_LIST
NOT DETERMINISTIC
MODIFIES SQL DATA
CALLED ON NULL INPUT
COMMIT ON RETURN YES
CONCURRENT ACCESS RESOLUTION USE CURRENTLY COMMITTED
SET OPTION ALWBLK = *ALLREAD ,
ALWCPYDTA = *OPTIMIZE ,
COMMIT = *NONE ,
DECRESULT = (31, 31, 00) ,
DYNDFTCOL = *NO ,
DYNUSRPRF = *USER ,
SRTSEQ = *HEX
BEGIN
DECLARE DATACLAIM CLOB ( 1048576 ) DEFAULT ' ' ;
DECLARE GCLAIMCOUNT CLOB ( 1048576 ) DEFAULT ' ' ;
DECLARE CR_CLAIM_LIST_STMT CURSOR WITH HOLD FOR CLM_DATA_STMT ;
DECLARE CR_CLAIM_COUNT_STMT CURSOR WITH HOLD FOR CLM_COUNT_STMT ;
SET DATACLAIM = 'SELECT * FROM table ';
SET GCLAIMCOUNT = 'select count(*) from table';
INSERT INTO DEBUGGING_DYNAMIC_QUERIES VALUES ( POLICY_NO , DATACLAIM , CURRENT TIMESTAMP , 'DATACLAIM' ) ;
INSERT INTO DEBUGGING_DYNAMIC_QUERIES VALUES ( GCLAIMCOUNT , CURRENT TIMESTAMP , 'GCLAIMCOUNT' ) ;
PREPARE CLM_DATA_STMT FROM DATACLAIM ;
OPEN CR_CLAIM_LIST_STMT ;
PREPARE CLM_COUNT_STMT FROM GCLAIMCOUNT ;
OPEN CR_CLAIM_COUNT_STMT ;
FETCH CR_CLAIM_COUNT_STMT INTO TOTAL_RECORDS_G4 ;
CLOSE CR_CLAIM_COUNT_STMT ;
Output the debug table :-
Wrong :-
DATACLAIM = "select * " - 2020-01-01 11:00 AM
GCLAIMCOUNT = "Select * " - 2020-01-01 11:01 AM
After retry :-
DATACLAIM = "select * " - 2020-01-01 12:00 pm
GCLAIMCOUNT = "Select count(*) " - 2020-01-01 12:01 Pm

Query SQLite database for a GUID in the form X'3D98F71F3CD9415BA978C010b1CEF941

I have an iOS project and data is written into an SQLite Database. For example, 'OBJECTROWID' in a table LDOCLINK stores info about a linked document.
OBJECTROWID starts of as a string with the format <3d98f71f 3cd9415b a978c010 b1cef941> but is cast to (NSData *) before being input into the database. The actual handling of the database insertion was written by a much more experienced programmer than myself. Anyway, as the image below shows, the database displays the OBJECTROWID column in the form X'3D98F71F3CD9415BA978C010b1CEF941'. I am a complete beginner with SQLite queries and cannot seem to return the correct row by using the WHERE clause with OBJECTROWID = or OBJECTROWID like.
SELECT * FROM LDOCLINK WHERE OBJECTROWID like '%';
gives all the rows (obviously) but I want the row where OBJECTROWID equals <3d98f71f 3cd9415b a978c010 b1cef941>. I have tried the following and none of them work:
SELECT * FROM LDOCLINK WHERE OBJECTROWID = 'X''3d98f71f3cd9415ba978c010b1cef941' no error - I thought that I was escaping the single quote that appears after the X but this didn't work
SELECT * FROM LDOCLINK WHERE OBJECTROWID like '%<3d98f71f 3cd9415b a978c010 b1cef941>%'
I cannot even get a match for two adjacent characters such as the initial 3D:
SELECT * FROM LDOCLINK WHERE OBJECTROWID like '%3d%' no error reported but it doesn't return anything.
SELECT * FROM LDOCLINK WHERE OBJECTROWID like '%d%' This is the strangest result as it returns ONLY the two rows that DON'T include my <3d98f71f 3cd9415b a978c010 b1cef941>, seemingly arbitrarily.
SELECT * FROM LDOCLINK WHERE OBJECTTYPE = '0' returns these same rows, just to illustrate that the interface works (SQLite Manager).
I also checked out this question and this one but I still could not get the correct query.
Please help me to return the correct row (actually two rows in this case - the first and third).
EDIT:
The code to write to database involves many classes. The method shown below is I think the main part of serialisation (case 8).
-(void)serializeValue:(NSObject*)value ToBuffer:(NSMutableData*)buffer
{
switch (self.propertyTypeID) {
case 0:
{
SInt32 length = 0;
if ( (NSString*)value )
{
/*
NSData* data = [((NSString*)value) dataUsingEncoding:NSUnicodeStringEncoding];
// first 2 bytes are unicode prefix
length = data.length - 2;
[buffer appendBytes:&length length:sizeof(SInt32)];
if ( length > 0 )
[buffer appendBytes:([data bytes]+2) length:length];
*/
NSData* data = [((NSString*)value) dataUsingEncoding:NSUTF8StringEncoding];
length = data.length;
[buffer appendBytes:&length length:sizeof(SInt32)];
if ( length > 0 )
[buffer appendBytes:([data bytes]) length:length];
}
else
[buffer appendBytes:&length length:sizeof(SInt32)];
}
break;
//depends on the realisation of DB serialisation
case 1:
{
Byte b = 0;
if ( (NSNumber*)value )
b = [(NSNumber*)value boolValue] ? 1 : 0;
[buffer appendBytes:&b length:1];
}
break;
//........
case 8:
{
int length = 16;
[buffer appendBytes:[(NSData*)value bytes] length:length];
}
break;
default:
break;
}
}

So, as pointed out by Tom Kerr, this post answered my question. Almost. The syntax wasn't exactly right. The form: SELECT * FROM LDOCLINK WHERE OBJECTROWID.Id = X'a8828ddfef224d36935a1c66ae86ebb3'; was suggested but I actually had to drop the .Id part.
Making:
SELECT * FROM LDOCLINK WHERE OBJECTROWID = X'3d98f71f3cd9415ba978c010b1cef941';

Using LPEG (Lua Parser Expression Grammars) like boost::spirit

So I am playing with lpeg to replace a boost spirit grammar, I must say boost::spirit is far more elegant and natural than lpeg. However it is a bitch to work with due to the constraints of current C++ compiler technology and the issues of TMP in C++. The type mechanism is in this case your enemy rather than your friend. Lpeg on the other hand while ugly and basic results in more productivity.
Anyway, I am digressing, part of my lpeg grammar looks like as follows:
function get_namespace_parser()
local P, R, S, C, V =
lpeg.P, lpeg.R, lpeg.S, lpeg.C, lpeg.V
namespace_parser =
lpeg.P{
"NAMESPACE";
NAMESPACE = V("WS") * P("namespace") * V("SPACE_WS") * V("NAMESPACE_IDENTIFIER")
* V("WS") * V("NAMESPACE_BODY") * V("WS"),
NAMESPACE_IDENTIFIER = V("IDENTIFIER") / print_string ,
NAMESPACE_BODY = "{" * V("WS") *
V("ENTRIES")^0 * V("WS") * "}",
WS = S(" \t\n")^0,
SPACE_WS = P(" ") * V("WS")
}
return namespace_parser
end
This grammar (although incomplete) matches the following namespace foo {}. I'd like to achieve the following semantics (which are common use-cases when using boost spirit).
Create a local variable for the namespace rule.
Add a namespace data structure to this local variable when namespace IDENTIFIER { has been matched.
Pass the newly created namespace data structure to the NAMESPACE_BODY for further construction of the AST... so on and so forth.
I am sure this use-case is achievable. No examples show it. I don't know the language or the library enough to figure out how to do it. Can someone show the syntax for it.
edit : After a few days of trying to dance with lpeg, and getting my feet troden on, I have decided to go back to spirit :D it is clear that lpeg is meant to be weaved with lua functions and that such weaving is very free-form (whereas spirit has clear very well documented semantics). I simply do not have the right mental model of lua yet.

Though "Create a local variable for the namespace rule" sounds disturbingly like "context-sensitive grammar", which is not really for LPEG, I will assume that you want to build an abstract syntax tree.
In Lua, an AST can be represented as a nested table (with named and indexed fields) or a closure, doing whatever task that tree is meant to do.
Both can be produced by a combination of nested LPEG captures.
I will limit this answer to AST as a Lua table.
Most useful, in this case, LPEG captures will be:
lpeg.C( pattern ) -- simple capture,
lpeg.Ct( pattern ) -- table capture,
lpeg.Cg( pattern, name ) -- named group capture.
The following example based on your code will produce a simple syntax tree as a Lua table:
local lpeg = require'lpeg'
local P, V = lpeg.P, lpeg.V
local C, Ct, Cg = lpeg.C, lpeg.Ct, lpeg.Cg
local locale = lpeg.locale()
local blank = locale.space ^ 0
local space = P' ' * blank
local id = P'_' ^ 0 * locale.alpha * (locale.alnum + '_') ^ 0
local NS = P{ 'ns',
-- The upper level table with two fields: 'id' and 'entries':
ns = Ct( blank * 'namespace' * space * Cg( V'ns_id', 'id' )
* blank * Cg( V'ns_body', 'entries' ) * blank ),
ns_id = id,
ns_body = P'{' * blank
-- The field 'entries' is, in turn, an indexed table:
* Ct( (C( V'ns_entry' )
* (blank * P',' * blank * C( V'ns_entry') ) ^ 0) ^ -1 )
* blank * P'}',
ns_entry = id
}
lpeg.match( NS, 'namespace foo {}' ) will give:
table#1 {
["entries"] = table#2 {
},
["id"] = "foo",
}
lpeg.match( NS, 'namespace foo {AA}' ) will give:
table#1 {
["entries"] = table#2 {
"AA"
},
["id"] = "foo",
}
lpeg.match( NS, 'namespace foo {AA, _BB}' ) will give:
table#1 {
["entries"] = table#2 {
"AA",
"_BB"
},
["id"] = "foo",
}
lpeg.match( NS, 'namespace foo {AA, _BB, CC1}' ) will give:
table#1 {
["entries"] = table#2 {
"AA",
"_BB",
"CC1"
},
["id"] = "foo",
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Snowflake JSON to tabular - stored-procedures

Related

Is there a way to average only some rows of data in a sheet

How to log data into the last row of the specified column instead of all columns?

Variable Assignment Issue for Multiple dynamic SQL in DB2 Iseries

Query SQLite database for a GUID in the form X'3D98F71F3CD9415BA978C010b1CEF941

Using LPEG (Lua Parser Expression Grammars) like boost::spirit

Categories

Resources