ZetaSQL - Creating a simple catalog with tables and columns using local service - parsing

We are using a Python client binding for ZetaSQL GRPC local service in our application to analyze statements and extract referenced tables and output columns.
It is possible to extract referenced tables using the following simplified Python code and the local service:
import zetasql.local_service as zql
conn = zql.connect()
language_options = conn.GetLanguageOptions(
zql.pb2.LanguageOptionsRequest(maximum_features=True)
)
# Used to allow ZetaSQL parser to parse `CREATE TABLE AS` statments
language_options.supported_statement_kinds.pop()
req = zql.pb2.ExtractTableNamesFromStatementRequest(
sql_statement=sql, options=language_options
)
res = conn.ExtractTableNamesFromStatement(req)
return json.loads(MessageToJson(res))
However, from what I see here, the local service doesn't have the full functionalities of the Java client, mainly creating simple catalog with tables and columns to analyze any SQL statement. Also setting analyzer options doesn't seem to be possible.
Is it possible to analyze SQL statements using ZetaSQL with only the local service? If not, what should be the alternative approach to extract output columns?

Related

databricks: reading from datawarehouse temp directory

I am writing a function like the following
def fromdw():
df=spark.read \
.format("com.databricks.spark.sqldw")\
.option("url", myurl) \
.option("query",sqlquery)\
.option( "forward_spark_azure_storage_credentials","True")\
.option("tempdir", mytempurl)\
.load()
return df
may I know the option tempdir is compulsory? i want to do a read without tempdir , because it was slow as it require to stage the results in a Blob folder first.
Yes. Use JDBC for smaller, faster queries. Use the Databricks spark connector for SQL DW (Synape) for moving large data sets. This connector bypasses both the Spark Driver node and the Synapse head node, and by using Azure Storage as a staging area, allows the unload and load to be performed in parallel by the clusters in either direction.
JDBC is documented here, and looks like this:
connectionProperties = {
"user" : user,
"password" : pw,
"driver" : "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
pushdown_query = "(select * from whatever) q"
df = spark.read.jdbc(url=jdbcUrl, table=pushdown_query, properties=connectionProperties)
display(df)
(notice how you have to wrap the SELECT to use a query instead of a table name).

In pyspark, reading csv files gets failed if even 1 path does not exist. How can we avoid this?

In pyspark reading csv files from different paths gets failed if even one path does not exist.
Logs = spark.read.load(Logpaths, format="csv", schema=logsSchema, header="true", mode="DROPMALFORMED");
Here Logpaths is an array that contain multiple paths. And these paths are created dynamically depending upon given startDate and endDate range. If Logpaths contain 5 paths and first 3 exists but 4th does not exist. Then whole extraction gets failed. How can I avoid this in pyspark or how can I check there existance before reading?
In scala I did this by checking file existance and filter out non-existed records by using hadoop hdfs filesystem globStatus function.
Path = '/bilal/2018.12.16/logs.csv'
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
val fileStatus = fs.globStatus(new org.apache.hadoop.fs.Path(Path));
So I got what I was looking for. Like the code I posted in the question which can be used in scala for file existance check. We can use below code in case of PySpark.
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("bilal/logs/log.csv"))
This is exactly the same code also used in scala, so in this case we are using java library for hadoop and java code runs on JVM on which spark is running.

Avoiding SSIS script task to convert utf-8 to unicode for AS400 data to SQL Server

After many tries I have concluded that the optimal way to transfer with SSIS data from AS400 (non-unicode) to SQL Server is:
Use native transfer utility to dump data to tsv (tab delimited)
Convert files from utf-8 to unicode
Use bulk insert to put them into SQL Server
In #2 step I have found a ready made code that does this:
string from = #"\\appsrv02\c$\bg_f0101.tsv";
string to = #"\\appsrv02\c$\bg_f0101.txt";
using (StreamReader reader = new StreamReader(from, Encoding.UTF8, false, 1000000))
using (StreamWriter writer = new StreamWriter(to, false, Encoding.Unicode, 1000000))
{
while (!reader.EndOfStream)
{
var line = reader.ReadLine();
if (line.Length > 0)
writer.WriteLine(line);
}
}
I need to fully understand what is happening here with the encoding and why this is necessary.
How can I replace this script task with a more elegant solution?
I don't have much insight into exactly why you need the utf-8 conversion task, except to say that SQL server - I believe - uses UCS-2 as its native storage format, and this is similiar to UTF-16 which is what your task converts the file to. I'm surprised SSIS can't work with a UTF-8 input source though.
My main point is to answer the "How could I replace this script task with a more elegant solution?":
I have had a lot of success using HiT OLEDB/400 Server. It allows you to set up your AS/400 / iSeries / System i / whatever IBM are calling it this week as a linked server in SQL server, and you can then access the 400's data directly from the server its linked to using the standard 4 part SQL syntax, e.g. SELECT * FROM my400.my400.myLib.myFile.
Or even better, it's much more efficient as a passthrough query using EXEC...AT.
Using this you would not need SSIS at all, you'd just need a simple stored proc with that does an insert into your destination table direct from the 400 data.

Neo4j: Inserting 7k nodes is slow (Spring Data Neo4j / SpringRestGraphDatabase)

I'm building an application where my users can manage dictionaries. One feature is uploading a file to initialize or update the dictionary's content.
The part of the structure I'm focusing on for a start is Dictionary -[:CONTAINS]->Word.
Starting from an empty database (Neo4j 1.9.4, but also tried 2.0.0M5), accessed via Spring Data Neo4j 2.3.1 in a distributed environment (therefore using SpringRestGraphDatabase, but testing with localhost), I'm trying to load 7k words in 1 dictionary. However I can't get it done in less than 8/9 minutes on a linux with core i7, 8Gb RAM and SSD drive (ulimit raised to 40000).
I've read lots of posts about loading/inserting performance using REST and I've tried to apply the advices I found but without better luck. The BatchInserter tool doesn't seem to be a good option to me due to my application constraints.
Can I hope to load 10k nodes in a matter of seconds rather than minutes ?
Here is the code I came up with, after all my readings :
Map<String, Object> dicProps = new HashMap<String, Object>();
dicProps.put("locale", locale);
dicProps.put("category", category);
Dictionary dictionary = template.createNodeAs(Dictionary.class, dicProps);
Map<String, Object> wordProps = new HashMap<String, Object>();
Set<Word> words = readFile(filename);
for (Word gw : words) {
wordProps.put("txt", gw.getTxt());
Word w = template.createNodeAs(Word.class, wordProps);
template.createRelationshipBetween(dictionary, w, Contains.class, "CONTAINS", true);
}
I resolve such problem by just creating some CSV file and after that read it from Neo4j. It is needed to make such steps:
Write some class which get input data and base on it create CSV file (it can be one file per node kind or even you can create file which will be used to build relation).
In my case I have also create servlet which allow Neo4j to read that file by HTTP.
Create proper Cypher statements which allow to read and parse that CSV file. There are some samples of which I use (if you use Spring Data also remember about labels):
simple one:
load csv with headers from {fileUrl} as line
merge (:UserProfile:_UserProfile {email: line.email})
more complicated:
load csv with headers from {fileUrl} as line
match (c:Calendar {calendarId: line.calendarId})
merge (a:Activity:_Activity {eventId: line.eventId})
on create set a.eventSummary = line.eventSummary,
a.eventDescription = line.eventDescription,
a.eventStartDateTime = toInt(line.eventStartDateTime),
a.eventEndDateTime = toInt(line.eventEndDateTime),
a.eventCreated = toInt(line.eventCreated),
a.recurringId = line.recurringId
merge (a)-[r:EXPORTED_FROM]->c
return count(r)
Try the below
Usw native Neo4j API rather than spring-data-neo4j while performing batch operations.
Commit in batches i.e. may be for each 500 words
NOTE: There are certain properties (type) added by SDN which will be missing when using the native approach.

Equivalence of Rails console for Node.js

I am trying out Node.js Express framework, and looking for plugin that allows me to interact with my models via console, similar to Rails console. Is there such a thing in NodeJS world?
If not, how can I interact with my Node.js models and data, such as manually add/remove objects, test methods on data etc.?
Create your own REPL by making a js file (ie: console.js) with the following lines/components:
Require node's built-in repl: var repl = require("repl");
Load in all your key variables like db, any libraries you swear by, etc.
Load the repl by using var replServer = repl.start({});
Attach the repl to your key variables with replServer.context.<your_variable_names_here> = <your_variable_names_here>. This makes the variable available/usable in the REPL (node console).
For example: If you have the following line in your node app:
var db = require('./models/db')
Add the following lines to your console.js
var db = require('./models/db');
replServer.context.db = db;
Run your console with the command node console.js
Your console.js file should look something like this:
var repl = require("repl");
var epa = require("epa");
var db = require("db");
// connect to database
db.connect(epa.mongo, function(err){
if (err){ throw err; }
// open the repl session
var replServer = repl.start({});
// attach modules to the repl context
replServer.context.epa = epa;
replServer.context.db = db;
});
You can even customize your prompt like this:
var replServer = repl.start({
prompt: "Node Console > ",
});
For the full setup and more details, check out:
http://derickbailey.com/2014/07/02/build-your-own-app-specific-repl-for-your-nodejs-app/
For the full list of options you can pass the repl like prompt, color, etc: https://nodejs.org/api/repl.html#repl_repl_start_options
Thank you to Derick Bailey for this info.
UPDATE:
GavinBelson has a great recommendation for running with sequelize ORM (or anything that requires promise handling in the repl).
I am now running sequelize as well, and for my node console I'm adding the --experimental-repl-await flag.
It's a lot to type in every time, so I highly suggest adding:
"console": "node --experimental-repl-await ./console.js"
to the scripts section in your package.json so you can just run:
npm run console
and not have to type the whole thing out.
Then you can handle promises without getting errors, like this:
const product = await Product.findOne({ where: { id: 1 });
I am not very experienced in using node, but you can enter node in the command line to get to the node console. I then used to require the models manually
Here is the way to do it, with SQL databases:
Install and use Sequelize, it is Node's ORM answer to Active Record in Rails. It even has a CLI for scaffolding models and migrations.
node --experimental-repl-await
> models = require('./models');
> User = models.User; //however you load the model in your actual app this may vary
> await User.findAll(); //use await, then any sequelize calls here
TLDR
This gives you access to all of the models just as you would in Rails active record. Sequelize takes a bit of getting used to, but in many ways it is actually more flexible than Active Record while still having the same features.
Sequelize uses promises, so to run these properly in REPL you will want to use the --experimental-repl-await flag when running node. Otherwise, you can get bluebird promise errors
If you don't want to type out the require('./models') step, you can use console.js - a setup file for REPL at the root of your directory - to preload this. However, I find it easier to just type this one line out in REPL
It's simple: add a REPL to your program
This may not fully answer your question, but to clarify, node.js is much lower-level than Rails, and as such doesn't prescribe tools and data models like Rails. It's more of a platform than a framework.
If you are looking for a more Rails-like experience, you may want to look at a more 'full-featured' framework built on top of node.js, such as Meteor, etc.

Resources