JSqlParser for Data Lineage? - jsqlparser

We are a data warehouse development team and most of our ETL logic can be expressed as a series of SQL select statements. I am looking for a tool to extract data lineage in a structured manner by parsing the queries.
The query and the simplified lineage output would look like this:
Query:
SELECT A AS COLUMN_1, B AS COLUMN_2, A+B AS COLUMN_SUM FROM MYTABLE;
Output
COLUMN_1: MYTABLE.A
COLUMN_2: MYTABLE.B
COLUMN_3: MYTABLE.A
COLUMN_3: MYTABLE.B
Is JSQLParser a good tool for this purpose? Any pointers or experiences on how to use the tool would be appreciated too.

JSqlParser does the parsing and gives you a structured way to look at your SQL. By the way JSqlParser is pretty good at this.
But it has no knowledge about your database schema, therefore it cannot know if column_1 A is from table MYTABLE. A more obvious example would be
select a, b from table1, table2
This knowledge must somehow injected by you :).
To make a simple parse do something like
Statement statement = CCJSqlParserUtil.parse(sql);
To extract used columns you could use the TablesNamesFinder utility, provided by JSqlParser like
Select selectStatement = (Select) statement;
TablesNamesFinder tablesNamesFinder = new TablesNamesFinder() {
#Override
public void visit(Column tableColumn) {
System.out.println(tableColumn);
}
};
System.out.println(" and tables=" + tablesNamesFinder.getTableList(selectStatement));
As you see one way to surf your data is some kind of visitor pattern.
If you have more questions feel free to use JSqlParsers gitter room or file an issue at github.

Related

Joining 2 large relations in Rascal

I'm trying to join two relations in Rascal, much like a SQL join, with the following code:
rel[loc,loc,loc] methodInvocationsWithClass = {arround 40000 tuples};
rel[loc,loc] declaredClassHierarchy = {around 20000 tuples};
{ <from,to,class,super> | <from,to,class> <- methodInvocationsWithClass, <sub,super> <- declaredClassHierarchy, class == sub };
While this does exactly what I need it appears it only works well on small relations and doesn't scale well.
Is there perhaps a more efficient alternative way to accomplish this?
Indeed, we have the join keyword for this. Also lots of other useful relational operations are supported. Either by keywords or functions inside the Relation module.

Propel query with nested statements and empty field value

I have quite a complex SQL query which I would like to transform into Propel but I am not sure about the best approach.
The query I need looks like this:
SELECT id_loan
FROM loan loanA
JOIN loan_funding on fk_loan = loanA.id_loan
JOIN `user` userA on loan_funding.fk_user = userA.id_user
WHERE
userA.`acc_internal_account_id` is not null
AND loanA.`state` = 'payment_origination'
AND loanA.id_loan IN (
SELECT id_loan from loan loanB
JOIN loan_funding on fk_loan = id_loan
JOIN `user` userB on loan_funding.fk_user = userB.id_user
WHERE
userB.`acc_internal_account_id` is null
AND loanB.`state` = 'payment_origination'
GROUP BY loanB.id_loan
)
GROUP BY loanA.id_loan
LIMIT 1;
What I would like to have is something completely based on the Generated Query Methods but I do not quite get how to do it.
Performance is not an issue but as for now it is unclear where and how those queries will be called from. However, it is important to get back an object as we need to use the getters and setters.
I found this website: http://propelorm.org/blog/2011/02/02/how-can-i-write-this-query-using-an-orm-.html which looks really cool and helpful, however, I am not sure what option fits best here.
I do not expect a complete solution but maybe some thoughts how to narrow down the problem...
What confuses me is especially the part where it compares the id_loan and fk_loan before it goes to the user table. How would this relationship be represented by propel? Might it be better to split the whole thing in multiple queries?
Any hints appreciated!

Performance of generated T-SQL from Entity Framework

I recently used Entity Framework for a project, despite my DBA's strong disapproval. So one day he came to my office complaining about generated T-SQL that reaches his database.
For instance, when I want to select a product based on the id, I write something like this:
context.Products.FirstOrDefault(p=>p.Id==id);
Which translates to
SELECT ... FROM (SELECT TOP 1 ... FROM PRODUCTS WHERE ID=#id)
So he is shouting, "Why on earth would you write a SELECT * FROM (SELECT TOP 1)"
So I changed my code to
context.Products.Where(p=>p.Id==id).ToList().FirstOrDefault()
and this produces a much cleaner T-SQL:
SELECT ... FROM PRODUCTS WHERE ID=#id
The inner query and the TOP 1 dissappeared. Enough mambling, my question is this: Does the first query really put an overhead for SQL Server? Is it harder to parse than the second method? The Id column has a Clustered index on. I want a good answer so I can rub it on his face (or mine)
Thanks,
Themos
Have you tried running the queries manually and comparing the executions plans?
The biggest problem here isn't that the SQL isn't perfectly formed to your DBA's standards (although I'm fairly certain that the query engine will optimize out the extra select). The second query actually returns the entire contents of the Products table which you then analyse in memory and this is definitely a task that should be performed by the DB and not the application layer.
In short, he's being a pedant; leave it the way it was.

Linq-to-SQL query - Need to filter by IDs returned by Full-Text Search sql functions - Hitting limit for Contains

My objective:
I have built a working controller action in MVC which takes user input for various filter criteria and, using PredicateBuilder (part of LinqKit - sorry, I'm not allowed enough links yet) builds the appropriate LINQ query to return rows from a "master" table in SQL with a couple hundred thousand records. My implementation of the predicates is totally inelegant, as I'm new to a lot of this, and under a very tight deadline, but it did make life easier. The page operates perfectly as-is.
To this, I need to add a Full-Text search filter. Understanding the way LINQ translates Contains to LIKE(%%), using the advice in Simon Blog: LINQ-to-SQL - Enabling Full-Text Searching, I've already prepared Table Functions in SQL to run Freetext queries on the relevant columns. I have 4 functions, to match the query against 4 separate tables.
My approach:
At the moment, I'm building the predicates (I'll spare you) for the initial IQueryable data object, running a LINQ command to return them, like so:
var MyData = DB.Master_Items.Where(outer);
Then, I'm attempting to further filter MyData on the Keys returned by my full-text search functions:
var FTS_Matches_Subtable_1 = (from tbl in DB.Subtable_1
join fts in DB.udf_Subtable_1_FTSearch(KeywordTerms)
on tbl.ID equals fts.ID
select tbl.ForeignKey);
... I have 4 of those sets of matches which I've tried to use to filter my original dataset in several ways with no success. For instance:
MyNewData = MyData.Where(d => FTS_Matches_Subtable_1.Contains(d.Key) ||
FTS_Matches_Subtable_2.Contains(d.Key) ||
FTS_Matches_Subtable_3.Contains(d.Key) ||
FTS_Matches_Subtable_4.Contains(d.Key));
I just get the error: The incoming tabular data stream (TDS) remote procedure call (RPC) protocol stream is incorrect. Too many parameters were provided in this RPC request. The maximum is 2100.
I get that it's because I'm trying to pass a relatively large set of data into the Contains function and LINQ is converting each record into a separate parameter, exceeding the limit.
I just don't know how to get around it.
I found another post linq expression to return property value which seemed SO promising. I tried ifwdev's solution (2nd highest ranked answer): using LinqKit to build an extension that will break up the queries into manageable chunks. But I can't figure out how to implement it. Out of my depth right now maybe?
Is there another approach that I'm missing? Some simpler way to accomplish this that I've overlooked?
Sorry for the long post. But thank you for any help you can provide!
This is a perfect time to go back to raw ado.net.
Twisting things around just to use linq to sql is probably just as time consuming if you wrote the query and hydration by hand.

Is a full list returned first and then filtered when using linq to sql to filter data from a database or just the filtered list?

This is probably a very simple question that I am working through in an MVC project. Here's an example of what I am talking about.
I have an rdml file linked to a database with a table called Users that has 500,000 rows. But I only want to find the Users who were entered on 5/7/2010. So let's say I do this in my UserRepository:
from u in db.GetUsers() where u.CreatedDate = "5/7/2010" select u
(doing this from memory so don't kill me if my syntax is a little off, it's the concept I am looking for)
Does this statement first return all 500,000 rows and then filter it or does it only bring back the filtered list?
It filters in the database since your building your expression atop of an ITable returning a IQueryable<T> data source.
Linq to SQL translates your query into SQL before sending it to the database, so only the filtered list is returned.
When the query is executed it will create SQL to return the filtered set only.
One thing to be aware of is that if you do nothing with the results of that query nothing will be queried at all.
The query will be deferred until you enumerate the result set.
These folks are right and one recommendation I would have is to monitor the queries that LinqToSql is creating. LinqToSql is a great tool but it's not perfect. I've noticed a number of little inefficiencies by monitoring the queries that it creates and tweaking it a bit where needed.
The DataContext has a "Log" property that you can work with to view the queries created. I created a simple HttpModule that outputs the DataContext's Log (formatted for sweetness) to my output window. That way I can see the SQL it used and adjust if need be. It's been worth its weight in gold.
Side note - I don't mean to be negative about the SQL that LinqToSql creates as it's very good and efficient almost every time. Another good side effect of monitoring the queries is you can show your friends that are die-hard ADO.NET - Stored Proc people how efficient LinqToSql really is.

Resources