What are the options when it comes to handling Data Lineage in Snowflake? [closed] - data-warehouse

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
Any ideas/options about handling Data Lineage in Snowflake? We are following a microservice architecture in which we are running a set of stored procedures that contain quite a few SQL queries as soon as certain events are triggered.
Example: When Table A is populated execute SP_Populate_Table_B and the result is that Table B is populated. We have a big set of SPs as we are populating the Staging Area, DataVault and our Dimensional Model.
We are in the lookout for any good way of handling all the metadata around this microservice way of performing our ETL. Basically automated way to track dependencies between tables, visualize the orchestration, have a better way to handle the changes of the SPs when tables are changed etc.
Can you please advice for some frameworks or tools, preferably open-source, that you have tried for Snowflake? Will DBT be a solution to that?
Thank you
Pantelis

dbt is a good solution to deploying your warehouse as code, but not a great solution for using your warehouse as a db for services to write intermediary tables.
If you care about data lineage, and you're willing to rethink the SP approach, then I would recommend dbt as a tool to deploy your warehouse infrastructure as code, and easily understand the downstream dependencies of your data.
dbt is great if you are willing to approach everything as an ELT problem, and allow dbt to be the infrastructure that transforms a subset of your mass-loaded data/events, into something that is ready to be analyzed or ingested for BI.
Read this for more context:
https://discourse.getdbt.com/t/understanding-idempotent-data-transformations/518

I'm not 100% sure if it supports snowflake just yet but I'd highly recommend looking into Packyderm. I believe it was built to solve just this kind of problem.
Might be worth a look or even contributing to if you really want Snowflake support.

Related

Is there any way to synch tasks and source code data between 2 TFS 2012 servers? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 8 years ago.
Improve this question
We use TFS to manage tasks and source code. We have 2 TFS 2012 servers. One is used by developers in India and another is used by developers in US. Is there any way to synch or mirror the TFS, so that at any point the both contain the same data (tasks and source code)?
If not immediate synch, then is there anyway to schedule a hourly, daily synch?
This is not a simple question nor a trivial project. You have two independent systems and you want to have the same data, which is practically impossible. You can get equivalent data in both systems.
My first option would be to collapse the two systems in one and leverage capabilities -- like TFS Proxy, caching HTTP proxies, WAN optimization hardware -- to reduce the latency impact for people further from the system.
This is preferable from a data management point of view and gives much more freedom to teams; it requires good infrastructure and network design.
The second option is to use TFS Integration Platform to synchronize the data. This requires accurate planning, but it is, generally doable. You need also to put in place some process, so semi-structural changes, say to Areas or Branches, is managed on one side only. Remember bug #42 for one system will be #89 on another!
I have seen implementation of both, and suggest to hire some good consultant to guide through the mine-fields, so to speak.

Convert Asp.Net MVC/ EF application to NoSQL application [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
We are thinking about convert Asp.Net MVC/EF/SQL application to NoSQL application to improve development speed and application speed. It looks like there is mixed response on ravendb, I want to know if you have using NoSQL in a .Net environment , what you are using.
Also, our business still rely on SSRS to do reports, and it is important that we can still export data from NoSQL to SQL environment, what will you suggest to export data from NoSQL to SQL Server.
Thanks for you comments.
My 2 cents:
RavenDB is a good document database. I'm using it in a .NET environment and it integrated nicely.
Move to a NoSQL database only if your data makes sense in such a structure or the foreseen performance improvement is compelling enough. RavenDB is a document database so it works great with documents but it's much harder to work with relational data. You'll likely find that keeping relational data in a SQL database is more efficient from a development perspective, but perhaps you'll find better performance with a NoSQL database (probably not RavenDB) at the expense of some developer efficiency.
Be open to a mix of SQL and NoSQL in your environment. For example you may find that your relational data fits best in SQL and your document data fits best in RavenDB. Or perhaps you'll want your document data in both places which would require some SQL-RavenDB syncing.
For exporting from RavenDB to SQL, check out RavenDB's Index Replication Bundle. Please see Matt's comment about the latest bundle to use.
You will find (open sourced) BrightStarDB the closest fit from a DBContext solution for entityframework, as for pulling data out you can export it either writing your own app/query/or its own query language. Got a few options there.
Check it out BrightStarDb.com

buffer tasks processing [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
The task is to process a lot of commands that needs to be saved in some kind of stack or buffer.
While one method will push the data in, there will be multiple threads or processes, that will take thees tasks one by one and process them.
Right know the Idea is to save the tasks in buffer that uses NoSQL database, so we can get object and delete it simultaneously.
I'm thinking, that for this problem probably is already solution with some kind of server or library, that handles task processing and distribution between multiple instances.
Is there such a thing?
Well the pattern implementation depends on your specific needs. Your question is too general for a better answer than the one provide as a comment by #AljoshaBre: "this is a classic - producer/consumer problem. look it up on the webz.". If you look at the wikipedia article about producer/consumer problem, you can find the pattern implementation in Java - the general pattern is small, but to address your specific needs more details are required. You say something about "task processing and distribution between multiple instances", and it leads for a more specific architectural pattern called Distributed Message Queues (some random ref). There is the Apache project ActiveMQ that aims to implement such pattern.
I found gearman, php queue manager, and solution with nodejs node-amqp and php-amqp using RabbitMQ

Which graph database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Which graph database should I use when dealing with a couple of thousand nodes and a couple of thousands relationships? Are these big numbers for any database or not? Which graph database is the fastest at read operations (assuming all data is loaded once at the beggining).
I had a look at neo4j and its visualization tool. Will I be able to have such a visualization tool in my application?
The questions you'll need to ask and answer for a graph database are similar to any other database. How much data? In memory or persistent? How will you interface with it? Embedded or a server process? Distributed or localized? Licensing?
A couple of thousand nodes and relationships is small for a graph database and most any graph database solution will work. For most people Neo4j is a fine choice, but there are some caveats. First, the licensing of Neo4j can be problematic in many situations. Secondly, the visualizer is part of the Neo4j server process - which means you're going to have another server process running. If you're concerned about the licensing you may want to check out OrientDB, which is under the Apache license, and thus very flexible.
From the sounds of it, you have a fairly small system and may be able to get by with using TinkerGraph, an in-memory graph database from Marko Rodriguez and the Tinkerpop hackers. It has the option to persist your data to a file if needed, is amazingly lightweight, and, like Neo4j and OrientDB, supports all the graph tools from the Tinkerpop stack, including the Jung Ouplemntation, which can give you the visualizations you desire.

A basic query about data mining [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Using data mining, we are able to find useful patterns in a large set of data using techniques like correlation etc etc and there must exist some open source tools for this (what are some examples?).
Is this pull-based or push-based? I mean, do we provide data set as well as specific queries as input to the data mining engine and it provides us answers (as in SQL) or we only supply large data set as input to the engine and it on its own find patterns (which we never knew existed and/or we couldn't formulate queries for this) and thus we don't really pull any specific queries from it, it pushes the patterns to us.
Some quick reading of Wikipedia article doesn't clarify my doubts in clear way.
As open source have a look at Weka.
In regards to the push-pull thing, well, it's a bit of both. But it's not quite that simple. You must be looking for something. E.g. if you are looking for clusters, there are unsupervised algorithms which will give you an answer with minimal guidance.
In practice things are more meaningful if you know about the data you analyse and you are looking at regularities and patterns that make sense.
Playing with Weka will give you a better idea of the range of possibilities.
Python and R are other great open source tools that have great popularity in the data mining area.
A great tool that i used recently is scikit-learn

Resources