In Python ETL code, how does one implement metadata? - data-warehouse

I have a Python ETL process that moves data from MySQL databases to a Vertica data warehouse.
The ETL code opens files exported from MySQL, aggregates and denormalizes the data using Pythons Pandas library, and writes new files that are later loaded into the Vertica data warehouse. The code is simple and works fine.
I happened to run into a presentation on building large scale enterprise ETL networks, and the presenter stressed the importance of including metadata into the process, being able to have metadata about the data set, and schema information. But there were no specifics given.
This makes me think my ETL process, without any such metadata concept, is too amateurish, and I would like to incorporate this schema metadata concept. Generally how do I do that?
The presentation: https://www.youtube.com/watch?v=1SQWzG3FIu4#t=2418
#40:20

Related

Google Cloud Architecture: Can a data lake be used for OLTP?

I want to design a large scale web application in the Google cloud and I need a OLAP system that creates ML models which I plan to design by sending all data through Pub/Sub into a BigTable data lake. The models are created by dataproc processes.
The models are deployed to micro services that execute them on data from user sessions. My question is: Where do I store the "normal business data" for this micro services? Do I have to separate the data for the micro services that provide the web application from the data in the data lake, e.g. by using MariaDB instances (db per uS)? Or can I connect them with BigTable?
Regarding the data lake: Are there alternatives to BigTable? Another developer told me that an option is to store data on Google Cloud Storage (Buckets) and access this data with DataProc to save cross-region costs from BigTable.
Wow, lot of questions, lot of hypothesis and lot of possibilities. The best answer is "all depends of your needs"!
Where do I store the "normal business data" for this micro services?
Want do you want to do in these microservices?
Relational data? use relational database like MySQL or PostgreSQL on Cloud SQL
Document oriented storage? Use Firestore or Datastore if the queries on document are "very simple" (very). Else, you can look at partner or marketplace solution like MongoDB Atlas or Elastic
Or can I connect them with BigTable?
Yes you can, but do you need this? If you need the raw data before processing, yes connect to BigTable and query it!
If not, it's better to have a batch process which pre-process the raw data and store only the summary in a relational or document database (better latency for user, but less details)
Are there alternatives to BigTable?
Depends of your needs. BigTable is great for high throughput. If you have less than 1 million of stream write per second, you can consider BigQuery. You can also query BigTable table with BigQuery engine thanks to federated table
BigTable, BigQuery and Cloud Storage are reachable by dataproc, so as you need!
Another developer told me that an option is to store data on Google Cloud Storage (Buckets)
Yes, you can stream to Cloud Storage, but be careful, you don't have checksum validation and thus you can be sure that your data haven't been corrupted.
Note
You can think your application in other way. If you publish event into PubSub, one of common pattern is to process them with Dataflow, at least for the pre-processing -> your dataproc job for training your model will be easier like this!
If you train a Tensorflow model, you can also consider BigQuery ML, not for the training (except if a standard model fit your needs but I doubt), but for the serving part.
Load your tensorflow model into BigQueryML
Simply query your data with BigQuery as input of your model, submit them to your model and get immediately the prediction. That you can store directly into BigQuery with an Insert Select query. The processing for the prediction is free, you pay only the data scanned into BigQuery!
As I said, lot of possibility. Narrow your question to have a sharper answer! Anyway, hope this help

Apache Kudu vs InfluxDB on time series data for fast analytics

How does Apache Kudu compare with InfluxDB for IoT sensor data that requires fast analytics (e.g. robotics)?
Kudu has recently released v1.0 I have a few specific questions on how Kudu handles the following:
Sharding?
Data retention policies (keeping data for a specified number of data points, or time and aggregating/discarding data thereafter)?
Are there roll-up /aggregation functionality (e.g. converting 1s interval data into 1min interval data)?
Is there support for continuous queries (i.e. materialised views on data - query to view the 60 seconds on an ongoing basis)?
How is the data stored between disk and memory?
Can regular time series be induced from an irregular one (converting irregular event data into regular time intervals)?
Also are there any other distinct strengths and/or weaknesses between Kudu and InfluxDB?
Kudu is a much lower level datastore than InfluxDB. Its more like a distributed file system that provides a few database like features than a full fledged database. It currently relies on a query engine such as Impala for finding data stored in Kudu.
Kudu is also fairly young. It likely would be possible to build a time series database with kudu as the distributed store underneath it, but currently the closest implementation to that would be this proof of concept project.
As for the answers to your questions.
1) Kudu stores data in tablets and offers two ways of partitioning data: Range Partitions and Hash based Partitioning
2) Nope Although if the data was structured with range partitioning, dropping a tablet should be an efficient operation (similar to how InfluxDB drops whole shards when deleting data).
3) Query engines that work with Kudu are able to do this, such as impala or spark.
4) Impala does have some support for views
5) Data is stored in a columnar format similar to Parquet however Kudu's big selling point is that Kudu allows the columnar data to be mutable, which is something that is very difficult with current parquet files.
6) While I'm sure you could get spark or impala to do this, its not a built in feature.
Kudu is still a new project and it is not really designed to compete with InfluxDB but rather give a highly scalable and highly performant storage layer for a service like InfluxDB. The ability to append data to a parquet like data structure is really exciting though as it could eliminate the need for lambda architectures.

When we use Datamart and Datawarehousing?

I am new to DW . When we should use the term Datamart and when we should use the term Datawarehousing . Please explain with example may be your own example or in terms of Adventureworks .
I'm don't work on MS SQL Server. But here's a generic example with a business use case.
Let me add another term to this. First off, there is a main transactional database which interacts with your application (assuming you have an application to interact with, obviously). The data gets written into the Master database (hopefully, you are using Master-Slave replication), and simultaneously gets copied into the salve. According to the business and reporting requirements, cleansing and ETL is performed on the application data and data is aggregated and stored in a denormalized form to improve reporting performance and reduce the number of joins. Complex pre-calculated data is readily available to the business user for reporting and analysis purposes. This is a dimensional database - which is a denormalized form of the main transactional database (most probably in 3NF).
But, as you may know, all businesses have different supporting systems which also bring in data in the form of spreadsheets, csvs and flatfiles. This data is usually for a single domain, such as, call center, collections so on and so forth. We can call every such separate domain data as data mart. The data from different domains is also operated upon by an ETL tool and is denormalized in its own fashion. When we combine all the datamarts and dimensional databases for solving reporting and analysis problem for business, we get a data warehouse.
Assume that you have a major application, running on a website - which is your main business. You have all primary consumer interaction on that website. That will give you your primary dimensional database. For consumer support, you may have a separate solution, such as Avaya or Genesys implemented in your company - they will provide you the data on the same (or probably different server). You prepare ETLs to load that data onto your own server. You call the resultant data as data marts. And you combine all of these things to get a data warehouse. I know, I am being repetitive but that is on purpose.

Can I connect directly to the output of a Mahout model with other data-related tools?

My only experience with Machine Learning / Data Mining is via SQL Server Analysis Services.
Using SSAS, I can set up models and fire direct singleton queries against it do to things like real-time market basket analysis and product suggestions. I can grab the "results" from the model as a flattened resultset and visualize same elsewhere.
Can I connect directly to the output of a Mahout model with other data-related tools in the same manner? For example, is there any way I can pull out a tabular resultset so I can render same with the visualization tool of my choice? ODBC driver, maybe?
Thanks!
The output of Mahout is generally a file on HDFS, though you could dump it out anywhere Hadoop can put data. And with another job to translate to put in whatever form you need, it's readable. And if you can find an ODBC driver for the data store you put it in, yes.
So I suppose the answer is, no, there is not by design any integration with any particular consumer. But you can probably hook up whatever you imagine.
There are some bits that are designed to be real-time systems queried via API, but I don't think it's what you mean.

What are the merits of using CouchDB vs Hadoop to store/analyze web app log data?

I want to move up from using plain Rails log files for my web applications, so I can analyze page views and usage patterns. I've heard CouchDB is sometimes used for this.
On the other hand, I know of people who just feed the plain text log files into Hadoop and reduce them into summary stats that they then insert into MySQL.
What are the pros and cons of each of these two methods of logging and analyzing log files?
I can only speak for CouchDB, but the main benefits of using a document database to store things like these are;
They are schema less so that you can alter the schema of your log entries and still perform queries on the various editions of the schema you might have.
The map/reduce algorithm is a very powerful way to do grouping queries.
REST interface makes it technology agnostic in terms of consuming the data.
Scaling is horizontal and "infinite".

Resources