Google Cloud Architecture: Can a data lake be used for OLTP? - machine-learning

I want to design a large scale web application in the Google cloud and I need a OLAP system that creates ML models which I plan to design by sending all data through Pub/Sub into a BigTable data lake. The models are created by dataproc processes.
The models are deployed to micro services that execute them on data from user sessions. My question is: Where do I store the "normal business data" for this micro services? Do I have to separate the data for the micro services that provide the web application from the data in the data lake, e.g. by using MariaDB instances (db per uS)? Or can I connect them with BigTable?
Regarding the data lake: Are there alternatives to BigTable? Another developer told me that an option is to store data on Google Cloud Storage (Buckets) and access this data with DataProc to save cross-region costs from BigTable.

Wow, lot of questions, lot of hypothesis and lot of possibilities. The best answer is "all depends of your needs"!
Where do I store the "normal business data" for this micro services?
Want do you want to do in these microservices?
Relational data? use relational database like MySQL or PostgreSQL on Cloud SQL
Document oriented storage? Use Firestore or Datastore if the queries on document are "very simple" (very). Else, you can look at partner or marketplace solution like MongoDB Atlas or Elastic
Or can I connect them with BigTable?
Yes you can, but do you need this? If you need the raw data before processing, yes connect to BigTable and query it!
If not, it's better to have a batch process which pre-process the raw data and store only the summary in a relational or document database (better latency for user, but less details)
Are there alternatives to BigTable?
Depends of your needs. BigTable is great for high throughput. If you have less than 1 million of stream write per second, you can consider BigQuery. You can also query BigTable table with BigQuery engine thanks to federated table
BigTable, BigQuery and Cloud Storage are reachable by dataproc, so as you need!
Another developer told me that an option is to store data on Google Cloud Storage (Buckets)
Yes, you can stream to Cloud Storage, but be careful, you don't have checksum validation and thus you can be sure that your data haven't been corrupted.
Note
You can think your application in other way. If you publish event into PubSub, one of common pattern is to process them with Dataflow, at least for the pre-processing -> your dataproc job for training your model will be easier like this!
If you train a Tensorflow model, you can also consider BigQuery ML, not for the training (except if a standard model fit your needs but I doubt), but for the serving part.
Load your tensorflow model into BigQueryML
Simply query your data with BigQuery as input of your model, submit them to your model and get immediately the prediction. That you can store directly into BigQuery with an Insert Select query. The processing for the prediction is free, you pay only the data scanned into BigQuery!
As I said, lot of possibility. Narrow your question to have a sharper answer! Anyway, hope this help

Related

Storing audit logs on GCP

I would like to store audit logs on our GCP cluster (where our app is). There are different storage/db options out there. We are looking into one table, bucket on similar without some relationships.
Background: we are delivering enterprise high-scale saas solution
What I need to do with our audit logs write, search them by audit logs fields/columns and to combine (AND, OR). Also sort options are important.
I focused on following options (please let me know if there is something else that matches better)
Cloud Storage
Cloud Firestore
GCP managed Atlas Kafka
Our requirements are:
to have a scalable and high performance storage
that data are encrypted at rest
to have search capability (full test search will be perfect but I'm good with simple search by column/filed)
What I've found so far from requirements point:
Mongo has greater performances then Firebase. Not sure comparing Cloud Storage (standard mode) with Mongo.
Cloud Storage and Cloud Firestore do encrypt data. Not sure about Mongo
Cloud Firestore and Mongo have search capability out of the box (not full text search). Cloud Storage has search with the BigQuery and over the permanent/temp tables.
My god-feeling is that Cloud Storage is not the best choice. I think that search capability is kind of cumbersome. Also that's document based structure for large binary docs (images, videos). Please correct me if I'm wrong.
Last 2 are more close to the matching solution. From the enterprise standpoint Mongo looks closer.
Please let me know your thoughts.
Use BigQuery! You can sink the logs directly in BigQuery. In GCP, all the data are encrypted at rest. BigQuery is a powerful datawarehouse with strong query capacity. All your requirement are met with this solution.

Apache Kudu vs InfluxDB on time series data for fast analytics

How does Apache Kudu compare with InfluxDB for IoT sensor data that requires fast analytics (e.g. robotics)?
Kudu has recently released v1.0 I have a few specific questions on how Kudu handles the following:
Sharding?
Data retention policies (keeping data for a specified number of data points, or time and aggregating/discarding data thereafter)?
Are there roll-up /aggregation functionality (e.g. converting 1s interval data into 1min interval data)?
Is there support for continuous queries (i.e. materialised views on data - query to view the 60 seconds on an ongoing basis)?
How is the data stored between disk and memory?
Can regular time series be induced from an irregular one (converting irregular event data into regular time intervals)?
Also are there any other distinct strengths and/or weaknesses between Kudu and InfluxDB?
Kudu is a much lower level datastore than InfluxDB. Its more like a distributed file system that provides a few database like features than a full fledged database. It currently relies on a query engine such as Impala for finding data stored in Kudu.
Kudu is also fairly young. It likely would be possible to build a time series database with kudu as the distributed store underneath it, but currently the closest implementation to that would be this proof of concept project.
As for the answers to your questions.
1) Kudu stores data in tablets and offers two ways of partitioning data: Range Partitions and Hash based Partitioning
2) Nope Although if the data was structured with range partitioning, dropping a tablet should be an efficient operation (similar to how InfluxDB drops whole shards when deleting data).
3) Query engines that work with Kudu are able to do this, such as impala or spark.
4) Impala does have some support for views
5) Data is stored in a columnar format similar to Parquet however Kudu's big selling point is that Kudu allows the columnar data to be mutable, which is something that is very difficult with current parquet files.
6) While I'm sure you could get spark or impala to do this, its not a built in feature.
Kudu is still a new project and it is not really designed to compete with InfluxDB but rather give a highly scalable and highly performant storage layer for a service like InfluxDB. The ability to append data to a parquet like data structure is really exciting though as it could eliminate the need for lambda architectures.

When we use Datamart and Datawarehousing?

I am new to DW . When we should use the term Datamart and when we should use the term Datawarehousing . Please explain with example may be your own example or in terms of Adventureworks .
I'm don't work on MS SQL Server. But here's a generic example with a business use case.
Let me add another term to this. First off, there is a main transactional database which interacts with your application (assuming you have an application to interact with, obviously). The data gets written into the Master database (hopefully, you are using Master-Slave replication), and simultaneously gets copied into the salve. According to the business and reporting requirements, cleansing and ETL is performed on the application data and data is aggregated and stored in a denormalized form to improve reporting performance and reduce the number of joins. Complex pre-calculated data is readily available to the business user for reporting and analysis purposes. This is a dimensional database - which is a denormalized form of the main transactional database (most probably in 3NF).
But, as you may know, all businesses have different supporting systems which also bring in data in the form of spreadsheets, csvs and flatfiles. This data is usually for a single domain, such as, call center, collections so on and so forth. We can call every such separate domain data as data mart. The data from different domains is also operated upon by an ETL tool and is denormalized in its own fashion. When we combine all the datamarts and dimensional databases for solving reporting and analysis problem for business, we get a data warehouse.
Assume that you have a major application, running on a website - which is your main business. You have all primary consumer interaction on that website. That will give you your primary dimensional database. For consumer support, you may have a separate solution, such as Avaya or Genesys implemented in your company - they will provide you the data on the same (or probably different server). You prepare ETLs to load that data onto your own server. You call the resultant data as data marts. And you combine all of these things to get a data warehouse. I know, I am being repetitive but that is on purpose.

Can I connect directly to the output of a Mahout model with other data-related tools?

My only experience with Machine Learning / Data Mining is via SQL Server Analysis Services.
Using SSAS, I can set up models and fire direct singleton queries against it do to things like real-time market basket analysis and product suggestions. I can grab the "results" from the model as a flattened resultset and visualize same elsewhere.
Can I connect directly to the output of a Mahout model with other data-related tools in the same manner? For example, is there any way I can pull out a tabular resultset so I can render same with the visualization tool of my choice? ODBC driver, maybe?
Thanks!
The output of Mahout is generally a file on HDFS, though you could dump it out anywhere Hadoop can put data. And with another job to translate to put in whatever form you need, it's readable. And if you can find an ODBC driver for the data store you put it in, yes.
So I suppose the answer is, no, there is not by design any integration with any particular consumer. But you can probably hook up whatever you imagine.
There are some bits that are designed to be real-time systems queried via API, but I don't think it's what you mean.

Implementing large scale log file analytics

Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.

Resources