I would like to store audit logs on our GCP cluster (where our app is). There are different storage/db options out there. We are looking into one table, bucket on similar without some relationships.
Background: we are delivering enterprise high-scale saas solution
What I need to do with our audit logs write, search them by audit logs fields/columns and to combine (AND, OR). Also sort options are important.
I focused on following options (please let me know if there is something else that matches better)
Cloud Storage
Cloud Firestore
GCP managed Atlas Kafka
Our requirements are:
to have a scalable and high performance storage
that data are encrypted at rest
to have search capability (full test search will be perfect but I'm good with simple search by column/filed)
What I've found so far from requirements point:
Mongo has greater performances then Firebase. Not sure comparing Cloud Storage (standard mode) with Mongo.
Cloud Storage and Cloud Firestore do encrypt data. Not sure about Mongo
Cloud Firestore and Mongo have search capability out of the box (not full text search). Cloud Storage has search with the BigQuery and over the permanent/temp tables.
My god-feeling is that Cloud Storage is not the best choice. I think that search capability is kind of cumbersome. Also that's document based structure for large binary docs (images, videos). Please correct me if I'm wrong.
Last 2 are more close to the matching solution. From the enterprise standpoint Mongo looks closer.
Please let me know your thoughts.
Use BigQuery! You can sink the logs directly in BigQuery. In GCP, all the data are encrypted at rest. BigQuery is a powerful datawarehouse with strong query capacity. All your requirement are met with this solution.
Related
Google just released Cloud Firestore, their new Document Database for apps.
I have been reading the documentation but I don't see a lot of differences between Firestore and Firebase DB.
The main point is that Firestore uses documents and collections which allow the easy use of querying compared to Firebase, which is a traditional noSQL database with a JSON base.
I would like to know a bit more about their differences, or usages, or whether Firestore just came to replace Firebase DB?
I wrote an entire blog post all about this very question, and I recommend you check it out (or the official documentation) for a more complete answer.
But if you want the quick(-ish) summary, here it is:
Better querying and more structured data -- While the Realtime Database is just a giant JSON tree, Cloud Firestore is a little more structured. All your data consists of documents (which are basically key-value stores) and collections (which are collections of documents). Documents will also frequently point to subcollections, which contain other documents, which themselves can contain other documents, and so on.
This structured data helps you out in two ways. First, all queries are shallow, meaning that you can request a document without grabbing all the data underneath. This means you can keep your data stored hierarchically in a way that makes more sense to you without having to worry about keeping your database shallow. Second, you have more powerful queries. For instance, you can now query across multiple fields without having to create those "combo" fields that combine (and denormalize) data from other parts of your database. In some cases, Cloud Firestore will just run those queries directly, and in other cases, it will automatically create and maintain indexes for you.
Designed to Scale -- Cloud Firestore will be able to scale better than the Realtime Database. It's important to note that your queries scale to the size of your result set, not your data set. So searching will remain fast no matter how large your data set might become.
Easier manual fetching of data -- Like the Realtime Database, you can set up listeners in Cloud Firestore to stream in changes in real-time. But if you don't want that kind of behavior, and just want a simple "fetch my data" call, Cloud Firestore has that as well, and it's built in as a primary use case. (They're much better than the once calls in Realtime Database-land)
Multi region support -- This basically means more reliability, as your data is shared across multiple data centers at once. But you still have strong consistency, meaning you can always make a query and be assured that you're getting the latest version of your data.
Different pricing model -- While the Realtime Database primarily charges based on storage or network bandwidth, Cloud Firestore primarily charges based on the number of operations you perform. Will this be better, or worse? It depends on your app.
For powering a news app, turn-based multiplayer game, or something like your own version of Stack Overflow, Cloud Firestore will probably look pretty favorable from a pricing standpoint. For something like a real-time group drawing app where you're sending across multiple updates a second to multiple people, it probably will be more expensive than the Realtime Database.
Why you still might want the to use the Realtime Database -- It comes down to a few reasons.
That whole "it'll probably be cheaper for apps that make lots of frequent updates" thing I mentioned previously,
It's been around for a long time and has been battle tested by thousands of apps,
It's got better latency and when you need something with reliably low latency for a real-timey feel, the Realtime Database might work better.
For most new apps, we recommend you check out Cloud Firestore. But if you have an app that's already on the Realtime Database, I don't really recommend switching just for the sake of switching, unless you have a compelling reason to do so.
Reasons to choose Cloud Firestore over Realtime Database
It is an improved version
Firebase database was enough for basic applications. But it was not powerful enough to handle complex requirements. That is why Cloud Firestore is introduced. Here are some major changes.
The basic file structure is improved.
Offline support for the web client.
Supports more advanced querying.
Write and transaction operations are atomic.
Reliability and performance improvements
Scaling will be automatic.
Will be more secure.
Pricing
In Cloud Firestore, rates have lowered even though it charges primarily on operations performed in your database along with bandwidth and storage. You can set a daily spending limit too. Here is the complete details about billing.
Future plans of Google
When they discovered the flaws with Real-time Database, they created another product rather than improving the old one. Even though there are no reliable details revealing their current standings on Real-time Database, it is the time to start thinking that it is likely to be abandoned.
Suggest link from google as well :
Firebase Real-time Database vs FireStore
Extracted from google docs, a small sumamry here:
FireBase Real Time DB is JSON based NO SQL DB, meant for mobile apps, regional, and used typically to store and sync data between users/devices in realtime / extremely low latency.
FireStore is JSON 'like' NOSQL DB meant for high concurrency, global, easily auto scaling persistence, designed for any clients (not only mobile apps) with typical use cases such as asset tracking, real time analytics, building retail product catalogs, social user profile, gaming leaderboards, chat based applications etc.
Cloud Firestore is Firebase's database for mobile app
development. It builds on the successes of the Realtime Database with
a new, more intuitive data model. Cloud Firestore also features
richer, faster queries and scales further than the Realtime Database.
Realtime Database is Firebase's original database. It's an efficient,
low-latency solution for mobile apps that require synced states
across clients in realtime.
To choose between Firebase Realtime database and Cloud firestore based on your application requirements, read official documentation here.
I have searched for it in many blogs, but it seems all the blogs present a biased view. I myself am having a little bias towards Prometheus now, However, i did not find any good article which explains a use case of Prometheus for sensor data.
In my case, we manufacture IoT devices and we have a lot of data coming in. Till now we have been using MongoDB for everything, but now I want to switch to a time-series database, but I am really confused, whether I can choose Prometheus or not.
I am comfortable writing my own metric converter which can convert my sensor data into Prometheus metrics format (If something doesn't exist already)
Don't feel bd, lots of folks start out trying MongoDB for IoT applications because Mongo claims it's great for IoT. Only problem is, it's terrible for IoT. :-)
What you need is a true Time Series Database (TSDB). If you want to be able to query your data with SQL, try out QuestDB. It's the fastest open source TSDB out there and it's small.
I think i found it. Its Victoria Metrics. Haven't seen something as amazing as VM. First thing, it supports both Prometheus and Influx DB Write protocol(not just these, it supports some other time series database protocols also) and supports query language similar to prometheus. It has Vm Agent whose multiple instances we can run easily. It has cluster support and performance-wise, nothing like it.
I want to design a large scale web application in the Google cloud and I need a OLAP system that creates ML models which I plan to design by sending all data through Pub/Sub into a BigTable data lake. The models are created by dataproc processes.
The models are deployed to micro services that execute them on data from user sessions. My question is: Where do I store the "normal business data" for this micro services? Do I have to separate the data for the micro services that provide the web application from the data in the data lake, e.g. by using MariaDB instances (db per uS)? Or can I connect them with BigTable?
Regarding the data lake: Are there alternatives to BigTable? Another developer told me that an option is to store data on Google Cloud Storage (Buckets) and access this data with DataProc to save cross-region costs from BigTable.
Wow, lot of questions, lot of hypothesis and lot of possibilities. The best answer is "all depends of your needs"!
Where do I store the "normal business data" for this micro services?
Want do you want to do in these microservices?
Relational data? use relational database like MySQL or PostgreSQL on Cloud SQL
Document oriented storage? Use Firestore or Datastore if the queries on document are "very simple" (very). Else, you can look at partner or marketplace solution like MongoDB Atlas or Elastic
Or can I connect them with BigTable?
Yes you can, but do you need this? If you need the raw data before processing, yes connect to BigTable and query it!
If not, it's better to have a batch process which pre-process the raw data and store only the summary in a relational or document database (better latency for user, but less details)
Are there alternatives to BigTable?
Depends of your needs. BigTable is great for high throughput. If you have less than 1 million of stream write per second, you can consider BigQuery. You can also query BigTable table with BigQuery engine thanks to federated table
BigTable, BigQuery and Cloud Storage are reachable by dataproc, so as you need!
Another developer told me that an option is to store data on Google Cloud Storage (Buckets)
Yes, you can stream to Cloud Storage, but be careful, you don't have checksum validation and thus you can be sure that your data haven't been corrupted.
Note
You can think your application in other way. If you publish event into PubSub, one of common pattern is to process them with Dataflow, at least for the pre-processing -> your dataproc job for training your model will be easier like this!
If you train a Tensorflow model, you can also consider BigQuery ML, not for the training (except if a standard model fit your needs but I doubt), but for the serving part.
Load your tensorflow model into BigQueryML
Simply query your data with BigQuery as input of your model, submit them to your model and get immediately the prediction. That you can store directly into BigQuery with an Insert Select query. The processing for the prediction is free, you pay only the data scanned into BigQuery!
As I said, lot of possibility. Narrow your question to have a sharper answer! Anyway, hope this help
I had been building my database using Cloud Firestore because this was the easiest to implement. However, the querying capabilities of Firestore are insufficient for what I want to build, mostly due to the fact it can't handle querying inequalities on multiple fields. I need a SQL database.
I have an instance of Google Cloud SQL set up. The integration is far harder than Firebase where you just need to add a Cocoapods Pod. From my research it looks like I need to set up a Cloud SQL proxy, although if there is a simpler way of connecting it, I'd be glad to hear about it.
Essentially, I need a way for a client on the iOS to read and write to a SQL database. Cloud SQL seemed like the best, most scalable option (though I'd be open to hearing about alternatives that are easy to implement).
You probably don't want to configure your application to rely on connecting directly to an SQL database. Firestore is a highly scalable database that can handle thousands of connections - MySQL and Postgres do not scale as cleanly.
Instead, you should consider constructing a simple front end service that can be used to query the database and return formatted results. There are a variety of benefits to structuring this way, including being able to further optimize or distribute your queries. Google AppEngine and Google Cloud Functions can both be used to stand up such a service quickly, and both provide easy connection options to Cloud SQL.
I’ve found that querying with Firestore is best designed around your front end needs. Using nested sub collections, the ref property or document/collection id relationships can get you most of what you need for the front end.
You could also use Firebase Functions written in most of the major languages which perform stateless transactions to a Cloud SQL, Spanner or any other GCP database instance.
Alternatively you could deploy container images to Google Container Registry and easily deploy to Kubernetes Engine, Compute Engine or Cloud Run. Each of which have trade offs and advantages.
One advantage to using Firestore is to easily tie users with authentication {uid}; rules to protect the backend; custom claims for role based permissions on the front end and access to real-time streams as observables with extremely low latency.
Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.