Graphite, Elastisearch, Cloudwatch, Prometheus, InfluxDB are all supported backends for Grafana. I am creating an application with grafana front-end, but an not being able understand how these backends differ and which would be the best to use for my application (would prefer open-source). My use case is a static log file being imported from an external server which I want to parse and fill-in the DB to be consumed by grafana. The data can have up to 5000 time-series data points for about a 100 measurement. The database need not be distributed. I would be glad to get some tips on how I can select a backing database out of these. Thanks in advance!!
Good answer by Brian, but adding more. You have to think about monitoring as 3 sets of data, which unfortunately in OSS you need a large mix of tools and projects. The fundamentals of monitoring consist of metrics (numbers such as what Grafana is good at visualizing), events (unstructured text such as what ELK is good at collecting and visualizing), and metadata (relationships, configuration, and other elements which span the other two categories).
Most people will use different technology stacks for each.
Metrics:
Graphite - Old, but well proven (uses RRD data stores)
InfluxDB - Newest, but less proven. Probably the best technology today
Prometheus - Uses a proprietary binary file based data store.
Events:
ElasticSearch - Java based unstructured data store, needs a lot of hardware to scale.
Once you have the metrics and events to visualize you'll need a bunch of tools. On ElasicSearch the ELK stack is most common E = ElasticSearch L = Logstash (ingesting logs) K = Kibana (visualization). Another alternative is Greylog which is better than Kibana IMHO.
Grafana is common, but not the best visualization. Unfortunately, the OSS tools out there just aren't great with metrics today.
That sounds like an event logging use case, so Elasticsearch is probably your best bet.
For metrics uses cases Prometheus would be a good choice.
Related
The goal of the project is to plot the x,y,z coordinates (attrs from an entity) in a 3D graph which updates as they change.
Note: it's not important how the value of x,y,z changes, it can be for example by hand through the prompt, using curl.
At first, I thought about using QuantumLeap, CrateDB and Grafana, but when I have deployed it I have realised that Grafana doesn't support the crate plugin anymore (https://community.grafana.com/t/plugin-cratedb-not-available/17165), and I got errors (I have tried it using PostgreSQL as it is explained here: https://crate.io/a/pair-cratedb-with-grafana-6-x/)
At this point, I would like to ask for some recommendations: Do you think I need to work with time-series data? If not, how should I address the problem? If yes, can I use another database manager with QuantumLeap and supported by Grafana that works with this time-series format? Or maybe do not use Grafana and accessing the time-series data from the Crate database manually via any frontend software which shows the 3D graph?
This is all a matter of question framing. Because the data format is well defined you can indirectly use any tool with with any NGSI Context Broker.
The problem can be broken down into the following steps:
What Graphing/Business Intelligence tools are available?
What databases do they support?
Which FIWARE Components can push data into a supported Database?
Now the simplest answer (given the user's needs) and as proposed in the question is to use Grafana - the PostGres plugin for Grafana will read from a CrateDB database and the QuantumLeap component can persist time-series data into CrateDB which is compatible with the PostGres format. An example on how to do this can be found in the QuantumLeap documentation
However you could use a component such as Draco or Cygnus to persist your data to a database (Draco is easier here since you could write a custom NIFI step to push in your preferred format.
Alternatively you could use the Cosmos Spark or Flink connectors to listen to an incoming stream of context data and persist something to a database
Or you could write a custom microservice which listens to the NGSI notification endpoint (which is raised by a subscription) interpret the payload and push to the database of your choice.
Once you have the data in a database there as well as Grafana there are plenty of other tools available - consider using the Knowage Engine or Apache Superset for example.
I have searched for it in many blogs, but it seems all the blogs present a biased view. I myself am having a little bias towards Prometheus now, However, i did not find any good article which explains a use case of Prometheus for sensor data.
In my case, we manufacture IoT devices and we have a lot of data coming in. Till now we have been using MongoDB for everything, but now I want to switch to a time-series database, but I am really confused, whether I can choose Prometheus or not.
I am comfortable writing my own metric converter which can convert my sensor data into Prometheus metrics format (If something doesn't exist already)
Don't feel bd, lots of folks start out trying MongoDB for IoT applications because Mongo claims it's great for IoT. Only problem is, it's terrible for IoT. :-)
What you need is a true Time Series Database (TSDB). If you want to be able to query your data with SQL, try out QuestDB. It's the fastest open source TSDB out there and it's small.
I think i found it. Its Victoria Metrics. Haven't seen something as amazing as VM. First thing, it supports both Prometheus and Influx DB Write protocol(not just these, it supports some other time series database protocols also) and supports query language similar to prometheus. It has Vm Agent whose multiple instances we can run easily. It has cluster support and performance-wise, nothing like it.
How does Apache Kudu compare with InfluxDB for IoT sensor data that requires fast analytics (e.g. robotics)?
Kudu has recently released v1.0 I have a few specific questions on how Kudu handles the following:
Sharding?
Data retention policies (keeping data for a specified number of data points, or time and aggregating/discarding data thereafter)?
Are there roll-up /aggregation functionality (e.g. converting 1s interval data into 1min interval data)?
Is there support for continuous queries (i.e. materialised views on data - query to view the 60 seconds on an ongoing basis)?
How is the data stored between disk and memory?
Can regular time series be induced from an irregular one (converting irregular event data into regular time intervals)?
Also are there any other distinct strengths and/or weaknesses between Kudu and InfluxDB?
Kudu is a much lower level datastore than InfluxDB. Its more like a distributed file system that provides a few database like features than a full fledged database. It currently relies on a query engine such as Impala for finding data stored in Kudu.
Kudu is also fairly young. It likely would be possible to build a time series database with kudu as the distributed store underneath it, but currently the closest implementation to that would be this proof of concept project.
As for the answers to your questions.
1) Kudu stores data in tablets and offers two ways of partitioning data: Range Partitions and Hash based Partitioning
2) Nope Although if the data was structured with range partitioning, dropping a tablet should be an efficient operation (similar to how InfluxDB drops whole shards when deleting data).
3) Query engines that work with Kudu are able to do this, such as impala or spark.
4) Impala does have some support for views
5) Data is stored in a columnar format similar to Parquet however Kudu's big selling point is that Kudu allows the columnar data to be mutable, which is something that is very difficult with current parquet files.
6) While I'm sure you could get spark or impala to do this, its not a built in feature.
Kudu is still a new project and it is not really designed to compete with InfluxDB but rather give a highly scalable and highly performant storage layer for a service like InfluxDB. The ability to append data to a parquet like data structure is really exciting though as it could eliminate the need for lambda architectures.
I have a question concerning graph databases! Is there a mechanism to use graph databases in a distributed environment?! I mean can you distribute a graph database?! Can we even traverse a graph database in a distributed environment?!
Definitely you can do it.
There are different databases which scale very good nowadays (JanusGraph, OrientDB, ArangoDB etc).
Even if you have a very big database which has to be scaled beyond a single datacenter to multiple geo-distributed datacenters you still has options.
For example, you can use JanusGraph with Cassandra / ScyllaDB storage backends. It will give you an option to asynchronously synchronize all your data from different datacenters.
Of course, there are some issues to be solved like consistency and so on but with today's tools, it's very possible to organize a distributed graph database.
Neo4j enterprise edition features clustering, read more on http://neo4j.com/docs/stable/ha.html.
Yes, you can use all sorts of graph databases in distributed environments. Can you distribute a graph database? Definitely yes.
BUT - distributing the same graph database in many different places (to speed up reads) is quite easy, and done all the time. Distributing a ridiculously massive database (so that parts of the graph database are in a bunch of different places) is quite hard.
I recommend this related question which talks about sharding and distributing databases. Pay particular attention to the bit about "sharding is an anti-pattern".
Can anyone point me to a reference or provide a high level overview of how companies like Facebook, Yahoo, Google, etc al perform the large scale (e.g. multi-TB range) log analysis that they do for operations and especially web analytics?
Focusing on web analytics in particular, I'm interested in two closely-related aspects: query performance and data storage.
I know that the general approach is to use map reduce to distribute each query over a cluster (e.g. using Hadoop). However, what's the most efficient storage format to use? This is log data, so we can assume each event has a time stamp, and that in general the data is structured and not sparse. Most web analytics queries involve analyzing slices of data between two arbitrary timestamps and retrieving aggregate statistics or anomalies in that data.
Would a column-oriented DB like Big Table (or HBase) be an efficient way to store, and more importantly, query such data? Does the fact that you're selecting a subset of rows (based on timestamp) work against the basic premise of this type of storage? Would it be better to store it as unstructured data, eg. a reverse index?
Unfortunately there is no one size fits all answer.
I am currently using Cascading, Hadoop, S3, and Aster Data to process 100's Gigs a day through a staged pipeline inside of AWS.
Aster Data is used for the queries and reporting since it provides a SQL interface to the massive data sets cleaned and parsed by Cascading processes on Hadoop. Using the Cascading JDBC interfaces, loading Aster Data is quite a trivial process.
Keep in mind tools like HBase and Hypertable are Key/Value stores, so don't do ad-hoc queries and joins without the help of a MapReduce/Cascading app to perform the joins out of band, which is a very useful pattern.
in full disclosure, I am a developer on the Cascading project.
http://www.asterdata.com/
http://www.cascading.org/
The book Hadoop: The definitive Guide by O'Reilly has a chapter which discusses how hadoop is used at two real-world companies.
http://my.safaribooksonline.com/9780596521974/ch14
Have a look at the paper Interpreting the Data: Parallel Analysis with Sawzall by Google. This is a paper on the tool Google uses for log analysis.