InfluXDB Raspberry: send data periodically to logging host - influxdb

I would like to use InfluXDB wit my Raspberry/Openhab home automatisation.
I am just worried about db size/performance.
So my plan would be: log only 1 month on Pi, let it be cleaned automatically.
Clean I understand is easy with retention. (Automatically clear old data )
For long time anaylsis I want to collect all the data on a server.
Now question: How can I export the data on PI before retention into flatfile and afterwards import that data in a seperate InfluXDB on different server?
(Or even better: is there a way to do this in a sort of cluster mode?)
thanks a lot,
Chris

I use InfluxDB on a pi for sensor logs. I log now 4 records every 5 seconds for more than 3 months and performances on my pi are really good. I don't have now the file size but was not more than 10Mb
You can use InfluxDB in cluster mode but not sure it will answer yor question for data cleaning.
To exprt data, you can use InfluxDB API to get all series in the data base, then all data and flush that in a json file. You can use the API to load that file in another DB

Related

Real Time computation

I have a algorithm written in python and mysql which takes inputs in csv file and some properties and then run for 20-25 mins to produce output.
I want to make it realtime such that if input csv is upload or and properties is changed the output is changed without need to run algorithm
Note Data on which algorithm runs can be very large.
Need help in making realtime computing algorithm
i am trying to change mysql to nosql DB but it still takes time to run and is not realtime
You should try using one of the streaming services for event driven update. Instead of CSV, write the data to stream like Kafka or Kinesis. Write your consumer that reads incoming invents and updates the data without running the full algorithm. You might be able to use Apache Flink for aggregation against Kafka as well.

Is the query over data stored in remote storage by Prometheus efficient? Do I need Grafana Mimir to scale?

We know that Prometheus has three phases of data storage:
In-memory: this is where the recent memory is stored. It allows for fast query using PromQL as it is RAM memory. [Am I wrong?]
After a few hours the in-memory data is formally saved to the disk in the format of Blocks.
After the data retention period is over, data is stored in a remote access.
I wanted to ask if it is efficient to query over the data stored in the remote access. If I need a lot of metrics to monitor for my org, do I need Grafana Mimir, which handles upto 1 billion active metrics.
Also, as a side question, how many MBs/GBs metrics can Prometheus store before the retention period is over?
Sparingly. Yes. Prom wont like it if you try query over a few years for example since it will go to storage for everything. but getting metrics from storage for an hour is easy and wont be a problem.
how many MBs/GBs metrics can Prometheus store? Its irrelevant. The retention period is intendant of the amount of data stored. You can store 100MB in a day or 100GB in a day it doesn't matter. What will matter is cardinality

Queries performances on ADLS gen 2

I'm trying to migrate our "old school" database (mostly time series) to an Azure Data Lake.
So I took a random table (10 years of data, 200m records, 20Gb), copied the data in a single csv file AND also to the same data and created 4000 daily files (in monthly folders).
On top of those 2 sets of files, I created 2 external tables.... and i'm getting pretty much the same performance for both of them. (?!?)
No matter what I'm querying, whether I'm looking for data on a single day (thus in a single small file) or making summation of the whole dataset... it basically takes 3 minutes, no matter if I'm looking at a single file or the daily files (4000). It's as if the whole dataset had to be loaded into memory before doing anything ?!?
So is there a setting somewhere that I could change so avoid having load all the data when it's not required?? It could literally make my queries 1000x faster.
As far as I understand, indexes are not possible on External tables. Creating a materialized view will defeat the purpose of using a Lake. t
Full disclosure; I'm new to Azure Data Storage, I'm trying to see if it's the correct technology to address our issue.
Best practice is to use Parquet format, not CSV. It is a columnar format optimized for OLAP-like queries.
With Synapse Preview, you can then use SQL on-demand engine (serverless technology) when you do not need to provision DW cluster and you will be charged per Tb of scanned data.
Or you can spin up Synapse cluster and ingest your data using COPY command into DW (it is in preview as well).

how to Add a database on the server to be accessible by all computers on the server

I made a sample database for my students to learn SQL
in that i created it and saved it
i added 30 entries to it
and i saved it
and i cannot copy the same file to 100 computers in my lab
so tell me how to do this
i searched the net but to no avail
sql> tables
-----------------------------------
dhana
-----------------------------------
task completed in 0.57 seconds
i want to put the same database in 100 computers but i cannot do it it will take long time to open the windows xp computer and copy the file from the network paste it and shut down the computer is too tedious
hmm.
refer the computer science for class 11 with python by Sumita Arora
and u may get it.
u are not searching the net properly
and the questions are kind of easy to be found on google
what is ur webbrowser
You can use something like mysql workbench and access your server remotely, although I've never tried 100 simultaneous connections. Another option is ssh from clients into your server and use database cli. Of course i assume you want one database and many clients. Not many databases.

Fetch data subset from gmond

This is in the context of a small data-center setup where the number of servers to be monitored are only in double-digits and may grow only slowly to few hundreds (if at all). I am a ganglia newbie and have just completed setting up a small ganglia test bed (and have been reading and playing with it). The couple of things I realise -
gmetad supports interactive queries on port 8652 using which I can get metric data subsets - say data of particular metric family in a specific cluster
gmond seems to always return the whole dump of data for all metrics from all nodes in a cluster (on doing 'netcat host 8649')
In my setup, I dont want to use gmetad or RRD. I want to directly fetch data from the multiple gmond clusters and store it in a single data-store. There are couple of reasons to not use gmetad and RRD -
I dont want multiple data-stores in the whole setup. I can have one dedicated machine to fetch data from the multiple, few clusters and store them
I dont plan to use gweb as the data front end. The data from ganglia will be fed into a different monitoring tool altogether. With this setup, I want to eliminate the latency that another layer of gmetad could add. That is, gmetad polls say every minute and my management tool polls gmetad every minute will add 2 minutes delay which I feel is unnecessary for a relatively small/medium sized setup
There are couple of problems in the approach for which I need help -
I cannot get filtered data from gmond. Is there some plugin that can help me fetch individual metric/metric-group information from gmond (since different metrics are collected in different intervals)
gmond output is very verbose text. Is there some other (hopefully binary) format that I can configure for export?
Is my idea of eliminating gmetad/RRD completely a very bad idea? Has anyone tried this approach before? What should I be careful of, in doing so from a data collection standpoint.
Thanks in advance.

Resources