InfluxDB 2 storage retention/max. size - influxdb

I am using InfluxDB 2.2 to store & aggregate data on a gateway device. The environment is pretty limited regarding space. I do not know in which interval and how large the data is that get's ingested. Retention is not that much of a requirement. All I want is to make sure that the influx db does not grow larger than let's say 5GB.
I know that I could just set restrictive bounds to the retention but this does not feel like an ideal solution. Do you see any possibility to achieve this?

Seems that you are more concerned about the disk space. If so, there are several workaround you could try:
Retention policy: this is similar to TTL in other NoSQL and it could help you to delete the obsolete data automatically. How long you should set the retention policy really depends on the business you are running. You could run the instance for a few days and see how the disk space is growing and then change your retention policy.
Downsampling: "Downsampling is the process of aggregating high-resolution time series within windows of time and then storing the lower resolution aggregation to a new bucket ". not all data need to retrieved at all times. Most of the time, the fresher the data (i.e. hot data), the more frequent it will be fetched. What's more, you might just need to see the big picture of historical data, i.e. less granular. For example, if you are collecting the data in second-level granularity, you could perform a downsampling task to only retain the mean of the indicator values at an minute or even hour precision instead. That will save you a lot of space while not affecting your trending view that much.
See more details here.

Related

Is the time it takes to retrieve data from disk to RAM linear?

Let's suppose that it takes X seconds to retrieve 100MB from disk to RAM in one retrieval. Then does that mean that it will take also X seconds to retrieve the same 100MB divided up to 100MB/N per retrieval in N retrievals, where each retrieval is roughly takes X/N seconds?
It depends on various factors such as the storage device, the file system, the operating system and the retrieval method.
In general, when you perform multiple small retrievals, the time it takes to retrieve the data may be longer than if you perform a single large retrieval, due to the overhead of performing multiple operations. This is because the storage device may have to seek to different locations on the disk, which takes time, and the operating system may have to manage multiple requests.
Additionally, if you're using a hard disk drive (HDD) as the storage device, the time it takes to retrieve the data may be longer because hard drives are slower than solid-state drives (SSD) when it comes to random read operations. However, if you're using an SSD, the difference in retrieval time between multiple small retrievals and a single large retrieval may be less significant.
In the end, it's hard to give a general answer without more information about the specific storage device, file system and operating system you're using. But, in general, you should expect that multiple small retrievals might take longer than a single large retrieval.

Xodus high insertion rate

I'm using Xodus for storing time-series data (100-500 million rows are inserted daily.)
I have multiple stores per one environment. New store is created every day, older stores (created more than 30 days can be deleted). Recently my total environment size grew up to 500 gb.
Reading/Writing speed degraded dramatically, after initial investigation it turns out, that Xodus background cleaner thread is consuming almost all IO resources. iostats shows almost 90 % utilization with 20 mb/sec reading and 0 mb/sec writing.
I decided to give background thread some time to cleanup environment, but it keep running for few days, so eventually I had to delete whole environment.
Xodus is great tool, it looks for me that I've made wrong choose, Xodus is not designed for inserting huge amount of data due append-only modifications design. If you insert too much data, background cleaner thread will not be able to compact your data and will consume all IO.
Can you advice any tip and tricks when working with big data size with Xodus ? I could create new environment every day instead of creating new store
If you are ok about fetching data from different environments, then you will definitely benefit from creating an instance of Environment every day instead of an instance of Store. In that case, GC will work on only a daily amount of data. Insertion rate will be more or less constant, whereas fetching will slowly degrade with the increase of the total amount of data.
If working with several environments within a single JVM, make sure the exodus.log.cache.shared setting of EnvironmentConfig is set to true.

Is one large sorted set or many small sorted sets more memory performant in Redis

I'm trying to design a data abstraction for Redis using sorted sets. My scenario is that I would either have ~60 million keys in one large sorted set or ~2 million small sorted sets with maybe 10 keys each. In either scenario the functions I would be using are O(log(N)+M), so time complexity isn't a concern. What I am wondering is what are the trade offs in memory impact. Having many sorted sets would allow for more flexibility, but I'm unsure if the cost of memory would become a problem. I know Redis says it now optimizes memory usage for smaller sorted sets, but it's unclear to me by how much and at what size is too big.
Having many small sorted sets would help spreading load over different redis instances, in case the data set grows beyond single host memory limit.

Spreadsheet Gear -- Generating large report via copy and paste seems to use a lot of memory and processor

I am attempting to generate a large workbook based report with 3 supporting worksheets of 100,12000 and 12000 rows and a final output sheet all formula based that ends up representing about 120 entities at 100 rows a piece. I generate a template range and copy and paste it replacing the entity ID cell after pasting each new range. It is working fine but I noticed that memory usage in the IIS Express process is approx 500mb and it is taking 100% processor usage as well.
Are there any guidelines for generating workbooks in this manner?
At least in terms of memory utilization, it would help to have some comparison, maybe against Excel, in how much memory is utilized to simply have the resultant workbook opened. For instance, if you were to open the final report in both Excel and the "SpreadsheetGear 2012 for Windows" application (available in the SpreadsheetGear folder under the Start menu), what does the Task Manager measure for each of these applications in terms of memory consumption? This may provide some insight as to whether the memory utilization you are seeing in the actual report-building process is unusually high (is there a lot of extra overhead for your routine?), or just typical given the size of the workbook you are generating.
In terms of CPU utilization, this one is a bit more difficult to pinpoint and is certainly dependent on your hardware as well as implementation details in your code. Running a VS Profiler against your routine certainly would be interesting to look into, if you have this tool available to you. Generally speaking, the CPU time could potentially be broken up into a couple broad categories—CPU cycles used to "build" your workbook and CPU cycles to "calculate" it. It could be helpful to better determine which of these is dominating the CPU. One way to do this might be to, if possible, ensure that calculations don't occur until you are finished actually generating the workbook. In fact, avoiding any unnecessary calculations could potentially speed things up...it depends on the workbook, though. You could avoid calculations by setting IWorkbookSet.Calculation to Manual mode and not calling any of the IWorkbook’s "Calculate" methods (Calculate/CalculateFull/CalculateFullRebuild) until you are fished up with this process. If you don't have access to a Profiler too, maybe set some timers, Console.WriteLines and monitor the Task Manager to see how your CPU fluctuates during different parts of your routine. With any luck you might be able to better isolate what part of the routine is taking the most amount of time.

What do these CloudKit metrics tell (Indexing cost, data usage, limit, metadata storage)

The ClouKit backend gives me these numbers . What do they tell?
Should I stop to index some attributes?
do I use too much data?
In CloudKit you have a limit of data that you can use for your app which starts at 5GB and increases for every user of your app. Besides your actual data indexes also take up some of this storage. If you think you come close to the limits of that free storage, then it might help removing some of the indexes.

Resources