Xodus high insertion rate - xodus

I'm using Xodus for storing time-series data (100-500 million rows are inserted daily.)
I have multiple stores per one environment. New store is created every day, older stores (created more than 30 days can be deleted). Recently my total environment size grew up to 500 gb.
Reading/Writing speed degraded dramatically, after initial investigation it turns out, that Xodus background cleaner thread is consuming almost all IO resources. iostats shows almost 90 % utilization with 20 mb/sec reading and 0 mb/sec writing.
I decided to give background thread some time to cleanup environment, but it keep running for few days, so eventually I had to delete whole environment.
Xodus is great tool, it looks for me that I've made wrong choose, Xodus is not designed for inserting huge amount of data due append-only modifications design. If you insert too much data, background cleaner thread will not be able to compact your data and will consume all IO.
Can you advice any tip and tricks when working with big data size with Xodus ? I could create new environment every day instead of creating new store

If you are ok about fetching data from different environments, then you will definitely benefit from creating an instance of Environment every day instead of an instance of Store. In that case, GC will work on only a daily amount of data. Insertion rate will be more or less constant, whereas fetching will slowly degrade with the increase of the total amount of data.
If working with several environments within a single JVM, make sure the exodus.log.cache.shared setting of EnvironmentConfig is set to true.

Related

InfluxDB 2 storage retention/max. size

I am using InfluxDB 2.2 to store & aggregate data on a gateway device. The environment is pretty limited regarding space. I do not know in which interval and how large the data is that get's ingested. Retention is not that much of a requirement. All I want is to make sure that the influx db does not grow larger than let's say 5GB.
I know that I could just set restrictive bounds to the retention but this does not feel like an ideal solution. Do you see any possibility to achieve this?
Seems that you are more concerned about the disk space. If so, there are several workaround you could try:
Retention policy: this is similar to TTL in other NoSQL and it could help you to delete the obsolete data automatically. How long you should set the retention policy really depends on the business you are running. You could run the instance for a few days and see how the disk space is growing and then change your retention policy.
Downsampling: "Downsampling is the process of aggregating high-resolution time series within windows of time and then storing the lower resolution aggregation to a new bucket ". not all data need to retrieved at all times. Most of the time, the fresher the data (i.e. hot data), the more frequent it will be fetched. What's more, you might just need to see the big picture of historical data, i.e. less granular. For example, if you are collecting the data in second-level granularity, you could perform a downsampling task to only retain the mean of the indicator values at an minute or even hour precision instead. That will save you a lot of space while not affecting your trending view that much.
See more details here.

Rising Total Memory on Heroku Dyno

I have a website hosted on a Heroku Dyno that allows max 512MB of memory.
My site allows users to upload raw time series data in CSV format, and I wanted to load test the performance of uploading a CSV with ~100k rows (3.2 MB in size). The UI lets the user upload the file, which in turns kicks of a Sidekiq job to import each row in the file into my database. It stores the uploaded file under /tmp storage on the dyno, which I believe gets cleared on each periodic restart of the dyno.
Everything actually finished without error, and all 100k rows were inserted. But several hours later I noticed my site was almost unresponsive and I checked Heroku metrics.
At the exact time I had started the upload, the memory usage began to grow and quickly exceeded the maximum 512MB.
The logs confirmed this fact -
# At the start of the job
Aug 22 14:45:51 gb-staging heroku/web.1: source=web.1 dyno=heroku.31750439.f813c7e7-0328-48f8-89d5-db79783b3024 sample#memory_total=412.68MB sample#memory_rss=398.33MB sample#memory_cache=14.36MB sample#memory_swap=0.00MB sample#memory_pgpgin=317194pages sample#memory_pgpgout=211547pages sample#memory_quota=512.00MB
# ~1 hour later
Aug 22 15:53:24 gb-staging heroku/web.1: source=web.1 dyno=heroku.31750439.f813c7e7-0328-48f8-89d5-db79783b3024 sample#memory_total=624.80MB sample#memory_rss=493.34MB sample#memory_cache=0.00MB sample#memory_swap=131.45MB sample#memory_pgpgin=441565pages sample#memory_pgpgout=315269pages sample#memory_quota=512.00MB
Aug 22 15:53:24 gb-staging heroku/web.1: Process running mem=624M(122.0%)
I can restart the Dyno to clear this issue, but I don't have much experience in looking at metrics so I wanted to understand what was happening.
If my job finished in ~30 mins, what are some common reasons why the memory usage might keep growing? Prior to the job it was pretty steady
Is there a way to tell what data is being stored in memory? It would be great to do a memory dump, although I don't know if it will be anything more than hex address data
What are some other tools I can use to get a better picture of the situation? I can reproduce the situation by uploading another large file to gather more data
Just a bit lost on where to start investigating.
Thanks!
Edit: - We have the Heroku New Relic addon which also collects data. Annoyingly enough, New Relic reports a different/normal memory usage value for that same time period. Is this common? What's it measuring?
There are most probable reasons for that:
Scenario 1. You process the whole file, first by loading every record from CSV to memory, doing some processing and then iterating over it and storing into database.
If that's the case then you need to change your implementation to process this file in batches. Load 100 records, process them, store in the database, repeat. You can also look at activerecord-import gem to speed up your inserts.
Scenario 2. You have memory leak in your script. Maybe you process in batches, but you hold references to unused object and they are not garbage collected.
You can find out by using ObjectSpace module. It has some pretty useful methods.
count_objects will return hash with counts for different object currently created on the heap:
ObjectSpace.count_objects
=> {:TOTAL=>30162, :FREE=>11991, :T_OBJECT=>223, :T_CLASS=>884, :T_MODULE=>30, :T_FLOAT=>4, :T_STRING=>12747, :T_REGEXP=>165, :T_ARRAY=>1675, :T_HASH=>221, :T_STRUCT=>2, :T_BIGNUM=>2, :T_FILE=>5, :T_DATA=>1232, :T_MATCH=>105, :T_COMPLEX=>1, :T_NODE=>838, :T_ICLASS=>37}
It's just a hash so you can look for specific type of object:
ObjectSpace.count_objects[:T_STRING]
=> 13089
You can plug this snippet in different points in your script to see how many objects are on the heap at specific time. To have consistent results you should manually trigger garbage collector before checking the counts. It will ensure that you will see only live objects.
GC.start
ObjectSpace.count_objects[:T_STRING]
Another useful method is each_object which iterates over all objects actually on the heap:
ObjectSpace.each_object { |o| puts o.inspect }
Or you can iterate over objects of one class:
ObjectSpace.each_object(String) { |o| puts o.inspect }
Scenario 3. You have memory leak in a gem or system library.
This like previous scenario, but the problem lies not in your code. You can find this also by using ObjectSpace. If you see there are some objects retained after calling library method, there is a chance that this library may have a memory leak. The solution would be to update such library.
Take a look at this repo. It maintains the list of gems with known memory leak problems. If you have something from this list I suggest to update it quickly.
Now addressing your other questions. If you have perfectly healthy app on Heroku or any other provider, you will always see memory increase over time, but it should stabilise at some point. Heroku is restarting dynos once a day or so. On your metrics you will see sudden drops and the slow increase over span of 2 days or so.
And New Relic by default shows average data from all instances. You should probably switch to showing data only from your worker dyno to see correct memory usage.
At the end I recommend to read this article about how Ruby uses memory. There are many useful tools mentioned there, derailed_benchmarks in particular. It was created by guy from Heroku (at that time) and it is a collection of many benchmarks related to most common problems people have on Heroku.

Spreadsheet Gear -- Generating large report via copy and paste seems to use a lot of memory and processor

I am attempting to generate a large workbook based report with 3 supporting worksheets of 100,12000 and 12000 rows and a final output sheet all formula based that ends up representing about 120 entities at 100 rows a piece. I generate a template range and copy and paste it replacing the entity ID cell after pasting each new range. It is working fine but I noticed that memory usage in the IIS Express process is approx 500mb and it is taking 100% processor usage as well.
Are there any guidelines for generating workbooks in this manner?
At least in terms of memory utilization, it would help to have some comparison, maybe against Excel, in how much memory is utilized to simply have the resultant workbook opened. For instance, if you were to open the final report in both Excel and the "SpreadsheetGear 2012 for Windows" application (available in the SpreadsheetGear folder under the Start menu), what does the Task Manager measure for each of these applications in terms of memory consumption? This may provide some insight as to whether the memory utilization you are seeing in the actual report-building process is unusually high (is there a lot of extra overhead for your routine?), or just typical given the size of the workbook you are generating.
In terms of CPU utilization, this one is a bit more difficult to pinpoint and is certainly dependent on your hardware as well as implementation details in your code. Running a VS Profiler against your routine certainly would be interesting to look into, if you have this tool available to you. Generally speaking, the CPU time could potentially be broken up into a couple broad categories—CPU cycles used to "build" your workbook and CPU cycles to "calculate" it. It could be helpful to better determine which of these is dominating the CPU. One way to do this might be to, if possible, ensure that calculations don't occur until you are finished actually generating the workbook. In fact, avoiding any unnecessary calculations could potentially speed things up...it depends on the workbook, though. You could avoid calculations by setting IWorkbookSet.Calculation to Manual mode and not calling any of the IWorkbook’s "Calculate" methods (Calculate/CalculateFull/CalculateFullRebuild) until you are fished up with this process. If you don't have access to a Profiler too, maybe set some timers, Console.WriteLines and monitor the Task Manager to see how your CPU fluctuates during different parts of your routine. With any luck you might be able to better isolate what part of the routine is taking the most amount of time.

Reduce NSManagedObjectContect save time

I am trying to make my algorithm faster as I am seeing long loading times. My app loads thousands of objects from an external database and then saves it on the device. I ran time profiler with my ipod touch and I saw that 17 seconds/26% of my loading time is spent executing NSManagedObjectContexts' save: function.
I am using one privacte nsmanagedobjectcontext and one persistant store coordinator.
What affects the time spent saving? If I save 1/10th of the data 10 times instead of all of the data once, which one is faster? What can I try to optimize the time spent saving?
You are on the right track. The best lever I have found for optimizing Core Data save times is indeed finding the right batch size.
As suggested by one commenter, the best way is to test it out. There are no definite rules because it really depends on the nature of your data, the size and number of the records etc.
Other strategies include using #autorelease pool(s) which will optimize memory usage and might improve performance as well.
Let us know your results!

SQLite: ON disk Vs Memory Database

We are trying to Integrate SQLite in our Application and are trying to populate as a Cache. We are planning to use it as a In Memory Database. Using it for the first time. Our Application is C++ based.
Our Application interacts with the Master Database to fetch data and performs numerous operations. These Operations are generally concerned with one Table which is quite huge in size.
We replicated this Table in SQLite and following are the observations:
Number of Fields: 60
Number of Records: 1,00,000
As the data population starts, the memory of the Application, shoots up drastically to ~1.4 GB from 120MB. At this time our application is in idle state and not doing any major operations. But normally, once the Operations start, the Memory Utilization shoots up. Now with SQLite as in Memory DB and this high memory usage, we don’t think we will be able to support these many records.
When I create the DB on Disk, the DB size sums to ~40MB. But still the Memory Usage of the Application remains very high.
Q. Is there a reason for this high usage. All buffers have been cleared and as said before the DB is not in memory?
Any help would be deeply appreciated.
Thanks and Regards
Sachin
You can use the vacuum command to free up memory by reducing the size of sqlite database.
If you are doing a lot of insert update operations then the db size may increase. You can use vaccum command to free up space.
SQLite uses memory for things other than the data itself. It holds not only the data, but also the connections, prepared statements, query cache, query results, etc. You can read more on SQLite Memory Allocation and tweak it. Make sure you are properly destroying your objects too (sqlite3_finalize(), etc.).

Resources