I’m trying to read data which is generated by another application and stored in a Microsoft Office Access .MDB file. The number of records in some particular tables can vary from a few thousands up to over 10 millions depending on size of the model (in the other application). Opening the whole table in one query can cause an Out Of Memory exception in large files. So I split the table on some criteria and read each part in a different query. But the problem is about middle sized files that could be read significantly faster in one single query with no exceptions.
So, am I on right way? Can I solve the OutOfMemory problem in another way? Is it OK to choose one of mentioned strategies (1 query or N query) based on the number of records?
By the way, I’m using DelphiXE5 and Delphi’s standard ADO components. And I need the whole data of the table, and no joining to other tables is needed. I’m creating ADO components by code and they are not connected to any visual controls.
Edit:
Well, it seems that my question is not clear enough. Here are some more details, which are actually answers to questions or suggestions posed in comments:
This .mdb file is not holding a real database; it’s just structured data, so no writing new data, no transactions, no user interactions, no server, nothing. A third-party application uses Access files to export its calculation results. The total size of these files is usually about a few hundred MBs, but they can grow up to 2 GBs. Now I need to load this data into a Delphi data structure before starting my own calculations since there’s no place for waiting for I/O during these calculations.
I can’t compile this project for x64, it’s extremely dependent on some old DLLs that share same memory manager with main executable and their authors will never release an x64 version. The company hasn’t yet decided to replace them, and it won’t change in near future.
And, you know, support guys just prefer to tell us “fix this” rather than asking two thousand customers to “buy more memory”. So I have to be really stingy about memory usage.
Now my question is: Does TADODataSet provide any better memory management for fetching such amount of data? Is there any property that prevents DataSet from fetching all data at once?
When I call ADOTable1.open it starts to allocate memory and waits to fetch the entire table, just as expected. But reading all those records in a for loop will take a while and there’s no need to have all that data, on the other hand, there’s no need to keep a record in memory after reading it since there's no seeking in rows. That’s why I split table with some queries. Now I want to know if TADODataSet can handle this or what I'm doing is the only solution.
I did some try and errors and improved performance of reading data, in both memory usage and elapsed time. My test case is a table with more than 5,000,000 records. Each record has 3 string fields and 8 doubles. No index, no primary key. I used GetProcessMemoryInfo API to get memory usage.
Initial State
Table.Open: 33.0 s | 1,254,584 kB
Scrolling : +INF s | I don't know. But allocated memory doesn't increase in Task Manager.
Sum : - | -
DataSet.DisableControls;
Table.Open: 33.0 s | 1,254,584 kB
Scrolling : 13.7 s | 0 kB
Sum : 46.7 s | 1,254,584 kB
DataSet.CursorLocation := clUseServer;
Table.Open: 0.0 s | -136 kB
Scrolling : 19.4 s | 56 kB
Sum : 19.4 s | -80 kB
DataSet.LockType := ltReadOnly;
Table.Open: 0.0 s | -144 kB
Scrolling : 18.4 s | 0 kB
Sum : 18.5 s | -144 kB
DataSet.CacheSize := 100;
Table.Open: 0.0 s | 432 kB
Scrolling : 11.4 s | 0 kB
Sum : 11.5 s | 432 kB
I also checked Connection.CursorLocarion, Connection.IsolationLevel, Connection.Mode, DataSet.CursorType and DataSet.BlockReadSize but they made no appreciable change.
I also tried to use TADOTable, TADOQuery and TADODataSet and unlike what Jerry said here in comments, both ADOTable and ADOQuery performed better than ADODataSet.
The value assigned to CacheSize should be decided for each case, not any grater values lead to better results.
Related
We have a 1 GB List that was created using View.asList() method on beam sdk 2.0. We are trying to iterate through every member of the list and do, for now, nothing significant with it (we just sum up a value). Just reading this 1 GB list is taking about 8 minutes to do so (and that was after we set the workerCacheMb=8000, which we think means that the worker cache is 8 GB). (If we don't set the workerCacheMB to 8000, it takes over 50 minutes before we just kill the job.). We're using a n1-standard-32 instance, which should have more than enough RAM. There is ONLY a single thread reading this 8GB list. We know this because we create a dummy PCollection of one integer and we use it to then read this 8GB ViewList side-input.
It should not take 6 minutes to read in a 1 GB list, especially if there's enough RAM. EVEN if the list were materialized to disk (which it shouldn't be), a normal single NON-ssd disk can read data at 100 MB/s, so it should take ~10 seconds to read in this absolute worst case scenario....
What are we doing wrong? Did we discover a dataflow bug? Or maybe the workerCachMB is really in KB instead of MB? We're tearing our hair out here....
Try to use setWorkervacheMb(1000). 1000 MB = Around 1GB. It will pick the side input from cache of each worker node and that will be fast.
DataflowWorkerHarnessOptions options = PipelineOptionsFactory.create().cloneAs(DataflowWorkerHarnessOptions.class);
options.setWorkerCacheMb(1000);
Is it really required to iterate 1 GB of side input data every time or you are need some specific data to get during iteration?
In case you need specific data then you should get it by passing specific index in the list. Getting data specific to index is much faster operation then iterating whole 1GB data.
After checking with the Dataflow team, the rate of 1GB in 8 minutes sounds about right.
Side inputs in Dataflow are always serialized. This is because to have a side input, a view of a PCollection must be generated. Dataflow does this by serializing it to a special indexed file.
If you give more information about your use case, we can help you think of ways of doing it in a faster manner.
I am using Delphi 7 Enterprise under Windows 7 64 bit.
My computer had 16 GB of RAM.
I try to use kbmMemTable 7.70.00 Professional Edition (http://news.components4developers.com/products_kbmMemTable.html) .
My table has 150,000 records, but when I try to copy the data from Dataset to the kbmMemTable it only copies 29000 records and I get this error: EOutOfMemory
I saw this message:
https://groups.yahoo.com/neo/groups/memtable/conversations/topics/5769,
but it didn't solve my problem.
An out of memory can happen of various reasons:
Your application uses too much memory in general. A 32 bit application typically runs out of memory when it has allocated 1.4GB using FastMM memory manager. Other memory managers may have worse or better ranges.
Memory fragementation. There may not be enough space in memory for a single large allocation that is requested. kbmMemTable will attempt to allocate roughly 200000 x 4 bytes as one single large allocation. As its own largest single allocation. That shouldnt be a problem.
Too many small allocations leading to the above memory fragmentation. kbmMemTable will allocate from 1 to n blocks of memory per record depending on the setting of the Performance property .
If Performance is set to fast, then 1 block will be allocated (unless blobs fields exists, in which case an additional allocation will be made per not null blob field).
If Performance is balanced or small, then each string field will allocate another block of memory per record.
best regards
Kim/C4D
In an embedded environment (using MSP430), I have seen some data corruption caused by partial writes to non-volatile memory. This seems to be caused by power loss during a write (to either FRAM or info segments).
I am validating data stored in these locations with a CRC.
My question is, what is the correct way to prevent this "partial write" corruption? Currently, I have modified my code to write to two separate FRAM locations. So, if one write is interrupted causing an invalid CRC, the other location should remain valid. Is this a common practice? Do I need to implement this double write behavior for any non-volatile memory?
A simple solution is to maintain two versions of the data (in separate pages for flash memory), the current version and the previous version. Each version has a header comprising of a sequence number and a word that validates the sequence number - simply the 1's complement of the sequence number for example:
---------
| seq |
---------
| ~seq |
---------
| |
| data |
| |
---------
The critical thing is that when the data is written the seq and ~seq words are written last.
On start-up you read the data that has the highest valid sequence number (accounting for wrap-around perhaps - especially for short sequence words). When you write the data, you overwrite and validate the oldest block.
The solution you are already using is valid so long as the CRC is written last, but it lacks simplicity and imposes a CRC calculation overhead that may not be necessary or desirable.
On FRAM you have no concern about endurance, but this is an issue for Flash memory and EEPROM. In this case I use a write-back cache method, where the data is maintained in RAM, and when modified a timer is started or restarted if it is already running - when the timer expires, the data is written - this prevents burst-writes from thrashing the memory, and is useful even on FRAM since it minimises the software overhead of data writes.
Our engineering team takes a two pronged approach to these problem: Solve it in hardware and software!
The first is a diode and capacitor arrangement to provide a few milliseconds of power during a brown-out. If we notice we've lost external power, we prevent the code from entering any non-violate writes.
Second, our data is particularly critical for operation, it updates often and we don't want to wear out our non-violate flash storage (it only supports so many writes.) so we actually store the data 16 times in flash and protect each record with a CRC code. On boot, we find the newest valid write and then start our erase/write cycles.
We've never seen data corruption since implementing our frankly paranoid system.
Update:
I should note that our flash is external to our CPU, so the CRC helps validates the data if there is a communication glitch between the CPU and flash chip. Furthermore, if we experience several glitches in a row, the multiple writes protect against data loss.
We've used something similar to Clifford's answer but written in one write operation. You need two copies of the data and alternate between them. Use an incrementing sequence number so that effectively one location has even sequence numbers and one has odd.
Write the data like this (in one write command if you can):
---------
| seq |
---------
| |
| data |
| |
---------
| seq |
---------
When you read it back make sure both the sequence numbers are the same - if they are not then the data is invalid. At startup read both locations and work out which one is more recent (taking into account the sequence number rolling over).
Always store data in some kind of protocol , like START_BYTE, Total bytes to write, data , END BYTE.
Before writting to external / Internal memory always check POWER Moniter registers/ ADC.
if anyhow you data corrupts, END byte will also corrupt. So that entry will not vaild after validation of whole protocol.
checksum is not a good idea , you can choose CRC16 instead of that if you want to include CRC into your protocol.
I am working as software Engineer. As far as I know the data being stored in memory (Either HARD Disk or RAM) is 0s and 1s.
I am sure beyond 0s and 1s there are different ways data being stored in memory devices based on memory device types.
Please share your ideas about it .
or
Where can I study about the how data stored in memory devices ?
Logically, a bit is the smallest piece of data that can be stored. A bit is either 0 or 1; nothing else. It's like asking "what is between these 2 quantum states?"
In electronics 0 or 1 can be distinguished by separate voltage levels, separate directions of magnetization etc. Some implementations may use multiple levels to store more than one bit in a given space. But logically 0 or 1 are the only values.
I'm trying to get a better idea of how my SQL 2000 instance is using it's memory. I've run DBCC MEMORYSTATUS and I'm hoping someone can give me a better idea of how to interpret the output.
My main concern is the 'Other' section of the buffer distribution. It is currently using by far the most pages at 166,000. Considering that SQL has only about 2GB of availible RAM, the fact that most of that is being used by 'Other' worries me.
Below is the full output .
I appreciate any help you can offer.
Buffer Distribution Buffers
Stolen 30595
Free 966
Procedures 208
Inram 0
Dirty 8424
Kept 0
I/O 137
Latched 437
Other 166065
It's your buffer pool aka data cache. From MS KB 271624
Other. These are committed pages that do not meet any of the criteria mentioned earlier. Typically, the majority of buffers that meet this criteria are hashed data and index pages in the buffer cache.
This looks good: you have 1300MB cached data + indexes which means your queries are hitting RAM not disk.