I have some scientific measurement data which should be permanently stored in a data store of some sort.
I am looking for a way to store measurements from 100 000 sensors with measurement data accumulating over years to around 1 000 000 measurements per sensor. Each sensor produces a reading once every minute or less frequently. Thus the data flow is not very large (around 200 measurements per second in the complete system). The sensors are not synchronized.
The data itself comes as a stream of triplets: [timestamp] [sensor #] [value], where everything can be represented as a 32-bit value.
In the simplest form this stream would be stored as-is into a single three-column table. Then the query would be:
SELECT timestamp,value
FROM Data
WHERE sensor=12345 AND timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
Unfortunately, with row-based DBMSs this will give a very poor performance, as the data mass is large, and the data we want is dispersed almost evenly into it. (Trying to pick a few hundred thousand records from billions of records.) What I need performance-wise is a reasonable response time for human consumption (the data will be graphed for a user), i.e. a few seconds plus data transfer.
Another approach would be to store the data from one sensor into one table. Then the query would become:
SELECT timestamp,value
FROM Data12345
WHERE timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
This would give a good read performance, as the result would be a number of consecutive rows from a relatively small (usually less than a million rows) table.
However, the RDBMS should have 100 000 tables which are used within a few minutes. This does not seem to be possible with the common systems. On the other hand, RDBMS does not seem to be the right tool, as there are no relations in the data.
I have been able to demonstrate that a single server can cope with the load by using the following mickeymouse system:
Each sensor has its own file in the file system.
When a piece of data arrives, its file is opened, the data is appended, and the file is closed.
Queries open the respective file, find the starting and ending points of the data, and read everything in between.
Very few lines of code. The performance depends on the system (storage type, file system, OS), but there do not seem to be any big obstacles.
However, if I go down this road, I end up writing my own code for partitioning, backing up, moving older data deeper down in the storage (cloud), etc. Then it sounds like rolling my own DBMS, which sounds like reinventing the wheel (again).
Is there a standard way of storing the type of data I have? Some clever NoSQL trick?
Seems like a pretty easy problem really. 100 billion records, 12 bytes per record -> 1.2TB this isn't even a large volume for modern HDDs. In LMDB I would consider using a subDB per sensor. Then your key/value is just 32 bit timestamp/32 bit sensor reading, and all of your data retrievals will be simple range scans on the key. You can easily retrieve on the order of 50M records/sec with LMDB. (See the SkyDB guys doing just that https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1X35alxcJ)
Try VictoriaMetrics as a time series database for big amounts of data.
It is optimized for storing and querying big amounts of time series data.
It uses low disk iops and bandwidth thanks to the storage design based on LSM trees, so it can work quite well on HDD instead of SSD.
It has good compression ratio, so 100 billion typical data points would require less than 100 GB of HDD storage. Read technical details on data compression.
Related
I am having issues using dask. It is very slow compared to pandas especially when reading large datasets of up to 40gig. The data set grows to about 100+ columns which are mainly float64 after some additional processing(This is quite slow especially when I call compute like so: output = df[["date", "permno"]].compute(scheduler='threading'))
I think I could live with delay even if frustrating however, when I try to save the data to parquet: df.to_parquet('my data frame', engine="fastparquet") it runs out of memory in a server with about 110gig ram. I notice that the buff/cache memory when I do free -h goes up from about 40megabytes to 40+gig.
I am confused how this is possible given that dask does not load everything into memory. I use 100 partitions for the dataset in dask.
Dask computations are executed lazily. The underlying operations aren't actually executed until the last possible moment. Here's what I can gather from your question / comment:
you read a 40GB dataset
you run grouping / sorting
you join with other datasets
you try to write to Parquet
The computation bottleneck isn't necessarily related to the Parquet writing part. Your bottleneck may be with the grouping, sorting, or joining.
You may need to perform a broadcast join, strategically persist, or repartition, it's hard to say given the information provided.
I've got several hundred pandas data frames, each of which has a column of very long strings that need to be processed/sentencized and finally tokenized before modeling with word2vec.
I can store them in any format on the disk, before I build a stream to pass them to gensim's word2vec function.
What format would be best, and why? The most important criterion would be performance vis-a-vis training (which will take many days), but coherent structure to the filesystem would also be nice.
Would it be crazy to store several million or maybe even a few billion text files containing one sentence each? Or perhaps some sort of database? If this was numerical data I'd use hdf5. But it's text. The cleanest would be to store them in the original data frames, but that seems less ideal from an i/o perspective, because I'd have to load each data frame (largish) every epoch.
What makes the most sense here?
As you do your preprocessing/tokenization of all the source data that you want to be part of a single training session, append the results to a single plain-text file.
Use space-separated words, and end each 'sentence' (or any other useful text-chunk that's less than 10,000 words long) with a newline.
Then you can use the corpus_file option for specifying your pre-tokenized training data, and will get the maximum possible multithreading benefit. (That mode will direct each thread to open its own view into a range of the single file, so there's no blocking on any distributor thread.)
I am developing an iOS app which stores its data in an sqlite3 database. Every insert, update or delete operation is logged locally and then pushed up to iCloud, and other devices running the app can download these transaction logs and execute the SQL commands within them to keep all devices running the app synchronised. This is working extremely well.
I am now looking into optimising the process and it occurs to me that logging the whole SQL command results in a lot of redundant data being pushed to and pulled from the cloud, which will ultimately result in longer sync times and increased data usage.
The SQL queries are very predictable (there is only one format each of insert, update and delete used in the app) so I am considering using an encoding/decoding routine which will compress the SQL command for storage in the transaction log, and then decompress it from the log for execution.
The string compression methods I have found don't seem to do too well with SQL queries, so I've devised my own:
Single byte to identify the SQL command type
Table and column names indexed in arrays in the app, and the names are encoded using their index position in the array
String of tab separated digits to represent groups of columns, and tab separated values (e.g. in a VALUES() clause)
Encoded check column and value (for the WHERE clause in an update or delete command)
Using this format I have compressed one example query of 186 bytes down to just 78 bytes. This has clear advantages for speed of data transmission and amount of data usage.
The disadvantage I foresee is that it will require more processing on the client end to encode and decode the commands. I am wondering whether anybody has done anything similar and has any advice to offer.
To make is clearer what I am asking: in general is it better to minimise the amount of data being synced and increase the burden on the client to interprete those data, or is it preferable to just sync the data as-is and leave the client to use it as-is?
I'm posting an answer to my own question because I have some information which may provide advice to others who are looking in to the same thing.
I spent some time yesterday writing SQL query compression and decompression functions in Objective C. These functions use the method detailed in my original question and in so doing reduce all non-data parts of the queries (SQL command [insert/update/delete], table names and column names) to a single digit to represent each, and remove all of the remaining SQL syntax (key words like "FROM", spaces, commas, brackets...).
I have done some testing by creating five records both with and without the query compression enabled, and here are the results for the file sizes:
Full SQL queries (logged unmodified to the transactions file)
Zip compressed: 747 bytes
Uncompressed: 1,515 bytes
Compressed SQL queries (compressed using my custom format to the transactions file)
Zip compressed: 673 bytes
Uncompressed: 785 bytes
As you can see, the greatest benefit was to using both types of compression. Encoding the queries achieved ~50% compression. Encoding and then zipping the queries achieved about 10% compression compared to zipping the unencoded queries.
The question I really need to ask myself now is whether it's worth the additional overhead to encode and then zip the queries, and unzip and decode them after downloading the transaction logs from other devices. In this example I only saved 74 bytes overall by encoding and then compressing the transaction log. With five queries this averages at 14.8 bytes saved per query (compared to zipping alone). This is just 14.4kb per 1000 records, which does not seem a lot.
In the past I had to work with big files, somewhere about in the 0.1-3GB range. Not all the 'columns' were needed so it was ok to fit the remaining data in RAM.
Now I have to work with files in 1-20GB range, and they will probably grow as the time will pass. That is totally different because you cannot fit the data in RAM anymore.
My file contains several millions of 'entries' (I have found one with 30 mil entries). On entry consists in about 10 'columns': one string (50-1000 unicode chars) and several numbers. I have to sort the data by 'column' and show it. For the user only the top entries (1-30%) are relevant, the rest is low quality data.
So, I need some suggestions about in which direction to head out. I definitively don't want to put data in a DB because they are hard to install and configure for non computer savvy persons. I like to deliver a monolithic program.
Showing the data is not difficult at all. But sorting... without loading the data in RAM, on regular PCs (2-6GB RAM)... will kill some good hours.
I was looking a bit into MMF (memory mapped files) but this article from Danny Thorpe shows that it may not be suitable: http://dannythorpe.com/2004/03/19/the-hidden-costs-of-memory-mapped-files/
So, I was thinking about loading only the data from the column that has to be sorted in ram AND a pointer to the address (into the disk file) of the 'entry'. I sort the 'column' then I use the pointer to find the entry corresponding to each column cell and restore the entry. The 'restoration' will be written directly to disk so no additional RAM will be required.
PS: I am looking for a solution that will work both on Lazarus and Delphi because Lazarus (actually FPC) has 64 bit support for Mac. 64 bit means more RAM available = faster sorting.
I think a way to go is Mergesort, it's a great algorithm for sorting a
large amount of fixed records with limited memory.
General idea:
read N lines from the input file (a value that allows you to keep the lines in memory)
sort these lines and write the sorted lines to file 1
repeat with the next N lines to obtain file 2
...
you reach the end of the input file and you now have M files (each of which is sorted)
merge these files into a single file (you'll have to do this in steps as well)
You could also consider a solution based on an embedded database, e.g. Firebird embedded: it works well with Delphi/Windows and you only have to add some DLL in your program folder (I'm not sure about Lazarus/OSX).
If you only need a fraction of the whole data, scan the file sequentially and keep only the entries needed for display. F.I. lets say you need only 300 entries from 1 million. Scan the first first 300 entries in the file and sort them in memory. Then for each remaining entry check if it is lower than the lowest in memory and skip it. If it is higher as the lowest entry in memory, insert it into the correct place inside the 300 and throw away the lowest. This will make the second lowest the lowest. Repeat until end of file.
Really, there are no sorting algorithms that can make moving 30gb of randomly sorted data fast.
If you need to sort in multiple ways, the trick is not to move the data itself at all, but instead to create an index for each column that you need to sort.
I do it like that with files that are also tens of gigabytes long, and users can sort, scroll and search the data without noticing that it's a huge dataset they're working with.
Please finde here a class which sorts a file using a slightly optimized merge sort. I wrote that a couple of years ago for fun. It uses a skip list for sorting files in-memory.
Edit: The forum is german and you have to register (for free). It's safe but requires a bit of german knowledge.
If you cannot fit the data into main memory then you are into the realms of external sorting. Typically this involves external merge sort. Sort smaller chunks of the data in memory, one by one, and write back to disk. And then merge these chunks.
I would like to store millions of data lines that looks like this:
key, value
key is an integer in the range of (0 to 5,000,000); all values are unique;
value is an unsigned int16 value (0 to 65535)
the key is to store the data while taking the LEAST AMOUNT OF DISK SPACE, and yet, be able to query the data. can you think of any algorithms / smart schemes for data storage that would be helpful?
just in case it matters, I use Linux.
One option would be, if the key values are not important data but rather just index data to utilize a flat file of bits ( with a descriptive header ). Every 16 bits is a value and the nth value would then be (n - 1) * 16 bits from the end of the header.
Additionally, if the key value does matter, a set flat file of about 10MB would allow for the entire key space to be stored without storing actual keys. The 16 bits that are at the (n - 1) * 16 offset would be that key's value.
That would probably be the least space intensive method for storage, as it would be only the data that is literally required. ( Though, if you are only interested in say 100k values and one has a key of 5 million you do end up with a lot of wasted space, which wouldn't be there with an actual key,value addressing system. So this methodology only achieves a minimum disk storage for sets of tightly grouped values or many many numbers (over about the 2 million mark ).
how do you plan to use stored data? with random or sequential access? for sequential access you can use any archiving algorithm, e.g. LZMA. Random access doesn't leave you a lot of space for improvements.
can you see any patterns of this data? e.g. if the difference between adjacent keys/values are often small you can store only packed differences. and million of other possible approaches.
[EDIT] also you can check techniques used for data compression in network communication
[EDIT1] and you can check this Google Code Integer Array Compression project
This depend upon the operation and data. I would first recommend "just using a database" (a simple key-value store such as BDB/EhCache [read: Key Value store], for instance :-)
Mimisbrunnr also has a good answer if all the keys are used.
If the keys are near constant/read-only and only a relatively small percent of the keys are used, consider the use of a (disk-based) Heap data-structure (very similar to an Array-based Heap; Heaps need not be Array-based). Robert Sedgewick had a good book from the late 80's that had a very lean implementation, but I forget the name. A Heap will be more beneficial when compared to a flat index with a smaller proportion of used keys and at full-load will have worse storage requirements.
(If abstracted, the used method could be switched and/or a hybrid heap with indexed/sequenced leaf-node values could be used [along with Huffman encoding or whatnot], but that is just adding far more complications. Keep it simple ... hence first suggestion of an existing key/value store ;-)
Happy coding.
Have you considered using a database designed for mobile devices such as SQL Server Compact, or another similar database? These will have a small footprint on the disk, while still providing the full search power you need.
Another example of a compact database engine is KeyDB for linux:
http://3d2f.com/programs/11-989-keydb-download.shtml