My question is about speed comparison for reading seismic data from random location on disk when data formats are different.
In one case I have SU (seismic unix) formatted data which I read in fortran with access='direct'.
In the other data set I have the same file converted to SEGY format. I have to read this one with access='stream' as record size is not fixed on segy data sets.
Trace locations I read from are very random in both cases. Data length for traces is about 10 kilobytes.
I am observing speed ratio of about four between the two data sets (ie reading modes)
Is this expected. If not what could be wrong?
Related
I have large set of short time series data (average length of short time series = 20). The total size of the data is around 6 GB.
Current system working in following way:
1) Load 6 GB data into RAM.
2) Process the data.
3) Put the forecast value corresponding to each time series in excel.
The problem is every time i run the above system it is taking nearly 1 hour in my 8 GB RAM PC.
Please suggest a better way to reduce my time.
Use a faster programming language. For example, you can prefer using Julia or C++ instead of MATLAB or Python.
Try to make your code more efficient. For example, instead of passing data to your functions by copying them (pass by value), try to pass them as reference (What's the difference between passing by reference vs. passing by value?). Use more efficient data structures.
Divide your dataset into smaller parts. Work on each small part separately. Then, merge the outputs at the end.
I've got several hundred pandas data frames, each of which has a column of very long strings that need to be processed/sentencized and finally tokenized before modeling with word2vec.
I can store them in any format on the disk, before I build a stream to pass them to gensim's word2vec function.
What format would be best, and why? The most important criterion would be performance vis-a-vis training (which will take many days), but coherent structure to the filesystem would also be nice.
Would it be crazy to store several million or maybe even a few billion text files containing one sentence each? Or perhaps some sort of database? If this was numerical data I'd use hdf5. But it's text. The cleanest would be to store them in the original data frames, but that seems less ideal from an i/o perspective, because I'd have to load each data frame (largish) every epoch.
What makes the most sense here?
As you do your preprocessing/tokenization of all the source data that you want to be part of a single training session, append the results to a single plain-text file.
Use space-separated words, and end each 'sentence' (or any other useful text-chunk that's less than 10,000 words long) with a newline.
Then you can use the corpus_file option for specifying your pre-tokenized training data, and will get the maximum possible multithreading benefit. (That mode will direct each thread to open its own view into a range of the single file, so there's no blocking on any distributor thread.)
I am trying to process a large set of text files which are delimitated by new lines. The files are gzipped and I've split the files into small chunks where uncompressed they are ~100mb or so. I have a total of 296 individual compressed files with a total uncompressed size of ~30Gb.
The rows are NQuads and I'm using a Bag to map the rows into a format which I can import into a database. The rows are being folded by key so that I can combine rows related to a single page.
This is the code I'm using to read the files and fold them.
with dask.config.set(num_workers=2):
n_quads_bag = dask.bag.\
read_text(files)
uri_nquads_bag = n_quads_bag.\
map(parser.parse).\
filter(lambda x: x is not None).\
map(nquad_tuple_to_page_dict).\
foldby('uri', binop=binop).\
pluck(1).\
map(lang_extract)
Then I'm normalizing the data into pages and entities. I'm doing this by a map function which splits things into a tuple with (page, entities). I am plucking the data and then writing it to two separate sets of files in Avro.
pages_entities_bag = uri_nquads_bag.\
map(map_page_entities)
pages_bag = pages_entities_bag.\
pluck(0).\
map(page_extractor).\
map(extract_uri_details).\
map(ntriples_to_dict)
entities_bag = pages_entities_bag.\
pluck(1) .\
flatten().\
map(entity_extractor).\
map(ntriples_to_dict)
with ProgressBar():
pages_bag.to_avro(
os.path.join(output_folder, 'pages.*.avro'),
schema=page_avro_scheme,
codec='snappy',
compute=True)
entities_bag.to_avro(
os.path.join(output_folder, 'entities.*.avro'),
schema=entities_avro_schema,
codec='snappy',
compute=True)
The code is failing on pages_bag.to_avro(... compute=True) with Killed/MemoryError. I've played around with reducing the partition sizes and reduced the processor count to 2.
Am I wrong in setting compute=True? Is this the reason that the whole dataset is being brought into memory? If so how else can I get the files to be written?
Or is it possible that the partitions of the pages or entities are way too big for the computer?
Another question I had is am I using the Bags incorrectly and is this the right approach for the problem I want to solve?
The specs of the Machine I'm running this on:
4 CPU
16GB of Ram
375 Scratch Disk
The way to get this to not run out of memory is to keep the files ~100MB uncompressed and to use a groupby. As the Dask documentation states you can force it shuffle on disk. The groupby supports setting a number of partitions on the output.
with dask.config.set(num_workers=2):
n_quads_bag = dask.bag.\
read_text(files)
uri_nquads_bag = n_quads_bag.\
map(parser.parse).\
filter(lambda x: x is not None).\
map(nquad_tuple_to_page_dict).\
groupby(lambda x: x[3], shuffle='disk', npartitions=n_quads_bag.npartitions).\
map(grouped_nquads_to_dict).\
map(lang_extract)
I have some scientific measurement data which should be permanently stored in a data store of some sort.
I am looking for a way to store measurements from 100 000 sensors with measurement data accumulating over years to around 1 000 000 measurements per sensor. Each sensor produces a reading once every minute or less frequently. Thus the data flow is not very large (around 200 measurements per second in the complete system). The sensors are not synchronized.
The data itself comes as a stream of triplets: [timestamp] [sensor #] [value], where everything can be represented as a 32-bit value.
In the simplest form this stream would be stored as-is into a single three-column table. Then the query would be:
SELECT timestamp,value
FROM Data
WHERE sensor=12345 AND timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
Unfortunately, with row-based DBMSs this will give a very poor performance, as the data mass is large, and the data we want is dispersed almost evenly into it. (Trying to pick a few hundred thousand records from billions of records.) What I need performance-wise is a reasonable response time for human consumption (the data will be graphed for a user), i.e. a few seconds plus data transfer.
Another approach would be to store the data from one sensor into one table. Then the query would become:
SELECT timestamp,value
FROM Data12345
WHERE timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
This would give a good read performance, as the result would be a number of consecutive rows from a relatively small (usually less than a million rows) table.
However, the RDBMS should have 100 000 tables which are used within a few minutes. This does not seem to be possible with the common systems. On the other hand, RDBMS does not seem to be the right tool, as there are no relations in the data.
I have been able to demonstrate that a single server can cope with the load by using the following mickeymouse system:
Each sensor has its own file in the file system.
When a piece of data arrives, its file is opened, the data is appended, and the file is closed.
Queries open the respective file, find the starting and ending points of the data, and read everything in between.
Very few lines of code. The performance depends on the system (storage type, file system, OS), but there do not seem to be any big obstacles.
However, if I go down this road, I end up writing my own code for partitioning, backing up, moving older data deeper down in the storage (cloud), etc. Then it sounds like rolling my own DBMS, which sounds like reinventing the wheel (again).
Is there a standard way of storing the type of data I have? Some clever NoSQL trick?
Seems like a pretty easy problem really. 100 billion records, 12 bytes per record -> 1.2TB this isn't even a large volume for modern HDDs. In LMDB I would consider using a subDB per sensor. Then your key/value is just 32 bit timestamp/32 bit sensor reading, and all of your data retrievals will be simple range scans on the key. You can easily retrieve on the order of 50M records/sec with LMDB. (See the SkyDB guys doing just that https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1X35alxcJ)
Try VictoriaMetrics as a time series database for big amounts of data.
It is optimized for storing and querying big amounts of time series data.
It uses low disk iops and bandwidth thanks to the storage design based on LSM trees, so it can work quite well on HDD instead of SSD.
It has good compression ratio, so 100 billion typical data points would require less than 100 GB of HDD storage. Read technical details on data compression.
If I have a 32^3 array of 64 bit integers, but it contains only a dozen different values, can you tell HDF5 to use an "internal mapping" to save memory and/or disk space? What I mean is that the array would be access normally with 64 bit ints, but each value would internally be stored as a byte (?) index into a table of 64 bit ints, potentially saving about 7/8 of the memory and/or disk space. If this is possible, does it actually saves memory, disk space or both?
I don't believe that HDF5 provides this functionality right out of the box, but there is no reason why you couldn't implement routines to write your data to an HDF5 file and read it back again in the way that you seem to want. I suppose you could write your look-up table and your array into different datasets.
It's possible, but not something I have any evidence to indicate, that HDF's compression facility would sufficiently compress your integer dataset that you could save a useful amount of space.
Then again, for the HDF5 files I work with (10s of GBs) I wouldn't bother to try to devise my own encoding scheme to save such modest amounts of space as a 32768 element array of 64 bit numbers might be able to dispense with. Sure, you could transform a dataset of 2097152 bits into one of 131072 but disk space (even RAM) just isn't that tight these days.
I'm beginning to form the impression that you are trying to use HDF5 on, perhaps, a smartphone :-)