Suppose I have an array of integer numbers. Each integer number represents the number of requests against a database within a 10 minute window. The array contains 20,000 numbers, representing a 4-month-period of continuous 10 minute windows. Now suppose I have 10 such arrays which come from 10 different real-world databases.
Now I want to generate 400 arrays out of the existing 10 arrays. I don't have much background in statistics but I read that simple bootstrapping is not applicable to time-series data (or dependent data in general). Is there a simple method based on bootstrapping that works for time-series data and what software (stand-alone or libraries, preferably free) is available to get this done without putting in more than 2 days of development?
Thanks for your help,
Jan
Related
I have 93 arrays. Each array has about 18 values in average
I need to make a product of these arrays.
So I have my two dimension array that store these 93 arrays.
Here is what I try to do
DATASET.first.product(*DATASET[1..-1])
Ruby returns
RangeError: too big to product
Does anyone know some workaround to figure out of it?
Some ways to chunk them?
What you want is impossible.
The product of 93 arrays with ~18 elements each is an array with approximately 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 elements, each of which is a 93-element array.
This means you need 549975033204266172374216967425209467080301768557741749051999338598022831065169332830885722071173603516904554174087168 * 93 * 64bit of memory to store it, which is roughly 409181424703974032246417423764355843507744515806959861294687507916928986312485983626178977220953161016576988305520852992 bytes. That is about 40 orders of magnitude more than the number of particles in the universe. In other words, even if you were to convert the entire universe into RAM, you would still need to find a way to store on the order of 827180612553027 yobibyte on each and every particle in the universe; that is about 6000000000000000000000000 times the information content of the World Wide Web and 10000000000000000000000 times the information content of the dark web.
Does anyone know some workaround to figure out of it? Some ways to chunk them?
Even if you process them in chunks, that doesn't change the fact that you still need to process 51147678087996754030802177970544480438468064475869982661835938489616123289060747953272372152619145127072123538190106624 elements. Even if you were able to process one element per CPU instruction (which is unrealistic, you will probably need dozens if not hundreds of instructions), and even if each instruction only takes one clock cycle (which is unrealistic, on current mainstream CPUs, each instruction takes multiple clock cycles), and even if you had a terahertz CPU (which is unrealistic, the fastest current CPUs top out at 5 GHz), and even if your CPU had a million cores (which is unrealistic, even GPUs only have a couple of thousand extremely simple cores), and even if your motherboard had a million sockets (which is unrealistic, mainstream motherboards only have a maximum of 4 sockets, and even the biggest supercomputers only have 10 million cores in total), and even if you had a million of those computers in a cluster, and even if you had a million of those clusters in a supercluster, and even if you had a million friends that also have a supercluster like this, it would still take you about 1621000000000000000000000000000000000000000000000000000000000000000000 years to iterate through them.
Right, so as it is hopefully clear that this should not be attempted I'll take a risk and attempt solving your actual problem.
You've mentioned in the comments that you need this array for property testing - I'll take a massive leap of faith here and assume you want to test that every possible combination satisfies some conditions - and this is the mistake here, as the amount of possible combination is just... large...
Instead, you can test that some of the combinations works. You can easily generate a short, randomized list of combinations using:
Array.new(num) { DATASET.map(&:sample) }
Where num is a number of combinations you want to test. Note that there is a chance that some of the entries will be duplicated - but given your dataset size the chances would be comparable with colliding uuids and can be safely ignored.
Generating such a subset of possible solutions is much easier, faster and, most importantly, possible. Since the output is randomized, it will test slightly different combination on each run, so remember to have some randomization setup in your test suite if you want to be able to recreate failures.
I have large set of short time series data (average length of short time series = 20). The total size of the data is around 6 GB.
Current system working in following way:
1) Load 6 GB data into RAM.
2) Process the data.
3) Put the forecast value corresponding to each time series in excel.
The problem is every time i run the above system it is taking nearly 1 hour in my 8 GB RAM PC.
Please suggest a better way to reduce my time.
Use a faster programming language. For example, you can prefer using Julia or C++ instead of MATLAB or Python.
Try to make your code more efficient. For example, instead of passing data to your functions by copying them (pass by value), try to pass them as reference (What's the difference between passing by reference vs. passing by value?). Use more efficient data structures.
Divide your dataset into smaller parts. Work on each small part separately. Then, merge the outputs at the end.
We have a 1 GB List that was created using View.asList() method on beam sdk 2.0. We are trying to iterate through every member of the list and do, for now, nothing significant with it (we just sum up a value). Just reading this 1 GB list is taking about 8 minutes to do so (and that was after we set the workerCacheMb=8000, which we think means that the worker cache is 8 GB). (If we don't set the workerCacheMB to 8000, it takes over 50 minutes before we just kill the job.). We're using a n1-standard-32 instance, which should have more than enough RAM. There is ONLY a single thread reading this 8GB list. We know this because we create a dummy PCollection of one integer and we use it to then read this 8GB ViewList side-input.
It should not take 6 minutes to read in a 1 GB list, especially if there's enough RAM. EVEN if the list were materialized to disk (which it shouldn't be), a normal single NON-ssd disk can read data at 100 MB/s, so it should take ~10 seconds to read in this absolute worst case scenario....
What are we doing wrong? Did we discover a dataflow bug? Or maybe the workerCachMB is really in KB instead of MB? We're tearing our hair out here....
Try to use setWorkervacheMb(1000). 1000 MB = Around 1GB. It will pick the side input from cache of each worker node and that will be fast.
DataflowWorkerHarnessOptions options = PipelineOptionsFactory.create().cloneAs(DataflowWorkerHarnessOptions.class);
options.setWorkerCacheMb(1000);
Is it really required to iterate 1 GB of side input data every time or you are need some specific data to get during iteration?
In case you need specific data then you should get it by passing specific index in the list. Getting data specific to index is much faster operation then iterating whole 1GB data.
After checking with the Dataflow team, the rate of 1GB in 8 minutes sounds about right.
Side inputs in Dataflow are always serialized. This is because to have a side input, a view of a PCollection must be generated. Dataflow does this by serializing it to a special indexed file.
If you give more information about your use case, we can help you think of ways of doing it in a faster manner.
I have some scientific measurement data which should be permanently stored in a data store of some sort.
I am looking for a way to store measurements from 100 000 sensors with measurement data accumulating over years to around 1 000 000 measurements per sensor. Each sensor produces a reading once every minute or less frequently. Thus the data flow is not very large (around 200 measurements per second in the complete system). The sensors are not synchronized.
The data itself comes as a stream of triplets: [timestamp] [sensor #] [value], where everything can be represented as a 32-bit value.
In the simplest form this stream would be stored as-is into a single three-column table. Then the query would be:
SELECT timestamp,value
FROM Data
WHERE sensor=12345 AND timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
Unfortunately, with row-based DBMSs this will give a very poor performance, as the data mass is large, and the data we want is dispersed almost evenly into it. (Trying to pick a few hundred thousand records from billions of records.) What I need performance-wise is a reasonable response time for human consumption (the data will be graphed for a user), i.e. a few seconds plus data transfer.
Another approach would be to store the data from one sensor into one table. Then the query would become:
SELECT timestamp,value
FROM Data12345
WHERE timestamp BETWEEN '2013-04-15' AND '2013-05-12'
ORDER BY timestamp
This would give a good read performance, as the result would be a number of consecutive rows from a relatively small (usually less than a million rows) table.
However, the RDBMS should have 100 000 tables which are used within a few minutes. This does not seem to be possible with the common systems. On the other hand, RDBMS does not seem to be the right tool, as there are no relations in the data.
I have been able to demonstrate that a single server can cope with the load by using the following mickeymouse system:
Each sensor has its own file in the file system.
When a piece of data arrives, its file is opened, the data is appended, and the file is closed.
Queries open the respective file, find the starting and ending points of the data, and read everything in between.
Very few lines of code. The performance depends on the system (storage type, file system, OS), but there do not seem to be any big obstacles.
However, if I go down this road, I end up writing my own code for partitioning, backing up, moving older data deeper down in the storage (cloud), etc. Then it sounds like rolling my own DBMS, which sounds like reinventing the wheel (again).
Is there a standard way of storing the type of data I have? Some clever NoSQL trick?
Seems like a pretty easy problem really. 100 billion records, 12 bytes per record -> 1.2TB this isn't even a large volume for modern HDDs. In LMDB I would consider using a subDB per sensor. Then your key/value is just 32 bit timestamp/32 bit sensor reading, and all of your data retrievals will be simple range scans on the key. You can easily retrieve on the order of 50M records/sec with LMDB. (See the SkyDB guys doing just that https://groups.google.com/forum/#!msg/skydb/CMKQSLf2WAw/zBO1X35alxcJ)
Try VictoriaMetrics as a time series database for big amounts of data.
It is optimized for storing and querying big amounts of time series data.
It uses low disk iops and bandwidth thanks to the storage design based on LSM trees, so it can work quite well on HDD instead of SSD.
It has good compression ratio, so 100 billion typical data points would require less than 100 GB of HDD storage. Read technical details on data compression.
We want to use redis for one of our data stores. We have a hard time "guessing" what the size of that redis store will be and we're hoping someone can come up with the right help.
This store will exclusively be be built using Sorted Sets. Each set will have a key that will be an integer between 1 and 10^10. We currently have about 8M keys, but we expect to reach 30M 'quickly'.
Each set will have a variable number of elements, but the average is 17 elements, with a max of 135 and a min of 0. (Let me know if we need to provide other numbers, like st. dev.).
The elements in the sorted set will be strings. Now we want them to be the shortest string possible (5 or 6 chars?), but still avoid collisions. The scores will be timestamps.
We currently have about 500 writes/sec, but expect to grow that 10 times, and we currently have 3000 reads/sec and expect to grow that also 10 times.
We will also use the "dump" strategy rather than AOF.
Our goal is to use a single (yet big) Redis master store (and maybe some slaves store). What RAM should we allocate to our redis instance?
If you use Redis 2.6, you can benefit from the ziplist memory optimization applied to zset, because most of your zsets have a small number of items.
To calculate the memory you need, you can simply fill an instance with a small number of keys corresponding to your requirements and extrapolate. For this use case, memory consumption will grow linearly with the number of keys.
I have just tried it on my system, I get 30 MB per 100000 keys (following your specifications), which results in 9 GB of memory required for 30M keys. You need to take some margin, and include some space for COW memory spent at save time.
A 12 GB server would probably work if you are careful.
A 16 GB server will be just fine.