find the last item but N in a stream w/o storing n items - stream

Suppose there is a stream of data arriving, D(0), D(1), D(2), .... When D(i) comes, I want to know D(i - N). The most straight forward way is to store the most recent N items and keep updating them upon arrival of new data. But the problem is N can be large so that there is no enough memory to store them. Is there anyway to achieve this by storing much less items than N? A constant of M << N of spaces are preferred? Thanks in advance.

Not as far as I can see, unless there is some regularity in the data that you can exploit. If the data are completely random (such that no element can be inferred from the others), then a choice of not saving element k will make it impossible to reproduce that element in iteration k + N.
Instead, consider:
Can you reduce N?
Can you store information on disk or (if you are in an embedded environment) on a slower, cheaper form of memory?
Is there some pattern in the data? If there is e.g. a repeating pattern, you can utilize that, or if there is some mathematical relationship between the numbers, perhaps some formula can aid in reconstructing one number from others. Even if there is no perceptible pattern, perhaps you could use some compression algorithm to reduce the data size?
Is there some limitation to the data, e.g. every number is between 0 and 255? If so, you could perhaps reduce the storage requirements.
(What is the application of this, by the way?)

Related

What is the sorting complexity time in the streaming API for the sorted() and thenComparing () methods (sorting by multiple fields (conditions))?

I have an information that the method sorted() in Stream API maybe to use merging sort (mergesort).
Then time complexity for the kind of sort :
Big Θ (n (log n) ) - best
Big Ω (n (log n) ) - average
Big O (n (log n) ) - worst
space complecity – O (n) - worst
And what is the time complexity if we use sorting by multiple fields of a custom object, using then.comparing() to build a chain of comparisons ?
How would you calculate the time complexity in such a case ?
While the actual algorithm used in Stream.sorted is intentionally unspecified, there are obvious reasons not to implement another sorting algorithm, but use the existing implementation of Arrays.sort.
The current implementation uses TimSort, a variation of merge sort that can exploit ranges of pre-sorted elements within the input, with a best case of being entirely linear, when the input is already sorted, which also applies to the possible case that the input is sorted backwards. In these cases, no additional memory is needed. The average case is somewhere between that best case and the unchanged worst case of O(n log n).
As explained in this answer, general statements about the algorithms used in Arrays.sort are tricky, because all of them are hybrid sort algorithms and constantly improved.
Normally, a comparison function does not depend on the size of the input (the array or collection to sort), which doesn’t change when using Comparator.comparing(…).thenComparing(…), as more expensive comparison functions only add a constant factor that doesn’t affect the overall time complexity, as long as the comparator still doesn’t depend on the input size.

Understanding ETL processes

ETL seems to be a pretty common task. I am basically reading some ETL mistakes which designers make with very large data on http://it.toolbox.com/blogs/infosphere/17-mistakes-that-etl-designers-make-with-very-large-data-19264
I need some practical insights for the following points
a) Incorporating Inserts, Updates, and Deletes in to the same data flow / same process.. How is that a problem?
b) Sourcing multiple systems at the same time, depending on heterogeneous systems of data.
c) Not producing the correct indexes on the sources/ lookups that need to be accessed.
d) Believing that ‘ I need to process all the data in one pass because it’s the fastest way to do it ‘
Any help?
a) Data integrity issue
b) data quality will increase and less failure for smaller chunks.
c) will take more time to complete<
d) wrong indexes can cause more time. Better have indexes based on the query you are executing.
i.e what comes in the where clause of statement
e) splitting the data into smaller data sets and processing the same would be an efficient solution
Your a BITS-PILANI(WILP) student rite.
A) It's a problem if you find the task takes too long to complete (due to increased data volumes), and then it becomes too difficult to technically split them out afterwards. But splitting the tasks out can increase the possibility of inconsistent data loads (i.e. your DELETE works but your insert fails, meaning you are missing a load of data)
B) I don't understand 'at the same time' here - Do you mean simultaneously? You could max out bandwidth (network, disk etc.) if you simultaneously try to load data from many systems. Sometimes you don't have a choice if you need to load that data at offline times.
C) Yes incorrect indexes will slow down access. But often vendors don't like you creating indexes in the source database.
D) Performance tuning (the fastest way to do it) is a complex topic. In some cases it might be faster to do it in one pass. In other cases it may not.

"Smart" / Economical Data Storage Techniques?

I would like to store millions of data lines that looks like this:
key, value
key is an integer in the range of (0 to 5,000,000); all values are unique;
value is an unsigned int16 value (0 to 65535)
the key is to store the data while taking the LEAST AMOUNT OF DISK SPACE, and yet, be able to query the data. can you think of any algorithms / smart schemes for data storage that would be helpful?
just in case it matters, I use Linux.
One option would be, if the key values are not important data but rather just index data to utilize a flat file of bits ( with a descriptive header ). Every 16 bits is a value and the nth value would then be (n - 1) * 16 bits from the end of the header.
Additionally, if the key value does matter, a set flat file of about 10MB would allow for the entire key space to be stored without storing actual keys. The 16 bits that are at the (n - 1) * 16 offset would be that key's value.
That would probably be the least space intensive method for storage, as it would be only the data that is literally required. ( Though, if you are only interested in say 100k values and one has a key of 5 million you do end up with a lot of wasted space, which wouldn't be there with an actual key,value addressing system. So this methodology only achieves a minimum disk storage for sets of tightly grouped values or many many numbers (over about the 2 million mark ).
how do you plan to use stored data? with random or sequential access? for sequential access you can use any archiving algorithm, e.g. LZMA. Random access doesn't leave you a lot of space for improvements.
can you see any patterns of this data? e.g. if the difference between adjacent keys/values are often small you can store only packed differences. and million of other possible approaches.
[EDIT] also you can check techniques used for data compression in network communication
[EDIT1] and you can check this Google Code Integer Array Compression project
This depend upon the operation and data. I would first recommend "just using a database" (a simple key-value store such as BDB/EhCache [read: Key Value store], for instance :-)
Mimisbrunnr also has a good answer if all the keys are used.
If the keys are near constant/read-only and only a relatively small percent of the keys are used, consider the use of a (disk-based) Heap data-structure (very similar to an Array-based Heap; Heaps need not be Array-based). Robert Sedgewick had a good book from the late 80's that had a very lean implementation, but I forget the name. A Heap will be more beneficial when compared to a flat index with a smaller proportion of used keys and at full-load will have worse storage requirements.
(If abstracted, the used method could be switched and/or a hybrid heap with indexed/sequenced leaf-node values could be used [along with Huffman encoding or whatnot], but that is just adding far more complications. Keep it simple ... hence first suggestion of an existing key/value store ;-)
Happy coding.
Have you considered using a database designed for mobile devices such as SQL Server Compact, or another similar database? These will have a small footprint on the disk, while still providing the full search power you need.
Another example of a compact database engine is KeyDB for linux:
http://3d2f.com/programs/11-989-keydb-download.shtml

how to remove duplicates from an array without sorting

i have an array which might contain duplicate objects.
I wonder if it's possible to find and remove the duplicates in the array:
- without sorting (strict requirement)
- without using a temporary secondary array
- possibily in O(N), with N the nb of the elements in the array
In my case the array is a Lua array, which contains tables:
t={
{a,1},
{a,2},
{b,1},
{b,3},
{a,2}
}
In my case, t[5] is a duplicate of t[2], while t[1] is not.
To summarize, you have the following options:
time: O(n^2), no extra memory - for each element in the array look for an equal one linearly
time: O(n*log n), no extra memory - sort first, walk over the array linearly after
time: O(n), memory: O(n) - use a lookup table (edit: that's probably not an option as tables cannot be keys in other tables as far as I remember)
Pick one. There's no way to do what you want in O(n) time with no extra memory.
Can't be done in O(n) but ...
what you can do is
Iterate thru the array
For each member search forward for repetitions, remove those.
Worst case scenario complexity is O(n^2)
Iterate the array, stick every value in a hash, checking if the it exists first. If it does remove from original array (or don't write to the new one). Not very memory efficient, but only 0(n) since you are only iterating the array once.
Yes, depending on how you look at it.
You can override the object insertion to prevent insertion of duplicate items. This is O(n) per object insertion and may feel faster for smaller arrays.
If you provide sorted object insertion and deletion then it is O(log n). Essentially you always keep the list sorted as you insert and delete so that finding elements is a binary search. The cost here is that element retrieval is now O(log n) instead of O(1).
This can be also be implemented efficiently using things like red-black tree's and multitree's but at the cost of additional memory. Such implementations offer several benefits for certain problems. For example, we can have O(log n) type of behavior even very very large tables with small a small memory footprint by using nested tree's. The top level tree provides a sort of paired down overview of the dataset while subtree's provide more refined access when needed.
For example, to see this suppose we have N elements. We could partition that into n1 groups. Each of those groups could then further be partitions into n2 more groups and those groups into n2 groups. Hence we have a depth of N/n1n2...
As you can see, the product of n's can become quite huge very quickly even for small n's. If N = 1 Trillion elements and n1 = 1000, n2 = 1000, n3 = 1000 it takes only 1000 + 1000 + 1000 + 1000 s = 4000 per access time. Also, we only have 10^9 times per node memory footprint.
Compare this to the average 500 billion access time's required for a direct linear search. It is over 100 million times faster and 1000 times less memory than a binary tree but about 100 times slower! (of course there is some overhead for keeping the tree's consistent but even that can be reduced).
If we were to use a binary tree then it would have a depth of about 40. The problem is there are about 1 trillion nodes so that is a huge amount of additional memory. By storing multiple values per node(and in the above case each node actually partial values and other tree's) we can significantly reduce the memory footprint but still have impressive performance.
Essentially linear access prevails at lower numbers and tree's prevail at high numbers. Tree's. Tree's consume more memory. By using multitree's we can combine the best of both worlds by using linear access over smaller numbers and having a larger number of elements per node(compared to binary tree's).
Such tree's are not trivial to create but essentially follow the same algorithmic nature of standard binary tree's, red-black tree's, AVL tree's, etc...
So if you are dealing with large datasets it is not a huge issue for performance and memory. Essentially, as you probably know, as one goes up the other goes down. Multitree's, sort of find the optimal medium. (assuming you chose your node sizes correctly)
The depth of the multitree is N/product(n_k,k=1..m). The memory footprint is the number of nodes which is product(n_k,k=1..m) (which can generally be reduced by an order of magnitude or possibly n_m)

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources