how to remove duplicates from an array without sorting - lua

i have an array which might contain duplicate objects.
I wonder if it's possible to find and remove the duplicates in the array:
- without sorting (strict requirement)
- without using a temporary secondary array
- possibily in O(N), with N the nb of the elements in the array
In my case the array is a Lua array, which contains tables:
t={
{a,1},
{a,2},
{b,1},
{b,3},
{a,2}
}
In my case, t[5] is a duplicate of t[2], while t[1] is not.

To summarize, you have the following options:
time: O(n^2), no extra memory - for each element in the array look for an equal one linearly
time: O(n*log n), no extra memory - sort first, walk over the array linearly after
time: O(n), memory: O(n) - use a lookup table (edit: that's probably not an option as tables cannot be keys in other tables as far as I remember)
Pick one. There's no way to do what you want in O(n) time with no extra memory.

Can't be done in O(n) but ...
what you can do is
Iterate thru the array
For each member search forward for repetitions, remove those.
Worst case scenario complexity is O(n^2)

Iterate the array, stick every value in a hash, checking if the it exists first. If it does remove from original array (or don't write to the new one). Not very memory efficient, but only 0(n) since you are only iterating the array once.

Yes, depending on how you look at it.
You can override the object insertion to prevent insertion of duplicate items. This is O(n) per object insertion and may feel faster for smaller arrays.
If you provide sorted object insertion and deletion then it is O(log n). Essentially you always keep the list sorted as you insert and delete so that finding elements is a binary search. The cost here is that element retrieval is now O(log n) instead of O(1).
This can be also be implemented efficiently using things like red-black tree's and multitree's but at the cost of additional memory. Such implementations offer several benefits for certain problems. For example, we can have O(log n) type of behavior even very very large tables with small a small memory footprint by using nested tree's. The top level tree provides a sort of paired down overview of the dataset while subtree's provide more refined access when needed.
For example, to see this suppose we have N elements. We could partition that into n1 groups. Each of those groups could then further be partitions into n2 more groups and those groups into n2 groups. Hence we have a depth of N/n1n2...
As you can see, the product of n's can become quite huge very quickly even for small n's. If N = 1 Trillion elements and n1 = 1000, n2 = 1000, n3 = 1000 it takes only 1000 + 1000 + 1000 + 1000 s = 4000 per access time. Also, we only have 10^9 times per node memory footprint.
Compare this to the average 500 billion access time's required for a direct linear search. It is over 100 million times faster and 1000 times less memory than a binary tree but about 100 times slower! (of course there is some overhead for keeping the tree's consistent but even that can be reduced).
If we were to use a binary tree then it would have a depth of about 40. The problem is there are about 1 trillion nodes so that is a huge amount of additional memory. By storing multiple values per node(and in the above case each node actually partial values and other tree's) we can significantly reduce the memory footprint but still have impressive performance.
Essentially linear access prevails at lower numbers and tree's prevail at high numbers. Tree's. Tree's consume more memory. By using multitree's we can combine the best of both worlds by using linear access over smaller numbers and having a larger number of elements per node(compared to binary tree's).
Such tree's are not trivial to create but essentially follow the same algorithmic nature of standard binary tree's, red-black tree's, AVL tree's, etc...
So if you are dealing with large datasets it is not a huge issue for performance and memory. Essentially, as you probably know, as one goes up the other goes down. Multitree's, sort of find the optimal medium. (assuming you chose your node sizes correctly)
The depth of the multitree is N/product(n_k,k=1..m). The memory footprint is the number of nodes which is product(n_k,k=1..m) (which can generally be reduced by an order of magnitude or possibly n_m)

Related

Lua table, if the key starts from 1000, will there be a performance loss?

a={}
a[1000]=1
I don't define other things. Will the storage space be occupied from 1 to 999?
I know that a[1]=nil or a[999]=nil, will it be calculated from 1 to 1000 in sequence during traversal?
No, there is not going to be space allocated for other (1-999) elements (unless you create 1000 elements and then delete 1-999). Lua supports "sparse" arrays and will use the hash part of the table to store those key/value pairs.
If you are asking whether a[1000] is going to be slower if 1-999 elements are not present, then it's possible, as in this case the hash part of the table is going to be used (instead of the array part), but you'll have to benchmark to see if there is any observable difference that matters in your case.

Converting a apriori object to a list taking more time even for small number of data

I am working on a data set of more than 22,000 records, and when I tried it with the apriori model, it's taking way too much time even for small number of records like 20. Is there a problem in my code or Is there a faster way to convert the asscocians into a list quickly? The code I used is below.
for i in range(0, 20):
transactions.append([str(dataset.values[i,j]) for j in range(0, 543)])
from apyori import apriori
associations = apriori(transactions, min_support=0.004, min_confidence=0.3, min_lift=3, min_length=2)
result = list(associations)
It's difficult to assess without your data, but the complexity of Apriori is based on a number of factors, including your support threshold, number of transactions, number of items, average/max transaction length, etc.
In cases where even a small number of transactions is taking a long time to run it's often a matter of too low of a minimum support. When support is very low (near 0) the algorithm is effectively still brute forcing, since it has to look at all possible combinations of items, of every length. This is the equivalent of a mathematical power set, which is exponential. For just 41 items you're actually trying 2^41 -1 possible combinations, which is just over 1.1 TRILLION possibilities.
I recommend starting with a "high" min_support at first (e.g. 0.20) and then working your way down slowly. It's easier to test things that take seconds at first than ones that'll take a long time.
Other important note: There is no min_length parameter in Apyori. I'm not sure where everyone's getting that from (you're not alone in thinking there is one), unless it's this one random blog post I found. The parameters are as follows (straight from the code):
Keyword arguments:
min_support -- The minimum support of relations (float).
min_confidence -- The minimum confidence of relations (float).
min_lift -- The minimum lift of relations (float).
max_length -- The maximum length of the relation (integer).
For what it's worth, I wrote unofficial docs for Apyori that can be found here.

Redis Data Structure Space Requirements

What is the difference in space between sorted sets and lists in redis? My guess is that sorted sets are some kind of balanced binary tree, and lists are a linked list. This means that on top of the three values that I'm encoding for each of them, key, score, value, although I'll munge together score and value for the linkedlist, the overhead is that the linkedlist needs to keep track of one other node, and the binary tree needs to keep track of two, so that the space overhead to using a sorted set is O(N).
If my value, and score are both longs, and the pointers to the other nodes are also longs, it seems like the space overhead of a single node goes from 3 longs to 4 longs on a 64-bit computer, which is a 33% increase in space.
Is this true?
It is much more than your estimation. Let's suppose ziplists are not used (i.e. you have a significant number of items).
A Redis list is a classical double-linked list: 3 pointers (prev,next,value) per item.
A sorted set is a dictionary plus a skip list. In the dictionary, items will be stored with 3 pointers as well (key,value,next). The skip list memory footprint is more complex to evaluate: each node takes 1 double (score), 2 pointers (obj,backward), plus n couples (pointer,span value) with n between 1 and 32. Most items will take only 1 or 2 couples.
In other words, when it is not represented as a ziplist, a sorted set is by far the Redis data structure with the most overhead. Compared to a list, the memory overhead is more than 200% (i.e. 3 times).
Note: the best way to evaluate memory consumption with Redis is to try to build a big list or sorted set with pseudo-data and use INFO to get the memory footprint.

find the last item but N in a stream w/o storing n items

Suppose there is a stream of data arriving, D(0), D(1), D(2), .... When D(i) comes, I want to know D(i - N). The most straight forward way is to store the most recent N items and keep updating them upon arrival of new data. But the problem is N can be large so that there is no enough memory to store them. Is there anyway to achieve this by storing much less items than N? A constant of M << N of spaces are preferred? Thanks in advance.
Not as far as I can see, unless there is some regularity in the data that you can exploit. If the data are completely random (such that no element can be inferred from the others), then a choice of not saving element k will make it impossible to reproduce that element in iteration k + N.
Instead, consider:
Can you reduce N?
Can you store information on disk or (if you are in an embedded environment) on a slower, cheaper form of memory?
Is there some pattern in the data? If there is e.g. a repeating pattern, you can utilize that, or if there is some mathematical relationship between the numbers, perhaps some formula can aid in reconstructing one number from others. Even if there is no perceptible pattern, perhaps you could use some compression algorithm to reduce the data size?
Is there some limitation to the data, e.g. every number is between 0 and 255? If so, you could perhaps reduce the storage requirements.
(What is the application of this, by the way?)

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources