Redis Data Structure Space Requirements - linked-list

What is the difference in space between sorted sets and lists in redis? My guess is that sorted sets are some kind of balanced binary tree, and lists are a linked list. This means that on top of the three values that I'm encoding for each of them, key, score, value, although I'll munge together score and value for the linkedlist, the overhead is that the linkedlist needs to keep track of one other node, and the binary tree needs to keep track of two, so that the space overhead to using a sorted set is O(N).
If my value, and score are both longs, and the pointers to the other nodes are also longs, it seems like the space overhead of a single node goes from 3 longs to 4 longs on a 64-bit computer, which is a 33% increase in space.
Is this true?

It is much more than your estimation. Let's suppose ziplists are not used (i.e. you have a significant number of items).
A Redis list is a classical double-linked list: 3 pointers (prev,next,value) per item.
A sorted set is a dictionary plus a skip list. In the dictionary, items will be stored with 3 pointers as well (key,value,next). The skip list memory footprint is more complex to evaluate: each node takes 1 double (score), 2 pointers (obj,backward), plus n couples (pointer,span value) with n between 1 and 32. Most items will take only 1 or 2 couples.
In other words, when it is not represented as a ziplist, a sorted set is by far the Redis data structure with the most overhead. Compared to a list, the memory overhead is more than 200% (i.e. 3 times).
Note: the best way to evaluate memory consumption with Redis is to try to build a big list or sorted set with pseudo-data and use INFO to get the memory footprint.

Related

Lua table, if the key starts from 1000, will there be a performance loss?

a={}
a[1000]=1
I don't define other things. Will the storage space be occupied from 1 to 999?
I know that a[1]=nil or a[999]=nil, will it be calculated from 1 to 1000 in sequence during traversal?
No, there is not going to be space allocated for other (1-999) elements (unless you create 1000 elements and then delete 1-999). Lua supports "sparse" arrays and will use the hash part of the table to store those key/value pairs.
If you are asking whether a[1000] is going to be slower if 1-999 elements are not present, then it's possible, as in this case the hash part of the table is going to be used (instead of the array part), but you'll have to benchmark to see if there is any observable difference that matters in your case.

Interview: System/API design

This question was asked in one of the big software company. I have come up with a simple solution and I want to know what others feel about the solution.
You are supposed to design an API and a backend for a system that can
allot phone numbers to people living in a city. The phone numbers will
start from 111-111-1111 and end at 999-999-9999. The API should enable
the clients (people in the city) to do the following:
When a client requests for a phone number, it allots one of the available numbers to them.
Some clients may want fancy numbers, so they can specifically ask for a number to be alloted to them. If the requested number is
available then the system allots it to them, otherwise the system
allots any available number.
The system need not have to know which number is alloted to which
client. The same client may make successive requests and get multiple
phone numbers for himself, but the system is not bothered. At any
point of time, the system only knows which phone numbers are alloted
and which phone numbers are free.
The numbers from 111-111-1111 to 999-999-9999 roughly corresponds to 8 billion numbers. Assuming that memory is not a constraint, I can think of the following two approaches (which are almost similar).
Maintain a huge boolean array of length 8 billion and have a next pointer that points to an array index (next is initialized to zero). If the value pointed by next is not free, then forward next until a free number is found. When fancy numbers are requested, just check whether the corresponding index position is free and return the number. The downside of this approach is, when allocating numbers in a regular way, if there is a huge chunk (say 1 billion) numbers in the middle that was allocated by fancy allocation, then the next pointer has to be moved 1 billion times.
To overcome the difficulty mentioned in the previos design, we can use some sort of a linked hashmap. We maintain a doubly linked list (this replaces the array in the previous design) and another array of the same length as the list where each element of the array points to a corresponding element in the list. So when allocating numbers in regular method, we advance a pointer in the linked list and mark nodes as and when we allocate (same as the previous method). When allocating fancy numbers, we can directly find the node in the list that corresponds to the special number requested by first indexing into the array and the following the pointer. Once the node is identified, short circuit the previous node and the next node so that we do not have to skip the used numbers one by one (which was the problem with the previous approach) when doing a regular allocation.
Let me know whether I am on the right track. Please enlighten me with any important details that I am missing.
You can do significantly better in the anser to this question.
First you should design you API. The one recommended by Icarus3 is perfectly good:
string acquireNextAvailableNumber();
boolean acquireRequestedNumber(string special);
The second one returns true (and reserves the number) if it is available, otherwise returns false.
The question doesn't specify how you allocate phone numbers, so allocate them to suit yourself. Make the first 'next available' request return "111-111-1111", the next "111-111-1112" etc. This means you can record all the numbers allocated through 'next' by just remembering the last one allocated. (You'll need to ask whether '111-111-1119" is followed by "111-111-1120" or 111-111-1121", but that's the sort of thing you should be asking anyway. In any case, the important thing is you can work out what is the next number knowing the last allocated one.)
Special requests you will need to store individually. A hash table work, but so does a BST or simply an ordered list. It depends on what tradeoffs you want between space and speed, and how often special numbers are likely to be requested. I'll use a BST (ordered by the number) in the rest of this, for reasons I'll come to.
So, how do you code this? For the next allocated number:
Look at the last allocated number, and find the next in sequence.
Check that number hasn't been allocated as a special number. You can do this very quickly with a BST because if it's there, it will be the lowest entry in the BST.
If the number was in the 'special numbers' database, increment the 'allocated numbers' value (to include that number) and remove the entry from the special numbers. Then repeat this process until you get a number that isn't in the special numbers.
Note that this process ensures that all 'special numbers' lower than the last one allocated by 'next' do not appear in the special numbers database. As the 'last normal number allocated' increases, it absorbs any special numbers allocated that were less than that, removing them from the table. This is what ensures that when we ask whether the next number in sequence is in the special numbers database, we only have to look at the lowest entry.
Checking for a special number is easy. If it is lower than the last 'normal' number allocated it isn't available. Otherwise you check to see if it exists in the BST. If it doesn't, you add it to the BST.
You can optimize this process by storing not just single numbers in the BST, but storing ranges of numbers. If the allocated special numbers are dense, then it reduces the amount of space in the tree and the number of accesses to find if one is in there. During the test to find if the 'next' number discovers a rnage of size n, then you can immediately increment the highest normal number by n, instead of having to go round the loop n times.
First, you did not prototype your APIs. For example, if I have to design these APIs I will publish 2 APIs.
string acquireNextAvailableNumber();
string acquireRequestedNumber(string special);
Second, you need to decide how you are going to implement it. code driven or data driven ?
You can maintain hash for all these numbers ( it will consume memory ) and quickly query the availability of the number. Or
you could maintain single list to store only distributed numbers ( less memory ). So, whenever request comes, you start searching 1 to n numbers in that list ( increased time-complexity ). if any first (or requested) number isn't there then you allocate it to client and add that entry in the list.
As, there are billion numbers, you will need to consider the trade-off between space and time.
You could also take the advantage of the database.
To enhance previous answers, any BST may not be good enough as insertions or deletions can make it unbalanced. A balanced BST, e.g. Red-Black Tree, should be a good choice.
So, a Red-Black Tree can be created and filled in the beginning to represent available numbers, and each allocation should remove an element from it.
init(from, to) - can be done in O(n) time, a straightforward implementation would be O(n log n). But that is a one-time initialization on your server's start
acquireNextAvailableNumber() - should remove smallest element, time cost O(1)
acquireRequestedNumber(special) - should make a search and remove element if found, guaranteed time cost O(log n)
In Java, a TreeSet<String> or TreeSet<Integer> could be used since it is implemented with Red-Black Tree.
The next question would probably have been that several request-processing threads would access your API, so since Java's TreeSet is not thread-safe, you should have wrapped it at initialization like so:
TreeSet numbers = init(...);
SortedSet availableNumbers = Collections.synchronizedSortedSet(numbers);

how to remove duplicates from an array without sorting

i have an array which might contain duplicate objects.
I wonder if it's possible to find and remove the duplicates in the array:
- without sorting (strict requirement)
- without using a temporary secondary array
- possibily in O(N), with N the nb of the elements in the array
In my case the array is a Lua array, which contains tables:
t={
{a,1},
{a,2},
{b,1},
{b,3},
{a,2}
}
In my case, t[5] is a duplicate of t[2], while t[1] is not.
To summarize, you have the following options:
time: O(n^2), no extra memory - for each element in the array look for an equal one linearly
time: O(n*log n), no extra memory - sort first, walk over the array linearly after
time: O(n), memory: O(n) - use a lookup table (edit: that's probably not an option as tables cannot be keys in other tables as far as I remember)
Pick one. There's no way to do what you want in O(n) time with no extra memory.
Can't be done in O(n) but ...
what you can do is
Iterate thru the array
For each member search forward for repetitions, remove those.
Worst case scenario complexity is O(n^2)
Iterate the array, stick every value in a hash, checking if the it exists first. If it does remove from original array (or don't write to the new one). Not very memory efficient, but only 0(n) since you are only iterating the array once.
Yes, depending on how you look at it.
You can override the object insertion to prevent insertion of duplicate items. This is O(n) per object insertion and may feel faster for smaller arrays.
If you provide sorted object insertion and deletion then it is O(log n). Essentially you always keep the list sorted as you insert and delete so that finding elements is a binary search. The cost here is that element retrieval is now O(log n) instead of O(1).
This can be also be implemented efficiently using things like red-black tree's and multitree's but at the cost of additional memory. Such implementations offer several benefits for certain problems. For example, we can have O(log n) type of behavior even very very large tables with small a small memory footprint by using nested tree's. The top level tree provides a sort of paired down overview of the dataset while subtree's provide more refined access when needed.
For example, to see this suppose we have N elements. We could partition that into n1 groups. Each of those groups could then further be partitions into n2 more groups and those groups into n2 groups. Hence we have a depth of N/n1n2...
As you can see, the product of n's can become quite huge very quickly even for small n's. If N = 1 Trillion elements and n1 = 1000, n2 = 1000, n3 = 1000 it takes only 1000 + 1000 + 1000 + 1000 s = 4000 per access time. Also, we only have 10^9 times per node memory footprint.
Compare this to the average 500 billion access time's required for a direct linear search. It is over 100 million times faster and 1000 times less memory than a binary tree but about 100 times slower! (of course there is some overhead for keeping the tree's consistent but even that can be reduced).
If we were to use a binary tree then it would have a depth of about 40. The problem is there are about 1 trillion nodes so that is a huge amount of additional memory. By storing multiple values per node(and in the above case each node actually partial values and other tree's) we can significantly reduce the memory footprint but still have impressive performance.
Essentially linear access prevails at lower numbers and tree's prevail at high numbers. Tree's. Tree's consume more memory. By using multitree's we can combine the best of both worlds by using linear access over smaller numbers and having a larger number of elements per node(compared to binary tree's).
Such tree's are not trivial to create but essentially follow the same algorithmic nature of standard binary tree's, red-black tree's, AVL tree's, etc...
So if you are dealing with large datasets it is not a huge issue for performance and memory. Essentially, as you probably know, as one goes up the other goes down. Multitree's, sort of find the optimal medium. (assuming you chose your node sizes correctly)
The depth of the multitree is N/product(n_k,k=1..m). The memory footprint is the number of nodes which is product(n_k,k=1..m) (which can generally be reduced by an order of magnitude or possibly n_m)

TStringList, Dynamic Array or Linked List in Delphi?

I have a choice.
I have a number of already ordered strings that I need to store and access. It looks like I can choose between using:
A TStringList
A Dynamic Array of strings, and
A Linked List of strings (singly linked)
and Alan in his comment suggested I also add to the choices:
TList<string>
In what circumstances is each of these better than the others?
Which is best for small lists (under 10 items)?
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
Which is best to minimize memory use?
Which is best to minimize loading time to add extra items on the end?
Which is best to minimize access time for accessing the entire list from first to last?
On this basis (or any others), which data structure would be preferable?
For reference, I am using Delphi 2009.
Dimitry in a comment said:
Describe your task and data access pattern, then it will be possible to give you an exact answer
Okay. I've got a genealogy program with lots of data.
For each person I have a number of events and attributes. I am storing them as short text strings but there are many of them for each person, ranging from 0 to a few hundred. And I've got thousands of people. I don't need random access to them. I only need them associated as a number of strings in a known order attached to each person. This is my case of thousands of "small lists". They take time to load and use memory, and take time to access if I need them all (e.g. to export the entire generated report).
Then I have a few larger lists, e.g. all the names of the sections of my "virtual" treeview, which can have hundreds of thousands of names. Again I only need a list that I can access by index. These are stored separately from the treeview for efficiency, and the treeview retrieves them only as needed. This takes a while to load and is very expensive memory-wise for my program. But I don't have to worry about access time, because only a few are accessed at a time.
Hopefully this gives you an idea of what I'm trying to accomplish.
p.s. I've posted a lot of questions about optimizing Delphi here at StackOverflow. My program reads 25 MB files with 100,000 people and creates data structures and a report and treeview for them in 8 seconds but uses 175 MB of RAM to do so. I'm working to reduce that because I'm aiming to load files with several million people in 32-bit Windows.
I've just found some excellent suggestions for optimizing a TList at this StackOverflow question:
Is there a faster TList implementation?
Unless you have special needs, a TStringList is hard to beat because it provides the TStrings interface that many components can use directly. With TStringList.Sorted := True, binary search will be used which means that search will be very quick. You also get object mapping for free, each item can also be associated with a pointer, and you get all the existing methods for marshalling, stream interfaces, comma-text, delimited-text, and so on.
On the other hand, for special needs purposes, if you need to do many inserts and deletions, then something more approaching a linked list would be better. But then search becomes slower, and it is a rare collection of strings indeed that never needs searching. In such situations, some type of hash is often used where a hash is created out of, say, the first 2 bytes of a string (preallocate an array with length 65536, and the first 2 bytes of a string is converted directly into a hash index within that range), and then at that hash location, a linked list is stored with each item key consisting of the remaining bytes in the strings (to save space---the hash index already contains the first two bytes). Then, the initial hash lookup is O(1), and the subsequent insertions and deletions are linked-list-fast. This is a trade-off that can be manipulated, and the levers should be clear.
A TStringList. Pros: has extended functionality, allowing to dynamically grow, sort, save, load, search, etc. Cons: on large amount of access to the items by the index, Strings[Index] is introducing sensible performance lost (few percents), comparing to access to an array, memory overhead for each item cell.
A Dynamic Array of strings. Pros: combines ability to dynamically grow, as a TStrings, with the fastest access by the index, minimal memory usage from others. Cons: limited standard "string list" functionality.
A Linked List of strings (singly linked). Pros: the linear speed of addition of an item to the list end. Cons: slowest access by the index and searching, limited standard "string list" functionality, memory overhead for "next item" pointer, spead overhead for each item memory allocation.
TList< string >. As above.
TStringBuilder. I does not have a good idea, how to use TStringBuilder as a storage for multiple strings.
Actually, there are much more approaches:
linked list of dynamic arrays
hash tables
databases
binary trees
etc
The best approach will depend on the task.
Which is best for small lists (under
10 items)?
Anyone, may be even static array with total items count variable.
Which is best for large lists (over 1000 items)?
Which is best for huge lists (over 1,000,000 items)?
For large lists I will choose:
- dynamic array, if I need a lot of access by the index or search for specific item
- hash table, if I need to search by the key
- linked list of dynamic arrays, if I need many item appends and no access by the index
Which is best to minimize memory use?
dynamic array will eat less memory. But the question is not about overhead, but about on which number of items this overhead become sensible. And then how to properly handle this number of items.
Which is best to minimize loading time to add extra items on the end?
dynamic array may dynamically grow, but on really large number of items, memory manager may not found a continous memory area. While linked list will work until there is a memory for at least a cell, but for cost of memory allocation for each item. The mixed approach - linked list of dynamic arrays should work.
Which is best to minimize access time for accessing the entire list from first to last?
dynamic array.
On this basis (or any others), which data structure would be preferable?
For which task ?
If your stated goal is to improve your program to the point that it can load genealogy files with millions of persons in it, then deciding between the four data structures in your question isn't really going to get you there.
Do the math - you are currently loading a 25 MB file with about 100000 persons in it, which causes your application to consume 175 MB of memory. If you wish to load files with several millions of persons in it you can estimate that without drastic changes to your program you will need to multiply your memory needs by n * 10 as well. There's no way to do that in a 32 bit process while keeping everything in memory the way you currently do.
You basically have two options:
Not keeping everything in memory at once, instead using a database, or a file-based solution which you load data from when you need it. I remember you had other questions about this already, and probably decided against it, so I'll leave it at that.
Keep everything in memory, but in the most space-efficient way possible. As long as there is no 64 bit Delphi this should allow for a few million persons, depending on how much data there will be for each person. Recompiling this for 64 bit will do away with that limit as well.
If you go for the second option then you need to minimize memory consumption much more aggressively:
Use string interning. Every loaded data element in your program that contains the same data but is contained in different strings is basically wasted memory. I understand that your program is a viewer, not an editor, so you can probably get away with only ever adding strings to your pool of interned strings. Doing string interning with millions of string is still difficult, the "Optimizing Memory Consumption with String Pools" blog postings on the SmartInspect blog may give you some good ideas. These guys deal regularly with huge data files and had to make it work with the same constraints you are facing.
This should also connect this answer to your question - if you use string interning you would not need to keep lists of strings in your data structures, but lists of string pool indexes.
It may also be beneficial to use multiple string pools, like one for names, but a different one for locations like cities or countries. This should speed up insertion into the pools.
Use the string encoding that gives the smallest in-memory representation. Storing everything as a native Windows Unicode string will probably consume much more space than storing strings in UTF-8, unless you deal regularly with strings that contain mostly characters which need three or more bytes in the UTF-8 encoding.
Due to the necessary character set conversion your program will need more CPU cycles for displaying strings, but with that amount of data it's a worthy trade-off, as memory access will be the bottleneck, and smaller data size helps with decreasing memory access load.
One question: How do you query: do you match the strings or query on an ID or position in the list?
Best for small # strings:
Whatever makes your program easy to understand. Program readability is very important and you should only sacrifice it in real hotspots in your application for speed.
Best for memory (if that is the largest constrained) and load times:
Keep all strings in a single memory buffer (or memory mapped file) and only keep pointers to the strings (or offsets). Whenever you need a string you can clip-out a string using two pointers and return it as a Delphi string. This way you avoid the overhead of the string structure itself (refcount, length int, codepage int and the memory manager structures for each string allocation.
This only works fine if the strings are static and don't change.
TList, TList<>, array of string and the solution above have a "list" overhead of one pointer per string. A linked list has an overhead of at least 2 pointers (single linked list) or 3 pointers (double linked list). The linked list solution does not have fast random access but allows for O(1) resizes where trhe other options have O(lgN) (using a factor for resize) or O(N) using a fixed resize.
What I would do:
If < 1000 items and performance is not utmost important: use TStringList or a dyn array whatever is easiest for you.
else if static: use the trick above. This will give you O(lgN) query time, least used memory and very fast load times (just gulp it in or use a memory mapped file)
All mentioned structures in your question will fail when using large amounts of data 1M+ strings that needs to be dynamically chaned in code. At that Time I would use a balances binary tree or a hash table depending on the type of queries I need to maken.
From your description, I'm not entirely sure if it could fit in your design but one way you could improve on memory usage without suffering a huge performance penalty is by using a trie.
Advantages relative to binary search tree
The following are the main advantages
of tries over binary search trees
(BSTs):
Looking up keys is faster. Looking up a key of length m takes worst case
O(m) time. A BST performs O(log(n))
comparisons of keys, where n is the
number of elements in the tree,
because lookups depend on the depth of
the tree, which is logarithmic in the
number of keys if the tree is
balanced. Hence in the worst case, a
BST takes O(m log n) time. Moreover,
in the worst case log(n) will approach
m. Also, the simple operations tries
use during lookup, such as array
indexing using a character, are fast
on real machines.
Tries can require less space when they contain a large number of short
strings, because the keys are not
stored explicitly and nodes are shared
between keys with common initial
subsequences.
Tries facilitate longest-prefix matching, helping to find the key
sharing the longest possible prefix of
characters all unique.
Possible alternative:
I've recently discovered SynBigTable (http://blog.synopse.info/post/2010/03/16/Synopse-Big-Table) which has a TSynBigTableString class for storing large amounts of data using a string index.
Very simple, single layer bigtable implementation, and it mainly uses disc storage, to consumes a lot less memory than expected when storing hundreds of thousands of records.
As simple as:
aId := UTF8String(Format('%s.%s', [name, surname]));
bigtable.Add(data, aId)
and
bigtable.Get(aId, data)
One catch, indexes must be unique, and the cost of update is a bit high (first delete, then re-insert)
TStringList stores an array of pointer to (string, TObject) records.
TList stores an array of pointers.
TStringBuilder cannot store a collection of strings. It is similar to .NET's StringBuilder and should only be used to concatenate (many) strings.
Resizing dynamic arrays is slow, so do not even consider it as an option.
I would use Delphi's generic TList<string> in all your scenarios. It stores an array of strings (not string pointers). It should have faster access in all cases due to no (un)boxing.
You may be able to find or implement a slightly better linked-list solution if you only want sequential access. See Delphi Algorithms and Data Structures.
Delphi promotes its TList and TList<>. The internal array implementation is highly optimized and I have never experienced performance/memory issues when using it. See Efficiency of TList and TStringList

Lookup table size reduction

I have an application in which I have to store a couple of millions of integers, I have to store them in a Look up table, obviously I cannot store such amount of data in memory and in my requirements I am very limited I have to store the data in an embebedded system so I am very limited in the space, so I would like to ask you about recommended methods that I can use for the reduction of the look up table. I cannot use function approximation such as neural networks, the values needs to be in a table. The range of the integers is not known at the moment. When I say integers I mean a 32 bit value.
Basically the idea is use some copmpression method to reduce the amount of memory but without losing many precision. This thing needs to run in hardware so the computation overhead cannot be very high.
In my algorithm I have to access to one value of the table do some operations with it and after update the value. In the end what I should have is a function which I pass an index to it and then I get a value, and after I have to use another function to write a value in the table.
I found one called tile coding , this one is based on several look up tables, does anyone know any other method?.
Thanks.
I'd look at the types of numbers you need to store and pull out the information that's common for many of them. For example, if they're tightly clustered, you can take the mean, store it, and store the offsets. The offsets will have fewer bits than the original numbers. Or, if they're more or less uniformly distributed, you can store the first number and then store the offset to the next number.
It would help to know what your key is to look up the numbers.
I need more detail on the problem. If you cannot store the real value of the integers but instead an approximation, that means you are going to reduce (throw away) some of the data (detail), correct? I think you are looking for a hash, which can be an artform in itself. For example say you have 32 bit values, one hash would be to take the 4 bytes and xor them together, this would result in a single 8 bit value, reducing your storage by a factor of 4 but also reducing the real value of original data. Typically you could/would go further and perhaps and only use a few of those 8 bits , say the lower 4 and reduce the value further.
I think my real problem is either you need the data or you dont, if you need the data you need to compress it or find more memory to store it. If you dont, then use a hash of some sort to reduce the number of bits until you reach the amount of memory you have for storage.
Read http://www.cs.ualberta.ca/~sutton/RL-FAQ.html
"Function approximation" refers to the
use of a parameterized functional form
to represent the value function
(and/or the policy), as opposed to a
simple table."
Perhaps that applies. Also, update your question with additional facts -- don't merely answer in the comments.
Edit.
A bit array can easily store a bit for each of your millions of numbers. Let's say you have numbers in the range of 1 to 8 million. In a single megabyte of storage you can have a 1 bit for each number in your set and a 0 for each number not in your set.
If you have numbers in the range of 1 to 32 million, you'll require 4Mb of memory for a big table of all 32M distinct numbers.
See my answer to Modern, high performance bloom filter in Python? for a Python implementation of a bit array of unlimited size.
If you are merely looking for the presence of the number in question a bloom filter, might be what you are looking for. Honestly though your question is fairly vague and confusing. It would help to explain what Q values are, and what you do with them once you find them in the table.
If your set of integers is homongenous, then you could try a hash table, because there is a trick you can use to cut the size of the stored integers, in your case, in half.
Assume the integer, n, because its set is homogenous can be the hash. Assume you have 0x10000 (16k) buckets. Each bucket index, iBucket = n&FFFF. Each item in a bucket need only store 16 bits, since the first 16 bits are the bucket index. The other thing you have to do to keep the data small is to put the count of items in the bucket, and use an array to hold the items in the bucket. Using a linked list will be too large and slow. When you iterate the array looking for a match, remember you only need to compare the 16 bits that are stored.
So assuming a bucket is a pointer to the array and a count. On a 32 bit system, this is 64 bits max. If the number of ints was small enough we might be able to do some fancy things and use 32 bits for a bucket. 16k * 8 bytes = 524k, 2 million shorts = 4mb. So this gets you a method to lookup the ints and about 40% compression.

Resources