Data structure for "limited memory map" - memory

I want to implement an associative collection, mapping keys K to values V. I further want to associate each value V with a weight, so we have Map[K → (V, Double)].
The idea now is to have a limited memory version of this collection, which only stores elements beyond a weight threshold t. With each insertion of an element e into this collection, we increase the weight of e by some amount, e.g. 1, and decay the weight of all other elements e' != e by some decay factor, say 0.0001.
This ensures that elements present in this collection are either a. recent, or b. frequent.
Of course, one could implement a naive version of such a data structure, which actively performs decay operations, and checks for voilations of threshold t for all elements. That would be terribly inefficient.
I am wondering if there already is a data structure out there which does exactly that. Maybe there is a related datastructure which I could use to implement my requirements. Input is appreciated.

That sounds like a variation on an LFU (Least Frequently Used) cache eviction policy, where the decision to evict takes into account an additional threshold based on the relative proportion of updates.
You would not necessarily need to actively decay all other entries when inserting - just retain the total number of inserts across all elements such that you can calculate whether to retain a particular eviction candidate by evaluating whether it exceeds the threshold at that point.
A slightly different approach, without the use of weights, would be to store the keys in buckets by time, where the key to value lookup was done within the context of a specific bucket.
Each bucket would have an associated time range within which values could be inserted into the bucket.
Each bucket would internally have a K -> V map.
An additional map of K -> Bucket would be retained.
Buckets would be inserted into an ordered list.
Insertion into the data structure would:
Allocate a new current bucket if now is outside the valid time range for the most recent bucket.
Check whether K already existed in a bucket
If it does not, insert K -> V into the current bucket and K -> current into the key to bucket map.
If it exists in the current (most recent) bucket, do nothing.
If it exist in an older bucket, move it to the current bucket and update the bucket map.
Things that are used recently move up into the more recent buckets. Things that are not end up in the older buckets. Periodically you purge older buckets.
A downside is the use of time to decay, although you could probably use the approach you describe above at the bucket level, so reducing the cost as a result of the smaller number of buckets than items in the collection to an acceptable level.

Related

will an additional, rarely changing tag have a relevant "cost"?

In a InfluxDB (Cloud) I store measurements of a many weather sensors.
So, each InfluxDB data point looks like this:
weather,deviceId=1234 temperature=21.3,humidity=66 1559260800000000000
To the existing single deviceId tag I'd like to add a second tag position, resulting in points such as...
weather,deviceId=1234,position=243 temperature=21.3,humidity=66 1559260800000000000
For a given deviceId the position would change very rarely, but it can happen.
When querying a sensor data, the deviceId and position would be filtered always together.
Will this additional tag have an relevant increase on the billed storage (GB-hr) or negatively affect performance or is InfluxDB able to optimize/compress this additional tag?
Some more context: Some sensors might be reused and placed to a different location. It would not make much sense to analyze data of a single sensor at different positions, hence filter data always like "sensor 1234 at position 243". As this is adding a fifth value to a otherwise relatively small data point, I'm worried that this might "cost" too much.
The short answer is no for both storage size and write/read performance in your case. That means, you should be fine by adding an additional tag.
tag values are stored only once and always as strings. See more details here.
Since values of position are limited and far less than that of deviceId, the storage will be just okay. See more details here.
Regarding the write/read performance, it all comes down to series cardinality, which is simply
The number of unique measurement, tag set, and field key combinations in an InfluxDB bucket.
Again, since values of position are limited and change very rarely, it won't add much cardinality to the database. Therefore, the write/read performance won't be impacted that much.
You could check the series cardinality of the old schema and new schema as follows to make the comparison:
Query series cardinality in the weather measurement:
import "influxdata/influxdb"
influxdb.cardinality(
bucket: "yourBucket",
start: -1y,
predicate: (r) => r._measurement == "weather",
)
See more details here.

Detect common features in multidimensional data

I am designing a system for anomaly detection.
There are multiple approaches for building such system. I choose to implement one facet of such system by detection of features shared by the majority of samples. I acknowledge the possible insufficiencies of such method but for my specific use-case: (1) It suffices to know that a new sample contains (or lacks) features shared by the majority of past data to make a quick decision.(2) I'm interested in the insights such method will offer to the data.
So, here is the problem:
Consider a large data set with M data points, where each data point may include any number of {key:value} features. I choose to model a training dataset by grouping all the features observed in the data (the set of all unique keys) and setting it as the model's feature space. I define each sample by setting its values for existing keys and None for values in features it does not include.
Given this training data set I want to determine which features reoccur in the data; and for such reoccurring features, do they mostly share a single value.
My question:
A simple solution would be to count everything - for each of the N features calculate the distribution of values. However as M and N are potentially large, I wonder if there is a more compact way to represent the data or more sophisticated method to make claims about features' frequencies.
Am I reinventing an existing wheel? If there's an online approach for accomplishing such task it would be even better.
If I understand correctly your question,
you need to go over all the data anyway, so why not using hash?
Actually two hash tables:
Inner hash table for the distribution of feature values.
Outer hash table for feature existence.
In this way, the size of the inner hash table will indicate how is the feature common in your data, and the actual values will indicate how they differ one another. Another thing to notice is that you go over your data only once, and the time complexity for every operation (almost) on hash tables (if you allocate enough space from the beginning) is O(1).
Hope it helps

C linked list or hash table for matrix operations

I have matrix in C with size m x n. Size isn't known. I must to have operations on matrix such as : delete first element and find i-th element. (where size woudn't be too big , from 10 to 50 columns of matrix). What is more efficient to use, linked list or hash table? How can I map column of matrix to one element of linked list or hash table depens what I choose to use?
Thanks
Linked lists don't provide very good random access, so from that perspective, you might not want to look in to using them to represent a matrix, since your lookup time will take a hit for each element you attempt to find.
Hashtables are very good for looking up elements as they can provide near constant time lookup for any given key, assuming the hash function is decent (using well established hashtable implementations would be wise)
Provided with the constraints that you have given though, a hashtable of linked lists might be a suitable solution, though it would still present you with the problem of finding the ith element, as you'd still need to iterate through each linked list to find the element you want. This would give you O(1) lookup for the row, but O(n) for the column, where n is the column count.
Furthermore, this is difficult because you'd have to make sure EVERY list in your hashtable is updated with the appropriate number of nodes as the number of columns grows/shrinks, so you're not buying yourself much in terms of space complexity.
A 2D array is probably best suited for representing a matrix, where you provide some capability of allowing the matrix to grow by efficiently managing memory allocation and copying.
An alternate method would be to look at something like the std::vector in lieu of the linked list, which acts like an array in that it's contiguous in memory, but will allow you the flexibility of dynamically growing in size.
if its for work then use hash table, avg runtime would be O(1).
for deletion/get/set given indices at O(1) 2d arr would be optimal.

Why does a hash table take up more memory than other data-structures?

I've been doing some reading about hash tables, dictionaries, etc. All literature and videos that I have watched imply to hash-tables as having the space/time trade-off property.
I am struggling to understand why a hash table takes up more space than, say, an array or a list with the same number of total elements (values)? Does it have something to do with actually storing the hashed keys?
As far as I understand and in basic terms, a hash table takes a key identifier (say some string), passes it through some hashing function, which spits out an index to an array or some other data-structure. Apart from the obvious memory usage to store your objects (values) in the array or table, why does a hash table use up more space? I feel like I am missing something obvious...
Like you say, it's all about the trade-off between lookup time and space. The larger the number of spaces (buckets) the underlying data structure has, the greater the number of locations the hash function has where it can potentially store each item, and so the chance of a collision (and therefore worse than constant-time performance) is reduced. However, having more buckets obviously means more space is required. The ratio of number of items to number of buckets is known as the load factor, and is explained in more detail in this question: What is the significance of load factor in HashMap?
In the case of a minimal perfect hash function, you can achieve O(1) performance storing n items in n buckets (a load factor of 1).
As you mentioned, the underlying structure of hash table is an array, which is the most basic type in the data structure world.
In order to make hash table fast, which support O(1) operations. The underlying array's capacity must be more than enough. It uses the term of load factor to evaluate this point. The load factor is the ratio of numbers of element in hash table to numbers of all the cells in the hash table. It evaluates how empty the hash table is.
To make the hash table run fast, the load factor can't be greater than some threshold value. For example, in the Quadratic Probing collision resolution method, the load factor should not be greater than 0.5. When the load factor approaches 0.5 while inserting new element into hash table, we need to rehashing the table to meet the requirement.
So hash table's high performance in the run time aspect is based on more space usage. This is time and space tradeoff.

Will Redis's sorted sets scale?

This may be more of a theoretical question but I'm looking for a pragmatic answer.
I plan to use Redis's Sorted Sets to store the ranking of a model in my database based on a calculated value. Currently my data set is small (250 members in the set). I'm wondering if the sorted sets would scale to say, 5,000 members or larger. Redis claims a 1GB maximum value and my values are the ID of my model so I'm not really concerned about the scalability of the value of the sorted set.
ZRANGE has a time complexity of O(log(N)+M). If I'm most frequently trying to get the top 5 ranked items from the set, log(N) of N set items might be a concern.
I also plan to use ZINTERSTORE which has a time complexity of O(N*K)+O(M*log(M)). I plan to use ZINTERSTORE frequently and retrieve the results using ZRANGE 0 -1
I guess my question is two fold.
Will Redis sorted sets scale to 5,000 members without issues? 10,000? 50,000?
Will ZRANGE and ZINTERSTORE (in conjunction with ZRANGE) begin to show performance issues when applied to a large set?
I have had no issues with hundreds of thousands of keys in sorted sets. Sure getting the entire set will take a while the larger the set is, but that is expected - even from just an I/O Standpoint.
One such instance was on a sever with several DBs in use and several sorted sets with 50k to >150k keys in them. High writes were the norm as these use a lot of zincrby commands coming by way of realtime webserver log analysis peaking at over 150M records per day. And I'd store a week at a time.
Given my experience, I'd say go for it and see; it will likely be fine unless your server hardware is really low end.
In Redis, sorted sets having scaling limitations. A sorted set cannot be partitioned. As a result, if the size of a sorted set exceeds the size of the partition, there is nothing you can do (without modifying Redis).
Quote from article:
The partitioning granularity is the key, so it is not possible to shard a dataset with a single huge key like a very big sorted set[1].
Reference:
[1] http://redis.io/topics/partitioning

Resources