Limit the growth of ETS storage - erlang

I'm considering using Erlang's ETS as a cache for user searches in a new Elixir project. Based on user input, the system will do lookups using an expensive third-party API.
In order to avoid making duplicate calls for the same user input, I intend to put a cache layer in front of the external API, and ETS seems like a good option for this. However, since there is no limit to the variations of user input, I'm concerned that the storage space required for the ETS table will grow without bound.
In my reading about ETS, I haven't seen anyone else discuss concern about the size of tables in ETS. Is that because this would be an abnormal use case for ETS?
At first blush, my preference would be to limit the number of entries in the ETS table, and reject (i.e. delete) the oldest entries once the limit is reached…
Is there a common strategy for dealing with unbounded number of entries in ETS?

I use ETS tables in production like a 'smart invalidated cache' with a redis API (also it have master-master replication like a SQL WAL log).
The biggest sizes is ~ 200-300Mb and they have more than 1million items. There are no any problems for last 2 years. I know about limits ERL_MAX_ETS_TABLES but havn't any information about sizes.
I have special 'smart indexes' for this tables. ETS select/match/etc is slow because this methods passing all the elements in the table.

use the ets:tab2list(TableId) function to convert the ETS table to a common list. After doing that, you are able to check the size of the list with the, well known BIF length(List).
Last but not least, you are now able to set a buffer (just check the size of the list with pattern matching, if, or case expression

Related

Does bucketing two *large* tables in Hive *in the same way* help perform much more efficient joins?

Imagine the following situation I am planning:
Have two rather large tables stored in Hive, both containing different types of customer related information (say, although this is not exactly the case, a record of customer transactions in one and customer owned data in the other). Let's call the tables A and B.
Tables are large in the sense that none of the tables fits completely in memory. (There are 10 million customers and theres is a few kilobytes of info associated to each of them in each of the two tables)
Be careful enough to bucket both tables in exactly the same way, by a field present in both tables (customer_id, which is a bigint), and using the same number of buckets 100.
I wonder whether this setup will, in any way, guarantee that a join (by customer_id) between both tables will be efficient, in the sense that very little shuffling of information between nodes will be required. I imagine this could the case, if for instance, there were a guarantee that the physical files corresponding to the same bucket in both tables are physically stored in the same (sets of nodes), i.e. if for every bucket i (in [0,99]) the file A/part_0_000i and the file B/part_0_000i were physically stored in the same nodes and the same held for their replicas.
Notes:
I am aware that partitioning and bucketing are different and that the first essentially determines the structure of subdirectories, whereas the second on determines which file each record goes too. This question is about bucketing only
Also, by number 2, map-side joins are not an option here, since, as far as my understading goes, they require loading one of the tables completely within each mapper and doing the join completely there.
Bucketing is used when there are too many levels in your data in which you want to partition by, or there are no good candidate partitions.
A concrete example would be partitioning on customerID in sales data. You may have 20 thousand customers. Partitions would contain small amounts of data which is inefficient and have too many partitions also inefficient. However you can hash the customerID and partition into 50 buckets for example. Then when you are merging on customerID the job will only have to scan against what is in a bucket rather than the entire sum of all your data.
With ideal bucketing your buckets should contain some multiple of the file system block size. Remember also that too many buckets or buckets that are built over varialbes not used as keys can be detrimental for other queries.
I have used them when I need to execute large jobs repeatedly. My queries time has been reduced significantly. I tend to only use when my data is very big. And big is relative to cluster size and capacity.
One great thing about bucketing is that they help ensure the bucketed partitions are of similar size. If you partition over State for example, California will have huge partitions while other states are very small.
Bucketing is tactical and not an appropriate for all use cases. Happy bucketing!
Yes, it will definitely help.
Bucketed tables are partitioned and sorted the same way, so they will be mergesorted, which works in linear time (n), otherwise the tables have to be sorted the same way first, which is usually nlog(n)

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

How to apportion between BatchInserterIndex cache and MMIO?

In a batch insertion using lucene indexes, given a large set of nodes and relations such that the node and relationship store cannot fit completely in mapped memory (hence the need for lucene index caching), how should one divide memory between MMIO and lucene index caches to achieve optimal performance? Having read the documentation, I am already somewhat familiar with how to divide memory within the mapped-memory schema. I am interested in the overall allotment of memory between MMIO and the lucene caches. Since I am working on a prototype with what hardware happens to be available, and the future resources and data volume are undetermined, I would prefer the answer to be in general terms (I think this would also make the answer more useful to the rest of Neo4j community too.) So it would be good if I could pose the question like this:
Given
rwN nodes and rwR relationships that are written and must be read later in the batch insertion,
woN nodes and woR relationships that are only written,
G gigabytes of RAM (not including what is required for the operating system)
What is the optimal division of G between lucene index caches and MMIO?
If more details are needed I can supply them for my particular case.
All these considerations are only relevant for importing (multiple) billions of nodes and relationships
Usually when you do lookups it depends on the "hot dataset size" of your index lookups.
By default that's all nodes but if you know your domain better, you can probably devise some paging that results in smaller needed caches (e.g. by pre-sorting your input data for relationship creation by start and end-node lookup-property) then you have kind of a moving window over your node data during which each node is accessed frequently.
I usually even sort by min(start,end).
Usually you try to use most of the RAM for mmio mapping of the rel-store and node store. The property stores are only written to but the others have to be updated as well.
The index cache lookup is only a HashMap behind the scenes, so quite wasteful. What I found to work better is to use a different approach, e.g. a multi-pass one.
use an string-array put all your lookup properties in there, sort it and use the array index (Arrays.binarySearch) as node-id then the lookup only in that array is quite efficient
another way is using a multi-pass on the source data so you already create the node-ids that are needed for the rels as part of the source, Friso and Kris from Xebia did something like that in their hadoop based solution esp. the monotonically increasing parallel id's

How to improve the performance in big table join?

Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.
My storage is about 12 TB and each client in current solution talkes 1.5T.
If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.
This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.
You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see https://my.vertica.com/docs/6.1.x/HTML/index.htm#1299.htm
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources