we know data cache for row data and index cache for group values. does aggregator process all the data to cache before its operation? - data-warehouse

Can you please help me to understand this by taking below example.
Group by cust_id,item_id.
what records will process to caches(index/data) in both scenarios with sorted input and unsorted input?
What will be case if cache memory runs out?Which alogritham it uses to perform aggregate calculations internally?

I don't know about internal algorithm, but in unsorted mode, it's normal for the Aggregator to store all rows in cache and wait for the last row, because it could be the first that must be returned according to Aggregator rules ! The Aggregator will never complain about the order of incoming rows. When using cache, it will store rows first in memory, then when the allocated memory is full, it will push cache to disk. If it runs out of disk space, the session will fail (and maybe others because of that full disk). You will have to clean those files manually.
In sorted mode, there is no such problem : rows come in groups ready to be aggregated and the aggregated row will go out as soon as all rows from a group are received, which is detected when one of the values of the keys changes. The Aggregator will complain and stop if rows are not in expected order. However it pushes the problem upward to the sorting part, that could be a Sorter, which can use a lot of cache itself, or the database with an ORDER BY clause in the SQL query that could take resources on the database side.
Be careful also that SQL ORDER BY may use a different locale than Informatica.

Related

Limit the growth of ETS storage

I'm considering using Erlang's ETS as a cache for user searches in a new Elixir project. Based on user input, the system will do lookups using an expensive third-party API.
In order to avoid making duplicate calls for the same user input, I intend to put a cache layer in front of the external API, and ETS seems like a good option for this. However, since there is no limit to the variations of user input, I'm concerned that the storage space required for the ETS table will grow without bound.
In my reading about ETS, I haven't seen anyone else discuss concern about the size of tables in ETS. Is that because this would be an abnormal use case for ETS?
At first blush, my preference would be to limit the number of entries in the ETS table, and reject (i.e. delete) the oldest entries once the limit is reached…
Is there a common strategy for dealing with unbounded number of entries in ETS?
I use ETS tables in production like a 'smart invalidated cache' with a redis API (also it have master-master replication like a SQL WAL log).
The biggest sizes is ~ 200-300Mb and they have more than 1million items. There are no any problems for last 2 years. I know about limits ERL_MAX_ETS_TABLES but havn't any information about sizes.
I have special 'smart indexes' for this tables. ETS select/match/etc is slow because this methods passing all the elements in the table.
use the ets:tab2list(TableId) function to convert the ETS table to a common list. After doing that, you are able to check the size of the list with the, well known BIF length(List).
Last but not least, you are now able to set a buffer (just check the size of the list with pattern matching, if, or case expression

How should I auto-expire entires in an ETS table, while also limiting its total size?

I have a lot of analytics data which I'm looking to aggregate every so often (let's say one minute.) The data is being sent to a process which stores it in an ETS table, and every so often a timer sends it a message to process the table and remove old data.
The problem is that the amount of data that comes in varies wildly, and I basically need to do two things to it:
If the amount of data coming in is too big, drop the oldest data and push the new data in. This could be viewed as a fixed size queue, where if the amount of data hits the limit, the queue would start dropping things from the front as new data comes to the back.
If the queue isn't full, but the data has been sitting there for a while, automatically discard it (after a fixed timeout.)
If these two conditions are kept, I could basically assume the table has a constant size, and everything in it is newer than X.
The problem is that I haven't found an efficient way to do these two things together. I know I could use match specs to delete all entires older than X, which should be pretty fast if the index is the timestamp. Though I'm not sure if this is the best way to periodically trim the table.
The second problem is keeping the total table size under a certain limit, which I'm not really sure how to do. One solution comes to mind is to use an auto-increment field wich each insert, and when the table is being trimmed, look at the first and the last index, calculate the difference and again, use match specs to delete everything below the threshold.
Having said all this, it feels that I might be using the ETS table for something it wasn't designed to do. Is there a better way to store data like this, or am I approaching the problem correctly?
You can determine the amount of data occupied using ets:info(Tab, memory). The result is in number of words. But there is a catch. If you are storing binaries only heap binaries are included. So if you are storing mostly normal Erlang terms you can use it and with a timestamp as you described, it is a way to go. For size in bytes just multiply by erlang:system_info(wordsize).
I haven't used ETS for anything like this, but in other NoSQL DBs (DynamoDB) an easy solution is to use multiple tables: If you're keeping 24 hours of data, then keep 24 tables, one for each hour of the day. When you want to drop data, drop one whole table.
I would do the following: Create a server responsible for
receiving all the data storage messages. This messages should be time stamped by the client process (so it doesn't matter if it waits a little in the message queue). The server will then store then in the ETS, configured as ordered_set and using the timestamp, converted in an integer, as key (if the timestamps are delivered by the function erlang:now in one single VM they will be different, if you are using several nodes, then you will need to add some information such as the node name to guarantee uniqueness).
receiving a tick (using for example timer:send_interval) and then processes the message received in the last N µsec (using the Key = current time - N) and looking for ets:next(Table,Key), and continue to the last message. Finally you can discard all the messages via ets:delete_all_objects(Table). If you had to add an information such as a node name, it is still possible to use the next function (for example the keys are {TimeStamp:int(),Node:atom()} you can compare to {Time:int(),0} since a number is smaller than any atom)

Is is possible in ruby to set a specific active record call to read dirty

I am looking at a rather large database.. Lets say I have an exported flag on the product records.
If I want an estimate of how many products I have with the flag set to false, I can do a call something like this
Product.where(:exported => false).count.. .
The problem I have is even the count takes a long time, because the table of 1 million products is being written to. More specifically exports are happening, and the value I'm interested in counting is ever changing.
So I'd like to do a dirty read on the table... Not a dirty read always. And I 100% don't want all subsequent calls to the database on this connection to be dirty.
But for this one call, dirty is what I'd like.
Oh.. I should mention ruby 1.9.3 heroku and postgresql.
Now.. if I'm missing another way to get the count, I'd be excited to try that.
OH SNOT one last thing.. this example is contrived.
PostgreSQL doesn't support dirty reads.
You might want to use triggers to maintain a materialized view of the count - but doing so will mean that only one transaction at a time can insert a product, because they'll contend for the lock on the product count in the summary table.
Alternately, use system statistics to get a fast approximation.
Or, on PostgreSQL 9.2 and above, ensure there's a primary key (and thus a unique index) and make sure vacuum runs regularly. Then you should be able to do quite a fast count, as PostgreSQL should choose an index-only scan on the primary key.
Note that even if Pg did support dirty reads, the read would still not return perfectly up to date results because rows would sometimes inserted behind the read pointer in a sequential scan. The only way to get a perfectly up to date count is to prevent concurrent inserts: LOCK TABLE thetable IN EXCLUSIVE MODE.
As soon as a query begins to execute it's against a frozen read-only state because that's what MVCC is all about. The values are not changing in that snapshot, only in subsequent amendments to that state. It doesn't matter if your query takes an hour to run, it is operating on data that's locked in time.
If your queries are taking a very long time it sounds like you need an index on your exported column, or whatever values you use in your conditions, as a COUNT against an indexed an column is usually very fast.

Deleting specific keys from memcached hash

I am trying to cache a table t_abc using memcached, for my rails application. The table has 50,000 entries. And I finally want have 20,000 keys(which will be of the form "abc_"+id). When the 20,001st entry is to be inserted in the cache, I want the least recently used key out of these 20,000(of the above form, and not some other keys in the memcached) to be deleted from the cache. How do I achieve that?
NOTE: I am keeping an expiry = 0 for all the entries.
No, unfortunately you cannot efficiently do what you want to do with memcached.
It has a single LRU which works across all the data you are storing. So if you are storing data from multiple tables then the fraction of the memcached entries taken up by each table depends on the relative patterns of access of the data from the different tables.
So to control the amount of rows of the table are cached, really all you can do is adjust how big your memcached is and vary what other data gets stored.
(Also, to evaluate the memcached configuration you might consider using a different metric, such as the hit rate or the median response time, rather than simply the number of rows cached.)

What's the effect of cache to NSFetchedResultsController

I read the doc of course, but I don't quite get the meaning of "setting up any sections and ordering the contents".
Don't these kinds of information come form data base?
Does it mean NSFetchedResultsController needs some other kinds of indices itself beside data base indices?
What's really going on when NSFetchedResultsController is setting up a cache?
Is cache useful for static data only? If my data update frequently, should I use cache or not?
How can I profile the performance of cache? I tried cache, but couldn't see any performance improvement. I timed -performFetch: but saw a time increase from 0.018s(without cache) to 0.023s(with cache). I also timed -objectAtIndexPath: and only a time decrease from 0.000030(without cache) to 0.000029(with catch).
In other words, I want to know when cache does(or does not) improves performance and why.
As #Marcus pointed below, "500 Entries is tiny. Core Data can handle that without a human noticeable lag. Caching is used when you have tens of thousands of records." So I think there are few apps that would benefit from using cache.
The cache for the NSFetchedResultsController is a kind of short cut. It is a cache of the last results from the NSFetchRequest. It is not the entire data but enough data for the NSFetchedResultsController to display its results quickly; very quickly.
It is a "copy" of the data from the database that is serialized to disk in a format that is easily consumed by the NSFetchedResultsController on its next instantiation.
To look at it another way, it is the last results flash frozen to disk.
From the documentation of NSFetchedResultsController:
Where possible, a controller uses a cache to avoid the need to repeat work performed in setting up any sections and ordering the contents
To take advantage of the cache you should use sectioning or ordering of your data.
So if in initWithFetchRequest:managedObjectContext:sectionNameKeyPath:cacheName: you set the sectionNameKeyPath to nil you probably won't notice any performance gain.
From the documentation
The Cache Where possible, a controller uses a cache to avoid the need
to repeat work performed in setting up any sections and ordering the
contents. The cache is maintained across launches of your application.
When you initialize an instance of NSFetchedResultsController, you
typically specify a cache name. (If you do not specify a cache name,
the controller does not cache data.) When you create a controller, it
looks for an existing cache with the given name:
If the controller can’t find an appropriate cache, it calculates the
required sections and the order of objects within sections. It then
writes this information to disk.
If it finds a cache with the same name, the controller tests the cache
to determine whether its contents are still valid. The controller
compares the current entity name, entity version hash, sort
descriptors, and section key-path with those stored in the cache, as
well as the modification date of the cached information file and the
persistent store file.
If the cache is consistent with the current information, the
controller reuses the previously-computed information.
If the cache is not consistent with the current information, then the
required information is recomputed, and the cache updated.
Any time the section and ordering information change, the cache is
updated.
If you have multiple fetched results controllers with different
configurations (different sort descriptors and so on), you must give
each a different cache name.
You can purge a cache using deleteCache(withName:).

Resources