Durable queue configuration in Terracotta

Durable queue configuration in Terracotta - scalability

Does anyone know how to configure durable queue on Terracotta server properly?
Terracotta stores clustered objects on server in files and writes data to them in append only fashion. I want to control how this internal data gets cleared somehow. I have multiple intensive applications that use common ehcache instance clustered by Terracota. There are threads that are putting data, others are reading and removing. Our hard disks are not made of rubber... AFAIK )) Does Terracotta clear removed cache items from the disk, what's the default behaviour, configuration options etc? Thanx, in advance

Any object clustered by Terracotta is durable.
So once you cluster a queue, it is durable. Every object referenced by a clustered data structure is also durable, so any message you place on the queue will be durable.
If you mean making it persistent to disk, then this setting is controlled by the persistence mode. For more details, see the configuration reference guide.
Objects placed into the clustered memory are garbage collected - the same principle as how objects in the Java heap are garbage collected. Once all cluster references to an object are cleared, an object can be cleared by the distributed garbage collector. The process is referred to as distributed garbage collection.
You can monitor the number of objects in the clustered heap, and the details of each distributed garbage collection (dgc) run, as well as invoke a dgc run, using the developer console.

Related

Can I use my existing Redis for a custom rails cache?

The documentation for the ActiveSupport::Cache::RedisCacheStore states:
Take care to use a dedicated Redis cache rather than pointing this at your existing Redis server. It won't cope well with mixed usage patterns and it won't expire cache entries by default.
Is this advice still true in general, especially when talking about custom data caches, not page (fragment) caches?
Or, more specifically, If I'm building a custom cache for specific "costly" backend calls to a slow 3rd-party API and I set an explicit expires_in value on my cache (or all my cached values), does this advice apply to me at all?

TLDR; yes, as long as you set an eviction policy. Which one? Read on.
On the same page, the docs for #new state that:
No expiry is set on cache entries by default. Redis is expected to be configured with an eviction policy that automatically deletes least-recently or -frequently used keys when it reaches max memory. See redis.io/topics/lru-cache for cache server setup.
This is more about memory management and access patterns than what's being cached. The Redis eviction policy documentation has a detailed section for policy choice and mixed usage (whether to use a single instance or not):
Picking the right eviction policy is important depending on the access pattern of your application, however you can reconfigure the policy at runtime while the application is running, and monitor the number of cache misses and hits using the Redis INFO output to tune your setup.
In general as a rule of thumb:
Use the allkeys-lru policy when you expect a power-law distribution in the popularity of your requests. That is, you expect a subset of elements will be accessed far more often than the rest. This is a good pick if you are unsure.
Use the allkeys-random if you have a cyclic access where all the keys are scanned continuously, or when you expect the distribution to be uniform.
Use the volatile-ttl if you want to be able to provide hints to Redis about what are good candidate for expiration by using different TTL values when you create your cache objects.
The volatile-lru and volatile-random policies are mainly useful when you want to use a single instance for both caching and to have a set of persistent keys. However it is usually a better idea to run two Redis instances to solve such a problem.
It is also worth noting that setting an expire value to a key costs memory, so using a policy like allkeys-lru is more memory efficient since there is no need for an expire configuration for the key to be evicted under memory pressure.
You do not have mixed usage. For example, you do not persist Sidekiq jobs in Redis, which have no TTL/expiry by default. So, you can treat your Redis instance as cache-only.

Dask client runs out of memory loading from S3

I have a s3 bucket with a lot of small files, over 100K that add up to about 700GB. When loading the objects from a data bag and then persist the client always runs out of memory, consuming gigs very quickly.
Limiting the scope to a few hundred objects will allow the job to run, but a lot of memory is being used by the client.
Shouldn't only futures be tracked by the client? How much memory do they take?

Martin Durant answer on Gitter:
The client needs to do a glob on the remote file-system, i.e.,
download the full defiinition of all the files, in order to be able to
make each of the bad partitions. You may want to structure the files
into sub-directories, and make separate bags out of each of those
The original client was using a glob *, ** to load objects from S3.
With this knowledge, fetching all of the objects first with boto then using the list of objects, no globs, resulted in very minimal memory use by the client and a significant speed improvement.

Memory usage of Combine.PerKey on a global window

We perform joins over a few PCollections using Combine.PerKey with a custom KeyedCombineFn. PCollections are assigned to a GlobalWindow with a Repeatedly.forever trigger on AfterProcessingTime.pastFirstElementInPane.
The PCollections contain around 1M keys, but for a given key only a few hundreds elements. The KeyedCombineFn retains around a few KB (sometimes up to 5 MB) of data in its accumulator.
Now that we have increased the amount of data we process in our pipeline we are seeing java.lang.OutOfMemoryError: Java heap space error. The pipeline runs on n1-highmem-4 machines on Google Cloud Dataflow.
Our assumption is that Dataflow workers manage the state for each key independently, and have heuristics to write/load it to/from disk depending on how much RAM it has available. Hence, the goal is to have individual state fit in one worker's memory.
Is this assumption correct? If so, why could we be seeing OOM errors? If not, do you mind elaborating on how Dataflow workers manage state in memory?

The Dataflow workers do behave roughly as you assume, but there is some estimation involved and it's possible something about your data is breaking that. Do you have a very large discrepancy in the serialized size of your accumulators and the in-memory size of the objects?
The easiest thing to try to fix this would be to run on fewer larger machines such as n1-highmem-8

Data keeps on growing TokuMx no repairDatabase

TokuMx though has benefits, we are running into issues. Recently we migrated to this engine and in process our clean up scripts are useless. We have transient data that we used clean every night and then reclaim disk via db.repairDatabase . However that command is not supported by TokuMX and as a result we are not able to reclaim the disk.
Is there an alternate way ?

It sounds like partitioned collections are the right abstraction for your application. Normal collections will suffer from the accumulation of MVCC garbage if you have a pattern of deleting large swaths of old data. With partitioned collections, you can drop a partition and reclaim all the space instantaneously.

Twitter Streaming API recording and Processing using Windows Azure and F#

A month ago I tried to use F# agents to process and record Twitter StreamingAPI Data here. As a little exercise I am trying to transfer the code to Windows Azure.
So far I have two roles:
One worker role (Publisher) that puts messages (a message being the json of a tweet) to a queue.
One worker role (Processor) that reads messages from the queue, decodes the json and dumps the data into a cloud table.
Which leads to lots of questions:
Is it okay to think of a worker role as an agent ?
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
Sorry for pounding all these questions, hope you don't mind,
Thanks a lot!

There is an opensource library named Lokad.Cloud which can process big message transparently, you can check it on http://code.google.com/p/lokad-cloud/

Is it okay to think of a worker role as an agent?
Yes, definitely.
In practice the message can be larger than 8 KB so I am going to need to use a blob storage and pass as message the reference to the blob (or is there another way?), will that impact performance ?
Yes, using the technique you're talking about (saving the JSON to blob storage with a name of "JSONMessage-1" and then sending a message to a queue with contents of "JSONMessage-1") seems to be the standard way of passing messages in Azure that are bigger than 8KB. As you're making 4 calls to Azure storage rather than 2 (1 to get the queue message, 1 to get the blob contents, 1 to delete from the queue, 1 to delete the blob) it will be slower. Will it be noticeably slower? Probably not.
If a good number of messages are going to be smaller than 8KB when Base64 encoded (this is a gotcha in the StorageClient library) then you can put in some logic to determine how to send it.
Is it correct to say that if needed I can increase the number of instances of the Processor worker role, and the queue will magically be processed faster ?
As long as you've written your worker role so that it's self contained and the instances don't get in each others way, then yes, increasing the instance count will increase the through put.
If you're role is mainly just reading and writing to storage, you might benefit by multi-threading the worker role first, before increasing the instance count which will save money.

Is it okay to think of a worker role
as an agent ?
This is the perfect way to think of it. Imagine the workers at McDonald's. Each worker has certain tasks and they communicate with each other via messages (spoken).
In practice the message can be larger
than 8 KB so I am going to need to use
a blob storage and pass as message the
reference to the blob (or is there
another way?), will that impact
performance?
As long as the message is immutable this is the best way to do it. Strings can be very large and thus are allocated to the heap. Since they are immutable passing around references is not an issue.
Is it correct to say that if needed I
can increase the number of instances
of the Processor worker role, and the
queue will magically be processed
faster?
You need to look at what your process is doing and decide if it is IO bound or CPU bound. Typically IO bound processes will have an increase in performance by adding more agents. If you are using the ThreadPool for your agents the work will be balanced quite well even for CPU bound processes but you will hit a limit. That being said don't be afraid to mess around with your architecture and MEASURE the results of each run. This is the best way to balance the amount of agents to use.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart