Isolation Scheduler in Apache Storm - stream

I have defined two topology and use Isolation Scheduler in Nimbus. I have allocated below configuration to my topology.
isolation.scheduler.machines:
"Topology-Test1": 2
"Topology-Test2": 3
Now , I want if there is no work coming for Topology-Test2. Then, all 3 nodes will be assigned to Topology-Test1. But when traffic comes for Topology-Test2. Then, all 3 nodes should be reassigned to Topology-Test2.
is it possible in Storm to achieve this?

While a straight forward implementation is not supported by Storm directly imho, there are two pointers here that might help you:
T-3 Scheduler: In this paper, we propose a heuristic scheduling algorithm – T3-Scheduler – for a heterogeneous fog or cloud cluster that can efficiently identify the tasks that communicate with each other and assign them to the same node, up to a specified level of utilisation for that node.
Resource Aware Scheduler: Maybe you can hijack that somehow. According to the docs: Resource Aware Scheduler can allocate resources on a per user basis. Each user can be guaranteed a certain amount of resources to run his or her topologies and the Resource Aware Scheduler will meet those guarantees when possible. When the Storm cluster has extra free resources, Resource Aware Scheduler will to be able allocate additional resources to user in a fair manner. The importance of topologies can also vary. Topologies can be used for actual production or just experimentation, thus Resource Aware Scheduler will take into account the importance of a topology when determining the order in which to schedule topologies or when to evict topologies
Good luck with finding your strategy.

Related

Auto-scale: run everywhere or on-demand?

I am involved in developing of a set of microservices with distributed processing capabilities with the help of Akka.NET.
Typically they consist of some dispatcher and some workers. Dispatcher by default assign work to his local worker, but when it [somehow] determines that current host is overloaded then assigns work to remote workers.
Say we have 10 hosts (VMs) and 30 such services (semantically different).
The question is: how to properly scale them?
The first solution is to run 3-services-per-host with capability to auto-scale each service on-demand on other 9 machines. And scale-down when not needed after some time.
The second solution is to run all 30 services on all 10 hosts always.
At a high level, you need to consider fault tolerance, localisation, recovery, general distributed computing issues such as CAP etc..
Unless you have different scaling needs for different services, I'd probably go for the second approach of running them on all hosts. This gives greater fault tolerance and seems conceptually simpler than having auto-scaling. However, this presumes you have similar needs for each type of service and means all services will be affect by an outage or failure of a host. If one particular service has different needs (I.e. Different SLA, non functional requirements, more powerful machine needed etc.) then there is more of an argument for more specialised deployments per service.

Storm process increasing memory

I am implementing a distributed algorithm for pagerank estimation using Storm. I have been having memory problems, so I decided to create a dummy implementation that does not explicitly save anything in memory, to determine whether the problem lies in my algorithm or my Storm structure.
Indeed, while the only thing the dummy implementation does is message-passing (a lot of it), the memory of each worker process keeps rising until the pipeline is clogged. I do not understand why this might be happening.
My cluster has 18 machines (some with 8g, some 16g and some 32g of memory). I have set the worker heap size to 6g (-Xmx6g).
My topology is very very simple:
One spout
One bolt (with parallelism).
The bolt receives data from the spout (fieldsGrouping) and also from other tasks of itself.
My message-passing pattern is based on random walks with a certain stopping probability. More specifically:
The spout generates a tuple.
One specific task from the bolt receives this tuple.
Based on a certain probability, this task generates another tuple and emits it again to another task of the same bolt.
I am stuck at this problem for quite a while, so it would be very helpful if someone could help.
Best Regards,
Nick
It seems you have a bottleneck in your topology, ie, a bolt receivers more data than in can process. Thus, the bolt's input queue grows over time consuming more and more memory.
You can either increase the parallelism for the "bottleneck bolt" or enable fault-tolerance mechanism which also enables flow-control via limited number of in-flight tuples (https://storm.apache.org/documentation/Guaranteeing-message-processing.html). For this, you also need to set "max spout pending" parameter.

Does Apache Mesos recognize GPU cores?

In slide 25 of this talk by Twitter's Head of Open Source office, the presenter says that Mesos allows one to track and manage even GPU (I assume he meant GPGPU) resources. But I cant find any information on this anywhere else. Can someone please help? Besides Mesos, are there other cluster managers that support GPGPU?
Mesos does not yet provide direct support for (GP)GPUs, but does support custom resource types. If you specify --resources="gpu(*):8" when starting the mesos-slave, then this will become part of the resource offer to frameworks, which can launch tasks that claim to use these resources. Once some of the gpu resources are in use by a task, only the remaining resources will be offered again, until that task completes and the gpu resources become available again. In this way, the Mesos resource allocator can actually schedule the gpu resources you declared, and ensure that only the amount declared are offered/allocated to frameworks.
Mesos does not yet have support for gpu isolation, but with "pluggable isolator modules", you could build your own gpu isolator to enforce gpu resource limits.
Alternately, if you don't want to allocate individual gpu resources, but only want to declare some nodes as having gpus while others do not, you can just use --attributes="hasGpu:true" or something similar to differentiate the nodes that do/do not have gpus. This information is also passed onto the frameworks in resource offers, but these attributes cannot be "consumed" by a running task, so they will always be offered for that node.
For more information, see https://mesos.apache.org/documentation/attributes-resources/

Defining Scaling Threshold for Azure Web Roles

Azure embraces the notion of elastic scaling and I've been able to acheive this with my Worker Roles. However, when it comes to my Web Roles (e.g. MVC Apps) I am not sure what to monitor (or how) to determine when its a good time to increase (or descrease) the number of running instances. I'm assuming I need to monitor one or many Performance Counters but not sure where to start.
Can anyone recommend a best practice for assessing an MVC Web Role instances load relative to scaling decisions?
This question is a bit open-ended, as monitoring is typically app-specific. Having said that:
Start with simple measurements that you'd look at on a local server, representing KPIs for your app. For instance: Maybe look at network utilization. This TechNet article describes performance counters collected by System Center for Windows Azure. For instance:
ASP.NET Applications Requests/sec
Network Interface Bytes
Received/sec
Network Interface Bytes Sent/sec
Processor % Processor Time Total
LogicalDisk Free Megabytes
LogicalDisk % Free Space
Memory Available Megabytes
You may also want to watch # of requests queued and request wait time.
Network utilization is interesting, since your NIC provides approx. 100Mbps per core and could end up being a bottleneck even when CPU and other resources are underutilized. You may need to scale out to more instances to handle high-bandwidth scenarios.
Also: I tend to give less importance to CPU utilization, even though it's so easy to measure (and shows up so frequently in examples). Running a CPU at near capacity is a good thing usually, since you're paying for it and might as well use as much as possible.
As far as decreasing: This needs to be handled a bit more carefully. Windows Azure compute is billed by the hour. If, say, you scale out to an extra instance at 11:50 and scale in again at 12:10, you've just incurred two cpu-hours. Also: You don't want to scale out, then take new measurements and deciding you can now scale back again (effectively creating a constant pulse of adding and decreasing instances). To make things easier, consider the Autoscaling Application Block (WASABi), found in the Enterprise Library. This has all the scale rules baked in (such as the ones I just mentioned) and is very straightforward to use.

Erlang Documentation/SMP: single-node and multi-node per machine or per application, and the confusion that may follow

I'm studying Erlang's process model at the moment. I have hit a snag in a tech report (section 3, paragraph 2) on Erlang:
This explains why it in some cases can be more efficient to run several SMP VM's
with one scheduler each instead on one SMP VM with several schedulers. Of course
the running of several VM's require that the application can run in many parallel tasks
which has no or very little communication with each other.
Now this paragraph is confusing me; I can see the uni-process multiple scheduler scenario, but I am failing to see multiple processes with a single scheduler; Presumably each process would have a different node name, and this would mean a certain application, without modification, cannot be used with this model; the virtue of not requiring modification has been mentioned as a key feature of SMP in the report. If the multiple processes have the same node names, than performance would be disastrous due to inter-Erlang-process messaging storms -- this assume the use of in-memory amnesia. Is there some process model that is not introduced in the article and that I am missing here ?
What is the author trying say here ? is he trying to suggest that an application would have to be rewritten (to take multiple unique node-names into account) for the multi-process single-scheduler case ?
-- edit 1: Clarification of Source of Problem --
The question has been answered through discussion; the following is an outline of the trouble I had.
The issue for this question has been that the documentation, as I recall, does not touch on a scenario of running multiple Erlang emulators per physical machine -- it has always been shown that the emulator represents your physical machine (in industrial usage); also, the scenario of having to explicitly partition a program for computational efficiency has never been considered. This sudden introduction has been the source of my woe.
The convention is still biased towards creating LOTS of processes and that the future holds many improvements for the SMP emulator for Erlang, and this means that single node per machine is still a very viable option assuming favourable application design.
Rewrite after reading article:
This explains why it in some cases can
be more efficient to run several SMP
VM's with one scheduler each instead
on one SMP VM with several schedulers.
Non-SMP VM has no-lock so runs fast.
Single scheduler SMP VM 10% slower, due to cost of checking locks
Multiple scheduler SMP VM slower again due to using/waiting for locks
Of course the running of several VM's
require that the application can run
in many parallel tasks which has no or
very little communication with each
other.
I think: Nodes on the same server have to have different names.
Inter process messaging while by slower due to the inter-process nature verse intra process messaging of a VM node.
If you have multiple schedulers in a single VM, they will inevitably contend over various resources (e.g. ets meta table, atom-table, scheduler run-queue during migration, etc.) because of the inner architecture. If you have a single scheduler, contention will obviously not occur. Lock checking and acquiring will still be done though, so running a non SMP VM instead shall yield even better performance (but requires a rebuilding of the VM from source).
Take a four-core machine for example. Option one means that you run four instances of the Erlang VM, each with a single scheduler, affinity set to different processor cores. Option two means running a single Erlang VM with four schedulers, each scheduler's affinity set to different processor cores.
If you have a whole lot of independent processes to run, option two will result in better performance, because the four cores will be fully utilized (theoretically). In contrast, in option one, this won't be possible, because the lock contention will make execution on cores wait for each other every now and then.
On the other hand if your processes need to chatter a lot, option one is the way to go because the inter-process communication is way cheaper than communication between different VMs. You gain more with this than you lose with lock contention.
I believe the answer is in the preceding paragraph:
The SMP VM with only one scheduler is slightly slower (10%) than the non
SMP VM.
This is because the SMP VM need to use locks for all shared
datastructures. But as
long as there are no lock-conflicts the overhead caused by
locking is not that high (it
is the lock conflicts that takes time).
Scheduler's reliance on locks for shared data structures can impose an overhead on a given system. It seems to follow that having multiple schedulers on one SMP VM imposes a collectively greater overhead.
There are some advatanges with several nodes on one physical machine.
1) Resource locking overhead as mentioned.
2) Fail-over. In telecom products you really don't want to have the beam come crashing down on you. If you have NIFs or linked-in drivers in your system this might occur.
3) Memory locality. Few nodes gives you a poor-mans way to force processes to a few cores. This could be a big boost for NUMA archs typically but also for SMP. The scheduler don't take NUMA into account (yet). You can spawn a process to a specific scheduler and lock it to it, it won't migrate but that is an undocumented feature ... or it was removed all together. I forget.
With several nodes you will need a load balancer between the nodes of course but that is the usual way to do it anyways. Some logic that supervises the nodes.
However, the numbers from the EUC papers are over a year old [#] and I wouldn't recommend a multi-node approach if you don't really need it. The runtime system is much better at handling these types of problems today. A lot of lock overhead has been removed and the mrq-scheduler has been improved.
# 2009's numbers look like this.
Edit:
Regarding 3) the spawn feature i mentioned is,
spawn_opt(fun() -> ... end, [{scheduler, Id}]) -> pid(),
where Id is an integer and refers to a specific scheduler.
I wouldn't recommend using it since it undocumented.

Resources