Does Apache Mesos recognize GPU cores? - twitter

In slide 25 of this talk by Twitter's Head of Open Source office, the presenter says that Mesos allows one to track and manage even GPU (I assume he meant GPGPU) resources. But I cant find any information on this anywhere else. Can someone please help? Besides Mesos, are there other cluster managers that support GPGPU?

Mesos does not yet provide direct support for (GP)GPUs, but does support custom resource types. If you specify --resources="gpu(*):8" when starting the mesos-slave, then this will become part of the resource offer to frameworks, which can launch tasks that claim to use these resources. Once some of the gpu resources are in use by a task, only the remaining resources will be offered again, until that task completes and the gpu resources become available again. In this way, the Mesos resource allocator can actually schedule the gpu resources you declared, and ensure that only the amount declared are offered/allocated to frameworks.
Mesos does not yet have support for gpu isolation, but with "pluggable isolator modules", you could build your own gpu isolator to enforce gpu resource limits.
Alternately, if you don't want to allocate individual gpu resources, but only want to declare some nodes as having gpus while others do not, you can just use --attributes="hasGpu:true" or something similar to differentiate the nodes that do/do not have gpus. This information is also passed onto the frameworks in resource offers, but these attributes cannot be "consumed" by a running task, so they will always be offered for that node.
For more information, see https://mesos.apache.org/documentation/attributes-resources/

Related

Isolation Scheduler in Apache Storm

I have defined two topology and use Isolation Scheduler in Nimbus. I have allocated below configuration to my topology.
isolation.scheduler.machines:
"Topology-Test1": 2
"Topology-Test2": 3
Now , I want if there is no work coming for Topology-Test2. Then, all 3 nodes will be assigned to Topology-Test1. But when traffic comes for Topology-Test2. Then, all 3 nodes should be reassigned to Topology-Test2.
is it possible in Storm to achieve this?
While a straight forward implementation is not supported by Storm directly imho, there are two pointers here that might help you:
T-3 Scheduler: In this paper, we propose a heuristic scheduling algorithm – T3-Scheduler – for a heterogeneous fog or cloud cluster that can efficiently identify the tasks that communicate with each other and assign them to the same node, up to a specified level of utilisation for that node.
Resource Aware Scheduler: Maybe you can hijack that somehow. According to the docs: Resource Aware Scheduler can allocate resources on a per user basis. Each user can be guaranteed a certain amount of resources to run his or her topologies and the Resource Aware Scheduler will meet those guarantees when possible. When the Storm cluster has extra free resources, Resource Aware Scheduler will to be able allocate additional resources to user in a fair manner. The importance of topologies can also vary. Topologies can be used for actual production or just experimentation, thus Resource Aware Scheduler will take into account the importance of a topology when determining the order in which to schedule topologies or when to evict topologies
Good luck with finding your strategy.

Can I write a file to a specific cluster location?

You know, when an application opens a file and write to it, the system chooses in which cluster will be stored. I want to choose myself ! Let me tell you what I really want to do... In fact, I don't necessarily want to write anything. I have a HDD with a BAD range of clusters in the middle and I want to mark that space as it is occupied by a file, and eventually set it as a hidden-unmoveable-system one (like page file in windows) so that it won't be accessed anymore. Any ideas on how to do that ?
Later Edit:
I think THIS is my last hope. I just found it, but I need to investigate... Maybe a file could be created anywhere and then relocated to the desired cluster. But that requires writing, and the function may fail if that cluster is bad.
I believe the answer to your specific question: "Can I write a file to a specific cluster location" is, in general, "No".
The reason for that is that the architecture of modern operating systems is layered so that the underlying disk store is accessed at a lower level than you can access, and of course disks can be formatted in different ways so there will be different kernel mode drivers that support different formats. Even so, an intelligent disk controller can remap the addresses used by the kernel mode driver anyway. In short there are too many levels of possible redirection for you to be sure that your intervention is happening at the correct level.
If you are talking about Windows - which you haven't stated but which appears to assumed - then you need to be looking at storage drivers in the kernel (see https://learn.microsoft.com/en-us/windows-hardware/drivers/storage/). I think the closest you could reasonably come would be to write your own Installable File System driver (see https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/_ifsk/). This is really a 'filter' as it sits in the IO request chain and can intercept and change IO Request Packets (IRPs). Of course this would run in the kernel, not in userspace, and normally this would be written in C and I note your question is tagged for Delphi.
Your IFS Driver can sit at differnt levels in the request chain. I have used this technique to intercept calls to specific file system locations (paths / file names) and alter the IRP so as to virtualise the request - even calling back to user space from the kernel to resolve how the request should be handled. Using the provided examples implementing basic functionality with an IFS driver is not too involved because it's a filter and not a complete storgae system.
However the very nature of this approach means that another filter can also alter what you are doing in your driver.
You could look at replacing the file system driver that interfaces to the hardware, but I think that's likely to be an excessive task under the circumstances ... and as pointed out already by #fpiette the disk controller hardware can remap your request anyway.
In the days of MSDOS the access to the hardware was simpler and provided by the BIOS which could be hooked to allow the requests to be intercepted. Modern environments aren't that simple anymore. The IFS approach does allow IO to be hooked, but it does not provide the level of control you need.
EDIT regarding suggestion by the OP of using FSCTL_MOVE_FILE
For simple environment this may well do what you want, it is designed to support a defragmentation process.
However I still think there's no guarantee that this actually will do what you want.
You will note from the page you have linked to it states that it is moving one or more virtual clusters of a file from one logical cluster to another within the same volume
This is a code that's passed to the underlying storage drivers which I have referred to above. What the storage layer does is up to the storage layer and will depend on the underlying technology. With more advanced storage there's no guarantee this actually addresses the physical locations which I believe your question is asking about.
However that's entirely dependent on the underlying storage system. For some types of storage relocation by the OS may not be honoured in the same way. As an example consider an enterprise storage array that has a built in data-tiering function. Without the awareness of the OS data will be relocated within the storage based on the tiering algorithms. Also consider that there are technologies which allow data to be directly accessed (like NVMe) and that you are working with 'virtual' and 'logical' clusters, not physical locations.
However, you may well find that in a simple case, with support in the underlying drivers and no remapping done outside the OS and kernel, this does what you need.
Since you problem is to mark bad cluster, you don't need to write any program. Use the command line utility CHKDSK that Windows provides.
I an elevated command prompt (Run as administrator), run the command:
chkdsk /r c:
The check will be done on the next reboot.
Don't forget to read the documentation.

Is it possible to run Metal code on two or more GPUs at the same time?

I have some parallel computing tasks written in Metal. I am wondering if I can run the metal kernel on two or more GPUs at the same time?
Yes.
If, for example, you're on a Mac with a discrete GPU and an integrated GPU, there will be multiple elements in the array returned by a call to MTLCopyAllDevices(). Same if you have one or more external GPUs connected to your Mac.
In order to run the same compute kernel on each GPU, you'll need to create separate resources and pipeline state objects, since these objects are all affiliated with a single MTLDevice. Everything else about encoding and enqueueing work remains the same.
Except in limited cases (i.e., when GPUs occupy the same peer group), you can't copy resources directly between GPUs. You can, however, use a MTLBlitCommandEncoder to copy shared or managed resources via the system bus.
If there are dependencies among the compute commands across devices, you may need to use events to explicitly synchronize them.

Limitations in Mesos and Marathon Regarding Docker

We have this scenario.
We have 3/3 master/slave arch for Mesos.
Each sleeve is identical, 4GB RAM and 4 Core CPUs.
We have started 10 marathon Apps with 1core CPU and 1GB RAM. We started the containers, but not utilizing them, as per the system it's saying 97% CPU is free.
Now, we are trying to start an another container with a 3Core CPU and 2GB RAM.
Unfortunately, we are not able to start the container, as per the Mesos logs, it's saying that marathon has declined the offer, but all slave nodes are not doing anything. Marathon apps stayed in Deployment state itself.
If mesos is not able to allocate resources to the marathon app (If containers are not utilizing the resources), then what's the use of Docker integration here.
As per my understanding:
Once an offer is accepted by marathon app, even if docker is not using that resource, mesos is thinking like that resources are already utilizing by the app. But if the container is not utilizing any resources, mesos need to collect the available resources and allocate to next marathon application.
Instead of that once an offer is assigned to marathon App, Mesos is subtracting the allocated resources from the total resources.
We are not fully utilizing the Docker features in Mesos/Marathon.
Let me know any suggestions and answers.
Thank you
Mesos tracks "allocation" and not the actual usage. If your app is not doing anything, it doesn't mean it won't do anything in the next moment. That means, if your app requested 1 CPU, this CPU is reserved for the app.
Now, if you don't want to precisely estimate resources your app is using, you may want to look at oversubscription in Mesos. You must keep in mind though, that once oversubscribed resources are requested by the app, for which these resources have been allocated, apps using oversubscribed resources may be terminated.
Mesos/Marathon actually considers the allocated 10*(1GB + 1CPU), because that is the max your app(s) is allowed to use.
And so yes your understanding is correct.
In my opinion you have at least 2 options
Assign less resources to your tasks.
There is actually an interesting new feature which seems to fit your use case: oversubscription which basically tries to utilize this difference between allocated and actual used resources.

Sandboxing user code with Erlang

As far as I know Erlang provides advanced features for error handling and isolation of processes.
I'm building a system that allow user to submit their code to be executed on the shared server environment and need to make it safe.
Requirements are:
limit CPU and Memory usage individually for each user-process.
forbid user-process to communicate with other processes (except some processes specially designed for such purpose).
forbid access to all sytem resources (shell, file system, ...).
terminate user-process in case of errors or high resource consumption.
Is it possible to to all this with Erlang and keep it performance efficient?
In general, Erlang doesn't provide means to sandbox code which a user can inject. You can try writing your own piece of protection code, but it is rather hard.
A better choice would probably be a language like "safe haskell":
http://www.haskell.org/ghc/docs/7.4.2/html/users_guide/safe-haskell.html
which is specifically built to do this kind of thing.
The isolation provided by Erlang is not intended to protect against malicious modules being injected. In fact, there is no such protection in the distributed case either. As soon as two machines are connected, there is no limit to what you can do to the other machine.
There has been work done on Safe Erlang in the past and you can find several papers about it.
The ErlHive project addresses the problem in an interesting way.

Resources