Which one is the performance cores in instruments? - ios

When I'm running profile in instrument on iPhone X with A11 CPU. This CPU has two performance cores and four efficiency cores.
May I ask if there is a way to tell which one is the performance core? And as for the main thread, will GCD put main thread tasks more on the performance cores rather than the efficiency ones?
I'm very interested to understand how this actually works.

GCD doesn't know anything about different kind of cores and GCD also doesn't decide which code runs on which core.
GCD decides which queue gets a thread of which thread pool and which code is scheduled to run next on the thread of the queue.
Deciding when a thread will run and on which core it will run is done by the thread schedular of the kernel. And the kernel also decides how many threads are available in which GCD thread pool.
The main thread is just a thread like any other thread. How much CPU time a thread gets depends on its own priority level, the amount of other threads, their priority levels, and the amount of workload scheduled for each of them.
As the A11 allows all 6 cores to be active at the same time, the kernel will decide which thread gets a high performance core and which one just a low performance one. High priority threads and threads with high computation workload (those that want to run very often and usually use up their full runtime quantum when running) are preferred for high performance cores. Low priority threads and threads with little computation workload (those that want to run infrequently and very often yield/block although their runtime quantum hasn't been used up yet) are preferred for low performance cores. Though, in theory every thread can run on any core as it would be stupid to leave cores unused if threads are waiting to run, yet low power cores are generally preferred as that reduces power consumption and increases battery runtime.

Related

Can hardware threads access main memory at the same time?

I am trying to understand microarchitecture.
When an operating system schedules code to run on a CPU hardware thread (as in Intel HyperThreading), can each execution context issue memory reads in parallel or is the pipeline shared?
I am trying to do some rough calculations and complexity analysis and I want to know if memory bandwidth is shared and if I should divide my calculation by the number of cores or hardware threads (assuming the pipeline is shared) or hardware threads (the memory bandwidth is parallel).
Yes, the pipeline is shared, so it's possible for each the two load execution units in a physical core to be running a uop from a different logical core, accessing L1d in parallel. (e.g. https://www.realworldtech.com/haswell-cpu/5/ / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram)
Off-core (L2 miss) bandwidth doesn't scale with number of logical cores, and one thread per core can fairly easily saturated it, especially with SIMD, if your code has high throughput (not bottlenecking on latency or branch misses), and low computational intensity (ALU work per load of data into registers. Or into L1d or L2 cache, whichever you're cache-blocking for). e.g. like a dot product.
Well-tuned high-throughput (instructions per cycle) code like linear algebra stuff (especially matmul) often doesn't benefit from more than 1 thread per physical core, instead suffering more cache misses when two threads are competing for the same L1d / L2 cache.
Cache-blocking aka loop tiling can help a lot, if you can loop again over a smaller chunk of data while it's still hot in cache. See How much of ‘What Every Programmer Should Know About Memory’ is still valid? (most of it).

Grand Central Dispatch: What happens when queues get overloaded?

Pretty simple question that I haven't found anywhere in documentation or tutorials on GCD: What happens if I'm submitting work to queues faster than it's being processed and removed? I'm aware that GCD queues have no size limit, would work just pile up until the program runs out of memory? Is there any way to properly handle this situation?
What happens if I'm submitting work to queues faster than it's being processed and removed?
It depends.
If dispatching tasks to a single/shared serial queue, they will just be added to the queue and it will process them in a FIFO manner. No problem. Memory is your only constraint.
If dispatching tasks to a concurrent queue, though, you end up with “thread explosion”, and you will quickly exhaust the limited number of worker threads available for that quality-of-service (QoS). This can result in unpredictable behaviors should the OS need to avail itself of a queue of the same QoS. Thus, you must be very careful to avoid this thread explosion.
See a discussion on thread explosion WWDC 2015 Building Responsive and Efficient Apps with GCD and again in WWDC 2016 Concurrent Programming With GCD in Swift 3.
Is there any way to properly handle this situation?
It is hard to answer that in the abstract. Different situations call for different solutions.
In the case of thread explosion, the solution is to constrain the degree of concurrency using concurrentPerform (limiting the concurrency to the number of cores on your device). Or we use operation queues and their maxConcurrentOperationCount to limit the degree of concurrency to something reasonable. There are other patterns, too, but the idea is to constrain concurrency to something suitable for the device in question.
But if you're just dispatching a large number of tasks to a serial queue, there's not much you can do (other than looking for parallelism opportunities, to make efficient use of all of CPU’s cores). But that's OK, as that is the whole purpose of a queue, to let it perform tasks in the order they were submitted, even if the queue can't keep up. It wouldn’t be a “queue” if it didn’t follow this FIFO sort of pattern.
Now if dealing with real-time data that cannot be processed quickly enough, you have a different problem. In that case, you might want to decouple the capture of the input from the processing and decide how to you want to handle it. E.g. if you can't keep up with real-time processing of a video, for example, you have a choice. Either you start dropping frames or process the data asynchronously/later. You just have to decide what is right for your use case. We cannot answer this question in the abstract.

Is JavoNet a threadsafe library, and more imporantlty, does it allow usage of all threads?

Is javonet threadsafe? I couldn't find any documentation one way or the other. Even if it is threadsafe, is there some sort of "mutex" that's preventing full usages of all threads?
When I tried to run javonet in parallel, it did work, but the CPU usage did not significantly increase above the sequential load (ie on a 10CPU system, the CPU usage hovered around 20% for parallel load, whcih was only merely double the sequential CPU load of 10%); however, if I ran 10 version of the exact same sequential code (that used javonet), I achieved 100% CPU usage....so it "feels" like javonet must have some built-in mutexes that's preventing full parallel usage.
Javonet is thread safe. You just need to follow standard practices for writing multi-threaded applications and Javonet will take care of executing your code properly.
Javonet creates new corresponding .NET thread for calling Java threads. Also the other way for callbacks, events and delegates if called from other thread Javonet will create the corresponding thread on Java side. Once the calling thread completes, Javonet will close the thread on the other side.
If the corresponding thread already exists, Javonet will rejoin to valid thread.
Javonet does use internal mutexes / readwritelocks while accessing objects instances, some caching collections and types what depending on your Java code might affect the parallelization capabilities.

Thread pools and context switching (tasks)?

This is quite a general computer science question and not specific to any OS or framework.
So I am a little confused by the overhead associated with switching tasks on a thread pool. In many cases it doesn't make sense to give every job its own specific thread (we don't want to create too many hardware threads), so instead we put these jobs into tasks which can be scheduled to run on a thread. We setup up a pool of threads and then dynamically allocate the tasks to run on a thread taken from the thread pool.
I am just a little confused (can't find a in depth answer) on the overhead associated with switching tasks on a specific thread (in the thread pool). A DrDobbs article (sourced below) states it does but I need a more in depth answer to what is actually happening (a cite-able source would be fantastic :)).
By definition, SomeWork must be queued up in the pool and then run on
a different thread than the original thread. This means we necessarily
incur queuing overhead plus a context switch just to move the work to
the pool. If we need to communicate an answer back to the original
thread, such as through a message or Future or similar, we will incur
another context switch for that.
Source: http://www.drdobbs.com/parallel/use-thread-pools-correctly-keep-tasks-sh/216500409?pgno=1
What components of the thread are actually switching? The thread itself isn't actually switching, just the data that is specific to the thread. What is the overhead associated with this (more, less or the same)?
let´s clarify first 5 key concepts here and then discuss how they correlates in a thread pool context:
thread:
In a brief resume it can be described as a program execution context, given by the code that is being run, the data in cpu registries and the stack. when a thread is created it is assigned the code that should be executed in that thread context. In each cpu cycle the thread has an instruction to execute and the data in cpu registries and stack in a given state.
task:
Represents a unit of work. It's the code that is assigned to a thread to be executed.
context switch (from wikipedia):
Is the process of storing and restoring the state (context) of a thread so that execution can be resumed from the same point at a later time. This enables multiple processes to share a single CPU and is an essential feature of a multitasking operating system. What constitutes the context is as explained above is the code that is being executed, the cpu registries and the stack.
What is context switched is the thread. A task represents only a peace of work that can be assigned to a thread to be executed. At given moment a thread can be executing a task.
Thread Pool (from wikipedia):
In computer programming, the thread pool is where a number of threads are created to perform a number of tasks, which are usually organized in a queue.
Thread Pool Queue:
Where tasks are placed to be executed by threads in the pool. This data structure is a shared peace of memory where threads may compete to queue/dequeue, may lead to contention in high load scenarios.
Illustrating a thread pool usage scenario:
In your program (eventually running in the main thread), you create a task and schedules it to be executed in thread pool.
The task is queued in the thread pool queue.
When a thread from the pool executes it dequeues a task from the pool and starts to executed it.
If there is no free cpus to execute the thread from the pool, the operating system at some point (depending on thread scheduler policy and thread priorities) will stop a thread from executing, context switching to other thread.
the operating system can stop the execution of a thread at any time, context switching to another thread, returning latter to continue where it stopped.
The overhead of the context switching is augmented when the number of active threads that competes for cpus grows. Thus, ideally, a thread pool tries to use the minimum necessary threads to occupy all available cpus in a machine.
If your tasks haven't code that blocks somewhere, context switching is minimized because it is used no more threads than the available cpus on machine.
Of course if you have only one core, your main thread and the thread pool will compete for the same cpu.
The article probably talks about the case in which work is posted to the pool and the result of it is being waited for. Running a task on the thread-pool in general does not incur any context switching overhead.
Imagine queueing 1000 work items. A thread-pool thread will executed them one after the other. All of that without a single context switch in between.
Switching happens doe to waiting/blocking.

Why does Firemonkey application use no more than 20% of CPU?

I have a large binary file (700 Mb approximately) which I load to TMemoryStream. After that I perform the reading with TMemoryStream.Read() and make some simple calculations but the application never takes more than 20% of CPU. My PC has i7 processor.
Is there any chance to increase the CPU using and speed up the reading process without using the threads?
As far as I know, the only way to utilise the power of multiple cpu cores with Delphi is to use threads.
If you do choose to use threads in your application, there are a couple libraries that may ease development. How Do I Choose Between the Various Ways to do Threading in Delphi?
Adding on to Shannon's answer, on an i7 processor with multiple cores, one thread will only be utilizing one core. One thread cannot run on more than one processor core. Therefore, if you wish to utilize multiple cores, you need to create multiple threads to handle various tasks. Creating a thread isn't necessarily as simple as saying do this in that thread, there's a lot to know about multi-threading. For example, your application has one main GUI thread, then one thread might be dedicated for performing some long calculation, another thread might be updating a caption to real-time data, and so on.
Windows automatically decides which core to assign a thread to, and usually divides it up fairly. So, if you have 8 processor cores, and 16 threads, each core would get 2 threads (presumably) and since each core sends its own ticks apart from each other, more than one thread could literally be running at the same time (as opposed to a single core where it divides each 'tick' between each thread).
So to answer your question, if you had 5 threads performing something big at the same time, then you would see 100% processor usage.

Resources