Parse list 3 threads a time, when 5 completed works, server signal to do something - pthreads

Hy I am curious does anyone know a tutorial example where semaphores are used for more than 1 process /thread. I'm looking forward to fix this problem. I have an array, of elements and an x number of threads. This threads work over the array, only 3 at a moment. After 5 works have been completed, the server is signelised and it clean those 5 nodes. But I'm having problems with the designing this problem. (node contains worker value which contains the 'name' of the thread that is allowed to work on it, respectivly nrNodes % nrThreads)
In order to make changes on the list a mutex is neccesarly in order not to overwrite / make false evaluations.
But i have no clue how to limit 3 threads to parse, at a given point, the list, and how to signal the main for cleaning session. I have been thinking aboutusing a semafor and a global constant. When the costant reaches 5, the server to be signaled(which probably would eb another thread.)
Sorry for lack of code but this is a conceptual question, what i have written so far doesn't affect the question in any way.

Related

Neo4j In memory configurations, multithreading, and slow writes

How do I improve performance when writing to neo4j. I currently have neo4j set up on a server and I am currently running it in embedded more. I believe my configurations are storing all the content of my graph database in memory based upon configurations I've found online
neostore.nodestore.db.mapped_memory=0
neostore.relationship.db.mapped_memory=0
neostore.propertystore.db.mapped_memory=0
neostore.propertystore.db.strings.mapped_memory=0
neostore.propertystore.db.arrays.mapped_memory=0
neostore.propertystore.db.index.keys.mapped_memory=0
neostore.propertystore.db.index.mapped_memory=0
node_auto_indexing=true
node_keys_indexable=type,id
cache_type=strong
use_memory_mapped_buffers=false
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
Please let me know if this is incorrect. The problem that I am encountering is that when I try to persist information to the graph database. It appears that those times are not very quick in comparison to our MYSQL times of the samething(ex. to add 250 items would take about 3sec and in MYSQL it takes 1sec) . I read online that when you have multiple indexes that that can slow down performance on persisting data so I am working on that right now to see if that is my culprit. But, I just wanted to make sure that my configurations seem to be inline when it comes to running your graph database in memory.
Second question to this topic. Okay, if my configurations are good and my database is indeed in memory, then is there a way to optimize persisting data just in case this isn't the silver bullet. If we ran one thread against our test that executes this functionality, oppose to 10 threads, its seems like the times for execution bubbles up
ex.( thread 1 finishes 1s, thread 2 finishes 2s, thread 3 finishes 3s,etc). Is there some special multithreaded configuration that I am missing to improve the performance when mulitple threads are hitting it at one time.
Neo4J version
1.9.1-enterprise
My Jvm configs are
-Xms25G -Xmx25G -XX:+UseNUMA -XX:+UseSerialGC
My Machine Specs:
File system type ext3
You cache arguments are invalid.
node_cache_size=12G
relationship_cache_size=12G
node_cache_array_fraction=10
relationship_cache_array_fraction=10
These can only be used with the GCR cache. Setting the cache isn't going to put everything in memory for you at start up, you will have to write code to do this for you. Something like this:
GlobalGraphOperations ggo = GlobalGraphOperations.at(graphDatabaseFactory);
for (Node n : ggo.getAllNodes()) {
for (String propertyKey : n.getPropertyKeys()) {
n.getProperty(propertyKey);
}
for (Relationship relationship : n.getRelationships()) {
}
}
Beware with the strong cache, if you have a lot of nodes/relationships, eventually your cache will become large and performing GC against it will cause long pauses in your system.
My recommendation would be to use the memory mapped files, as this is an OS handled and will be outside of heap space. It doesn't provide near the speed of caching, but it will provide a speed up if you have to read from the neo store.

F# Array.Parallel hanging

I have been struggling with parallel and async constructs in F# for the last couple days and not sure where to go at this point. I have been programming with F# for about 4 months - certainly no expert - and I currently have a series of calculations that are implemented in F# (asp.net 4.5) and are working correctly when executed sequentially. I am running the calculations on a multi-core server and since there are millions of inputs to perform the same calculation on, I am hoping to take advantage of parallelism to speed it up.
The calculations are extremely data parallel - basically the exact calculation on different input data. I have tried a number of different avenues and I continually run into the same issue - it seems as if the parallel looping never gets to the end of the input data set. I have tried TPL, ConcurrentQueues, Parallel.Array.map/iter and all the same result: the program starts out fine and then somewhere in the middle (indeterminate) it just hangs and never completes. For simplicity I actually removed the calculation from the program and I am just calling a print method, and Here is where the code is currently at:
let runParallel =
let ids = query {for c in db.CustTable do select c.id} |> Seq.take(5)
let customerInputArray= getAllObservations ids
Array.Parallel.iter(fun c -> testParallel c) customerInputArray
let key = System.Console.ReadKey()
0
A few points...
I limited the results above to only 5 just for debugging. The actual program does not apply the Take(5).
The testParallel method is just a printfn "test".
The customerInputArray is a complex data type. It is a tuple of lists that contain records. So I am pretty sure my problem must be there...but I added exception handling and no exception is getting raised, so have no idea how to go about finding the problem.
Any help is appreciated. Thanks in advance.
EDIT: Thanks for the advice...I think it is definitely deadlock. When I remove all of the printfn, sprintfn, and string concat operations, it completes. (of course, I need those things in there.)
Is printfn, sprintfn, and string ops not thread-safe?
Another EDIT: Iteration always stops on the last item..So if my input array has 15 items, the processing stops on item 14, or seems to never get to item 15. Then everything just hangs. Does not matter what the size of the input array is..Any ideas what can be causing this? I even switched over to Parallel.ForEach (instead of Array.Parallel) and same behavior.
Update on the situation and how I resolved this issue.
I was unable to upload code from my example due to my company's firewall policy, so in the end my question did not have enough details. I failed to mention that I was using a type provider which was important information in this situation. But here is what I figured out.
I am using the F# type provider for SQL Server and was passing around its Service Types which I suspect are not thread-safe. When I replaced the ServiceTypes with plain old F# Records, the code worked fine - no more deadlocks and everything completed without error.

Using random() function on multiple threads

I'm working on an app where I need reproducible random numbers. I use srandom() with a seed to initialize the random number sequence. Then I use random() to generate the random numbers from this seed. If this is the only thread generating random numbers, everything works fine. However, if there are multiple threads generating random numbers, they interfere with each other.
Apparently, the sequence of random numbers is not thread safe. There must be a central random number generator that is called by all threads.
My app generates hundreds of objects, each one of which has four sequences of 14 random numbers generated this way. Each of these 4 sequences has its own non-random seed. This way, the random numbers should be reproducible. The problem is, because of the thread interference I just described, sometimes the sequence of 14 numbers being generated will be interrupted by a random number request by another thread.
After thinking about this for a while, I've decided to call
dispatch_sync(dispatch_get_main_queue(), ^{//generate the 14 numbers});
to get each sequence. This should force them to get generated in the proper sequence. In reading the documentation, it says there could be a deadlock if dispatch_sync is called on the queue it's running in. How can I tell if I'm already on the main queue? If I am, I don't need to dispatch anything, right?
Is there a better way to do this?
I suspect another way to do this is similar to this but using a dedicated queue instead of the main queue. I've never tried making my own queue before. Also, the method that needs to call the queue is an ephemeral one, so I'd need to somehow pass the custom queue around if I'm going to go that route. How does one pass a queue as an argument?
For now, I'm running with my idea, above, dispatching synchronously to the main queue, and the app seems to work fine. Worst case scenario, this snippet of code would be run about 4800 times (4 for each of 1200 objects, which is currently the max.).
I assume you want computationally random numbers, rather than cryptographic random numbers.
My suggestion would be to have separate RNGs for each thread, with each thread RNG seeded centrally from a master RNG. Since the system RNG is not thread safe, then create your own small RNG method -- a good LCG should work -- for use exclusively within one thread.
Use the built-in random() to produce only the initial seeds for each of your sub-threads. Setting the overall initial seed with srandom() will ensure that the thread local my_random() methods will all get a consistent initial reseed as long as the threads are started in the same order each time.
Effectively you are building a hierarchy of RNGs to match your hierarchy of threads.
Another option would be to have a singleton do the computation. The object needing the set of random numbers would ask the singleton for them in a batch.

The memory consistency model CUDA 4.0 and global memory?

Update: The while() condition below gets optimized out by the compiler, so both threads just skip the condition and enter the C.S. even with -O0 flag. Does anyone know why the compiler is doing this? By the way, declaring the global variables volatile causes the program to hang for some odd reason...
I read the CUDA programming guide but I'm still a bit unclear on how CUDA handles memory consistency with respect to global memory. (This is different from the memory hierarchy) Basically, I am running tests trying to break sequential consistency. The algorithm I am using is Peterson's algorithm for mutual exclusion between two threads inside the kernel function:
flag[threadIdx.x] = 1; // both these are global
turn = 1-threadIdx.x;
while(flag[1-threadIdx.x] == 1 && turn == (1- threadIdx.x));
shared_gloabl_variable_x ++;
flag[threadIdx.x] = 0;
This is fairly straightforward. Each thread asks for the critical section by setting its flag to one and by being nice by giving the turn to the other thread. At the evaluation of the while(), if the other thread did not set its flag, the requesting thread can then enter the critical section safely. Now a subtle problem with this approach is that if the compiler re-orders the writes so that the write to turn executes before the write to flag. If this happens both threads will end up in the C.S. at the same time. This fairly easy to prove with normal Pthreads, since most processors don't implement sequential consistency. But what about GPUs?
Both of these threads will be in the same warp. And they will execute their statements in lock-step mode. But when they reach the turn variable they are writing to the same variable so the intra-warp execution becomes serialized (doesn't matter what the order is). Now at this point, does the thread that wins proceed onto the while condition, or does it wait for the other thread to finish its write, so that both can then evaluate the while() at the same time? The paths again will diverge at the while(), because only one of them will win while the other waits.
After running the code, I am getting it to consistently break SC. The value I read is ALWAYS 1, which means that both threads somehow are entering the C.S. every single time. How is this possible (GPUs execute instructions in order)? (Note: I have compiled it with -O0, so no compiler optimization, and hence no use of volatile).
Edit: since you have only two threads and 1-threadIdx.x works, then you must be using thread IDs 0 and 1. Threads 0 and 1 will always be part of the same warp on all current NVIDIA GPUs. Warps execute instructions SIMD fashion, with a thread execution mask for divergent conditions. Your while loop is a divergent condition.
When turn and flags are not volatile, the compiler probably reorders the instructions and you see the behavior of both threads entering the C.S.
When turn and flags are volatile, you see a hang. The reason is that one of the threads will succeed at writing turn, so turn will be either 0 or 1. Let's assume turn==0: If the hardware chooses to execute thread 0's part of the divergent branch, then all is OK. But if it chooses to execute thread 1's part of the divergent branch, then it will spin on the while loop and thread 0 will never get its turn, hence the hang.
You can probably avoid the hang by ensuring that your two threads are in different warps, but I think that the warps must be concurrently resident on the SM so that instructions can issue from both and progress can be made. (Might work with concurrent warps on different SMs, since this is global memory; but that might require __threadfence() and not just __threadfence_block().)
In general this is a great example of why code like this is unsafe on GPUs and should not be used. I realize though that this is just an investigative experiment. In general CUDA GPUs do not—as you mention most processors do not—implement sequential consistency.
Original Answer
the variables turn and flag need to be volatile, otherwise the load of flag will not be repeated and the condition turn == 1-threadIdx.X will not be re-evaluated but instead will be taken as true.
There should be a __threadfence_block() between the store to flag and store to turn to get the right ordering.
There should be a __threadfence_block() before the shared variable increment (which should also be declared volatile). You may also want a __syncthreads() or at least __threadfence_block() after the increment to ensure it is visible to other threads.
I have a hunch that even after making these fixes you may still run into trouble, though. Let us know how it goes.
BTW, you have a syntax error in this line, so it's clear this isn't exactly your real code:
while(flag[1-threadIdx.x] == 1 and turn==[1- threadIdx.x]);
In the absence of extra memory barriers such as __threadfence(), sequential consistency of global memory is enforced only within a given thread.

External memory merge sort

Can anyone point me to a good reference on External Memory Mergesort? I've read the wiki page but am having trouble understanding it exactly. An animation might help but I can't seem to find one.
Basically, I know that you have a certain number of blocks on disk, and you can fit a certain number of blocks in memory. Lets say you have 32 blocks on disk and 4 blocks in memory. In the first pass you read 4 blocks into memory at a time, sort them in memory, and write them back out do disk. So at this point you have 8 sorted runs of 4 blocks. How does the merging work? Since I have 4 blocks in memory (assume I have one more for output) I think I should be able to merge 4 of those 8 runs at a time, and then merge the next 4 runs. And then in the last pass I want to merge the whole thing. But don't you have to read each block from disk each time? So how does this not become a n^2 solution?
I think I get it now. When you merge you still have to read all of the blocks on disk (let's call that B) but you don't have to read them B times. You only read them B log B times.

Resources