I'm implementing various kinds of data processing pipelines using dask.distributed. Usually the original data is read from S3 and in the end processed (large) collection would be written to CSV on S3 as well.
I can run the processing asynchonously and monitor progress, but I've noticed that all to_xxx() methods that store collections to file(s) seem to be synchronous calls. One downside of it is that the call blocks, potentially for a very long time. Second, I cannot easily construct a complete graph to be executed later.
Is there a way to run e.g. to_csv() asynchronously and get a future object instead of blocking?
PS: I'm pretty sure that I can implement async storage myself, e.g. by converting collection to delayed() and storing each partition. But it seems like a common case - unless I missed already existing feature it would be nice to have something like this included in the framework.
Most to_* functions have a compute=True keyword argument that can be replaced with compute=False. In these cases it will return a sequence of delayed values that you can then compute asynchronously
values = df.to_csv('s3://...', compute=False)
futures = client.compute(values)
Related
Actually I am new to data movement SDK,I want to know how we can used data movement sdk to remove collection from docs which match's specific condition in real time in marklogic ?
Yes, DMSK can reprocess documents in the database including modifying the collections on the documents.
The most efficient way to change document collections on the server might be to take an approach similar to the out-of-the-box ApplyTransformListener (as summarized by
https://docs.marklogic.com/guide/java/data-movement#id_51555) but to execute a custom module instead of a transform.
Summarizing the main points:
Write an SJS (Server-Side JavaScript) module that declares a variable (using the JavaScript var statement) to receive the document URIs sent by the client and modifies the collections on those documents using a function such as
https://docs.marklogic.com/xdmp.documentSetCollections
Install the SJS module in the modules database as described here
https://docs.marklogic.com/guide/java/resourceservices#id_13008
Create a QueryBatcher to get the document URIs either from a query on the database or from a client iterator as described here:
https://docs.marklogic.com/guide/java/data-movement#id_46947
Supply a lambda function for the QueryBatcher.onUrisReady() method - see
https://docs.marklogic.com/javadoc/client/com/marklogic/client/datamovement/QueryBatcher.html#onUrisReady-com.marklogic.client.datamovement.QueryBatchListener-
In the lambda function, construct and execute a ServerEvaluationCall to the SJS module, assigning the variable to the URIs passed to the lambda function - see:
https://docs.marklogic.com/guide/java/resourceservices#id_84134
Be sure to register failure listeners using the QueryBatcher.onQueryFailure() ApplyTransformListener.onFailure() methods to log error or otherwise respond to the unexpected.
Hoping that helps,
I'm trying to use SnakeYAML for stream processing of (big) YAML documents.
(Context)
Currently, I'm stuck with the »present« step. It seems that the »present« process is not available in SnakeYAML, or at least I'm unable to find it, i.e. I can parse a string to Events, but I cannot put Events back together to a string
Have I overlooked a »present« process in SnakeYAML? Or is there some third party code out there that can perform the »present« step?
I don't have enough memory to hold a full Node graph.
I overlooked it the first few times because the void emit(Event) method signature confused me:
You can perform the Present process using the org.yaml.snakeyaml.emitter.Emitter class.
When I read the section on
NSDataReadingOptions
Options for methods used to read NSData objects.
enum {
NSDataReadingMappedIfSafe = 1UL << 0,
NSDataReadingUncached = 1UL << 1,
NSDataReadingMappedAlways = 1UL << 3,
};
typedef NSUInteger NSDataReadingOptions;
It says that
NSDataReadingUncached
A hint indicating the file should not be stored in the file-system caches.
For data being read once and discarded, this option can improve performance.
Available in OS X v10.6 and later.
Declared in NSData.h.
So I am assuming that by default these URL requests are cached and there is no need to implement NSURLRequest to cache data if I want to use shared global cache ? Is this understanding correct ?
Let me start off by saying that dataWithContentsOfURL:options:error: and its ilk are probably the worst APIs for getting something from network. They are very alluring to developers because they can get a resource from network in a single line of code, but they come with some very pernicious side-effects:
First, they block the thread on which they are called. This means that if you execute this on the main thread (the only thread on which your UI can be updated), then your application will appear frozen to the user. This is a really big 'no no' from a user experience perspective.
Second, you cannot cancel these requests, so even if you put this request on a background thread, it will continue to download even though the data may no longer be useful. For example, if your user arrives on a view controller and you execute this request and the user subsequently decides to hit a back button, that data will continue to download, even though it is no longer relevant.
Bottom line: DO NOT USE THESE APIs.
Please use async networking like NSURLConnection or AFNetworking. These classes were designed to get data efficiently and in a way that doesn't impact the user experience. What's even better is that they handle the specific use case you originally asked: how do I stop it from caching on disk?.
There is no answer to your specific question which refer to the caches managed by the URL loading system.
The reading options in method dataWithContentsOfFile:options:error: refer solely to reading from the file system (please consult to the official documentation of NSDataReadingOptions).
Unfortunately, the documentation gives no hint about the behavior when reading from remote resources. Whether the response is cached in the URL cache is unspecified - that is, we don't know, and the internal behavior and implementation can change with each new OS version.
IMHO, we should generally avoid to use the "convenient methods" to read from remote resources. It appears, these methods work only by "accident" when accessing remote resources. So, I strongly agree with #Wayne Hartman ;)
I've been poring over this subject for the past 12 hours, and I simply cannot seem to get anywhere. I do not even know if this is possible, but I'm hoping it is because it would go a long way to continuing my project.
What I am attempting to do is create coroutines so the particular program I use does not freeze up due to its inability to perform asynchronous http requests. I've figured out how to do that part, even though my understanding of coroutines is still in the "Huh? How does that work?" phase. My issue now is being able to respond to multiple requests with the correct information. For instance, the following should produce three separate responses:
foo(a)
foo(b)
foo(c)
where foo initiates a coroutine with the parameters inside. If all requested separately, then it returns the proper results. However, if requested as a block, it will only return foo(c)'s result. Now, I understand the reasoning behind this, but I cannot find a way to make it return all three results when requested as a block. To help understand this problem a bit, here's the actual code:
function background_weather()
local loc = url.escape(querystring)
weatherpage = http.request("http://api.wunderground.com/api/004678614f27ceae/conditions/q/" .. loc .. ".json")
wresults = json.decode(weatherpage)
--process some stuff here, mainly datamining
end
--send datamined information as a response
coroutine.yield()
end
And the creation of the coroutine:
function getweather ()
-- see if backgrounder running
if background_task == nil or
coroutine.status (background_task) == "dead" then
-- not running, create it
background_task = coroutine.create (background_weather)
-- make timer to keep it going
AddTimer ("tickler", 0, 0, 1, "",
timer_flag.Enabled + timer_flag.Replace,
"tickle_it")
end -- if
end -- function
The querystring variable is set with the initial request. I didn't include it here, but for the sake of testing, use 12345 as the querystring variable. The timer is something that the original author of the script initialized to check if the coroutine was still running or not, poking the background every second until done. To be honest, I'm not even sure if I've done this correctly, though it seems to run asynchronously in the program.
So, is it possible to receive multiple requests in one block and return multiple responses, correctly? Or is this far too much a task for Lua to handle?
Coroutines don't work like that. They are, in fact, blocking.
The problem coroutines resolve is "I want to have a function I can execute for a while, then go back to do other thing, and then come back and have the same state I had when I left it".
Notice that I didn't say "I want it to keep running while I do other things"; the flow of code "stops" on the coroutine, and only continues on it when you go back to it.
Using coroutines you can modify (and in some cases facilitate) how the code behaves, to make it more evident or legible. But it is still strictly single-threaded.
Remember that what Lua implements must be specified by C99. Since this standard doesn't come with a thread implementation, Lua is strictly single-threaded by default. If you want multi-threading, you need to hook it to an external lib. For example, luvit hooks Luajit with the libuv lib to achieve this.
A couple good references:
http://lua-users.org/wiki/CoroutinesTutorial
http://lua-users.org/wiki/ThreadsTutorial
http://lua-users.org/wiki/MultiTasking
http://kotisivu.dnainternet.net/askok/bin/lanes/comparison.html
Chapter 9.4 of Programming in Lua contains a fairly good example of how to deal with this exact problem, using coroutines and LuaSocket's socket.select() function to prevent busylooping.
Unfortunately I don't believe there's any way to use the socket.http functions with socket.select; the code in the PiL example is often all you'll need, but it doesn't handle some fairly common cases such as the requested URL sending a redirect.
I've a "best practice" question on CouchDB (actually I'm using TouchDB a CouchDB port to iOS), when using CouchCocoa framework.
I need to delete a bunch of documents that I get via a query.
I know 3 ways to do this:
1) put all the documents into an NSArray, then use [CouchDatabase deleteDocuments:]
2) foreach query rows call the delete method, like:
for (CouchQueryRow* row in query.rows)
[row.document DELETE];
3) create a query that emit the _id, _rev properties and add the _deleted property, then use the bulk update, like:
[couchDatabase putChanges:]
What's the better performance-wise? There's a better way to do it?
At the HTTP API level, the fastest way to achieve this is to run a single batch request that provides the _id and current _rev of all documents to be removed.
Your job is to make sure that CouchCocoa actually does this — I know that CouchCocoa will try to cache the _rev of documents it reads, so if you are deleting documents that have just been read, [CouchDatabase deleteDocuments:] should be enough, otherwise you will have to [CouchDatabase getDocumentsWithIDs:] first.
If your documents are very large, it might become better to get the _rev using a view instead of a bulk fetch. This forces you to use [CouchDatabase putChanges:] to perform the bulk deletion. I don't know where the document size threshold lies, so you will have to benchmark this one.
Of course, you also need to decide what happens when a conflict occurs.