I'm trying to use SnakeYAML for stream processing of (big) YAML documents.
(Context)
Currently, I'm stuck with the »present« step. It seems that the »present« process is not available in SnakeYAML, or at least I'm unable to find it, i.e. I can parse a string to Events, but I cannot put Events back together to a string
Have I overlooked a »present« process in SnakeYAML? Or is there some third party code out there that can perform the »present« step?
I don't have enough memory to hold a full Node graph.
I overlooked it the first few times because the void emit(Event) method signature confused me:
You can perform the Present process using the org.yaml.snakeyaml.emitter.Emitter class.
Related
I'm implementing various kinds of data processing pipelines using dask.distributed. Usually the original data is read from S3 and in the end processed (large) collection would be written to CSV on S3 as well.
I can run the processing asynchonously and monitor progress, but I've noticed that all to_xxx() methods that store collections to file(s) seem to be synchronous calls. One downside of it is that the call blocks, potentially for a very long time. Second, I cannot easily construct a complete graph to be executed later.
Is there a way to run e.g. to_csv() asynchronously and get a future object instead of blocking?
PS: I'm pretty sure that I can implement async storage myself, e.g. by converting collection to delayed() and storing each partition. But it seems like a common case - unless I missed already existing feature it would be nice to have something like this included in the framework.
Most to_* functions have a compute=True keyword argument that can be replaced with compute=False. In these cases it will return a sequence of delayed values that you can then compute asynchronously
values = df.to_csv('s3://...', compute=False)
futures = client.compute(values)
I want to save final data from console output to file without intermediate.
How can i do that?
The report module exports all info into html in JSON format. You can get some info from there (cumulative percentiles, for example). You even don't have to modify python code in that case, just add some JS to the page that generates a table.
On the other hand, if you want something more then that info included there, you should implement it in the report module.
What particular pieces of last screen data are you interested in?
P.S by the way, one may create a couple of templates and then provide the template parameter in report section of load.ini to specify which one you want to use.
This screen is good report only for "const" benchmarking. For "line" and "step" ramping the last screen always demonstrates the worst timings and resources. But we are thinking about this feature request.
I recently wrote a mailing platform for one of our employees to use. The system runs great, scales great, and is fun to use. However, it is currently inoperable due to a bug that I can't figure out how to fix (fairly inexperienced developer).
The process goes something like this...
Upload a CSV file to a specific FTP directory.
Go to the import_mailing_list page.
Choose a CSV file within the FTP directory.
Name and describe what the list contains.
Associate file headings with database columns.
Then, the back-end loops over each line of the file, associating the values with a heading, and importing these values into a database.
This all works wonderfully, except in a specific case, when a raw CSV is not correctly formatted. For example...
fname, lname, email
Bob, Schlumberger, bob#bob.com
Bobbette, Schlumberger
Another, Record, goeshere#email.com
As you can see, there is a missing comma on line two. This would cause an error when attempting to pull "valArray[3]" (or valArray[2], in the case of every language but mine).
I am looking for the most efficient solution to keep this error from happening. Perhaps I should check the array length, and compare it to the index we're going to attempt to pull, before pulling it. But to do this for each and every value seems inefficient. Anybody have another idea?
Our stack is ColdFusion 8/9 and MySQL 5.1. This is why I refer to the array index as [3].
There's ArrayIsDefined(array, elementIndex), or ArrayLen(array)
seems inefficient?
You gotta code what you need to code, forget about inefficiency. Get it right before you get it fast (when needed).
I suppose if you are looking for another way of doing this (instead of checking the array length each time, although that really doesn't sound that bad to me), you could wrap each line insert attempt in a try/catch block. If it fails, then stuff the failed row in a buffer (including the line number and error message) that you could then display to the user after the batch has completed, so they could see each of the failed lines and why they failed. This has the advantages of 1) not having to explicitly check the array length each time and 2) catching other errors that you might not have anticipated beforehand (maybe a value is too long for your field, for example).
Since I'm not really seeing any content anywhere that doesn't point back to the original Microsoft documents on this matter, or source code that really doesn't seem to answer the questions I'm having, I thought I might ask a few things here. (Delphi tag is there because that's what my dev environment is on the code I'm making from this)
That said, I had a few questions the API document wasn't answering. First one: fdi_notify messages. What is "my responsibility" is in coding these: fdintCABINET_INFO: fdintPARTIAL_FILE: fdintNEXT_CABINET: fdintENUMERATE: ? I'll illustrate what I mean by an example. For fdintCLOSE_FILE_INFO, "my responsibility" is to Close a file related to handle given me, and set the file's date and time according to the data passed in fdi_notify.
I figure I'm missing something since my code isn't handling extracting spanned CAB files...any thoughts on how to do this?
What you're more than likely running into is that FDICopy only reads the cab you passed in. It will use fdintNEXT_CABINET to get spanned data for any files you extract in response to fdintCOPY_FILE, but it only calls fdintCOPY_FILE for files that start on that first cab.
To get a directory listing for the entire set, you need to call FDICopy in a loop. Every time you get a fdintCABINET_INFO event, save off the psz1 parameter (next cab name). When FDICopy returns, check that. If it's an empty string you're done, if not call FDICopy again with the next cab as the new path.
fdintCABINET_INFO: The only responsibility for this is returning 0 to continue processing. You can use the information provided (the path of the next cabinet, next disk, path name, nad set ID), but you don't need to.
fdintPARTIAL_FILE: Depending on how you're processing your cabs, you can probably ignore this. You'll only see it for the second and later images in a set, and it's to tell you that the particular entry is continued from a previous cab. If you started at the first cab in the set you'll have already seen an fdintCOPY_FILE for the file. If you're processing random .cabs, you won't really be able to use it either, since you won't have the start of the file to extract.
fdintNEXT_CABINET: You can use this to prompt the user for a new directory for the next cabinet, but for simple spanning support just return 0 if the passed in filename is valid or -1 if it isn't. If you return 0 and the cab isn't valid, or is the wrong one, this will get called again. The easiest approach (if you don't request a new disk/directory), is just to check pfdin^.fdie. If it's FDIError_None it's equal the first time being called for the requested cab, so you can return 0. If it's anything else it's already tried to open the requested cab at least once, so you can return -1 as an error.
fdintENUMERATE: I think you can ignore this. It isn't covered in the documentation, and the two cab libraries I've looked at don't use it. It may be a leftover from a previous API version.
I want to get the PEB from the "notepad.exe" process. Does someone know how to do it?
I tried the GetModuleHandle API, but it doesn't return a valid pointer (it return zero every time) because I have to be the caller process of that module.
For that reason, I want to know how to get it to work with EnumProcessModules or CreateToolhelp32Snapshot.
Matt Pietrek described how to do that in a 1994 Under the Hood column. It was about how to get the environment variables of another process, where the first step is to get a pointer to the PEB. To do that, he says, call NtQueryInformationProcess. The PROCESS_BASIC_INFORMATION structure it fills contains the base address of the PEB structure. (You'll need to use ReadProcessMemory to read it since the address will be in the context of the external process's address space, not yours.)
To call NtQueryInformationProcess, you'll need a handle to the process. If you started the process yourself (by calling CreateProcess), then you already have a handle. Otherwise, you'll need to find the process ID and then call OpenProcess. To get the process ID, search for the process you want with EnumProcesses or Process32First/Process32Next. (I prefer the latter because it provides more information with less work.)