Parsing file in parallel

Parsing file in parallel - parsing

I am thinking about a way to parse a fasta-file in parallel. For those of you not knowing fasta-format an example:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
So lines starting with an '>' are header lines containing an identifier for the sequence following the identifier.
I suppose you load the entire file to memory but after this i am having trouble finding a way to process these data.
The problem is: Threads can not start at an arbitrary position because they could cut sequences this way.
Does someone has any experience in parsing files in parallel when the lines depend on each other? Any idea is appreciated.

Should be easy enough, since the dependence of lines on each other is very simple in this case: just make the threads start in an arbitrary position and then just skip the lines until they get to one that starts with a '>' (i.e. starts a new sequence).
To make sure no sequence gets processed twice, keep a set of all sequence IDs that have been processed (or you could do it by line number if the sequence IDs aren't unique, but they really should be!).

Do a preprocessing step, walk through the data once, and determine all valid start points. Let's call these tasks. Then you can simply use a worker-crew model, where each worker repeatedly asks for a task (a starting point), and parses it.

Related

How to create a Save/Load function on Scratch?

Im trying to make a game on Scratch that will use a feature to generate a special code, and when that code is input into a certain area it will load the stats that were there when the code was generated. I've run into a problem however, I don't know how to make it and I couldn't find a clear cut answer for how to make it.
I would prefer that the solution be:
Able to save information for as long as needed (from 1 second to however long until it's input again.)
Doesn't take too many blocks to make, so that the project won't take forever to load it.
Of course i'm willing to take any solution in order to get my game up and running, those are just preferences.

You can put all of the programs in a custom block with "Run without screen refresh" on so that the program runs instantly.
If you save the stats using variables, you could combine those variable values into one string divided by /s. i.e. join([highscore]) (join("/") (join([kills]) (/))
NOTE: Don't add any "/" in your stats, you can probably guess why.
Now "bear" (pun) with me, this is going to take a while to read
Then you need the variables:
[read] for reading the inputted code
[input] for storing the numbers
Then you could make another function that reads the code like so: letter ([read]) of (code) and stores that information to the [input] variable like this: set [input] to (letter ([read]) of (code)). Then change [read] by (1) so the function can read the next character of the code. Once it letter ([read]) of (code) equals "/", this tells the program to set [*stat variable*] to (input) (in our example, this would be [highscore] since it was the first variable we saved) and set [input] to (0), and repeat again until all of the stats variables are filled (In this case, it repeats 2 times because we saved two variables: [highscore] and [kills]).
This is the least amount of code that it takes. Jumbling it up takes more code. I will later edit this answer with a screenshot showcasing whatever I just said before, hopefully clearing up the mess of words above.

The technique you mentioned is used in many scratch games but there is two option for you when making the save/load system. You can either do it the simpler way which makes the code SUPER long(not joking). The other way is most scratchers use, encoding the data into a string as short as possible so it's easy to transfer.
If you want to do the second way, you can have a look at griffpatch's video on the mario platformer remake where he used a encode system to save levels.https://www.youtube.com/watch?v=IRtlrBnX-dY The tips is to encode your data (maybe score/items name/progress) into numbers and letters for example converting repeated letters to a shorter string which the game can still decode and read without errors
If you are worried it took too long to load, I am pretty sure it won't be a problem unless you really save a big load of data. The common compress method used by everyone works pretty well. If you want more data stored you may have to think of some other method. There is not an actual way to do that as different data have different unique methods for things working the best. Good luck.

Apache-camel Xpathbuilder performance

I have following question. I set up an camel -project to parse certain xml files. I have to selecting take out certain nodes from a file.
I have two files 246kb and 347kb in size. I am extracting a parent-child pair of 250 nodes in the above given example.
With the default factory here are the times. For the 246kb file respt 77secs and 106 secs. I wanted to improve the performance so switched to saxon and the times are as follows 47secs and 54secs. I was able to cut the time down by at least half.
Is it possible to cut the time further, any other factory or optimizations I can use will be appreciated.
I am using XpathBuilder to cut the xpaths out. here is an example. Is it possible to not to have to create XpathBuilder repeatedly, it seems like it has to be constructed for every xpath, I would have one instance and keep pumping the xpaths into it, maybe it will improve performance further.
return XPathBuilder.xpath(nodeXpath)
.saxon()
.namespace(Consts.XPATH_PREFIX, nameSpace)
.evaluate(exchange.getContext(), exchange.getIn().getBody(String.class), String.class);
Adding more details based on Michael's comments. So I am kind of joining them, will become clear with my example below. I am combining them into a json.
So here we go, Lets say we have following mappings for first and second path.
pData.tinf.rexd: bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:ReqdExctnDt/text()
pData.tinf.pIdentifi.instId://bm:Document/bm:xxxxx/bm:PmtInf[{0}]/bm:CdtTrfTxInf[{1}]/bm:PmtId/bm:InstrId/text()
This would result in a json as below
pData:{
tinf: {
rexd: <value_from_xml>
}
pIdentifi:{
instId: <value_from_xml>
}
}

Hard to say without seeing your actual XPath expression, but given the file sizes and execution time my guess would be that you're doing a join which is being executed naively as a cartesian product, i.e. with O(n*m) performance. There is probably some way of reorganizing it to have logarithmic performance, but the devil is in the detail. Saxon-EE is quite good at optimizing join queries automatically; if not, there are often ways of doing it manually -- though XSLT gives you more options (e.g. using xsl:key or xsl:merge) than XPath does.

Actually I was able to bring the time down to 10 secs. I am using apache-camel. So I added threads there so that multiple files can be read in separate threads. Once the file was being read, it had serial operation to based on the length of the nodes that had to be traversed. I realized that it was not necessary to be serial here so introduced parrallelStream and that now gave it enough power. One thing to guard agains is not to have a proliferation of threads since that can degrade the performance. So I try to restrict the number of threads to twice or thrice the number of cores on the operating machine.

Dask - Understanding diagnostics - memory:list

I am working on some fairly complex application that is making use of Dask framework, trying to increase the performance. To that end I am looking at the diagnostics dashboard. I have two use-cases. On first I have a 1GB parquet file split in 50 parts, and on second use case I have the first part of the above file, split over 5 parts, which is what used for the following charts:
The red node is called "memory:list" and I do not understand what it is.
When running the bigger input this seems to block the whole operation.
Finally this is what I see when I go inside those nodes:
I am not sure where I should start looking to understand what is generating this memory:list node, especially given how there is no stack button inside the task as it often happens. Any suggestions ?

Red nodes are in memory. So this computation has occurred, and the result is sitting in memory on some machine.
It looks like the type of the piece of data is a Python list object. Also, the name of the task is list-159..., so probably this is the result of calling the list Python function.

Control the tie-breaking choice for loading chunks in the scheduler?

I have some large files in a local binary format, which contains many 3D (or 4D) arrays as a series of 2D chunks. The order of the chunks in the files is random (could have chunk 17 of variable A, followed by chunk 6 of variable B, etc.). I don't have control over the file generation, I'm just using the results. Fortunately the files contain a table of contents, so I know where all the chunks are without having to read the entire file.
I have a simple interface to lazily load this data into dask, and re-construct the chunks as Array objects. This works fine - I can slice and dice the array, do calculations on them, and when I finally compute() the final result the chunks get loaded from file appropriately.
However, the order that the chunks are loaded is not optimal for these files. If I understand correctly, for tasks where there is no difference of cost (in terms of # of dependencies?), the local threaded scheduler will use the task keynames as a tie-breaker. This seems to cause the chunks to be loaded in their logical order within the Array. Unfortunately my files do not follow the logical order, so this results in many seeks through the data (e.g. seek halfway through the file to get chunk (0,0,0) of variable A, then go back near the beginning to get chunk (0,0,1) of variable A, etc.). What I would like to do is somehow control the order that these chunks get read, so they follow the order in the file.
I found a kludge that works for simple cases, by creating a callback function on the start_state. It scans through the tasks in the 'ready' state, looking for any references to these data chunks, then re-orders those tasks based on the order of the data on disk. Using this kludge, I was able to speed up my processing by a factor of 3. I'm guessing the OS is doing some kind of read-ahead when the file is being read sequentially, and the chunks are small enough that several get picked up in a single disk read. This kludge is sufficient for my current usage, however, it's ugly and brittle. It will probably work against dask's optimization algorithm for complex calculations. Is there a better way in dask to control which tasks win in a tie-breaker, in particular for loading chunks from disk? I.e., is there a way to tell dask, "all things being equal, here's the relative order I'd like you to process this group of chunks?"

Your assessment is correct. As of 2018-06-16 there is not currently any way to add in a final tie breaker. In the distributed scheduler (which works fine on a single machine) you can provide explicit priorities with the priority= keyword, but these take precedence over all other considerations.

Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS)
Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function.
These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function.
Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process.
What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2'
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist.
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job.
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.

Iif you input files are small - you can load them into static variables and use rules as an input.
If above is not a case I can suggest the following ways:
a) To give rule-1 and rule-2 high replication factor close to the number of nodes you have. Then you can read from HDFS rule=1 and rule-2 for each record in the input relatively efficient - because it will be sequential read from the local datanode.
b) If you can consider some hash function which, when applied to the rule and to the input string will predict without false negatives that they can match - then you can emit this hash for rules, input record and resolve all possible matches in the reducer. It will be very similar to the way how a join is done using MR
c) I would consider some other optimization techniques like building search trees, or sorting since otherwise the problem looks computationally expensive and will took forever...

On this page find Real-World Cluster Configurations
it will cover file size configuration

You could use the param "mapred.child.java.opts" in conf/mapred-site.xml to increase the memory for your mappers. You might not be able to run as many map slots per server but with more servers in your cluster you could still parallelize your job.

Read the content text file from the MapReduce function and read the keyword text file from the mapper function (for reading your HDFS) and split using StringTokenizer value.toString reading from MapReduce and in your mapper function write HDFS read text file code it will read line-by-line so use two while loops here you compare. Whenever you want data send it to reducer.

Split the 3gb text file into several text files and apply that all text files as usual MapReduce your previous program.
For splitting text file I written Java program and you decide how many lines you want write in each text file.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart