how to get chunks in diff3 algorithm implementation - diff3

I'm trying to implement diff3 algorithm and currently stuck at chunks creation stage. I already know how to get LCS between original file and "other" file and LCS between original file and "my" file. Which steps need to do to get chunks?

I don't think this really answers your question, but Subversion implements this layering exactly as you describe here. It follows the theory quite closely, so you might be able to re-use some pieces.
See http://svn.apache.org/viewvc/subversion/trunk/subversion/libsvn_diff/

Related

Parse batch of SequenceExample

There is function to parse SequenceExample --> tf.parse_single_sequence_example().
But it parses only single SequenceExample, which is not effective.
Is there any possibility to parse a batch of SequenceExamples?
tf.parse_example can parse many Examples.
Documentation for tf.parse_example contain a little info about SequenceExample:
Each FixedLenSequenceFeature df maps to a Tensor of the specified type (or tf.float32 if not specified) and shape (serialized.size(), None) + df.shape. All examples in serialized will be padded with default_value along the second dimension.
But it is not clear, how to do that. Have not found any examples in google.
Is it possible to parse many SequenceExamples using parse_example() or may be other function exists?
Edit:
Where can I ask question to tensorflow developers: does they plan to implement parse function for multiple SequenceExample -s?
Any help ll be appreciated.
If you have many small sequences where batching at this stage is important, I would recommend VarLenFeatures or FixedLenSequenceFeatures with regular Example protos (which, as you note, can be parsed in batches with parse_example). For examples of this, see the unit tests associated with example parsing (testSerializedContainingSparse parses Examples with FixedLenSequenceFeatures).
SequenceExamples are more geared toward cases where there is significant amounts of preprocessing work to be done for each SequenceExample (which can be done in parallel with queues). parse_example does does not support SequenceExamples.

Deap: Want to know the generation that created the best individual

I'm running a genetic algorithm program and can find the best individual at the end of the run (hof[0]), but i want to know which generation produced it. Is there any attributes of hof[0] that will help print the individual and the generation that created it.
I tried looking at the manuals and Google for answers but could not find it anywhere.
I also couldn't find a list of the attributes of individuals that I could print. Can someone point to the right link and documentation to that.
Thanks
This deap post suggest tracking the logbook, or explicitly adding the generation to the individual along with fitness:
https://groups.google.com/g/deap-users/c/r7fZbMwHg3I/m/BAzHh2ogBAAJ
For the latter:
If you are working with the algo locally(recommended if working beyond a tutorial as something always comes up like adding plotting or this very questions) then you can modify the fitness update line to resemble:
fitnesses = toolbox.map(toolbox.evaluate, invalid_ind)
for ind, fit in zip(invalid_ind, fitnesses):
ind.fitness.values = fit
ind.generation = gen # now we can: print(hof[0].gen)
if halloffame is not None:
halloffame.update(population)
There is no built in way to do this (yet/to the best of my knowledge), and implementing this so would probably be quite a large task. The simplest of which (simplest in thought, not in implementation) would be to change the individual to be a tuple, where tup[0] is the individual and tup[1] is the generation it was produced in, or something similar.
If you're looking for a hacky way, you could maybe try writing the children of each generation to a text file and cross-checking your final solution with the text file; but other than that I'm not sure.
You could always try posting on their Google Group, though it can take a couple of days for a reply.
Good luck!

Parsing haskell preserving comments / formatting

I want to do some source code transformation (automatic import list cleanup) and I'd like to preserve comments and formatting. I heard some stuff on and off about parsers that do this, I think for the ghc parser.
It looks like I might be able to do this with hs-src-exts Language.Haskell.Exts.Annotate and its SrcSpans by pulling things out of the file. I think the SrcsSpanInfo only covers the parsed parts, but I could theoretically figure out the comments by looking at what's in between. But it's not documented in much detail, and there are no helper functions I can find, and it looks like a hassle, e.g. there's no easy way to print out a parsed expression including formatting and comments. So I think it's not meant to be used in this way, it's just so you can highlight code in the file or something. My impression is that the author meant to use annotations to support this, but never got around to it.
It looks like neither yi nor leksah do this. I feel like HaRe might, but it's not super documented. Is there a haskell parser out there that does this?
The haskell-src-exts recently got support for preserving comments, and it already records src spans. I'm not sure if pretty printing is supported, but you could probably get that working.
The GHC parser also does similar things.

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

How can I use NLP to parse recipe ingredients?

I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.

Resources