How does one merge texts together? like a version control system would? - ruby-on-rails

Wikipedia says this is pretty good: http://en.wikipedia.org/wiki/Merge_(revision_control)#Three-way_merge
But how does one implement that? or are there any gems / plugins for Ruby on Rails that will handle that for me?
My situation:
• I have base text
• changes from person A
• changes from person B
• both changes should be included and not overriding the other
any directions I could be pointed in? thanks!

I think you should look at the merge3 gem again [source].
This small example explains it:
require 'rubygems'
require 'merge3'
start = <<TEXT
This is the baseline.
The start.
The end.
TEXT
changed_A = <<TEXT
This is the baseline.
The start (changed by A).
The end.
TEXT
changed_B = <<TEXT
This is the baseline.
The start.
B added this line.
The end.
TEXT
result = Merge3::three_way(start, changed_A, changed_B)
puts result
The output it generates is:
This is the baseline.
The start (changed by A).
B added this line.
The end.
I am not sure how it handles merge conflicts, and since it is supposed to handle 3-way merges of files, it seems to be line-based. If that is a problem (as your example tries to compare simple string), you could add a newline between every character.
Hope this helps.

If you're storing versioned text that you want to be able to merge, then it sounds like you have a perfect use case for calling a version control system. Store the text in files and call the VCS for version control operations (perhaps the Git or Grit gems would be helpful).

Related

Biopython PDBIO assembly chain IDs

I am using Bio.PDB to parse structures in mmCIF and PDB format. I realised that PDBIO does not deal well with two-character chain identifiers (like ‘AA’ or ‘AB’) found in assembly structures. I have made a slight change to the code that fits me. Attached you will find the modified PDBIO module. What it does basically is checking the length of the chain identifier string and adds a space in front of it, if is a single character. The formatting string is modified accordingly.
These are my changes in Bio.PDB.PDBIO module. Please consider it putting it in a future update.
Modified:
_ATOM_FORMAT_STRING = "%s%5i %-4s%c%3s%s%4i%c %8.3f%8.3f%8.3f%s%6.2f %4s%2s%2s\n"
Modified:
for chain in model.get_list():
if not select.accept_chain(chain):
continue
chain_id = chain.get_id()
if len(chain_id)==1: #Added line
chain_id = ' {}'.format(chain_id) #Added line
Modified:
fp.write("TER %5i %3s %s%4i%c \n
Stackoverflow is a site to ask questions. What you are proposing is a change to BioPython software. Luckily, BioPython is open-source, so you create a pull request so that your change can be added to the software.
Go to https://github.com/biopython/biopython/blob/master/Bio/PDB/PDBIO.py
Click on the arrow icon in the top right corner. This will create a fork of the BioPython repository
Make the changes that you mentioned above in your fork and a title and a description:
Click on propose file change. You can now visually compare your modifications side by side.
If everything looks OK, click on create pull request. This will send a pull request to the master branch of the BioPython repository. There it will be reviewed. If the authors of the BioPython software agree that this is a usefull change, they will merge it into the sofware.

Options for MeCab Japanese tokenizer on iOS?

I'm using the iPhone library for MeCab found at https://github.com/FLCLjp/iPhone-libmecab . I'm having some trouble getting it to tokenize all possible words. Specifically, I cannot tokenize "吉本興業" into two pieces "吉本" and "興業". Are there any options that I could use to fix this? The iPhone library does not expose anything, but it uses C++ underneath the objective-c wrapper. I assume there must be some sort of setting I could change to give more fine-grained control, but I have no idea where to start.
By the way, if anyone wants to tag this 'mecab' that would probably be appropriate. I'm not allowed to create new tags yet.
UPDATE: The iOS library is calling mecab_sparse_tonode2() defined in libmecab.cpp. If anyone could point me to some English documentation on that file it might be enough.
There is nothing iOS-specific in this. The dictionary you are using with mecab (probably ipadic) contains an entry for the company name 吉本興業. Although both parts of the name are listed as separate nouns as well, mecab has a strong preference to tag the compound name as one word.
Mecab lacks a feature that allows the user to choose whether or not compounds should be split into parts. Note that such a feature is generally hard to implement because not everyone agrees on which compounds can be split and which ones can't. E.g. is 容疑者 a compound made up of 容疑 and 者? From a purely morphological point of view perhaps yes, but for most practical applications probably no.
If you have a list of compounds you'd like to get segmented, a quick fix is to create a user dictionary for the parts they consist of, and make mecab use this in addition to the main dictionary.
There is Japanese documentation on how to do this here. For your particular example, it would involve the steps below.
Make a user dictionary with two entries, one for 吉本 and one for 興業:
吉本,,,100,名詞,固有名詞,人名,名,*,*,よしもと,ヨシモト,ヨシモト
興業,,,100,名詞,一般,*,*,*,*,こうぎょう,コウギョウ,コウギョウ
I suspect that both entries exist in the default dictionary already, but by adding them to a user dictionary and specifying a relatively low specificness indicator (I've used 100 for both -- the lower, the more likely to be split), you can get mecab to tend to prefer the parts over the whole.
Compile the user dictionary:
$> $MECAB/libexec/mecab/mecab-dict-index -d /usr/lib64/mecab/dic/ipadic -u mydic.dic -f utf-8 -t utf-8 ./mydic
You may have to adjust the command. The above assumes:
Mecab was installed from source in $MECAB. If you use mecab installed by a package manager, you might have difficulties finding the mecab-dict-index tool. Best install from source.
The default dictionary is in /usr/lib64/mecab/dict/ipadic. This is not part of the mecab package; it comes as a separate package (e.g. this) and you may have difficulties finding this, too.
mydic is the name of the user dictionary created in step 1. mydic.dic is the name of the compiled dictionary you'll get as output (needs not exist).
Both the system dictionary (-t option) and the user dictionary (-f option) are encoded in UTF-8. This may be wrong, in which case you'll get an error message later when you use mecab.
Modify the mecab configuration. In a system-wide installation, this is a file named /usr/lib64/mecab/dic/ipadic/dicrc or similar. In your case it may be located somewhere else. Add the following line to the end of the configuration file:
userdic = home/myhome/mydic.dic
Make sure the absolute path to the dictionary compiled above is correct.
If you then run mecab against your input, it will split the compound into its parts (I tested it, using mecab 0.994 on a Linux system).
A more thorough fix would be to get the source of the default dictionary and manually remove all compoun nouns you want to get split, then recompile the dictionary. As a general remark, using a CJK tokenizer for a serious application in production mode over a longer period of time usually involves a certain amount of dictionary maintenance (adding/removing entries) regularly.

Rails: Hash#to_json - split over multiple lines?

I'm having a slight annoyance with the javascript_I18n plugin, which generates js-friendly versions of the I18n translation tables, so that you can localise your javascript. It all works fine, but it works by calling to_json on each locale's translations hash and outputting the results into a file. When you call to_json on a hash the resulting string is all a single lne, which means you end up with files with an enormous line at the end.
This in turn is stopping git from being able to merge any changes, because the merge works on a line by line basis and can't deal with a single massive line with changes in the middle somewhere. It's not a massive problem because i can always just re-generate the js-friendly translation file with a rake task that comes with the plugin (which replaces the mid-merge file with a brand new one which i can just commit in), but it's just a bit annoying. It occurred to me that if the json was output on different lines, instead of all the same line, then it wouldn't be a problem, and it wouldn't even make the file that much bigger, just inserting two characters (\n) per line.
Before i try and hack the resulting string with a gsub to split it onto seperate lines, is there a nicer way to call to_json on a hash and output the results onto seperate lines? Or a nicer way to solve this problem generally? (i can't find much useful in the documentation for javascript_I18n).
Grateful for any advice - max
Not to answer you question, but to give a suggestion:
It would probably be easier to just ignore all the generated js translation files.
This is because even if you separate the translation into multiple lines, there are still chances of merge conflicts, which you probably already resolved once in your yaml file.
I would set it up this way:
1). gitignore all the js translation files.
2). In ActionController, add a before filter to generate the js files automatically on each load in development mode only.
3). Tweak your deploy.rb file to generate the js files after code update.
No more merge conflicts! (at least for js translation files) :D
Aaron Qian

Pretty Print for (Informix-)4gl code

i'm searching for a pretty print program (script, code, whatever) for Informix-4GL sources.
Do you know any ? Than you, Peter.
Have you looked at the IIUG (International Informix User Group) software archive? There are two pretty printers there (of indeterminate quality).
The other place to look would be the Aubit4GL site - an open source variant of I4GL. Again, I'm not sure that they have a pretty-printer, but it might be something they have (though a casual check doesn't show one).
I don't know if anyone is reading this post anymore, but the easiest way to get some kind of nice "pretty print" of 4gl code is to view it in the Openedge Developer Studio, then use ctrl-I to set indention. You can adjust indention in the editor settings by saying the length of "tabs". (default is 4, I use 3)
Then do a ctrl-shift-f to make all command words uppercase.
Next, you can condense the code a few lines by moving all the "DO:" statements up a line next to the "THEN" statement with this regular expression search and replace.
ctrl-f:
search "\s*\n\s*DO[:]"
replace " DO:"
make sure you click the checkbox marked regular expressions.
At this point the code is nice and tidy.
Do a ctrl-a and ctrl-c to copy it to the clipboard.
paste it in Outlook as an email without sending. Print it in color.

Adding MS-Word-like comments in LaTeX

I need a way to add text comments in "Word style" to a Latex document. I don't mean to comment the source code of the document. What I want is a way to add corrections, suggestions, etc. to the document, so that they don't interrupt the text flow, but that would still make it easy for everyone to know, which part of the sentence they are related to. They should also "disappear" when compiling the document for printing.
At first, I thought about writing a new command, that would just forward the input to \marginpar{}, and when compiling for printing would just make the definition empty. The problem is you have no guarantee where the comments will appear and you will not be able to distinguish them from the other marginpars.
Any idea?
todonotes is another package that makes nice looking callouts. You can see a number of examples in the documentation.
Since LaTeX is a text format, if you want to show someone the differences in a way that they can use them (and cherry pick from them) use the standard diff tool (e.g., diff -u orig.tex new.tex > docdiffs). This is the best way to annotate something like LaTeX documents, and can be easily used by anyone involved in the production of a document from LaTeX sources. You can then use standard LaTeX comments in your patch to explain the changes, and they can be very easily integrated. If the document lives in a version control system of some sort, just use the VCS to generate a patch file that can be reviewed.
I have used changes.sty, which gives basic change colouring:
\added{new text}
\deleted{old text}
\replaced{new text}{old text}
All of these take an optional parameter with the initials of the author who did this change. This results in different colours used, and these initials are displayed superscripted after the changed text.
\replaced[MI]{new text}{old text}
You can hide the change marks by giving the option final to the changes package.
This is very basic, and comments are not supported, but it might help.
My little home-rolled "fixme" tool uses \marginpar where possible and goes inline in places (like captions) where that is hard to arrange. This works out because I don't often use margin paragraphs for other things. This does mean you can't finalize the layout until everything is fixed, but I don't feel much pain from that...
Other than that I heartily agree with Michael about using standard tools and version control.
See also:
Tips for collaboratively editing a LaTeX document (which addresses you main question...)
https://stackoverflow.com/questions/193298/best-practices-in-latex
and a self-plug:
How do I get Emacs to fill sentences, but not paragraphs?
You could also try the trackchanges package.
You can use the changebar package to highlight areas of text that have been affected.
If you don't want to do the markup manually (which can be tedious and interrupt the flow of editing) the neat latexdiff utility will take a diff of your document and produce a version of it with markup added to visually display the changes between the two versions in the typeset output.
This would be my preferred solution, although I haven't tested it out on large, multi-file documents.
The best package I know is Easy Review that provides the commenting functionality into LaTeX environment. For example, you can use the following simple commands such as \add{NEW TEXT}, \remove{OLD TEXT}, \replace{OLD TEXT}{NEW TEXT}, \comment{TEXT}{COMMENT}, \highlight{TEXT}, and \alert{TEXT}.
Some examples can be found here.
The todonotes package looks great, but if that proves too cumbersome to use, a simple solution is just to use footnotes (e.g. in red to separate them from regular footnotes).
Package trackchanges.sty works exactly the way changes.sty. See #Svante's reply.
It has easy to remember commands and you can change how edits will appear after compiling the document. You can also hide the edits for printing.

Resources