I want to do some translations on a project that consists of multiple files for different apps. However to easily make things consistent across all files it would be useful with a translation tool that can load in a bunch of .po files and e.g. cross check files for identical or similar reference msgid strings to make sure the translations. And perhaps also allow to translate multiple files/strings in one go if the reference is the same.
Does something like this exist..?
I had to do the same thing for a project (CiviCRM). One suggestion I received was to check with OpenRefine, which presumably has a few tools to find similar strings, but I wanted to automate the process with something simple, so I wrote a short script.
Fair warning, this is not the most efficient, and it can take a while to run on big projects (we have around 16 000 strings in CiviCRM).
For reference:
https://github.com/civicrm/l10n/blob/master/bin/find-similar-strings.php
And since SO doesn't like links as answers, more details here:
#!/usr/bin/php
<?php
/**
* Reads from STDIN and finds similar-looking strings.
*
* Usage:
* cat *.pot | ../bin/find-similar-strings.php
*
* Context:
* http://forum.civicrm.org/index.php/topic,34805.0.html
*/
// Default match threshold is 90% match.
$threshold = (! empty($argv[1]) ? $argv[1] : 90);
// Read all input from stdin.
$src = file_get_contents("php://stdin");
// http://stackoverflow.com/a/1070937/2387700
// Extract all "msgid" strings (they can be multi-line).
preg_match_all('/msgid\s+\"([^\"]*)\"/', $src, $matches);
$msgids = $matches[1];
// Sort the strings alphabetically, to make them easier to compare.
// sort($msgids);
foreach ($msgids as $key1 => $msgid1) {
foreach ($msgids as $key2 => $msgid2) {
$percent = 0;
if ($msgid1 && $msgid2 && $msgid1 != $msgid2) {
if (similar_text($msgid1, $msgid2, $percent)) {
if ($percent > $threshold) {
$percent = (int) $percent;
echo "$msgid1 [$percent %]\n";
echo "$msgid2 \n\n";
}
}
}
}
// To avoid going through the strings twice, we unset the string
// si that the inner-loop goes faster.
unset($msgids[$key1]);
}
This will load the .pot file (source strings, but I guess you could run it on a .po file as well), and loop through all the strings one by one.
I was hesitating to sort the strings alphabetically, but I find more than a few instances where strings had an incorrect space prepended to them, typo, etc.
Another possible improvement would be to first check for the length of the string, and skip strings that are of very different lengths.
Related
I have created a DXL script that goes through every row of a couple modules. I am printing out certain rows and its information. I am doing this by having a for loop that goes through the rows and if it hits a row that I am interested in, I save the elements in the columns of this row to different string variables and print those string variables. The script does not take too long to run if the module does not have a lot of rows I am interested in but if I want to run multiple modules at the same time or if a module has a lot of rows I am interested in, the script can take hours. I can show the code that I have if this is not enough to come up with solutions. Any help would be appreciated!
I have tried using a skip list to store the print statements in that and then tried going through the skip list to print each value but that did not make the script run any faster.
string sep=","
for o in m do
{
string ver1= o."column1"
if (checkIf(o) && (!(isDeleted(o))))
{
string ver2= o."column2"
string onum=number(o)
""
string otext = o."Object Text"
print ver1 sep ver2 sep onum
}
}
Initial optimization:
for o in m do
{
if (checkIf(o) && (!(isDeleted(o)))) {
//This doesn't appear to be used?
//string otext = o."Object Text"
print o."column1" "," o."column2" "," number(o) "\n"
}
}
Reasoning: DOORS has a system in place called the string table that holds declared strings in memory- and doesn't necessarily do the best at clearing it out when appropriate. By constantly declaring strings in your loop, you might be bumping into the memory limits of that system.
Problem with this is that the results all end up in that 'DXL editor' little window, and then have to be copy and pasted somewhere else to actually use it.
Secondary optimization:
// Turn off runlimit for timing
pragma runLim , 0
// Set file location - CHANGE FOR YOUR COMPUTER
string csv_location = "C:/Users/Username/Desktop/Info_Collection.csv"
// Open stream
Stream out = append csv_location
// Set headers
out << "Module,Column 1,Column 2,Object Number" "\n"
// Define your loop constraints
Module m = current
Object o
// Run your loop
for o in m do
{
if (checkIf(o) && (!(isDeleted(o)))) {
//This doesn't appear to be used?
//string otext = o."Object Text"
out << fullName(m)","o."column1" "," o."column2" "," number(o) "\n"
}
}
close out
This will let you run the same script in different modules, all outputting to the same CSV file, which you can then load into Excel or your data manipulation engine of choice.
This keeps the data collection happening outside of DOORS, so if something goes awry, you can track down where it occurred.
My third optimization would be to use a list of modules in, say, excel as an input and do this analysis, but that might be going too far.
If this doesn't help, then we can start examining other issues.
Note- I still would like to know what 'checkIf' is/does.
If your objective is to speed up the execution of the script, since most of the objects are of no interest to you, the most effective way I know of is to filter out most of the objects that are not interesting, e.g., a filter which is obj."Object Text" != "" would filter out Headings, if you are just interested in requirements, obj."Object Text" contains "[Ss]hall" etc. Save as a view for later use.
for o in m do { respects the display set, so if you don't touch most of the objects it will speed it up a lot!
Hope this helps.
Don
I'm new to DXL and I want to extract the variables containing
_I_,_O_ and _IO_
from a module and export then to csv file. Please help me with this
EG:
ADCD_WE_I_DFGJDGFJ_12_QWE and CVFDFH_AWEQ_I_EHH[X] is set to some value
This question has two parts.
You want to find variables that contain those parts in their name
You want to export to a .csv file
Another person may be able to expand on a better way, but the only way coming to mind right now for 1. is this:
Loop over the attributes in the module (for ad in m do {}) and get the string of the attribute names.
I am assuming that your attributes are valued at _I_, _O_ or _OI_? Like alpha = "_I_"? Are these enumerate values?
If they are, then you only need to check the value of each object's attribute. If one of those are the attribute values, then add it to something like a skip list. Having a counter here would be useful, maybe one for each attribute, like countI, countO, countOI, you can then use the counters as keys for the put() function for the skip list.
Once you have found all the attributes then you can move on to writing to csv
Stream out = write("filepathname/filename.csv") // to declare the stream
out << "_I_, _O_, _OI_\n"
Then you could loop over your Skip lists at the same time
int ijk = 0; bool finito = false
while(!finito) do {
if(ijk<countI) {
at = get(skipListI, ijk)
out << at.name ","
}
else out << ","
if(ijk<countO) {
at = get(skipListO, ijk)
out << at.name ","
}
else out << ","
if(ijk<countOI) {
at = get(skipListOI, ijk)
out << at.name "\n"
}
else out << "\n"
ijk++
// check if the next iteration would be outside of bounds on all lists
if(ijk >= countI && ijk >= countO && ijk >= countIO) finito = true
}
Or instead of at.name, you could print out whatever part of the attribute you wanted. The name, the value, "name:value" or whatever.
I have not run this, so you will be left to do any troubleshooting.
--
I hope this idea gets you started, write out what you want on paper first and then follow that plan. The key things I have mentioned that would be useful here are Skip lists, and Stream write (or append, if you want to keep adding).
In the future, please consider making your question more clear. Are you looking for those search terms in the name of the attribute, or in the value of the attribute. Are you looking to print out the names or the values, or the what? What kind of format for the .csv are you going to have? Any information will help your question be answered.
I have a large dataset where subsets of variables have been entered with the same prefix, followed by an underscore and some details. They are all binary YN and the variables are all doubles. For example, I have the variables onsite_healthclinic and onsite_CBO where values can only be 1 or 0.
I want to rename them all according to the question they are on the survey I'm working off of (so the above variables would become q0052_healthclinic and q0052_CBO), but if I use the code below using substr I (obviously) get type mismatch:
foreach var in onsite_healthclinic onsite_CBO {
local new = substr(`var', 8, .)
rename `new' q0052_`new'
}
My question is, is there another command other than substr that I can use so that I don't have to either a) convert all of the variables to strings first; or b) rename them all manually (there are ~20 in each subset, so while doable, it's a waste of time).
There is no need for a loop here at all. Although the essential answer is one line long I give here a complete, self-contained answer.
clear
set obs 1
foreach v in onsite_healthclinic onsite_CBO {
gen `v' = 1
}
rename onsite_* q0052_*
describe, fullnames
This answer implies that you've not studied the help under rename groups.
Will this work?
foreach var in onsite_healthclinic onsite_CBO {
local new = substr("`var'", 8, .)
rename onsite_`new' q0052_`new'
}
I added quotes around the call to the local var in the substr function and added onsite_ to the rename and that seemed to work.
I noticed that Boost spirit offers some limits, in a question here on SO there is an user asking for help about boost spirit and the other user who gave the answer specified that boost spirit works well with statements and not with "generic text" ( I'm sorry if I don't recall it correctly ).
Now I would like to think about Postscript and PDF in terms of tokens and simplify my approach to this formats this way, the problem is that the PDF is kind of a mix between a markup language and a programming language with jumps and tables in it, and I can't think about something similar when considering the most popular file formats like XML, C++ code and others languages and formats.
There is also another fact: I can't really find people that had some kind of experience with boost::spirit wiriting a pdf parser or writer, so I'm asking, boost::spirit it's capable of parsing a PDF file and output the elements as tokens ?
Although this has nothing to do with Boost, let me assure you that the parsing of PDF (and PostScript) are about as trivial as you could want. Let's say that you have a scanner object that returns a series of tokens. The token types you will get from the scanner are:
String
Dict begin (<<)
Dict End (>>)
Name (/whatever)
Number
Hex array
Left Angle (<)
Right Angle (>)
Array begin ([)
Array end (])
Procedure begin ({)
Procedure end (})
Comment (%foo)
Word
My scanner is a finite-state automata with states for Start, Comment, String, HexArray, Token, DictEnd, and Done.
The way you parse PDF is not by parsing it, but by executing it. Given these tokens, my "parser" looks like this (in C#):
while (true) {
MLPdfToken = scanner.GetToken();
if (token == null)
return MachineExit.EndOfFile;
PdfObject obj = PdfObject.FromToken(token);
PdfProcedure proc = obj as PdfProcedure;
if (proc != null)
{
if (IsExecuting())
{
if (token.Type == PdfTokenType.RBrace)
proc.Execute(this);
else
Push(obj);
}
else {
proc.Execute(this);
}
if (proc.IsTerminal)
return Machine.ParseComplete;
}
else {
Push(obj);
}
}
I'll also add that if you give every PdfObject an Execute() method such that the base class implementation is machine.Push(this) and IsTerminal that returns false, the REPL gets easier:
while (true) {
MLPdfToken = scanner.GetToken();
if (token == null)
return MachineExit.EndOfFile;
PdfObject obj = PdfObject.FromToken(token);
if (IsExecuting())
{
if (token.Type == PdfTokenType.RBrace)
obj.Execute(this);
else
Push(obj);
}
else {
obj.Execute(this);
if (obj.IsTerminal)
return Machine.ParseComplete;
}
}
There's more support in Machine - Machine has a Stack of PdfObject and a few methods for accessing it (Push, Pop, Mark, CountToMark, Index, Dup, Swap), as well as ExecProcBegin and ExecProcEnd.
Beyond that, it's very light. The only thing that is slightly odd is that PdfObject.FromToken takes a token and if it is a primitive type (number, string, name, hex, bool) returns a corresponding PdfObject. Otherwise, it takes the given token and looks in a "proc set" dictionary of procedure names associated with PdfProcedure objects. So when you encounter the token << that gets looked up in a the proc set and comes up with this code:
void DictBegin(PdfMachine machine)
{
machine.Push(new PdfMark(PdfMarkType.Dictionary));
}
So << really means "mark the stack as the start of a dictionary. >> gets more interesting:
void DictEnd(PdfMachine machine)
{
PdfDict dict = new PdfDict();
// PopThroughMark pops the entire stack up to the first matching mark,
// throws an exception if it fails.
PdfObject[] arr = machine.PopThroughMark(PdfMarkType.Dictionary);
if ((arr.Length & 1) != 0)
throw new PdfException("dictionaries need an even number of objects.");
for (int i=0; i < arr.Length; i += 2)
{
PdfObject key = arr[i], val = arr[i + 1];
if (key.Type != PdfObjectType.Name)
throw new PdfException("dictionaries need a /name for the key.");
dict.put((PdfName)key, val);
}
machine.Push(dict);
}
So >> Pops up to the nearest dictionary mark into an array then puts each pair into the dictionary. Now, I could have done this without allocating the array. I could just pop pairs, putting them into the dictionary until I either hit the mark, fail to get a name or underflow the stack.
The important takeaway is that there really isn't any syntax in PDF, nor is there any in PostScript. At least not so much as you'd notice. The only real Syntax (and the read-eval-(push) loop shows it) is '}'.
So when you this is a PDF 14 0 obj << /Type /Annot /SubType /Square >> endobj what your really seeing is a series of procedures:
Push 14
Push 0
Execute obj (Pop two numbers and push a "definition" object).
Execute dictionary begin
Push /Type
Push /Annot
Push /SubType
Push /Square
Execute dictionary end
Execute endobj (pop the top object and then get (not pop) the next one. If the second is a definition, set its "value" to the first object, else throw).
Since "endobj" is terminal, parsing ends and the top of the stack is the result.
So when you are asked to look up object 14 in the PDF, the cross-reference table tells you where to seek to, you make a new Machine with the stream pointer at that location and run it. If the top of the stack is a "definition" object, you've succeeded.
About now you should be nodding but not trusting me, since you're thinking about PDF streams, which look like this:
<< [/key value]* >> stream ...raw data... endstream endobj
Again, there is no syntax. The proc stream looks at the top of the stack, which should be a PdfDict. If it is, it consumes characters until the next newline (scanner does this), stores the current file position in the stream as data start, reads the stream length from the dict (which may cause another Machine to get newed up), and skips past the end of stream and pushes the new stream object on the stack. endstream is a no-op. The only difference between a PdfDict and a PdfStream is that a PdfStream has a start position and a bool saying that it's a stream, otherwise I dual-purpose the object.
PostScript is almost identical except that the execution environment is a little more complex. For example, you need several stacks in your machine: a parameter stack, a dictionary stack, and an execution stack. From there, you more or less just bind your tokenizer into the set of primitive procedures as well as the word exec, and then most of your interpreter is written in PS itself.
If you're talking about boost, you're looking at C++, which means that you can't be as fast and loose with memory as I am, so you'll want to either use smart pointers or figure out where you scope is and be careful to dispose objects instead of blithely throwing them away, but that's just the normal C++ stuff.
Currently, I make PDF tools for my company in .NET, but in a former life I worked on Acrobat versions 1-4, and most of what I described is exactly what Acrobat did under the hood (well, more or less - it was C, not C++, but it's the same approach).
With respect to the xref table (or xref stream), you read that first - the spec tells you that if you jump to EOF and scan back, you find the start of the xref table. You parse that (which is a CS 101 assignment), parse the trailer, seek to the /Prev if any and repeat until no more /Prev entries. That gives you a complete xref for looking up objects.
As for writing - there are a number of approaches that you can take. The most obvious one is that when an object is meant to be referenced, you create a new reference object by assigning the newest available xref entry to it. Whenever objects refer to other objects for writing, they ask if these objects are referenced. If they are, they write the reference (ie, 14 0 R). When it comes time to write a referenced object, you get the current stream pointer and store it in the xref, then write <objnum> <generation> obj <object contents> endobj. For example, my code to write a dictionary looks like this:
public override ToStream(PdfStreamingContext context)
{
if (context.HasReference(this)) // is object referenced in xref
{
PdfUtils.WriteObjectDefinitionBegin(this, context);
}
context.Writer.Indent();
context.Writer.WriteLine("<<");
WriteContents(context);
context.Writer.Exdent();
context.Writer.Writeline(">>");
if (context.HasReference(this))
{
PdfUtils.WriteObjectDefinitionEnd(this, context);
}
}
I've chopped out some chaff so you can see the wheat underneath. The context is an object that holds a new xref table as well as an object for writing to streams that automagically handles appropriate newline discipline, indentation, line wrapping, and so on.
What you should see is that the basics here are straight forward, if not trivial. And now's when you should be asking yourself the question, "if it's trivial, how come there isn't more (serious) competition for Acrobat in the market? The answer is that even though it's trivial, it's still easy to write PDFs that aren't spec compliant and Acrobat handles most of those. The real challenge is to be able to honor the spec and make sure that you include all required values in a dictionary and that they are in range and semantically correct. Hell, even the date time format--which is pretty well-specified--is a mound of special case code in my library to manage where other people have screwed it up royally. Being able to generate consistently correct PDF is hard and consuming the garbage in the sea of PDFs in the world is harder.
I could (and probably should) write a book about how to do this. While a lot of the fringe code is grubby, the overall structure can be very pretty.
tl;dr - If you're thinking of a recursive descent parser for PDF, you're thinking too hard. All you need is a tokenizer and a simple REPL.
Excuse the n00bness of this question, but I have a web application where I want to send a potentially large file to the server and have it parse the format. I'm using the Play20 framework and I'm new to Scala.
For example, if I have a csv, I'd like to split each row by "," and ultimately create a List[List[String]] with each field.
Currently, I'm thinking the best way to do this is with a BodyParser (but I could be wrong). My code looks something like:
Iteratee.fold[String, List[List[String]]]() {
(result, chunk) =>
result = chunk.splitByNewLine.splitByDelimiter // Psuedocode
}
My first question is, how do I deal with a situation like the one below where a chunk has been split in the middle of a line:
Chunk 1:
1,2,3,4\n
5,6
Chunk 2:
7,8\n
9,10,11,12\n
My second question is, is writing my own BodyParser the right way to go about this? Are there better ways of parsing this file? My main concern is that I want to allow the files to be very large so I can flush a buffer at some point and not keep the entire file in memory.
If your csv doesn't contain escaped newlines then it is pretty easy to do a progressive parsing without putting the whole file into memory. The iteratee library comes with a method search inside play.api.libs.iteratee.Parsing :
def search (needle: Array[Byte]): Enumeratee[Array[Byte], MatchInfo[Array[Byte]]]
which will partition your stream into Matched[Array[Byte]] and Unmatched[Array[Byte]]
Then you can combine a first iteratee that takes a header and another that will fold into the umatched results. This should look like the following code:
// break at each match and concat unmatches and drop the last received element (the match)
val concatLine: Iteratee[Parsing.MatchInfo[Array[Byte]],String] =
( Enumeratee.breakE[Parsing.MatchInfo[Array[Byte]]](_.isMatch) ><>
Enumeratee.collect{ case Parsing.Unmatched(bytes) => new String(bytes)} &>>
Iteratee.consume() ).flatMap(r => Iteratee.head.map(_ => r))
// group chunks using the above iteratee and do simple csv parsing
val csvParser: Iteratee[Array[Byte], List[List[String]]] =
Parsing.search("\n".getBytes) ><>
Enumeratee.grouped( concatLine ) ><>
Enumeratee.map(_.split(',').toList) &>>
Iteratee.head.flatMap( header => Iteratee.getChunks.map(header.toList ++ _) )
// an example of a chunked simple csv file
val chunkedCsv: Enumerator[Array[Byte]] = Enumerator("""a,b,c
""","1,2,3","""
4,5,6
7,8,""","""9
""") &> Enumeratee.map(_.getBytes)
// get the result
val csvPromise: Promise[List[List[String]]] = chunkedCsv |>>> csvParser
// eventually returns List(List(a, b, c),List(1, 2, 3), List(4, 5, 6), List(7, 8, 9))
Of course you can improve the parsing. If you do, I would appreciate if you share it with the community.
So your Play2 controller would be something like:
val requestCsvBodyParser = BodyParser(rh => csvParser.map(Right(_)))
// progressively parse the big uploaded csv like file
def postCsv = Action(requestCsvBodyParser){ rq: Request[List[List[String]]] =>
//do something with data
}
If you don't mind holding twice the size of List[List[String]] in memory then you could use a body parser like play.api.mvc.BodyParsers.parse.tolerantText:
def toCsv = Action(parse.tolerantText) { request =>
val data = request.body
val reader = new java.io.StringReader(data)
// use a Java CSV parsing library like http://opencsv.sourceforge.net/
// to transform the text into CSV data
Ok("Done")
}
Note that if you want to reduce memory consumption, I recommend using Array[Array[String]] or Vector[Vector[String]] depending on if you want to deal with mutable or immutable data.
If you are dealing with truly large amount of data (or lost of requests of medium size data) and your processing can be done incrementally, then you can look at rolling your own body parser. That body parser would not generate a List[List[String]] but instead parse the lines as they come and fold each line into the incremental result. But this is quite a bit more complex to do, in particular if your CSV is using double quote to support fields with commas, newlines or double quotes.