how to tokenize/parse/search&replace document by font AND font style in LibreOffice Writer? - parsing

I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.
main word (font 1, bold)
foreign equivalent transliterated (font 1, italic)
foreign equivalent (font 2, bold)
part of speech (font 1, italic)
Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.
I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.
I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:
Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation
Replace: term found above + "delimiter"
Any suggestions how I can write a script for this, or if an existing tool can solve the problem?
Thanks!
Pseudo code for desired effect:
var delimiter = "|"
Go to beginning of document
While not end of document do:
var $currLine = get line from doc
var $currChar = get next character which is not space or punctuation;
var $font = currChar.font
var $font_style - currChar.font_style (e.g. bold, italic, normal)
While not end of line do:
$currChar = next character which is not space or punctuation;
if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
print $delimiter
$font = currChar.font
$font_style - currChar.font_style (e.g. bold, italic, normal)
}
end While
end While

Here are tips for each of the things your pseudocode does.
First, the easiest way to move line by line is with the TextViewCursor, although it is slow. Notice the XLineCursor section. For the while loop, oVC.goDown() will return false when the end of the document is reached. (oVC is our variable for the TextViewCursor).
Get each character by calling oVC.goRight(0, False) to deselect followed by oVC.goRight(1, True) to select. Then the selected value is obtained by oVC.getString(). To ignore space and punctuation, perhaps use python's isalnum() or the re module.
To determine the font of the character, call oVC.getPropertyValue(attr). Values for attr could simply be CharAutoStyleName and CharStyleName to check for any changes in formatting.
Or grab a list of specific properties such as 'CharFontFamily', 'CharFontFamilyAsian', 'CharFontFamilyComplex', 'CharFontPitch', 'CharFontPitchAsian' etc. Character properties are described at https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Formatting.
To insert the delimiter into the text: oVC.getText().insertString(oVC, "|", 0).
This python code from github shows how to do most of these things, although you'll need to read through it to find the relevant parts.
Alternatively, instead of using the LibreOffice API, unzip the .odt file and parse content.xml with a script.

Related

How to make a variable non delimited file to be a delimited one

Hello guys I want to convert my non delimited file into a delimited file
Example of the file is as follows.
Name. CIF Address line 1 State Phn Address line 2 Country Billing Address line 3
Alex. 44A. Biston NJ 25478163 4th,floor XY USA 55/2018 kenning
And so on all the data are in this format.
First three lines are metadata and then the data.
How can I make it delimited in proper format using logic.
There are two parts in the problem:
how to find the column widths
how to split each line into fields and output a new line with delimiters
I could not propose an automated solution for the first one, because (not knowing anything about the metadata format), there is no clear way to find where one column ends and the next one begins. Some of the column headings contain multiple space-separated words and space is also used as a separator between the headings (and apparently one cannot use the rule "more than one space means the end of a heading name" because there's only one space between "Address line 2" and "Country" - and they're clearly separate columns. Clearly, finding the correct column widths requires understanding English and this is not something that you can write a program for.
For the second problem, things are much easier - once you have the column positions. If you figure the column positions manually (or programmatically, if you know something about the metadata that I don't - and you have a simple method for finding what's a column heading), then a program written in AWK can do this, for example:
cols="8,15,32,40,53,66,83,105"
awk_prog='BEGIN {
nt=split(cols,tabs,",")
delim=","
ORS=""
}
{ o=1 ;
for (i in tabs) { t=tabs[i] ; f=substr($0,o,t-o); sub(" *$","",f) ; print f
delim ; o=t } ;
print substr($0, o) "\n"
}'
awk -v cols="$cols" "$awk_prog" input_file
NOTE that the above program does not deal correctly with the case when the separator character (e.g. ",") appears inside the data. If you decide to use this as-is, be sure to use a separator that is not present in the input data. It may be better to modify the code to escape any separator characters found in the input data (there are different ways to do this - depends on what you plan to feed the output file to).

Copy a table from iPython notebook into Word?

I want to copy a table from iPython notebook into a Word doc. I'm using Word for Mac 2011. The table is a standard pandas output and looks like this:
If I use Apple+C to copy the table, and then paste it into a Word doc, I get this:
Surely there must be an easier way?
Creating a table with the same number of rows/columns in Word and then trying to paste the cells there doesn't work either.
I guess I could screenshot the table, but I'd like to include the raw data in the document if possible.
The problem in this case (from the Word perspective) is not the table layout - it's the paragraph layout. Each paragraph has a substantial indent on right and left, and more space before/after than you would normally want.
I don't think any of the Paste options (e.g. Paste Special) in Word is going to help, unless you paste as unformatted text, then select the text, convert to a table, then proceed from there.
But, even a simple Word VBA macro such as this one will leave you with something a bit more manageable. (Select a table you copied in, then run the macro). A little bit more work on the code would probably allow you to get most of the formatting you want, most of the time.
Sub fixupSelectedTable()
With Selection.Tables(1).Range.ParagraphFormat
.LeftIndent = 0
.RightIndent = 0
.SpaceBefore = 0
.SpaceAfter = 0
.LineSpacingRule = wdLineSpaceSingle
End With
End Sub
If you are more familiar with Applescript, the equivalent looks something like this:
-- you may need to fix up the application name
-- (I use this to ensure that the script uses the Open Word 2011 doc
-- and does not try to start Word for Mac 15 (2016))
tell application "/Applications/Microsoft Office 2011/Microsoft Word.app"
tell the paragraph format of the text object of table 1 of the text object of the selection
set paragraph format left indent to 0
set paragraph format right indent to 0
set space before to 0
set space after to 0
set line spacing rule to line space single
end tell
end tell

Parse a Word Document By Font?

I'm currently trying to write a script which would run through a word document and output to a text file all the lines that are written in a certain font.
So if I had the document:
"This is the first line of the document.
This is the second line of the document.
This is the third line of the document."
And say normal lines are Times New Roman, bold is Arial, and italics is Sans Serif.
Then, ideally, I could parse the document for all lines in Arial and the text file output would have the line:
This is the second line of the document.
Any idea on how to do this from a script? I was thinking about first converting the doc into xml, but I do not think this is possible within a script.
You'll want to use the FIND object, and the FONT property of the FIND object.
So, something like this:
Public Sub FindTest()
Dim r As Range
Set r = ActiveDocument.Content
With r.Find
.ClearFormatting
.Style = "SomeStyleName"
Do While .Execute(Forward:=True, Format:=True) = True
'---- we found a range
Dim duperange As Range
Set duperange = r.Duplicate
Debug.Print r.Text
Loop
End With
End Sub
Note that where I've specified Style, you could specify font formatting via the FIND.FONT object, or various other formatting options. Just browse around the FIND object to see what's available.

Delphi 2009: Search skipping diacritics in unicode utf-8

I am having utf-8 encoded file containing arabic text and I have to search it.
My problem are diacritics, how to search skipping them?
Like if you load that text in Internet Explorer (converting text in HTML ofcourse ), IE is skipping those diacritics?
Any help?
Edit1: Search is simply performed by following code:
var m1 : TMemo; //contains utf-8 data)
m2 : TMemo; // contains results
...
m2.lines.BeginUpdate;
for s in m1.Lines do
begin
if pos(eSearch.Text,s)>0 then
begin
m2.Lines.Add(s);
end;
end;
m2.Lines.EndUpdate;
Edit2: Example of unicode data:
قُلْ هُوَ اللَّهُ أَحَدٌ
If you search only letters without diacritics قل the word قُلْ wont be found.
On Vista+ you can probably (I have no experience with Arabic) use CompareString with option LINGUISTIC_IGNOREDIACRITIC.
NORM_IGNORENONSPACE may also help. Then again, it may not.
Alternatively (but I'm just guessing) you may be able to parse your strings with GetStringTypeEx and manually remove diacritics. Probably you'd have to call FoldString or MultiByteToWideChar with flag MAP_COMPOSITE first.
I find that diacritics are not the only problem.
I would do character replacements, replacing them by empty strings, I would also normalize the text 'أ' 'إ' 'آ' are all converted to 'ا', and also do the same for ى ئ ي ؤ و ة ه ...
For search I'd also use a light stemmer like the "khoja stemmer" (Java source here)
A more advanced way is to do it like TREC:
Remove punctuation
Remove diacritics (mainly weak vowels) Most of the corpus did not contain weak vowels.
Some of the dictionary entries contained weak vowels. This made everything consistent.
Remove non letters
Replace initial إ or أ with bare alif .ا
Replace آ with ا
Replace the sequence ىء with ئ
Replace final ى with ي
Replace final ة with ه
Strip 6 prefixes: definite articles ( فال آال، بال، وال، ال، ) and و
(and) from the beginnings of normalized words
Strip 10 suffixes from the ends of words ات ان، ها،ي ة، ه، ية، يه، ين، ون
I would index the text by this modified text (for memos I'd store the index of the word in the original text), and do the same thing for the search query.
I would also search in Memo1.Text and not the lines one by one, the search could be for multiple words that may be at the end of a line and wrapped to the next line.

removing whitespaces in ActionScript 2 variables

let's say that I have an XML file containing this :
<description><![CDATA[
<h2>lorem ipsum</h2>
<p>some text</p>
]]></description>
that I want to get and parse in ActionScript 2 as HTML text, and setting some CSS before displaying it. Problem is, Flash takes those whitespaces (line feed and tab) and display it as it is.
<some whitespace here>
lorem ipsum
some text
where the output I want is
lorem ipsum
some text
I know that I could remove the whitespaces directly from the XML file (the Flash developer at my workplace also suggests this. I guess that he doesn't have any idea on how to do this [sigh]). But by doing this, it would be difficult to read the section in the XML file, especially when lots of tags are involved and that makes editing more difficult.
So now, I'm looking for a way to strip those whitespaces in ActionScript. I've tried to use PHP's str_replace equivalent (got it from here). But what should I use as a needle (string to search) ? (I've tried to put in "\t" and "\r", don't seem to be able to detect those whitespaces).
edit :
now that I've tried to throw in newline as a needle, it works (meaning that newline successfully got stripped).
mystring = str_replace(newline, '', mystring);
But, newlines only got stripped once, meaning that in every consecutive newlines, (eg. a newline followed by another newline) only one newline can be stripped away.
Now, I don't see that this as a problem in the str_replace function, since every consecutive character other than newline get stripped away just fine.
Pretty much confused about how stuff like this is handled in ActionScript. :-s
edit 2:
I've tried str_replace -ing everything I know of, \n, \r, \t, newline, and tab (by pressing tab key). Replacing \n, \r, and \t seem to have no effect whatsoever.
I know that by successfully doing this, my content can never have real line breaks. That's exactly my intention. I could format the XML the way I want without Flash displaying any of the formatting stuff. :)
Several ways to approach this. Perhaps the simplest answer is, in one sense your Flash developer is probably right, and you should move your whitespace outside of the CDATA container. The reason being, many people (me at least) tend to assume that everything inside a CDATA is "real data", as opposed to markup. On the other hand, whitespace outside a CDATA is normally assumed to be irrelevant, so data like this:
<description>
<![CDATA[<h2>lorem ipsum</h2>
<p>some text</p>]]>
</description>
would be easier to understand and to work with. (The flash developer can use the XML.ignoreWhite property to ignore the whitespace outside the CDATA.)
With that said, if you're editing the XML by hand, then I can see why it would be easier to use the formatting you describe. However, if the extra whitespace is inside the CDATA, then it will inevitable be included in the String data you extract, so your only option is to grab the content of the CDATA and remove the whitespace afterwards.
Then your question reduces to "how do I strip leading/trailing whitespace from a String in AS2?". And unfortunately, since AS2 doesn't support RegEx there's no simple way to do this. I think your best option would be to parse through from the beginning and end to find the first/last non-white character. Something along these lines (untested pseudocode):
myString = stuffFromXML;
whitespace = " " + "\t" + "\n" + "\r" + newline;
start = 0;
end = myString.length;
while ( testString( myString.substr(start,1), whitespace ) ) { start++; }
while ( testString( myString.substr(end-1,1), whitespace ) ) { end--; }
trimmedString = myString.substring( start, end );
function testString( needle, haystack ) {
return ( haystack.indexOf( needle ) > -1 );
}
Hope that helps!
Edit: I notice that in your example you'd also need to remove tabs and whitespace within your text data. This would be tricky, unless you can guarantee that your data will never include "real" tabs in addition to the ones for formatting. No matter what you do with the CDATA tags, it would probably be wiser not to insert extraneous formatting inside your real content and then remove it programmatically afterward. That's just making your own life difficult.
Second edit: As for what character to remove to get rid of newlines, it depends partially on what characters are actually in the XML to begin with (which probably depends on what OS is running where the file is generated), and partially on what character the client machine (that's showing the flash) considers a newline. Lots of gory details here. In practice though, if you remove \r, \n, and \r\n, that usually does the trick. That's why I added both \r and \n to the "whitespace" string in my example code.
its been a while since I've tinkered with AS2.
someXML = new XML();
someXML.ignoreWhite = true;
if you wanted to str_replace try '\n'
Is there a reason that you are using cdata? Admittedly I have no idea what the best practice for this sort of this is, but I tend to leave them out and just have the HTML sit there inside the node.
var foo = node.childnodes.join("") parses it out just fine and I never seem to come across these whitespace problems.
I'm reading this over and over again, and if I'm interpreting you right, all you want to know how to do is strip certain characters (tabs and newlines) from a string in AS2, right? I cannot believe no one has given you the simple one line answer yet:
myString = myString.split("\n").join("");
That's it. Repeat that for \r, \n, and \t and all newlines and tabs will be gone. If you want it as an easy function, then do this:
function stripWhiteSpace(str: String) : String
{
return str.split("\r").join("").split("\n").join("").split("\t").join("");
}
That function won't modify your old string, it will return a new one without \r, \n, or \t. To actually modify the old string use that function like this:
myString = stripWhiteSpace(myString);

Resources