google translate misses up the coding of my file - localization

i am trying to use google translate for localization of an XML file, it has near 350K lines, but some of them contain coding for in-game font size and color, like so:
<replacement><p horizontalalignment="center"><br/><image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/><br/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Six_Superior" scalerate="1.5"/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/><br/><image enablescale="true" imagesetpath="00009499.Field_Boss" scalerate="1.4"/>Хмельной лик<br/><br/></p>Уничтожить зараженных насекомых<br/>возле мест обитания их королевы。<br/></replacement>
now for god knows what reason, google translate alters that code in the process of translation into some unacceptable coding, like so:
<replacement> <p horizontalalignment="center"> <br/> <image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/> <br/> <image enablescale = "true "imagesetpath =" 00015590.Tag_Dungeon_Six_Superior "scalerate =" 1.5 "/> <image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/> <br/> <image enablescale = "true" imagesetpath = "00009499.Field_Boss" scalerate = "1.4" /> Intoxicated face <br/> <br/> </ p> Destroy infected insects <br/> habitats near their queen. <br/> </ replacement>
is there any way to avoid that, why is it happening exactly? anyhelp is appreciated on that matter,thanks
EDIT : i am also looking for a way to input my text and have it out in the same exact language with only the coding mishaps changing, so i can isolate those,build a comparison table and then use that to fix the errors after the actual translation is done, but i don't see a way for selecting the same language as input AND output in google translate, it always forces me choose a different one in input or output, kind of makes sense but if there is a way to do that, i might be able to work around it..

Do not feed Google translate with your Xml file, as far as I know it doesn't understand Xml.
Extract the text from the Xml file.
Feed the text to translate.
Transform the text back to Xml.
You could simply transform the Xml to a text document with a single line per Xml element so it would be easier to turn it back into Xml.
More detail
According to the Toolkit you can upload:
HTML (.HTML)
Microsoft Word (.DOC/.DOCX)
OpenDocument Text (.ODT)
Plain Text (.TXT)
Rich Text (.RTF)
Wikipedia URLs
And a couple of extras such as JSON. So no Xml.
The best way I see is to transform your Xml document into one of these types (I would probably use JSON) and transform it is such a way that it can easily be transformed back again by using either position (1 line in the text file is the first element in the Xml document) or by an id (add the Id or position of the element in the xml hierarchy to the JSON element)
My guess is that the toolkit recognizes the html tags in the xml and escapes them. So another option might be to un-escape the > to > and &lt to <

Related

Carrying style IDs/names from HTML to .docx?

Is it possible to somehow tell pandoc to carry the names of styles from original HTML to .docx?
I understand that in order to tune the actual styles, I should be using reference.docx file generated by pandoc. However, reference.docx is limited to what styles it has to: headings, body text, block text, etc.
I'd like to:
specify "myStyle" style in the input HTML (via a "class" attribute, via any other HTML attribute or even via a filter code written in Lua),
<html>
<body>
<p>Hello</p>
<p class="myStyle">World!</p>
</body>
</html>
add a custom "myStyle" to reference.docx using Word,
run a html->docx conversion an expect pandoc generate a paragraph element with "myStyle" (instead of BodyText, which I believe it sets by default), so the end result looks like this (contents of word/document.xml inside the resulting output.docx was cut for brevity):
<w:p>
<w:pPr>
<w:pStyle w:val="BodyText" />
</w:pPr>
<w:r>
<w:txml:space="preserve">Hello</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="myStyle" />
</w:pPr>
<w:r>
<w:txml:space="preserve">World!</w:t>
</w:r>
</w:p>
There's some evidence styleId can be passed around, but I don't really understand it and am unable to find any documentation about it.
Doc on filtering in Lua states you can access attrs when manipulating a pandoc.div, but it says nothing about whether any of the attrs will be interpreted by pandoc in any meaningful way.
Finally, found what I needed – Custom styles. It's limited, but better than what I arrived earlier, and of course much better than nothing at all :)
I'll leave a step-by-step guide here in case anyone stumbles upon a similar question.
First, generate a reference.docx file like this:
pandoc --print-default-data-file reference.docx > styles.docx
Then open the file in MS Word (I was using a macOS version) you'll see this:
Click the "New style..." button on the right, and create a style to your liking. In my case I made change the style of text to be bold, in blue color:
Since I am converting from HTML to DOCX, here's my input.html:
<html>
<body>
<div>Page 1</div>
<div custom-style="eugene-is-testing">Page 2</div>
<div>Page 3</div>
</body>
</html>
Run:
pandoc --standalone --reference-doc styles.docx --output output.docx input.html
Finally, enjoy the result:

SEC company filings: Is the <SEC-HEADER> tag valid SGML? If so, how to parse it?

I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.
The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.

Opening a Word document at a particular bookmark

Using MVC 5 Razor Views.
I currently have a link to open a document that sits on the server in my about view as follows........
Basic Training <img src="~/Content/Images/Word.jpg" height="24" width="24" />
What I'd like is to be able to have a link to open this document at a particluar bookmark.
From what I have read so far, it would seem that the bookmark is specified after a # symbol. Unfortunately it doesn't seem to work and the document just opens from the start.
I've tried opening via an action using the #' notation as werll, but as yet, no joy.
FileStream fs = new FileStream(Server.MapPath(#"~\Content\My Doc - Basic Training.docx"), FileMode.OpenOrCreate, FileAccess.ReadWrite);
return File(fs, "My Doc - Basic Training.docx");
I've simply been appending #BoomarkName to the filename. No joy as of yet.
Is it possible?
If so could someone please point me in the right direction.
Just seems to work with #bookmark_name, as explained in How to create a hyperlink from an HTML page to a bookmark in Word.
So:
Basic Training <img src="~/Content/Images/Word.jpg" height="24" width="24" />

html2pdf and local (latvian) language characters

I am using Html2PDf to convert html to pdf.
But I am not able to achieve that it shows local (latvian) language letters. It shows ? instead.
I do understand that I should somehow add appropriate fonts, but I do not know where to get those fonts (which one support latvinan language) and how to add them into html2pdf.
Html2Pdf is based on tcpdf and currently there is font folder.
I think that is seems trivial question, but I was searching via google, but have not found answer that works for me.
require_once('inc/html2pdf/html2pdf.class.php');
$html2pdf = new HTML2PDF('P','A4','en');
//$html2pdf->pdf->setDefaultFont('times');
// HEADER
$pdf_output .='<page style="font-size: 11px; >';
$pdf_output .= '<img src="images/raka_pdf_logo.png" alt="logo"/><br><br><br><br>';
...
You may find the right font-family in html2pdf>tcpdf>fonts

retrieve xml file using nsxmlparser in ios

i am getting problem while reading xml files through nsxmlparser in ios,
<PRODUCTS>
<PRODUCTSLIST>
<PRODUCTDETAILS>
<headertext> test header </headertext>
<description><b style="font-size: x-small;">product, advantages</b></description>
</PRODUCTDETAILS>
</PRODUCTSLIST>
</PRODUCTS>
while i read the file using nsxmlparser i am able to get value(test header) for headertext but the description attribute value contains html tags so i cant able to get the result (<b style="font-size: x-small;">product, advantages</b>)i am getting result as empty
How can i get the result as((<b style="font-size: x-small;">product, advantages</b>)) for description attribute?
Speaking from a developers perspective I would not recommend using NSXMLParser due to it's laborious way to parse XML Files. There is a great write up about choosing the right XML Parser.
I use KissXML quite often.
You can find a quit tutorial of using it here.
Hope this helps.
Your problem is that the "b" tag is considered part of the XML structure, try escaping the '<' and '>' characters of the 'b' tag:
#"<b style=\"font-size: x-small;>product, advantages</b>"
see here

Resources