Kanji character looking different in different contexts - character-encoding

First noticed it when parsing and epub with a rudimentary java program, it does not seem caused by my program as within the same application, Calibre (ebook reader), the same character looks different in ebook HTML editing window and preview window: kanji different looks in Calibre ebook editor
The right look is the one of the preview window.
If I unpack the epub and open the html file for that page I get the character looking, more or less, like an I (or I itself).
If I paste the right looking kanji to MS word or libre office I still get the character looking like I. Extacting the font used in the epub and using it in word I still get the I-like character.
Can someone explain why this happen and how I can get the correct kanji in cut&paste or in the .html files resulting from unpacking the epub?

Related

OpenXml get page number to which each paragraph in a .docx file

I have a Word docx file and I want to retrieve all the paragraphs in OpenXml with c#.
I need to know:
1.-The number of pages of the Documents.
2.-The page number to which each paragraph belongs.
Can you show an example where the paragraphs of the document are read?
Unfortunately, As Why only some page numbers stored in XML of docx file? answers, docx dose not contains reliable page number service. Xml files carry no page number, until microsoft Word open it and render dynamically. Even you read openxml documents like https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.pagenumber?view=openxml-2.8.1 .
You can unzip some docx files, and search "page" or "pg". Then you will know it. I do this on different kinds of docx files in my situation. All tell me the same truth. Glad if this helps.
Few month ago, I reprogramed a python package call docx2python to do similar thing. I reproduced a structured(with level) xml format file from a docx file. As far as I know, a paragraph contains several Runs and each Run contain one only text. You can read this document to see how to do it. Plain paragraphes are not hard to read. https://learn.microsoft.com/en-us/dotnet/api/documentformat.openxml.wordprocessing.paragraph?view=openxml-2.8.1 . Glad if this helps.

How to do a directory wide search for Chinese characters in a text editor?

I have Atom and Sublime and I'm working with a series of documents with some Chinese comments. It seems that both text editor decodes the characters before displaying (happens in a fraction of a second when opening each file where you can see it turn from gibberish into Chinese). After those files are opened, I am able to include them in my directory wide search.
However, without opening any of these documents, I am unable to search at them.
I'm resorting to opening hundreds of tabs currently to work, which is highly inefficient.
I'm guessing both editors are waiting until each file is opened before processing it. How do I ensure that all files in my Project is decoded as Chinese in the background?
I'm opened to using other text editors.

Latex generated pdf unreadable

Of late, I have observed that pdf generated by latex files are unreadable in certain email browsers (when previewing the attachment in Outlook) as well as the printed hard copy especially math symbols like inner products, integral etc overlap with each other making the file ugly and unreadable. Surprisingly the same file looks perfectly fine when viewed using the ShareLatex built-in pdf browser as well as the desktop version of Adobe Reader.
ShareLatex documentation suggest switching the PDF viewer from built-in to native. Upon changing to native, even the browser version had unreadable characters.
[https://www.sharelatex.com/learn/Kb/Changing_PDF_viewer]
So, I would like to know if there is better way to compile the tex file in Sharelatex so that its readable across platforms and in print.
Most of the "pdf generation from tex" related issues posted on StackOverflow point out problems with viewing images. As such the pdf files I am generating don't contain any images.
Thanks in advance !
AFAIK there's not a single build-in PDF viewer (browser, e-mail client, ...) that works well. But what you could test is if \usepackage{lmodern} makes things better ...

Search Words in pdf files

Is it possible to search "words" in pdf files with delphi?
I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.
It depends on the structure of the specific PDF.
If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.
If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.
Here is page that will detail some of the structure of a PDF. This a SO post for the same.
The components/libraries mentioned in the answer to this question should do what you need.
I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!
Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!
One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.
It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.
The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.
To know about the filters installed on your PC, you can use ifilter explorer.
Wikipedia has some links on its ifilters page.
Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.
PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.

RTF editor

I have a templates written in RTF(with some tags which are replaced by data from DB in app), but when I edit them in MS Word, Word put some invisible tags to the templates, which destruct my tags(I must open template in Notepad and edit code).
Do you know some editor for RTF, which strict follows RTF specification?
Thanks
On Windows, the included app Wordpad is pretty decent in my opinion.
The RTF spec allows an RTF editor such as Word or a third party control to sprinkle the tags in-between the RTF text, provided that the actually RTF display is maintained. For this reason, there is no way to guarantee that your original template text will not be disturbed. For this reason, I recommend using an RTF editor API to do any search/replacement within your template. The RTF editor knows to put aside the RTF tags and access the original text as intended.
OK, I know that google find bunch of editors, but I don`t have time to try each of them to find out best one.
so I search for advice which is good, not which is avialable
EDIT: I found and for weeks use this solution
TE EDIT
and is very good, I recommend it.

Resources