got wrong characters encoding using pdfbox to extract text from pdf

got wrong characters encoding using pdfbox to extract text from pdf - character-encoding

Recently,I have to index pdf into ElasticSearch and using pdfbox to extract text from pdf, however I got wrong characters encoding like this
Ýëĭ2ĈjŬj§ė¥
1 ŋ?nĳ"2$ 2016£ 2Ú 5Õ,”Òªj§?ně#ĳ"2ě
^ë2ļŘœ A$j§?n 2016£ě#ëÖĭ2Ĉļê
2 èÅŋ?n$ 2016£ 2Ú 6ÕöĿS¿ ĿS¿ ĿS
Õ¿ ĿSÖ¿ eöĿS&ØºĨĘ
http://www.sse.com.cnLćĈ
A$j§Ýëĭ2ĈŘĐ
My code is exactly the same as this page says here. I try pdfbox lib version from 0.8.x to 2.0.x, but it still can not work.
Any help or advice will be grateful!

I got answer from #Tilman comment.
See pdfbox.apache.org/1.8/faq.html#notext and the answer below too.

Related

Inp(Abaqus) mesh file conversion into .msh file via meshio

I'm using meshio to convert an .inp mesh file into compatible form for Gmsh (.msh file type).Αlthough the conversion (meshio convert input.inp output.msh) generates a .msh file, Gmsh can't read it because it appears unusual characters. Please check the attached .pnj. Does anyone have an idea or even better a solution to fix it?
thanks in advance
output.msh file
I tried to convert the .msh output file into ascii but it did not workded

importxml of url with Hebrew returns in encoding other than UTF-8 that chrome doesn't recognize

For example, in the dummy spreadsheet (tab 'desired outcome'), under "Link 1" you will see this URL:
http://www.promotion-il.co.il/service/%5DE%5E4%5D9%5E5-%5E8%5D9%5D7-%5D7%5E9%5DE%5DC%5D9-%5DC%5E2%5E1%5E7%5D9%5DD/
However, the actual URL in UTF-8 is:
http://www.promotion-il.co.il/service/%D7%9E%D7%A4%D7%99%D7%A5-%D7%A8%D7%99%D7%97-%D7%97%D7%A9%D7%9E%D7%9C%D7%99-%D7%9C%D7%A2%D7%A1%D7%A7%D7%99%D7%9D/
The actual URL string that contains Hebrew is:
http://www.promotion-il.co.il/service/מפיץ-ריח-חשמלי-לעסקים/
I will also add that the same URL has returned with a proper UTF-8 encoding for other blog posts. (See second example in the same tab).
Why is it happening?
How can it be fixed?
Thanks in advance!

This is the solution I came up with eventually:
I saw that for the imported urls - in order to fix a broken url 2 repalcements were needed:
5D --> D7%9
5E --> D7%A
I used this formula in a separate column to achieve it:
==ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE((<COLUMN WITH IMPORTED URLS HERE>),"5D","D7%9"),"5E","D7%A"))

How to decode windows-874 imap subject?

I'm having a serious problem with imap decoding. I received an email which might be encoded in windows-874. And this causes the whole letter to be read. I tried to use iconv('tis-620','utf-8',$txt) but I've had no luck.
I've tried searching everywhere that there might be an answer but it seems like it is the first problem of the universe. (or I don't search the correct word?)
The subject is :
Charset : ASCII
=?windows-874?Q?=CB=E9=CD=A7=BE=D1=A1=C3=D2=A4=D2=BE=D4=E0=C8=C9=CA=D3=CB=C3=D1=BA=A7=D2=B9=E4=B7=C2=E0=B7=D5=E8=C2=C7=E4=B7=C2=A4=C3=D1=E9=A7=B7=D5=E8
30
=E2=C3=A7=E1=C3=C1=CA=C7=D1=CA=B4=D5=CA=D8=A2=D8=C1=C7=D4=B7=AB=CD=C2 8?=
So, please tell me what the encoding is, if it's not tis-62. How can I decode this into a human language?

Finally I found my way home. Firstly I created a function to detect any encoding in a text given.
function win874($str){
$win874=strpos($str,"windows-874");
return $win874;
}
function utf8($str){
$utf8=strpos($str,"UTF-8");
return $utf8;
}
Then I convert with php functions:
if(win874($headers->subject)=="0" and utf8($headers->subject)=="0"){
echo $headers->subject;
}
if(win874($headers->subject)>="1"){
$subj0=explode("?",$headers->subject);
echo $subj0[3];
}
if(utf8($headers->subject)>="1"){
echo imap_utf8($headers->subject);
}
Because text with windows-874 always begins with "=?windows-874?Q?" so I used the simple function like "explode()" to extract the main idea from the junk. As I said, the main idea always comes after the 3rd question mark. Then I have the subject.
But the problem remains. I still have to change the browser encoding to Thai to make the text readable. (settings>tools>encoding>Thai : in chrome). Any suggestions?

Convert sxw to rml error

When I try to convert sxw file to rml file using OpenOffice , this error occurs :
Exception: 'asci' codec can't encode character u'\xe9'
what's the meaning of that error? and how can I fix it?

please check this link UnicodeEncodeError when trying to convert Django models to XML This is the same issue that we got here.
You can use yourfield.encode("utf-8") or use format() in openerp. [[format(obj.your_str_field or '')]]

We have Similar issues posted on lp: https://bugs.launchpad.net/openobject-server/+bug/956798
and it has been fixed on linked branch you can take the patch apply, which will make your report to tolerate the Unicode encoding.
Thank You

convert jruby 1.8 string to windows encoding?

I want to export some data from my jruby on rails webapp to excel, so I create a csv string and send it as a download to the client using
send_data(text, :filename => "file.csv", :type => "text/csv; charset=CP1252", :encoding => "CP1252")
The file seems to be in UTF-8 which Excel cannot read correctly. I googled the problem and found that iconv can convert encodings. I try to do that with:
ic = Iconv.new('CP1252', 'UTF-8')
text = ic.iconv(text)
but when I send the converted text it does not make any difference. It is still UTF-8 and Excel cannot read the special characters. there are several solutions using iconv, so this seems to work for others. When I convert the file on the linux shell manually with iconv it works.
What am I doing wrong? Is there a better way?
Im using:
- jruby 1.3.1 (ruby 1.8.6p287) (2009-06-15 2fd6c3d) (Java HotSpot(TM) Client VM 1.6.0_19) [i386-java]
- Debian Lenny
- Glassfish app server
- Iceweasel 3.0.6
Edit:
Do I have to include some gem to use iconv?
Solution:
S.Mark pointed out this solution:
You have to use UTF-16LE encoding to make excel understand it, like this:
text= Iconv.iconv('UTF-16LE', 'UTF-8', text)
Thanks, S.Mark for that answer.

According to my experience, Excel cannot handle UTF-8 CSV files properly. Try UTF-16 instead.
Note: Excel's Import Text Wizard appears to work with UTF-8 too
Edit: A Search on Stack Overflow give me this page, please take a look that.
According to that, adding a BOM (Byte Order Mark) signature in CSV will popup Excel Text Import Wizard, so you could use it as work around.

Do you get the same result with the following?
cp1252= Iconv.conv("CP1252", "UTF8", text)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

got wrong characters encoding using pdfbox to extract text from pdf - character-encoding

I got answer from #Tilman comment. See pdfbox.apache.org/1.8/faq.html#notext and the answer below too.

Related

Inp(Abaqus) mesh file conversion into .msh file via meshio

importxml of url with Hebrew returns in encoding other than UTF-8 that chrome doesn't recognize

How to decode windows-874 imap subject?

Convert sxw to rml error

convert jruby 1.8 string to windows encoding?

Categories

Resources