how to use tika for extracting the content from ppt? - apache-tika

fellow programmers!I extract a ppt file with using tika,which has only plain text.However,the result that tika give a content type is a jpg format!So my question is how to deal with it for I only want that case to be detected as a plain text type.

I change some source code in the tika,so I can get what content I want.In this way,I extract the ppt file and get the right result.

Related

objective-c, PDF, How to solve "failed to parse embedded CMap." issue in PDF Seaching?

I am trying perform searching text in PDF, My project works fine on mostly PDF, but it fails to search text on some PDF, and xcode shows this message on console :
"failed to parse embedded CMap." How to solve this issue, So that I can search text on all PDF. Any suggestion will be great. Thanks in advance .
In general, it is impossible to search for text in all PDFs. This is for two main reasons:
PDFs use character codes that do not correspond to Unicode. A Cmap is used in this case to associate PDF character codes with a Unicode, but is not required to be present in the PDF document.
Even if a Cmap is included, the characters of text are not guaranteed to appear in order in the PDF document. PDF displays the glyphs corresponding to a character code based on geometry not on text.

Is there a way to convert text stored in a textview text storage as HTML characters?

For example, I have a mini RTF editor that consist of a textview and I change the sizes of text in the text storage. Is there a way I can get these values as HTML? Or would I have to parse it manually?
There is no one to one conversion from the RTF spec to the HTML spec. You will either need to parse/convert yourself, or use a third party HTML - RTF converter.
Since your ultimate goal is to convert the RTF content to PDF, you might like to consider an RTF to PDF Converter.

RTF file to TXT/CSV file in objective-c?

I have RTF files containing that sort of content:
long_text_description_1 number1a number1b number1c
long_text_description_2 number2a number2b number2c
long_text_description_3 number3c
long_text_description_4 number4a number4b number4c
…
I need to extract the plain raw text without the colours, fonts and other formatting thing.
The only thing I need to keep are the most basic row/column information, ideally I would like a CSV file.
The file I get contain all the formatting:
{\cs18\lang1033\langfe1033\f0\b\i0\ul0\strike0\scaps0\fs15\afs15\charscalex100\expndtw0\cf1\dn0 number1a}
What is the best way to remove all rtf information while only keeping the row information?
Trying to figure out myself many many regular expressions sound dangerous unless there is a complete understanding of the RTF format.
What I could find on the Internet mostly focused on using Windows languages & libraries unavailable in iOS.
All rtf tags are in the form \xxx.
Try using a regular expression like "\\S+" and remove all matches or replace with nothing.
For your example, you'll end up with { number1a} This will remove any backslash followed by any characters.

IOS how to decode PDF CIDFontType2 text

I want to search text in pdf in chinese. I am using CGPDFScanner. I can't get the correct text with CIDFontType2.
my font object has ToUnicode entry
fontName is HFKAAO+LinGothic-Bold
it has CIDToGIDMap entry with name identity (pdf document said it means truetype font program is embedded)
CIDSystemInfo
Registry is Adobe
Ordering is Identity
it has FontFile2 entry in FontDescriptor Filter with FlateDecode
I found someone said I just inflate the text I got from Tj but that does not work... I used zlib to inflate the text , and it seems that it is not produce correct data.
Is there any sample code that I can study?
I just found https://github.com/KurtCode/PDFKitten but it can not work with chinese....
I found the problem
I just use CMap to translate the string form Tj
there is a bug in my code that I decode the CMap
after I fix the bug, every thing is ok
thanks~

What Character encoding is this?

When i backup my blackberry using blackberry desktop mananger, it saves it as an .ipd file.
its in hex... Not sure if its any particular type. But i used software called ABC amber Text Converter to convert this .ipd file into plain text format. And some of it comes out as plain text, Like all the messages saved in the backup file. But some of the text in the file looks like this:
qÖ²u_+;¢õ¿B[[¤†D`Ø,>p
|Cñ:ÌQ†nÁä¼sÒ®sKDv©{(]
)++³É«.gsn>
z
'‚51o4Kq
8Ütâ¯cí¿þ2´Õ|5kl$S,H
dbiIjz
*!~k$|
&*OÝ>0ðî­wã
+zno%q
2k;
YnÁÅŸ5|Xñ7Ú<}y2
A
V܉lO5‰<œtÅRI-I
Does anybody have any idea What the hell this is or if there is Any way i can decode this?
Thanks
It's just binary data. You may have been able to extract some text from the file where strings of text were stored, but the rest will be just bytes of data.
You'll need a specific program that understands these backup files. A quick google reveals a few choices, such as MagicBerry.
One of the Blackberry developers has helpfully blogged a bit of information about the binary format, so you could try using that to write your own program to parse it:
http://us.blackberry.com/devjournals/resources/journals/jan_2006/ipd_file_format.jsp

Resources