Hi am trying to parse a pdf file, am able to extract the Text from pdf, but if the pdf is compressed (using flatedecode), i get junk characters so needed to know how to decompress the text, also how to know the filter used?
You can use zlib library if working in c++ to do decompression of the bytes for the content stream of a page.
Related
I am processing pdf files with imagemagick to images but this particular file is processed to some really gibberish stuff
To simplify stuff I am doing simple
convert file.pdf out.jpg
Just an idea is that it is mix of text pdf and image pdf and this could cause troubles. Can you help?
Pages of document which are in text are converted to this gibberish, last page which is actually scan is fine
this is the link to the original
file
EDIT: I found out that also files without combination of text and scan are causing issues, actually files which contain text data, not scanned image. So the question is how to setup imagemagick to convert pdf with pure text to image without getting this output
Problem was with ghostscript 9.22,
update to 9.23 helps
I am reading a PDF file stored locally (using nsbundle) and converting it to text.
But when I am trying to read the PDF from http i.e. URL scheme and give the path to my PDF to text converter it returns nil.
Any solutions would be appreciated.
My basic question is how to read a PDF file from a URL path?
on that way there are many restriction to convert PDF file to plain text.If you want to display PDF on app then use PDF Reader Core
I am trying to convert office files to PDF using POI and iText. I am able to do the basic conversion where I read the word file using WordExtractor and write the contents to PDF file using PDF writer.
However, this does not retain the structure (tables, styles etc). I have come across this forum that you can retain the formats using Tika. Are there any working examples for this?
I know that Apache Tika is a text extractor. It can extract text from doc, pdf, ppt and lots of other file formats. Now I need this function in ios, so I want to know is there any alternative to Apache Tika for ios?
If there is no such library for ios, you can tell me tools that can extract specified file format.
Thank you in advance.
libopc for extracting text from docx, xlsx, pptx.
Antiword for older MS formats.
You can extract strings from a PDF using CoreGraphics also, and
using PDFiPhone too.
If you're also looking for extracting text from a HTML document, have a look at NSXMLParser.
I have a bunch of Word documents (.doc) stored in my SQL database that I need open, clean the properties such as Title, Subject etc and then save the file back to the database.
Is it even possible to open a ".doc" file from a stream?
Word is not able to open .doc files from a stream in memory. To open the file you would have to save the document to a temporary location first.
However, Word's little-known RTF converter interface can be used to load documents from streams in RTF format. If using RTF instead of the binary format is an option for you1, you might want to have a look at the WinWord Converter SDK:
How to Obtain the WinWord Converter SDK (GC1039)
For an import converter you would have to implement the ForeignToRtf method that will be called by Word to receive the RTF input.
1Actually you can still save the files in the .doc format; however, you would have to convert the .doc file to RTF first using the SDK and then open the RTF stream in Word. The conversion from the binary format to RTF and vice versa should be mostly lossless as the RTF format has been developed in sync with the binary format. However, it should be borne in mind that using the RTF converter interface will not allow you to use any of the new features introduced with OpenXML/Office 2010.
I'm pretty sure the Word DOCUMENT object implements IPersistStream (the COM interface). I +KNOW+ it implements IPersistFile.
It's not the easiest thing to work with, and since it's COM, it doesn't interoperate well with .net streams, but I believe it'd be doable using IPersistStream.