How to extract inline images from PDF using Apache Tika Server and save them as files? - apache-tika

Is there a way to do this? I'm using the following headers in my PUT request to http://localhost:9998/tika
"Content-Type", "application/pdf"
"X-Tika-OCRLanguage", "eng"
"X-Tika-PDFextractInlineImages", "true"
"X-Tika-PDFOcrStrategy", "no_ocr"
Will the response contain the images? And if so, how do I save them?
Using Apache Tika server 1.26

the response will be string not the image(s)
the flag: PDFOcrStrategy tells the tika to use ocr (tesseract) or only try to extract the text from document without ocr - useful for native pdfs
the flag: PDFextractInlineImages tells the tika to ignore/include the embedded images
so when You have the scanned pdf you should use
"X-Tika-PDFextractInlineImages", "true"
"X-Tika-PDFOcrStrategy", "ocr_only"
for native pdfs
"X-Tika-PDFextractInlineImages", "false"
"X-Tika-PDFOcrStrategy", "no_ocr"
but in both scenarios tika will return text
If you want to take the images from pdf document IMO you should use the pdf box, or similar library. The goal of tika is to return text from input

Related

Apache Tika detecting docx mimetype wrong

I'm using Apache tika to see if the file extension matches the actual mimetype.For example, if a file is named .pdf but is actually an .exe, it would return false.
I send in a .docx with the mimetype
application/vnd.openxmlformats-officedocument.wordprocessingml.document
Tika does a .detect on the content and says that it is a
application/x-tika-ooxml
Any idea what could be wrong? Obviously the method returns false in this case cause they're not a match.

Filename not assigned to csv export in HighCharts on Mac

I am using the EXPORT-CSV plugin for Highcharts to export data to csv. (Thank you to the developers of this plugin!) When testing in Safari on a Mac, however, the exported csv file does not take the filename as expected from
exporting: {
filename: "FancyFileName"
}
and instead just uses the default Highcharts name "chart". All the built-in export types do use the desired filename from Safari, and the csv also gets the desired filename from all the other standard browsers I have tested.
Here is a fiddle.
How can Safari be convinced to use the filename I give it? Thanks for your help.
The hard coded filename is coming from the php script written on the server. You can find that script on php script for download of file with hardcoded file name
You can use that same code on your local machine server. To do that
create a php file on your local server.
copy the above code in it.
change url www.highcharts.com/studies/csv-export/csv.php in export.csv library to your local server url.
change the name whatever you wish to using $_POST['variablename']
Variable name should be pass when making post call from export.csv library and use in php file using $_POST
Highcharts.post(url, {
data: content,
name: name,
type: MIME,
extension: extension
});
This is the code used in export.csv library for making post call. I have added extra parameter name to be used in my php script for dynamic filename.

Rails/Ruby save image as base64 and access it in the views

I would like to know can we convert a image into base64 and save it in a database and access it in the views.
I have searched google and stackoverflow, all of them starts from middle like encoding or displaying the image.
I need to know how can we convert a image url/path(lets say i store image inside my app and its url stored in column)
How to encode it as base64 before saving(should we convert to base64 first and save in db?).
How to display it in the views
You can split this task to three or four steps:
getting the image
encoding to base64
storing it in database (optionaly)
display it in views
Getting the image
From Assets pipeline
If you are using Rails asset pipeline for that, you can use Rails.application.assets hash to get to image: Rails.application.assets['image_name.png'].to_s will give you the content of image_name.png image.
from file - local or by url
Here is the question about that on StackOverflow.
encode
Base64 Ruby module docs tells how to use Base64 encoding in Ruby:
Base64.strict_encode64(your_content_here)
NOTE: in this case strict_encode64 is preferrable over just encode64 because it doesn't add any newlines. (credit goes to Sergey Mell for pointing that out)
From docs:
encode64 - ... Line feeds are added to every 60 encoded characters.
strict_encode64 - ... No line feeds are added.
Store it in database (optionaly)
I suggest you to create a separate ActiveRecord model for that, with field of type text to keep base64 representation of image.
Display it in views
You can provide data-url to src attribute of img tag, so, the browser will decode image from base64 and display it just like regular image:
<img src="_BASE64_HERE"/>
Don't forget to change image format to whatever format you are using in data:image/png section.
UPDATE (2018-08-22): I have tried to use urlsafe_encode64, as suggested by Xornand, and for me it produces the output that is not recognized as image by the browser.
Tried in both Firefox 61.0.2 and Chromium 68.0.3440.106.
For the sake of reference and to enable experimentation, here are results themselves.
Image used as "original" (resized it to be even more small to reduce the size of base64 output):
encode64:
/9j/4AAQSkZJRgABAQEAYABhAAD/4QBARXhpZgAASUkqAAgAAAABAGmHBAAB
AAAAGgAAAAAAAAACAAKgCQABAAAAZAAAAAOgCQABAAAAfwAAAAAAAAD/2wBD
AAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwc
KDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIy
MjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL/
/gA0T3B0aW1pemVkIGJ5IEpQRUdtaW5pIDMuMTQuMTQuNzI2NzA4NjAgMHhm
ZjIzZjM3OQD/wAARCAB/AGQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAA
AAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIh
MUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKT
lJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi
4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQF
BgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMi
MoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZH
SElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJma
oqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq
8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDyNj85+tBbFEvE7j0Y/wA6YxrA2HIx
Jqwp4rS0LwvfayyspWKFj99z1+grvrf4TCSEMLqVzjqu3+VTKSRUYtnmQrS0
WTy9Tib0DfyNdNq/wz1WwUvbAyp6MMH/AArl7eKaw1JUuImjdcgq4x2NS2mt
CopqSuZkLf6Vdn/ZevR/ADgaHAv/AE9E/qteaQn9/de6NXoHw/c/YLVSetye
PxFKrsXT/wAzoPFigaovH/LIfzNYHet/xW2dVHtGP61gfxVqcKLMNXojjFUI
zVyI0AaSN8tFRKw2iigDxi9+S8mX0c1b0mxa6nDNGXUnCr/eNPNgbu5aduIz
g4HU1oaZdvp16JBErIo2hCe1M3SO20uKGwgVZdW0+G9Yjy4XfcR7EjgfSuv8
M6pqc91LZzWitLD1aORQCOxGOoI9K8t1W90vU9k0cM1vMgySuOvpVrwvrM1l
qguXnYkptLE9QOgqJQW5tGfQ92Sa4+YXMIEZ/vNkiud8Q+GdM1C4U3EAKn7r
rwy/jV3Tdfi1G1yrBiv3sH2qw7xOgjBJ5PfNc8tNUax8zx7xb8OJtAt5tU09
3uLIofMB+9HnufUe9R+BDstrTP8Az8/4V7KGla3aEhJIWUgqwzkHqD7Vxh0z
StHuGitrRYvn3hdxIBP16UOfMrMpQszO8TtnVT7ItYWfmrvP7PttQgDSJGc8
cZyKzbvwaWQvZTZcfwMev41pGtF7nJLDSitNTnY2xVqFqptFLbzNDMjJIhwV
YYIqaNua2Odmkr8UVCrfLRQI8fkuZAkTq7KSuCQfSrumTmdzE7Ev1BJ5PtVc
2h8pR6E01LaRHDoSGByCO1O6Ntbm6I8KSfujnJNLp5Ny2YSMZPzH/CobK9md
TBOvDcbgOPxrS0q3FrMzMg2hSMdATSRdzodD1M6ULhgcuy8Z9feuz8K6m7Rs
8nzDOcV5vZ2sjYeT+Lk5712emSER4XACjtUTjc0gzupbw+SZVXI6kKOgrh9V
kj1LV5o7dlMipn3Fbdtq4is5uN+xGZlxyR3rza21Rv7QkugzbgeOe1Z8hpzn
daTPIlqI5wWKjhgBW7pV1NNOqgDHGD2+hFcPp+tQFCJxt3Hsa6fTr2OJCImK
c8NwSKiUS1I1fFehW+sWMl1arjULZMsg6yIO2PUdq82j4Nes6JdFplaQDzF4
LDoR/hXA+KdM/szxDcIi4hlPmx49D2/A5Fa05aWZx14Wd0ZyniimA8UVqYHI
f2YxUAKT9BT10iU/8smrtFt+igKDj0xUgszj+EAjpnvWWp1XRx8eiTHny8fU
1ZXSJY0JbBXqQCa61bMocMB+dP8AsIOFwSPXrRewHPQSIqtnAA/StbSbyJWd
JABjpmsnVITp95sK5VxxVCZ7hIfNAIReorQSN241rZp14DgE5xjqPpXG2MFz
OGkWKVgxyQqkk0GZ76QDkAnkDvWnYtJB8ylyxGAFOD/Opehe4W1o8TgzFkXI
3BgRj2Nd3oFpaRyrKLndkgDL8EHsa4GUzzXn7wSIrDOWFamlXX2Z1UuXj6Z6
YOf8DWctTRHpVsklndgRO5CneoI6gfeXjrxn8q534mzXE8lk9kkxePdv8sEg
A4xnH0NaFnrCSTRqXDsjI69iCDj/AB/OrmmXMRu78yuMRuEXP4nH6ipS1CTS
VzyIavfLwWGR6rRXsk2q6akhDFWPrRV6kc0exzYt125bJ5x171ZjRMBgBn37
f54pflI+RgWxywPQn61Irp8rE9T9AB/XtTuZFTUL2PT41eU5JbCr/WqTa7HK
h+zwOHI6ucD9OtXdW0watbgCURyq2UbHHTvzWNB4ZvlYia6iSNc5cAnFVFxt
qGpl3yfanNxPK3m5zurIvbtlQxlgQRxiu1j0S0Q4Y/aD3yeled6skkGry26c
DzMADsKaak9B7Ghp8IRN38Zq4kRwcdR2FV7dvL24BNaMO1U8xjggfnQ0NMdF
H5u1pZgoTkD+lSGzDMyxyDAP3QOTVAk4AUkY5GBUwvXRspuLduO/+FQ0aJmi
pW3uLeRWAYdQO/HWqUurEu7hyGdiSB64qPVLosgdk2yDAB/z7YrC8w4OR9DT
SIm+hpSamxcnOfq1FZhznjB+uKKqxlc7mbW4wxUEB3PY9PWs648RyqC0UmRj
GCDz9M1z7Ssw6E+h/wDr0EyDCuyKTwBn/wCtT5A5i/J4mvy52yMoI5Kt/wDr
qnJ4jvSmDO6jOcfhVSQBztMnA7VnzWzgMVOfenyhcsy6nqFyCyzOV6fexVBp
5klRnVt3c9ah8+WDCyLj+tWYbxfNUkgnPemlYNzctb1EjVpGwBST6ukjbUbC
9u1Vbm5glRfkAJ61HDFaMwWWIbCeqnBFLQpXNGOZnXhjnHQ1t2WlyahbkrKq
Mv8AfbFQWekaWsSMNQuMH+BcEj8xXQ28XhuBQLiW6fI5Es23H4Lis5M1Rzl/
avCfIlGHBznOQfpWFMJY3MYHOevtXWavf6NLtis4eQNibXJOe3Wsz7EJYVJ+
WTJ4Y8/lTiRM5vzH7LkfQ0VsGylPKqCP9oYNFWZ2LqaRMedwQcA5GBnPT+dT
x+G55QjuoUYJyT64/wDr126m1FuXWDdEyB2O3qehz9MdqmSKBAxODGg4DKRk
k5+o/Gq5hWOCl8MOoARCcn7w7+1V5fDtwgYlDkLnaRyRz/8AW/OvRzCm5I5F
xtIXDAnaep5/EYqZbeP7SGLFcIV5XIwecE9e4HXHFJTHynlf/CO3kilhbrIh
74yaiTwZNMob7MQu4/cznOcYwP5V69BaLIXO5d5GMrk4JPcdccgjJ4H51KsE
ciufOWNVj+Vwo477h+J7envRzhynkcHw91K4d/LMsSocev6fiK04fhlqu8Aa
hApyAQ69M9Mnp+HuK9KmDKy+RC+TglyflHJwQfX6/wD16mgikmMZSaJdoJYk
gsWHGMgf/XqHJlKKPOovhnqzXBR9aiUA4yik8YznH4VswfCiwkPmXWq3kxUE
lUCqT6DGPr+Vd5GrWrKIwXUqxycFwccf1/Sk3MUiXa6MxCsfKJUYIHfvz19v
yV2By8XgvTLBFitLWJSyk72OW47kfhWTc6TdW7yqU/dISCSpPy8dwTnHNegv
GoAizH5oAJBXIkx6jt25z/KmSxGUqhQLGwJYseoBHyj06nii4Hmi2CwqEMg+
uDz+VFehrpNm67sbc9uf6Yopgc2LlYrUSRmQxbifkB6AjBGMce3epLaJkYzf
vGMq/IrEkD2OO+fTPaq9vP5ixIYkMkyEyehVeP6irULLGsZLO4UGRUXAbAA6
9jyfXvVEk6P587xzLtTftUq3PbB9c85okCLbkvviQkBGwxb1HqfQ/nxUT2zz
xqY3O+QCRMSbQo/75PNKpULDtDP5rELliAcj3z2J/OkMekUtvK0cYRkEahGV
dmck9cn0AOetSLd7IxNK25WRdqqBg5ByTnjoCTUkJkkgl+0IglRsAZ47ZGR6
f0qAXm+KGJCFIUAEOxYDjuVPP40ASIjGTz4RDJLLj5mUAEj7v69xU1tciV18
9CSSzHBGBjscY9/z7GkDTm5t5ZZFbBMYC9/XnjuKhUyyWsaxIsJRmXaVVgMk
jHoec+nA96Vh3L3kjc0gmQdAXTGRjv3HrxxxmnXELXksZlOYo1Mu5xtKk579
CMDseuKy7Uu1vb/ZFZImZm25AJHX6etWiyNdC3mQqzyDYoY4Kjk9OnBzz60h
lkyxzweXMpmGMBk3bDkAgkg9MVdF5Daw5TAUL/q26qOg69ecDnFUCUhLkRvJ
wzZUj5SBgcHHBANKpeUCcp5kY3AggHIHPTgYoAtOZWCGWxt7klQVkZl5Htx0
opvn2CRxLDL5ShB8ihgB+GKKBH//2Q==
strict_encode64:
/9j/4AAQSkZJRgABAQEAYABhAAD/4QBARXhpZgAASUkqAAgAAAABAGmHBAABAAAAGgAAAAAAAAACAAKgCQABAAAAZAAAAAOgCQABAAAAfwAAAAAAAAD/2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL//gA0T3B0aW1pemVkIGJ5IEpQRUdtaW5pIDMuMTQuMTQuNzI2NzA4NjAgMHhmZjIzZjM3OQD/wAARCAB/AGQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDyNj85+tBbFEvE7j0Y/wA6YxrA2HIxJqwp4rS0LwvfayyspWKFj99z1+grvrf4TCSEMLqVzjqu3+VTKSRUYtnmQrS0WTy9Tib0DfyNdNq/wz1WwUvbAyp6MMH/AArl7eKaw1JUuImjdcgq4x2NS2mtCopqSuZkLf6Vdn/ZevR/ADgaHAv/AE9E/qteaQn9/de6NXoHw/c/YLVSetyePxFKrsXT/wAzoPFigaovH/LIfzNYHet/xW2dVHtGP61gfxVqcKLMNXojjFUIzVyI0AaSN8tFRKw2iigDxi9+S8mX0c1b0mxa6nDNGXUnCr/eNPNgbu5aduIzg4HU1oaZdvp16JBErIo2hCe1M3SO20uKGwgVZdW0+G9Yjy4XfcR7EjgfSuv8M6pqc91LZzWitLD1aORQCOxGOoI9K8t1W90vU9k0cM1vMgySuOvpVrwvrM1lqguXnYkptLE9QOgqJQW5tGfQ92Sa4+YXMIEZ/vNkiud8Q+GdM1C4U3EAKn7rrwy/jV3Tdfi1G1yrBiv3sH2qw7xOgjBJ5PfNc8tNUax8zx7xb8OJtAt5tU093uLIofMB+9HnufUe9R+BDstrTP8Az8/4V7KGla3aEhJIWUgqwzkHqD7Vxh0zStHuGitrRYvn3hdxIBP16UOfMrMpQszO8TtnVT7ItYWfmrvP7PttQgDSJGc8cZyKzbvwaWQvZTZcfwMev41pGtF7nJLDSitNTnY2xVqFqptFLbzNDMjJIhwVYYIqaNua2Odmkr8UVCrfLRQI8fkuZAkTq7KSuCQfSrumTmdzE7Ev1BJ5PtVc2h8pR6E01LaRHDoSGByCO1O6Ntbm6I8KSfujnJNLp5Ny2YSMZPzH/CobK9mdTBOvDcbgOPxrS0q3FrMzMg2hSMdATSRdzodD1M6ULhgcuy8Z9feuz8K6m7Rs8nzDOcV5vZ2sjYeT+Lk5712emSER4XACjtUTjc0gzupbw+SZVXI6kKOgrh9Vkj1LV5o7dlMipn3Fbdtq4is5uN+xGZlxyR3rza21Rv7QkugzbgeOe1Z8hpzndaTPIlqI5wWKjhgBW7pV1NNOqgDHGD2+hFcPp+tQFCJxt3Hsa6fTr2OJCImKc8NwSKiUS1I1fFehW+sWMl1arjULZMsg6yIO2PUdq82j4Nes6JdFplaQDzF4LDoR/hXA+KdM/szxDcIi4hlPmx49D2/A5Fa05aWZx14Wd0ZyniimA8UVqYHIf2YxUAKT9BT10iU/8smrtFt+igKDj0xUgszj+EAjpnvWWp1XRx8eiTHny8fU1ZXSJY0JbBXqQCa61bMocMB+dP8AsIOFwSPXrRewHPQSIqtnAA/StbSbyJWdJABjpmsnVITp95sK5VxxVCZ7hIfNAIReorQSN241rZp14DgE5xjqPpXG2MFzOGkWKVgxyQqkk0GZ76QDkAnkDvWnYtJB8ylyxGAFOD/Opehe4W1o8TgzFkXI3BgRj2Nd3oFpaRyrKLndkgDL8EHsa4GUzzXn7wSIrDOWFamlXX2Z1UuXj6Z6YOf8DWctTRHpVsklndgRO5CneoI6gfeXjrxn8q534mzXE8lk9kkxePdv8sEgA4xnH0NaFnrCSTRqXDsjI69iCDj/AB/OrmmXMRu78yuMRuEXP4nH6ipS1CTSVzyIavfLwWGR6rRXsk2q6akhDFWPrRV6kc0exzYt125bJ5x171ZjRMBgBn37f54pflI+RgWxywPQn61Irp8rE9T9AB/XtTuZFTUL2PT41eU5JbCr/WqTa7HKh+zwOHI6ucD9OtXdW0watbgCURyq2UbHHTvzWNB4ZvlYia6iSNc5cAnFVFxtqGpl3yfanNxPK3m5zurIvbtlQxlgQRxiu1j0S0Q4Y/aD3yeled6skkGry26cDzMADsKaak9B7Ghp8IRN38Zq4kRwcdR2FV7dvL24BNaMO1U8xjggfnQ0NMdFH5u1pZgoTkD+lSGzDMyxyDAP3QOTVAk4AUkY5GBUwvXRspuLduO/+FQ0aJmipW3uLeRWAYdQO/HWqUurEu7hyGdiSB64qPVLosgdk2yDAB/z7YrC8w4OR9DTSIm+hpSamxcnOfq1FZhznjB+uKKqxlc7mbW4wxUEB3PY9PWs648RyqC0UmRjGCDz9M1z7Ssw6E+h/wDr0EyDCuyKTwBn/wCtT5A5i/J4mvy52yMoI5Kt/wDrqnJ4jvSmDO6jOcfhVSQBztMnA7VnzWzgMVOfenyhcsy6nqFyCyzOV6fexVBp5klRnVt3c9ah8+WDCyLj+tWYbxfNUkgnPemlYNzctb1EjVpGwBST6ukjbUbC9u1Vbm5glRfkAJ61HDFaMwWWIbCeqnBFLQpXNGOZnXhjnHQ1t2WlyahbkrKqMv8AfbFQWekaWsSMNQuMH+BcEj8xXQ28XhuBQLiW6fI5Es23H4Lis5M1Rzl/avCfIlGHBznOQfpWFMJY3MYHOevtXWavf6NLtis4eQNibXJOe3Wsz7EJYVJ+WTJ4Y8/lTiRM5vzH7LkfQ0VsGylPKqCP9oYNFWZ2LqaRMedwQcA5GBnPT+dTx+G55QjuoUYJyT64/wDr126m1FuXWDdEyB2O3qehz9MdqmSKBAxODGg4DKRkk5+o/Gq5hWOCl8MOoARCcn7w7+1V5fDtwgYlDkLnaRyRz/8AW/OvRzCm5I5FxtIXDAnaep5/EYqZbeP7SGLFcIV5XIwecE9e4HXHFJTHynlf/CO3kilhbrIh74yaiTwZNMob7MQu4/cznOcYwP5V69BaLIXO5d5GMrk4JPcdccgjJ4H51KsEciufOWNVj+Vwo477h+J7envRzhynkcHw91K4d/LMsSocev6fiK04fhlqu8AahApyAQ69M9Mnp+HuK9KmDKy+RC+TglyflHJwQfX6/wD16mgikmMZSaJdoJYkgsWHGMgf/XqHJlKKPOovhnqzXBR9aiUA4yik8YznH4VswfCiwkPmXWq3kxUElUCqT6DGPr+Vd5GrWrKIwXUqxycFwccf1/Sk3MUiXa6MxCsfKJUYIHfvz19vyV2By8XgvTLBFitLWJSyk72OW47kfhWTc6TdW7yqU/dISCSpPy8dwTnHNegvGoAizH5oAJBXIkx6jt25z/KmSxGUqhQLGwJYseoBHyj06nii4Hmi2CwqEMg+uDz+VFehrpNm67sbc9uf6Yopgc2LlYrUSRmQxbifkB6AjBGMce3epLaJkYzfvGMq/IrEkD2OO+fTPaq9vP5ixIYkMkyEyehVeP6irULLGsZLO4UGRUXAbAA69jyfXvVEk6P587xzLtTftUq3PbB9c85okCLbkvviQkBGwxb1HqfQ/nxUT2zzxqY3O+QCRMSbQo/75PNKpULDtDP5rELliAcj3z2J/OkMekUtvK0cYRkEahGVdmck9cn0AOetSLd7IxNK25WRdqqBg5ByTnjoCTUkJkkgl+0IglRsAZ47ZGR6f0qAXm+KGJCFIUAEOxYDjuVPP40ASIjGTz4RDJLLj5mUAEj7v69xU1tciV189CSSzHBGBjscY9/z7GkDTm5t5ZZFbBMYC9/XnjuKhUyyWsaxIsJRmXaVVgMkjHoec+nA96Vh3L3kjc0gmQdAXTGRjv3HrxxxmnXELXksZlOYo1Mu5xtKk579CMDseuKy7Uu1vb/ZFZImZm25AJHX6etWiyNdC3mQqzyDYoY4Kjk9OnBzz60hlkyxzweXMpmGMBk3bDkAgkg9MVdF5Daw5TAUL/q26qOg69ecDnFUCUhLkRvJwzZUj5SBgcHHBANKpeUCcp5kY3AggHIHPTgYoAtOZWCGWxt7klQVkZl5Htx0opvn2CRxLDL5ShB8ihgB+GKKBH//2Q==
urlsafe_encode64:
_9j_4AAQSkZJRgABAQEAYABhAAD_4QBARXhpZgAASUkqAAgAAAABAGmHBAABAAAAGgAAAAAAAAACAAKgCQABAAAAZAAAAAOgCQABAAAAfwAAAAAAAAD_2wBDAAgGBgcGBQgHBwcJCQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL_2wBDAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjL__gA0T3B0aW1pemVkIGJ5IEpQRUdtaW5pIDMuMTQuMTQuNzI2NzA4NjAgMHhmZjIzZjM3OQD_wAARCAB_AGQDASIAAhEBAxEB_8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL_8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4-Tl5ufo6erx8vP09fb3-Pn6_8QAHwEAAwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL_8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3-Pn6_9oADAMBAAIRAxEAPwDyNj85-tBbFEvE7j0Y_wA6YxrA2HIxJqwp4rS0LwvfayyspWKFj99z1-grvrf4TCSEMLqVzjqu3-VTKSRUYtnmQrS0WTy9Tib0DfyNdNq_wz1WwUvbAyp6MMH_AArl7eKaw1JUuImjdcgq4x2NS2mtCopqSuZkLf6Vdn_ZevR_ADgaHAv_AE9E_qteaQn9_de6NXoHw_c_YLVSetyePxFKrsXT_wAzoPFigaovH_LIfzNYHet_xW2dVHtGP61gfxVqcKLMNXojjFUIzVyI0AaSN8tFRKw2iigDxi9-S8mX0c1b0mxa6nDNGXUnCr_eNPNgbu5aduIzg4HU1oaZdvp16JBErIo2hCe1M3SO20uKGwgVZdW0-G9Yjy4XfcR7EjgfSuv8M6pqc91LZzWitLD1aORQCOxGOoI9K8t1W90vU9k0cM1vMgySuOvpVrwvrM1lqguXnYkptLE9QOgqJQW5tGfQ92Sa4-YXMIEZ_vNkiud8Q-GdM1C4U3EAKn7rrwy_jV3Tdfi1G1yrBiv3sH2qw7xOgjBJ5PfNc8tNUax8zx7xb8OJtAt5tU093uLIofMB-9HnufUe9R-BDstrTP8Az8_4V7KGla3aEhJIWUgqwzkHqD7Vxh0zStHuGitrRYvn3hdxIBP16UOfMrMpQszO8TtnVT7ItYWfmrvP7PttQgDSJGc8cZyKzbvwaWQvZTZcfwMev41pGtF7nJLDSitNTnY2xVqFqptFLbzNDMjJIhwVYYIqaNua2Odmkr8UVCrfLRQI8fkuZAkTq7KSuCQfSrumTmdzE7Ev1BJ5PtVc2h8pR6E01LaRHDoSGByCO1O6Ntbm6I8KSfujnJNLp5Ny2YSMZPzH_CobK9mdTBOvDcbgOPxrS0q3FrMzMg2hSMdATSRdzodD1M6ULhgcuy8Z9feuz8K6m7Rs8nzDOcV5vZ2sjYeT-Lk5712emSER4XACjtUTjc0gzupbw-SZVXI6kKOgrh9Vkj1LV5o7dlMipn3Fbdtq4is5uN-xGZlxyR3rza21Rv7QkugzbgeOe1Z8hpzndaTPIlqI5wWKjhgBW7pV1NNOqgDHGD2-hFcPp-tQFCJxt3Hsa6fTr2OJCImKc8NwSKiUS1I1fFehW-sWMl1arjULZMsg6yIO2PUdq82j4Nes6JdFplaQDzF4LDoR_hXA-KdM_szxDcIi4hlPmx49D2_A5Fa05aWZx14Wd0ZyniimA8UVqYHIf2YxUAKT9BT10iU_8smrtFt-igKDj0xUgszj-EAjpnvWWp1XRx8eiTHny8fU1ZXSJY0JbBXqQCa61bMocMB-dP8AsIOFwSPXrRewHPQSIqtnAA_StbSbyJWdJABjpmsnVITp95sK5VxxVCZ7hIfNAIReorQSN241rZp14DgE5xjqPpXG2MFzOGkWKVgxyQqkk0GZ76QDkAnkDvWnYtJB8ylyxGAFOD_Opehe4W1o8TgzFkXI3BgRj2Nd3oFpaRyrKLndkgDL8EHsa4GUzzXn7wSIrDOWFamlXX2Z1UuXj6Z6YOf8DWctTRHpVsklndgRO5CneoI6gfeXjrxn8q534mzXE8lk9kkxePdv8sEgA4xnH0NaFnrCSTRqXDsjI69iCDj_AB_OrmmXMRu78yuMRuEXP4nH6ipS1CTSVzyIavfLwWGR6rRXsk2q6akhDFWPrRV6kc0exzYt125bJ5x171ZjRMBgBn37f54pflI-RgWxywPQn61Irp8rE9T9AB_XtTuZFTUL2PT41eU5JbCr_WqTa7HKh-zwOHI6ucD9OtXdW0watbgCURyq2UbHHTvzWNB4ZvlYia6iSNc5cAnFVFxtqGpl3yfanNxPK3m5zurIvbtlQxlgQRxiu1j0S0Q4Y_aD3yeled6skkGry26cDzMADsKaak9B7Ghp8IRN38Zq4kRwcdR2FV7dvL24BNaMO1U8xjggfnQ0NMdFH5u1pZgoTkD-lSGzDMyxyDAP3QOTVAk4AUkY5GBUwvXRspuLduO_-FQ0aJmipW3uLeRWAYdQO_HWqUurEu7hyGdiSB64qPVLosgdk2yDAB_z7YrC8w4OR9DTSIm-hpSamxcnOfq1FZhznjB-uKKqxlc7mbW4wxUEB3PY9PWs648RyqC0UmRjGCDz9M1z7Ssw6E-h_wDr0EyDCuyKTwBn_wCtT5A5i_J4mvy52yMoI5Kt_wDrqnJ4jvSmDO6jOcfhVSQBztMnA7VnzWzgMVOfenyhcsy6nqFyCyzOV6fexVBp5klRnVt3c9ah8-WDCyLj-tWYbxfNUkgnPemlYNzctb1EjVpGwBST6ukjbUbC9u1Vbm5glRfkAJ61HDFaMwWWIbCeqnBFLQpXNGOZnXhjnHQ1t2WlyahbkrKqMv8AfbFQWekaWsSMNQuMH-BcEj8xXQ28XhuBQLiW6fI5Es23H4Lis5M1Rzl_avCfIlGHBznOQfpWFMJY3MYHOevtXWavf6NLtis4eQNibXJOe3Wsz7EJYVJ-WTJ4Y8_lTiRM5vzH7LkfQ0VsGylPKqCP9oYNFWZ2LqaRMedwQcA5GBnPT-dTx-G55QjuoUYJyT64_wDr126m1FuXWDdEyB2O3qehz9MdqmSKBAxODGg4DKRkk5-o_Gq5hWOCl8MOoARCcn7w7-1V5fDtwgYlDkLnaRyRz_8AW_OvRzCm5I5FxtIXDAnaep5_EYqZbeP7SGLFcIV5XIwecE9e4HXHFJTHynlf_CO3kilhbrIh74yaiTwZNMob7MQu4_cznOcYwP5V69BaLIXO5d5GMrk4JPcdccgjJ4H51KsEciufOWNVj-Vwo477h-J7envRzhynkcHw91K4d_LMsSocev6fiK04fhlqu8AahApyAQ69M9Mnp-HuK9KmDKy-RC-TglyflHJwQfX6_wD16mgikmMZSaJdoJYkgsWHGMgf_XqHJlKKPOovhnqzXBR9aiUA4yik8YznH4VswfCiwkPmXWq3kxUElUCqT6DGPr-Vd5GrWrKIwXUqxycFwccf1_Sk3MUiXa6MxCsfKJUYIHfvz19vyV2By8XgvTLBFitLWJSyk72OW47kfhWTc6TdW7yqU_dISCSpPy8dwTnHNegvGoAizH5oAJBXIkx6jt25z_KmSxGUqhQLGwJYseoBHyj06nii4Hmi2CwqEMg-uDz-VFehrpNm67sbc9uf6Yopgc2LlYrUSRmQxbifkB6AjBGMce3epLaJkYzfvGMq_IrEkD2OO-fTPaq9vP5ixIYkMkyEyehVeP6irULLGsZLO4UGRUXAbAA69jyfXvVEk6P587xzLtTftUq3PbB9c85okCLbkvviQkBGwxb1HqfQ_nxUT2zzxqY3O-QCRMSbQo_75PNKpULDtDP5rELliAcj3z2J_OkMekUtvK0cYRkEahGVdmck9cn0AOetSLd7IxNK25WRdqqBg5ByTnjoCTUkJkkgl-0IglRsAZ47ZGR6f0qAXm-KGJCFIUAEOxYDjuVPP40ASIjGTz4RDJLLj5mUAEj7v69xU1tciV189CSSzHBGBjscY9_z7GkDTm5t5ZZFbBMYC9_XnjuKhUyyWsaxIsJRmXaVVgMkjHoec-nA96Vh3L3kjc0gmQdAXTGRjv3HrxxxmnXELXksZlOYo1Mu5xtKk579CMDseuKy7Uu1vb_ZFZImZm25AJHX6etWiyNdC3mQqzyDYoY4Kjk9OnBzz60hlkyxzweXMpmGMBk3bDkAgkg9MVdF5Daw5TAUL_q26qOg69ecDnFUCUhLkRvJwzZUj5SBgcHHBANKpeUCcp5kY3AggHIHPTgYoAtOZWCGWxt7klQVkZl5Htx0opvn2CRxLDL5ShB8ihgB-GKKBH__2Q==

open office crashes after some time giving garbled font in converted PDF

We are converting word to pdf using the openoffice(3.4.1 version) in java with JODConverter.
below is the code used.
OpenOfficeConnection connection =
new SocketOpenOfficeConnection(2100);
try
{
connection.connect();
DocumentConverter converter =
new OpenOfficeDocumentConverter(connection);
converter.convert(inputFile, outputFile);
connection.disconnect();
return "Sucess " + DestinationPath + DestinationFileName;
}
catch (Exception localException1) {
}
The problem is that after random no of days the converted PDF contains the garbled fonts.
like # # ! $ $ " % &
The only solution we have so far is to restart the server. System guys are saying the the problem is with Open Office.
We are using open office to convert the document since it converts the doc files exactly including all the formatting and table structure.
what could be the solution to this.
So OpenOffice can be a little temperamental when running on a server, especially as it isn't multi-threaded and you end up having to run a pool of OpenOffice processes - see How can I use OpenOffice in server mode as a multithreaded service?.
Added to that often the rendering is off when converting to PDF - see https://forum.openoffice.org/en/forum/viewtopic.php?f=7&t=68865 which is why you may want to consider using a conversion service to automate the conversion tasks for you ?
For complete transparency I work for Zamzar (an online file conversion service), we have recently released a developer API - https://developers.zamzar.com/ that allows you to convert between a multitude of file types, specifically applicable to you here in that we support both doc and docx to pdf with little or not loss in the way the PDF is rendered. It maybe worth a look to see if this is a better alternative to trying to run your own solution through OpenOffice on a server.

tika returning incorrect line of text for pdf with lots of tables

I am using tika to extract text from a pdf file that has lot of tables.
java -jar tika-app-0.9.jar -t https://s3.amazonaws.com/centraldoc/alg1.pdf
It is returning some invalid text and sometimes it is trimming white space between 2 words; for example it returns
"qu inakli fmyathematical ideas to the real world" instead of "Link mathematical ideas to the real world".
Is there a way to minimize this kind of error? or is there another library that I can use? Does it make sense to use OCR to process these kind of pdf.
Try to control order when using PDFBox parser: PDFTextStripper has a flag that controls the order of lines in the document. By default (in PDFBox) it's set to false for performance reasons (no order preserved), but Tika changed its behavior between releases switching this flag on and off.
More details exactly on this problem in my blog Extracting text from PDF files with Apache Tika 0.9 (and PDFBox under the hood).
To get text from PDF to display in the right order, I had to set the SortByPosition flag to true... (tika-app-1.19.jar)
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
PDFParser pdfParser = new PDFParser();
PDFParserConfig config = pdfParser.getPDFParserConfig();
config.setSortByPosition(true); // needed for text in correct order
pdfParser.setPDFParserConfig(config);
pdfParser.parse(is, handler, metadata, context);

Resources