Apache Tika - Document to XHTML - Not getting the images - apache-tika

I am trying to convert a Word Document to an XHTML document. I am using version V 1.24. The code I am using is exactly as per the landing page
Reproducing code for reference:
org.xml.sax.ContentHandler handler = new ToXMLContentHandler();
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
File initialFile = new File("C:\\TikaHTML\\Doc3.docx");
InputStream targetStream =
new DataInputStream(new FileInputStream(initialFile));
parser.parse(targetStream, handler, metadata);
BufferedWriter writer = new BufferedWriter(new FileWriter("C:\\TikaHTML\\Doc3.XHTML"));
writer.write(handler.toString());
writer.close();
targetStream.close();
When I try to open the XHTML file in the browser, I am getting an encoding related error. Thanks to below comment, this was resolved.
Also, why am I not getting the Images? The file has tags like below:
<p><img src="embedded:image6.png" alt="image6.png" /></p>

Related

Not able to save manually edited data after filling pdf using iTextSharp

I succeeded filling out a PDF form with database data using the iTextSharp DLL. But my code breaks Adobe's extended features. Once I've filled forms using iTextSharp, the resulting document is a flat form and we can't fill it out manually again.
I already resolved the flattening problem using the following line of code.
pdfStamper.FormFlattening = false;
Now when I open the PDF file with the db data using following code, I am able to edit the form manually:
public ActionResult ViewFile()
{
string fileName = "I9 Form.pdf";
string filenames = string.Concat(Guid.NewGuid().ToString(), ".pdf");
PdfReader pdfReader = new PdfReader(Server.MapPath(String.Format
("~/App_Data/TempletePDF/") + fileName));
MemoryStream stream = new MemoryStream();
PdfStamper pdfStamper = new PdfStamper(pdfReader, stream);
AcroFields formFields = pdfStamper.AcroFields;
formFields.SetField("LastName", "John");
pdfStamper.FormFlattening = false;
pdfStamper.Writer.CloseStream = false;
pdfStamper.Close();
byte[] file = stream.ToArray();
MemoryStream output = new MemoryStream();
output.Write(file, 0, file.Length);
output.Position = 0;
HttpContext.Response.AddHeader
("content-disposition", "inline; filename=form.pdf");
// Return the output stream
return File(output, "application/pdf");
}
I am able to print the file with manually entered data using the pdf print button, but I'm no longer able to save the file with manually entered data.
When i am trying to open this saved file normally. It gives me the following error message:
"This document enabled extended features in Adobe Acrobat Reader DC. The
document has been changed since it was created and use of extended features
is no longer available. Please contact the author for the original version
of this document."
It sounds as if you're filling out a Reader-enabled form. In the comments, I referred to the concept of Reader-enabling:
Can I create a Reader-enabled PDF using iText? (The answer is: no, of course not!)
How can I create a Reader enabled PDF that can be signed in Adobe Reader? (The answer is: this can only be done with Adobe software.)
From these answers, you know that Reader-enabling is achieved by introducing a digital signature that uses a private key owned by Adobe.
You fill out the form using a PdfStamper that is created like this:
PdfStamper pdfStamper = new PdfStamper(pdfReader, stream);
This alters the file and breaks the digital signature. As a result, the Reader-enabling is lost and if usage rights are defined (such as saving the file manually), then these usage rights are no longer valid.
You can work around this by creating the PdfStamper in append mode:
PdfStamper stamper = new PdfStamper(pdfReader, stream, '\0', true);
Now the original file (the bytes that are signed using Adobe's private key) remain unaltered. You just add some extra bytes. This will preserve Reader-enabling.

Export SSRS report directly without rendering it on ReportViewer

I have a set of RDL reports hosted on the report server instance. Some of the report renders more than 100,000 records on the ReportViewer. So that it takes quite long time to render it on the Viewer. So, we decided to go with Export the content directly from the server based on the user input parameters for the report as well as export file format.
Main thing here, I do not want the user to wait until the export file available for download. Rather, User can submit the action and can proceed to do other works. In the background, the program has to export the file to some physical location. When the download will be available, the user will be informed with some notification about the exported file.
I found the way in this Link. I need to know what are the ways to achieve the above mentioned functionality as well as how to pass the input parameters for the report. Pl suggest me.
Note: I was using XML as datasource for the rdl reports.
EDIT
I found something useful and did the coding like the below,
string path = ServerURL +"?" + _reportFolder + "ReportName&rs:Command=Render&rs:Format=PDF";
WebRequest req = WebRequest.Create(path);
string reportParametersQT = String.Empty;
req.Credentials = CredentialCache.DefaultNetworkCredentials;
WebResponse response = req.GetResponse();
Stream stream = response.GetResponseStream();
//screen.Response.Clear();
string enCodeFileName = HttpUtility.UrlEncode("fileName.pdf", System.Text.Encoding.UTF8);
// The word attachment in Addheader is used to directly show the save dialog box in browser
Response.AddHeader("content-disposition", "attachment; filename=" + enCodeFileName);
Response.BufferOutput = false; // to prevent buffering
Response.ContentType = response.ContentType;
byte[] buffer = new byte[1024];
int bytesRead = 0;
while ((bytesRead = stream.Read(buffer, 0, buffer.Length)) > 0)
{
Response.OutputStream.Write(buffer, 0, bytesRead);
}
Response.End();
I am able to download the exported file. But need to save the file in physical location instead of downloading. I dont know how to do that.
Both of these are very easy to do. You essentially just pass the parameters in the URL that you're calling, for example for a parameter called "LearnerList" you add &LearnerList=12345 to the URL. For exporting, add an additional paramter for Format=PDF (or whatever you want the file as) to get the report to export as a PDF instead of generating in Report Viewer.
Here's an example URL:
https://reporting.MySite.net/ReportServer/Pages/ReportViewer.aspx?/Users+Folders/User/My+Reports/Learner+Details&rs:Format=PDF&LearnerList=202307
Read these two pages, and you should be golden:
https://msdn.microsoft.com/en-us/library/ms155391.aspx
https://msdn.microsoft.com/en-us/library/ms154040.aspx

pdf.js to display output of file created with tcpdf

I really hope you will be able to help me out on this one.
I am new to pdf.js so for the moment, I am playing around with the pre-built version to see if I can integrate this into my web app.
My problem:
I am using tcpdf to generate a pdf file which I would like to visualize using pdf.js without having to save it to a file on the server.
I have a php file (generate_document.php) that I use to generate the pdf. The file ends with the following:
$pdf->Output('test.pdf', 'I');
according to the tcpdf documentation, the second parameter can be used to generate the following formats:
I: send the file inline to the browser (default). The plug-in is used if available. The name given by name is used when one selects the "Save as" option on the link generating the PDF.
D: send to the browser and force a file download with the name given by name.
F: save to a local server file with the name given by name.
S: return the document as a string (name is ignored).
FI: equivalent to F + I option
FD: equivalent to F + D option
E: return the document as base64 mime multi-part email attachment (RFC 2045)
Then, I would like to view the pdf using pdf.js without creating a file on the server (= not using 'F' as a second parameter and passing the file name to pdf.js).
So, I thought I could simply create an iframe and call the pdf.js viewer pointing to the php file:
<iframe width="100%" height="100%" src="/pdf.js_folder/web/viewer.html?file=get_document.php"></iframe>
However, this is not working at all....do you have any idea what I am overlooking? Or is this option not available in pdf.js?
I have done some research and I have seen some posts here on converting a base64 stream to a typed array but I do not see how this would be a solution to this problem.
Many thanks for your help!!!
EDIT
#async, thanks for your anwer.
I got it figured out in the meantime, so I thought I'd share my solution with you guys.
1) In my get_document.php, I changed the output statement to convert it directly to base64 using
$pdf_output = base64_encode($pdf->Output('test_file.pdf', 'S'));
2) In viewer.js, I use an XHR to call the get_document.php and put the return in a variable (pdf_from_XHR)
3) Next, I convert what came in from the XHR request using the solution that was already mentioned in a few other posts (e.g. Pdf.js and viewer.js. Pass a stream or blob to the viewer)
pdf_converted = convertDataURIToBinary(pdf_from_XHR)
function convertDataURIToBinary(dataURI) {
var base64Index = dataURI.indexOf(BASE64_MARKER) + BASE64_MARKER.length;
var base64 = dataURI.substring(base64Index);
var raw = window.atob(base64);
var rawLength = raw.length;
var array = new Uint8Array(new ArrayBuffer(rawLength));
for (i = 0; i < rawLength; i++) {
array[i] = raw.charCodeAt(i);
}
return array;
}
et voilĂ  ;-)
Now i can inject what is coming from that function into the getDocument statement:
PDFJS.getDocument(pdf_converted).then(function (pdf) {
pdfDocument = pdf;
var url = URL.createObjectURL(blob);
PDFView.load(pdfDocument, 1.5)
})

iTextSharp preserve html formatting on pdf

I am using some basic styles in ckeditor bold, italic, etc. to allow my users to style their text for report writing.
When this string is passed to iTextSharp I am removing the html otherwise the html is printed on the pdf. I am removing this with
Regex.Replace(item.DevelopmentPractice.ToString(), #"<[^>]*>| ", String.Empty)
Is there a way to format the text on the pdf to preserve the bold but not display
<strong></strong>
UPDATE
I have provided full code below as requested.
public FileStreamResult pdf(int id)
{
// Set up the document and the Memory Stream to write it to and create the PDF writer instance
MemoryStream workStream = new MemoryStream();
Document document = new Document(PageSize.A4, 30, 30, 30, 30);
PdfWriter.GetInstance(document, workStream).CloseStream = false;
// Open the pdf Document
document.Open();
// Set up fonts used in the document
Font font_body = FontFactory.GetFont(FontFactory.HELVETICA, 10);
Font font_body_bold = FontFactory.GetFont(FontFactory.HELVETICA, 10, Font.BOLD);
Chunk cAreasDevelopmentHeading = new Chunk("Areas identified for development of practice", font_body_bold);
Chunk cAreasDevelopmentComment = new Chunk(item.DevelopmentPractice != null ? Regex.Replace(item.DevelopmentPractice.ToString(), #"<[^>]*>| ", String.Empty) : "", font_body);
Paragraph paraAreasDevelopmentHeading = new Paragraph();
paraAreasDevelopmentHeading.SpacingBefore = 5f;
paraAreasDevelopmentHeading.SpacingAfter = 5f;
paraAreasDevelopmentHeading.Add(cAreasDevelopmentHeading);
document.Add(paraAreasDevelopmentHeading);
Paragraph paraAreasDevelopmentComment = new Paragraph();
paraAreasDevelopmentComment.SpacingBefore = 5f;
paraAreasDevelopmentComment.SpacingAfter = 15f;
paraAreasDevelopmentComment.Add(cAreasDevelopmentComment);
document.Add(paraAreasDevelopmentComment);
document.Close();
byte[] byteInfo = workStream.ToArray();
workStream.Write(byteInfo, 0, byteInfo.Length);
workStream.Position = 0;
// Setup to Download
HttpContext.Response.AddHeader("content-disposition", "attachment; filename=supportform.pdf");
return File(workStream, "application/pdf");
This really is not the best way to do HTML to PDF - iText or no iText. Try to look for a different method, you are not actually converting HTML to PDF, you are inserting scraped text to PDF using Chunks.
The most common way to do iText HTML2PDF seems to be to use HTMLWorker (I think it might be XMLWorker in newer versions), but people complain about that too; see this. It looks like you are building the PDF using non-converted iText elements without HTML and want to use HTML within those elements and I'm guessing that it will be very, very hard.
In the linked HTML worker example, have a look at the structure of the program. They do a HTML2PDF conversion - but if that fails, they create the PDF using the other iText methods, like Paragraph and Chunk. They there set the Chunk to have some styling as well.
I guess that you would have to parse the incoming HTML, divide it to chunks yourself, convert the s to Chunks with styling and only then vomit them onto the PDF. Now imagine doing that with a data source like CKE - even with a very strict ACF it would be a nightmare. If anyone knows of any other way than this, I want to know too (I do basically CKE to PDF for a living)!
Do you have any options, such as creating your own editor or using some other PDF technique? I use wkhtmltopdf but my situation is very different. I would use PrinceXML but it's too expensive.

Create and download word file from template in MVC

I have kept a word document (.docx) in one of the project folders which I want to use as a template.
This template contains custom header and footer lines for user. I want to facilitate user to download his own data in word format. For this, I want to write a function which will accept user data and referring the template it will create a new word file replacing the place-holders in the template and then return the new file for download (without saving it to server). That means the template needs to be intact as template.
Following is what I am trying. I was able to replace the placeholder. However, I am not aware of how to give the created content as downloadable file to user. I do not want to save the new content again in the server as another word file.
public void GenerateWord(string userData)
{
string templateDoc = HttpContext.Current.Server.MapPath("~/App_Data/Template.docx");
// Open the new Package
Package pkg = Package.Open(templateDoc, FileMode.Open, FileAccess.ReadWrite);
// Specify the URI of the part to be read
Uri uri = new Uri("/word/document.xml", UriKind.Relative);
PackagePart part = pkg.GetPart(uri);
XmlDocument xmlMainXMLDoc = new XmlDocument();
xmlMainXMLDoc.Load(part.GetStream(FileMode.Open, FileAccess.Read));
xmlMainXMLDoc.InnerXml = ReplacePlaceHoldersInTemplate(userData, xmlMainXMLDoc.InnerXml);
// Open the stream to write document
StreamWriter partWrt = new StreamWriter(part.GetStream(FileMode.Open, FileAccess.Write));
xmlMainXMLDoc.Save(partWrt);
partWrt.Flush();
partWrt.Close();
pkg.Close();
}
private string ReplacePlaceHoldersInTemplate(string toReplace, string templateBody)
{
templateBody = templateBody.Replace("#myPlaceHolder#", toReplace);
return templateBody;
}
I believe that the below line is saving the contents in the template file itself, which I don't want.
xmlMainXMLDoc.Save(partWrt);
How should I modify this code which can return the new content as downloadable word file to user?
I found the solution Here!
This code allows me to read the template file and modify it as I want and then to send response as downloadable attachment.

Resources