Extracting Apache POI HWPF Hyperlinks - hyperlink

HYPERLINK "target"label
How can i extract hyperlinks from a HWPF document? I can get paragraphs from the doc file and extract the correct styling if necessary, i.e. bold, italic etc. But how would i identify and extract hyperlinks from a paragraph?

The .doc format doesn't store hyperlinks in the simplest of ways, as you've noticed...
A Hyperlink will be a single CharacterRun, with special markers on it. Once you have detected it, just split up the text based on the quotes.
There's a good example of doing this in Apache Tika, look at the handleSpecialCharacterRuns method of WordExtractor to see it done.

Related

How to scan words, lines and their properties on the text?

I'need to scan a document. It's not OCR, let me show you:
--Example--
Table of Contents
Some Italic Words
Sentence 23
--End--
Suppose that as a ".doc" formatted text. I need to scan it line by line and understand the first line is bold, second is italic and third one includes space after first word and followed by a number. Reason i want to recognize them is i need to categorize them in a table view like bold lines italics, numbereds etc.
I'm okay in both swift and objective-c but totally clueless about document scanning. If you offer any reference, framework or approach i would be grateful to hear.
variant: your doc is really a docx. (docx is xml) Parse the XML. The format defines XML tags it uses to mark stuff bold or italic or whatever -- a docx is kind of like html.
variant: If your doc is really a doc! then we are not talking about xml but a binary format. It is also document and you can go parse it but I don't think will be easy
BUT
There is a library I know: doc2text that can parse a lot of stuff. (http://www.textlib.com/doc2text.html)
We used in past projects and it did an okay job and using this saves you A LOT of effort writing your own parsers

iOS search and replace PDF string

Is it possible to search and replace a known string from a PDF with Objective-C/Quartz 2D?
I've some nice formatted PDF with tabular data, created with Latex (and generated with pdflatex). Every pdf will have a placeholder string, something like XXXXXX that I would like to change programmatically.
This strings will be replaced only by other numbers.
I'm aware that the PDF could be an editable form, but i don't want it because i prefer to leave all the fonts and formatting as they're typeset by Latex.
It is not possible to search and replace text in PDF files using Quartz 2D. Quartz 2D offers a read only low level interface for reading PDF files. While searching can be implemented on top of it, although with much effort, modifying the files and replacing text is not possible.

PasteSpecial using Ole,PowerPoint,Delphi

How do you use PasteSpecial in Delphi to paste into an Ole PowerPoint. I have rtf data i want to paste into powerpoint and I need to use PasteSpecial. However I cannot find documentation on how to fill out the parameters it needs.
PasteSpecial is just going to favor one format over the other. So you can prioritize the formats, or eliminate formats, to influence the pasting. For example, if you have RTF and TEXT on the clipboard, and PP always pastes TEXT by default, even if RTF is listed first, then you could just eliminate TEXT and provide ONLY RTF. Then it has to paste as RTF.
MSDN has documentation for the 2003 and 2007 versions. In both cases, the first parameter should be ppPasteRTF if you want to choose the clipboard contents with RTF format. You can use EmptyParam for the remaining five parameters.

Including full LaTeX documents within others

I'm currently finishing off my dissertation, and would like to be able to include some documents within my LaTeX document.
The files I'd like to include are weekly reports done in LaTeX to my supervisor. Obviously all documents are page numbered seperately.
I would like them to be included in the final document.
I could concatenate all the final PDFs using GhostScript or some other tool, but I would like to have consistent numbering throughout the document.
I have tried including the LaTeX from each document in the main document, but the preamble etc causes problems and the small title I have in each report takes a whole page...
In summary, I'm looking for a way of including a number of 1 or 2 page self-complete LaTeX files in a large report, keeping their original layouts, but changing the page numbering.
For a possible solution of \input-ing the original LaTeX files while skipping their preamble, the newclude package might help.
Otherwise, you can use pdfpages for inserting pre-existing PDFs into your dissertation. I seem to recall that it has a feature of "suppressing" the original page numbers by covering them up with white boxes.
The suggestion from #Will Robertson works great. I'd just like to add an example for all lazy people:
\usepackage{pdfpages}
...
% Insert _all_ pages from some_pdf.pdf:
\includepdf[pages=-]{some_pdf} % the .pdf extension may be omitted
From the documentation of the package:
To include a specific range of pages, you could do pages={4-9}. If start is omitted, it defaults to the first page, if end is omitted, it defaults to the last page.
To include it in landscape mode, do landscape=true
Maintaining the original formatting per document will be difficult if they're using different formats. For example, concatenating different document classes will be near impossible.
I would suggest you go with the GhostScript solution with a slight twist. Latex allows you to set the starting page number using \setcounter{page}{13} for example. If you can find an application that can count the pages of a PDF document (pdfinfo in the pdfjam Ubuntu package is one example), then you can do the following:
Compile the next document to PDF
Concatenate the latest PDF with the current full PDF
Find the page count of the full PDF
Use sed to pluck in a \setcounter{page}{N} command into the next latex file
Go back to the beginning
If you need to do any other processing, again use sed. You should (assuming you fix the infinite loop in the above algorithm ;-) ) end up with a final PDF document with all original PDFs concatenated and continuous line numbers.
Have a look a the combine package, which seems to be exactly what you're searching for.
Since it merges documents at the source level, I guess the page numbers will be correct.

Best Way to Automate Adding Text to an Image and formatting for Printing?

Here's what I have:
Quarter Sheet Flyer (4 per page) as a PSD or JPG
Text file with one entry of text per line.
What I want to do:
Print out 100 flyers (on 25 pieces of paper)
Somehow automate the process of adding the text to the image, either via some scripting language or a Photoshop automated task. Then format the pages to print, either to generate a 25 page PDF file or generate four at a time and send them to the printer page by page.
Anyone have any experience with something like this or have any recommendations on how I should go about doing this?
Thanks for your help!
You can use Microsoft Word automation to generate a word file with the correct text and image, and then just print it.
This would be one of the simpler solutions, you can implement the entire thing as a word macro (VBA).
A more complex solution would be to use VB6 or .net to print the text and the image into the form and then print the form.
You can write a script that will generate an html page with the image and the text, and then print out the html using a browser.

Resources