How do you use PasteSpecial in Delphi to paste into an Ole PowerPoint. I have rtf data i want to paste into powerpoint and I need to use PasteSpecial. However I cannot find documentation on how to fill out the parameters it needs.
PasteSpecial is just going to favor one format over the other. So you can prioritize the formats, or eliminate formats, to influence the pasting. For example, if you have RTF and TEXT on the clipboard, and PP always pastes TEXT by default, even if RTF is listed first, then you could just eliminate TEXT and provide ONLY RTF. Then it has to paste as RTF.
MSDN has documentation for the 2003 and 2007 versions. In both cases, the first parameter should be ppPasteRTF if you want to choose the clipboard contents with RTF format. You can use EmptyParam for the remaining five parameters.
Related
I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language
I'need to scan a document. It's not OCR, let me show you:
--Example--
Table of Contents
Some Italic Words
Sentence 23
--End--
Suppose that as a ".doc" formatted text. I need to scan it line by line and understand the first line is bold, second is italic and third one includes space after first word and followed by a number. Reason i want to recognize them is i need to categorize them in a table view like bold lines italics, numbereds etc.
I'm okay in both swift and objective-c but totally clueless about document scanning. If you offer any reference, framework or approach i would be grateful to hear.
variant: your doc is really a docx. (docx is xml) Parse the XML. The format defines XML tags it uses to mark stuff bold or italic or whatever -- a docx is kind of like html.
variant: If your doc is really a doc! then we are not talking about xml but a binary format. It is also document and you can go parse it but I don't think will be easy
BUT
There is a library I know: doc2text that can parse a lot of stuff. (http://www.textlib.com/doc2text.html)
We used in past projects and it did an okay job and using this saves you A LOT of effort writing your own parsers
Is it possible to search and replace a known string from a PDF with Objective-C/Quartz 2D?
I've some nice formatted PDF with tabular data, created with Latex (and generated with pdflatex). Every pdf will have a placeholder string, something like XXXXXX that I would like to change programmatically.
This strings will be replaced only by other numbers.
I'm aware that the PDF could be an editable form, but i don't want it because i prefer to leave all the fonts and formatting as they're typeset by Latex.
It is not possible to search and replace text in PDF files using Quartz 2D. Quartz 2D offers a read only low level interface for reading PDF files. While searching can be implemented on top of it, although with much effort, modifying the files and replacing text is not possible.
I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...
Another clipboard question:
When text is put onto the clipboard, it frequently goes in multiple ways, usually with and without formatting information. What I want to know is this -- how do you change the text on the clipboard without altering the formatting. In other words, I want to change the text side of things, but keep the formatting exactly the same.
This is again for my "TextScrubber" application where I want to remove line breaks from the text on the clipboard, but I don't want to alter the format info about that text.
I'm hoping that I don't have to "brute force" it by iterating over all the formats present, storing each, and then reinserting them after the text has been scrubbed.
I think the "brute force" is precisely what you'll have to do - according to MSDN Win32 API there is no way to do otherwise.
Yep, Nick. I think in this case you're going to be stuck with the solution already suggested. The clipboard is one area that hasn't really gotten much attention in the enhancement department throughout the years. That is probably because it does need to be simple, ubiquitous, and functional.
Why not simply load from the clipboard, change the text, and write back to the clipboard?
Maybe something simple like Sergey Tkachenko's TBin Clipboard: http://delphi32.org/vcl/2889/
Eric Rosenberger's answer to "Can not round trip html format to clipboard" might also be of use.