This is more of a learning than a question. I was recently struggling with matching strings parsed out of a PDF using PDFBox. My solution might be helpful to others
A list of text was obtained from the PDF using PDFBox like this (Exceptions omitted for brevity):
List<String> lines = new ArrayList<String>();
PDDocument document = PDDocument.load(f);
PDFTextStripper pdfStripper = new PDFTextStripper();
String text = pdfStripper.getText(document);
String[] pageText = text.trim().split(pdfStripper.getLineSeparator());
for (String line : pageText) {
lines.add(line);
}
The List now contains all the lines from the file in order.
However, String.contains and String.equals fails on lines that are seemingly identical in the logs (ie: 'EMERA INCORPORATED'). In converting each characters into a Hex, it became clear the Space character was the issue:
Line (Parsed from PDF with PDF Box): EMERA INCORPORATED
45 4d 45 52 41 a0 49 4e 43 4f 52 50 4f 52 41 54 45 44
CompanyName (Set In Java): EMERA INCORPORATED
45 4d 45 52 41 20 49 4e 43 4f 52 50 4f 52 41 54 45 44
Note the 'a0' in the PDFBox String where in Java there is the space ('20').
The solution was to use Regex to identify the line: EMERA\S+INCORPORATED. This gives better controller over the matching, so its not bad. But it was a bit annoying to figure this out as when reviewing the logs, the Strings being compared looked identical, yet both contains and equals returned false.
My conclusion, use RegEx to identify text patterns coming out of a PDF (obtained with PDFBox) and ensure to add '\S' to represent potential spaces. Maybe this post can save someone some pain. Also, perhaps someone more familiar with PDFBox could provide tips on using the API better if this is user error on my part.
perhaps someone more familiar with PDFBox could provide tips on using the API better if this is user error on my part
It is not an error in PDFBox API usage. It is not even specific to PDFBox at all. It more is a matter of wrong expectations.
Different kinds of space characters
First of all, there are different kinds of space characters. There of course is the most often used Unicode Character 'SPACE' (U+0020) but there also are others, in particular the Unicode Character 'NO-BREAK SPACE' (U+00A0).
Thus, if you don't know that only one particular space character is used in a given text, it is completely normal to use regular expressions with '\S' instead of ' '.
What does PDFBox extract?
In the case at hand using the non breaking space was not even used by choice of PDFBox. Instead, it was ingrained in the PDF.
When extracting text from a PDF, PDFBox (just like other PDF libraries) uses the information inside the PDF concerning which glyph represents which Unicode character. This information can be given by an Encoding entry or an ToUnicode entry of the respective font declaration in the PDF.
Only if there is a gap between two text chunks (a free space not created by drawing a space character but by moving the text insertion point without a text character), PDF text extractors add a space character of their respective choice, usually the regular space.
As PDFBox does use the regular space in the later case, the issue at hand is a situation of the first case, the PDF itself indicates that the space there is a non breaking one.
Related
I have a client who is migrating from Samsung printer to Sharp. One "critical" print job they need to do involves printing barcodes on sticky labels for shipping. The print job is being generated by a piece of VERY old software whose origin is in the pre-Smartphone era (and probably no source code).
The barcode font is in a recognizable PCL file and can be uploaded to the Sharp. The actual print job, on the other hand, I sort of see pieces of PCL escape sequences but not in any form the Sharp (or even a HP) could do anything useful with.
For example, I see the sequence "&l1O" (hex 26 6c 31 4f) which, if it were preceded by an ESC, would select landscape mode. What I see instead is (in hex) 1b 15 36 before the "&l1O". In other places, wherever I expect a PCL escape sequence I am seeing similar 3-octet groupings: 1b 15 37, 1b 14 21, etc. instead of a single 1b preceding the PCL command.
So my question: Can anybody please point me to documentation or site that would help me interpret these sequences. I am thinking if I know what the original file is doing I can write a filter to run the file through to produce something useful.
BTW, I have gone through the PCL5 Technical Ref manual and these are not mentioned anywhere in it.
Thank you.
These commands are documented under "Primary and Secondary Fonts" in the HP PCL5 Technical Reference:
The printer maintains two independent font select tables for use in
selecting a primary font and a secondary font. All of the
characteristics previously described apply to both tables. This
provides access to two distinct fonts, only one of which is selected
at a given time. To alternate between the primary and the secondary
font, the control codes ‘‘SI’’ (Shift In; ASCII 15) is used to
designate primary and ‘‘SO’’ (Shift Out; ASCII 14) is used to
designate secondary.
I know in Apple's PDFKit I can get 'string' which returns an NSString object representing the text on the page.
https://developer.apple.com/documentation/pdfkit/pdfpage?language=objc
Is there a way to change text that's in the PDF? If not, how do you recommend I go about figuring out how to edit text in a PDF? Thank you!
To understand your real problem, you need to know more about how a PDF works.
First, a PDF is more like a container of (drawing, rendering) instructions than a container of content.
There are two flavors of PDF. Tagged and untagged. Tagged PDF is essentially a normal PDF document + a tree-like datastructure that tells you which parts of the document make up which logical elements.
Comparable to HTML, which contains a logical structure, the tags mark paragraphs, bullet points in lists, rows in tables, etc.
If you have an untagged document, you are essentially left with nothing but the bare rendering instructions
go to position 50, 50
set font to Arial
set font color to 0, color-space to grayshades
draw the glyph for 'H'
go to position 60, 50
draw the glyph for 'e'
Instructions like this are gathered into objects. Objects can be gathered into streams. Streams can be compressed. Instructions and objects do not need to appear in any logical order.
Having objects means that you can re-use certain things. Like drawing an image on every page of a company letterhead. Or instructions like 'use the font in object 456'.
In order to be able to work with these objects, every object is given a number. And a mapping of objects, their number, and their byte-offset in the file is stored at the back of the document. This is known as the XREF table.
xref
152 42
0000000016 00000 n
0000001240 00000 n
0000002133 00000 n
0000002296 00000 n
0000002344 00000 n
0000002380 00000 n
0000002551 00000 n
Now, back to your problem.
Suppose that you change a word 'dog' by a word 'cats'.
You'd run into several problems:
every byte offset in the document is suddenly wrong, since 'cats' contains 4 bytes, and 'dog' contains 3 bytes.
no object can be found, all instructions go wrong
if at any point your substitution causes the text to go too far out of alignment, you would need to perform layout again.
Why is layout such a problem?
Remember what I said earlier about the PDF containing only the rendering instructions. It's insanely hard to reconstruct things like paragraph-boundaries, or tables, lists, etc from the raw instructions.
Especially so if you want to do this for other scripts than just Latin script (imagine Hebrew, or Arabic). Or if your page layout is non-standard (like a scientific article, which appears in columns rather than lines that take up an entire page.)
Structure recognition is in fact the topic of ongoing research.
I just finished reading the article "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" by Joel Spolsky.
I'd really appreciate clarification on this part of the article.
OK, so say we have a string: Hello which, in Unicode, corresponds to these five code points:
U+0048 U+0065 U+006C U+006C U+006F...That’s where encodings come in.
The earliest idea for Unicode encoding, which led to the myth about the two bytes, was, hey, let’s just store those numbers in two bytes each. So Hello becomes
00 48 00 65 00 6C 00 6C 00 6F
Right? Not so fast! Couldn’t it also be:
48 00 65 00 6C 00 6C 00 6F 00 ?
Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to store their Unicode code points in high-endian or low-endian mode, whichever their particular CPU was fastest at, and lo, it was evening and it was morning and there were already two ways to store Unicode. So the people were forced to come up with the bizarre convention of storing a FE FF at the beginning of every Unicode string; this is called a Unicode Byte Order Mark and if you are swapping your high and low bytes it will look like a FF FE and the person reading your string will know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the beginning.
My questions:
Why could the two zero's at the beginning of 0048 be moved to the end?
What is FE FF and FF FE, what's the difference between them and how were they used? (Yes I tried googling those terms, but I'm still confused)
Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?
Also, I'd appreciate any recommended resources to learn more about this stuff.
Summary: the 0xFEFF (byte-order mark) character is used to solve the endianness problem for some character encodings. However, most of today's character encodings are not prone to the endianness problem, and thus the byte-order mark is not really relevant for today.
Why could the two zero's at the beginning of 0048 be moved to the end?
If two bytes are used for all characters, then each character is saved in a 2-byte data structure in the memory of the computer. Bytes (groups of 8 bits) are the basic addressable units in most computer memories, and each byte has its own address. On systems that use the big-endian format, the character 0x0048 would be saved in two 1-byte memory cells in the following way:
n n+1
+----+----+
| 00 | 48 |
+----+----+
Here, n and n+1 are the addresses of the memory cells. Thus, on big-endian systems, the most significant byte is stored in the lowest memory address of the data structure.
On a little-endian system, on the other hand, the character 0x0048would be stored in the following way:
n n+1
+----+----+
| 48 | 00 |
+----+----+
Thus, on little-endian systems, the least significant byte is stored in the lowest memory address of the data structure.
So, if a big-endian system sends you the character 0x0048 (for example, over the network), it sends you the byte sequence 00 48. On the other hand, if a little-endian system sends you the character 0x0048, it sends you the byte sequence 48 00.
So, if you receive a byte sequence like 00 48, and you know that it represents a 16-bit character, you need to know whether the sender was a big-endian or little-endian system. In the first case, 00 48 would mean the character 0x0048, in the second case, 00 48 would mean the totally different character 0x4800.
This is where the FE FF sequence comes in.
What is FE FF and FF FE, what's the difference between them and how were they used?
U+FEFF is the Unicode byte-order mark (BOM), and in our example of a 2-byte encoding, this would be the 16-bit character 0xFEFF.
The convention is that all systems (big-endian and little-endian) save the character 0xFEFF as the first character of any text stream. Now, on a big-endian system, this character is represented as the byte sequence FE FF (assume memory addresses increasing from left to right), whereas on a little-endian system, it is represented as FF FE.
Now, if you read a text stream, that has been created by following this convention, you know that the first character must be 0xFEFF. So, if the first two bytes of the text stream are FE FF, you know that this text stream has been created by a big-endian system. On the other hand, if the first two bytes are FF FE, you know that the text stream has been created by a little-endian system. In either case, you can now correctly interpret all the 2-byte characters of the stream.
Why did he then say "Phew. Not every Unicode string in the wild has a byte order mark at the beginning."?
Placing the byte-order mark (BOM) character 0xFEFF at the beginning of each text stream is just a convention, and not all systems may follow it. So, if the BOM is missing, you have the problem of not knowing whether to interpret the 2-byte characters as big-endian or little-endian.
Also, I'd appreciate any recommended resources to learn more about this stuff.
https://en.wikipedia.org/wiki/Endianness
https://en.wikipedia.org/wiki/Unicode
https://en.wikibooks.org/wiki/Unicode/Character_reference
https://en.wikipedia.org/wiki/Byte_order_mark
https://en.wikipedia.org/wiki/UTF-16#Byte_order_encoding_schemes
Notes:
Today, the most widely used Unicode-compatible encoding is UTF-8. UTF-8 has been designed to avoid the endianness problem, thus, the entire byte-order mark 0xFEFF stuff is not relevant for UTF-8 (see here).
The byte-order mark is however relevant to the other Unicode-compatible encodings UTF-16 and UTF-32, which are prone to the endianness problem. If you browse through the list of available encodings, for example in the settings of a text editor or terminal, you see that there are big-endian and little-endian versions of UTF-16 and UTF-32, typically called UTF-16BE and UTF-16LE, or UTF-32BE and UTF-32LE, respectively. However, UTF-16 and UTF-32 are rarely used in practice.
Other popular encodings used today include the encodings from the ISO 8859 series, such as ISO 8859-1 (and the derived Windows-1252), known as Latin-1, or also the pure ASCII encoding. However, all these are single-byte encodings, that is, each character is encoded to 1 byte and saved in a 1-byte data structure. Thus, the endianness problem doesn't apply here, and the byte-order mark story is also not relevant for these cases.
All in all, the endianness problem for character encodings, that you struggled to understand, has thus mostly a historical value, and is not really relevant for today's world anymore.
This is all to do with the internal storage of data in the computer's memory - in this example (00 48), some computers will store the largest byte first and the smallest byte second (known as big-endian), and others will store the smallest byte first (little-endian). So, depending on your computer, when you read the bytes out of memory you'll get either the 00 first or the 48 first. And you need to know which way round it's going to be to make sure you interpret the bytes correctly. For a more in-depth introduction to the topic, see Endianness on Wikipedia (https://en.wikipedia.org/wiki/Endianness)
These days, most compilers and interpreters will take care of this low-level stuff for you, so you will rarely (if ever) need to worry about it.
I am writing a program that converts national and international account numbers into IBAN numbers. To start, I need to form a string: Bank ID + Branch ID + Account Number + ISO Country Code without the trailing spaces that may be present in these fields. But not every account number has the same length, some account numbers have branch identifiers while others don't, so I will always end up with trailing spaces from these fields.
My working storage looks something like this:
01 Input-IBAN.
05 BANK-ID PIC N(10) VALUE "LOYD".
05 BRANCH-ID PIC N(10) VALUE " ".
05 ACCOUNT-NR PIC N(28) VALUE "012345678912 ".
05 COUNTRY-CODE PIC N(02) VALUE "GB".
01 Output-IBAN PIC N(34).
I've put some values in there for the example; in reality it would depend on the input. The branch code is optional, hence me leaving it empty in the example.
I basically want to go from this input strung together:
"LOYD 012345678912 GB"
to this:
"LOYD012345678912GB"
Does anyone know a way to do this that does not result in performance issues? I have thought of using the FUNCTION REVERSE and then using an INSPECT for tallying leading spaces. But I've heard that's a slow way to do it. Does anyone have any ideas? And maybe an example on how to use said idea?
EDIT:
I've been informed that the elementary fields may contain embedded spaces.
I see now that you have embedded blanks in the data. Neither answer you have so far works, then. Gilbert's "squeezes out" the embedded blanks, mine would lose any data after the first blank in each field.
However, just to point out, I don't really believe you can have embedded blanks if you are in any way generating an "IBAN". For instance, https://en.wikipedia.org/wiki/International_Bank_Account_Number#Structure,
specifically:
The IBAN should not contain spaces when transmitted electronically.
When printed it is expressed in groups of four characters separated by
a single space, the last group being of variable length
If your source-data has embedded blanks, at the field level, then you need to refer that back up the line for a decision on what to do. Presuming that you receive the correct answer (no embedded blanks at the field level) then both existing answers are back on the table. You amend Gilbert's by (logically) changing LENGTH OF to FUNCTION LENGTH and dealing with any possibility of overflowing the output.
With the STRING you again have to deal with the possibility of overflowing the output.
Original answer based on the assumption of no embedded blanks.
I'll assume you don't have embedded blanks in the elementary items which make up your structure, as they are sourced by standard values which do not contain embedded blanks.
MOVE SPACE TO OUTPUT-IBAN
STRING BANK-ID
BRANCH-ID
ACCOUNT-NR
COUNTRY-CODE
DELIMITED BY SPACE
INTO OUTPUT-IBAN
STRING only copies the values until it runs out of data to copy, so it is necessary to clear the OUTPUT-IBAN before the STRING.
Copying of the data from each source field will end when the first SPACE is encountered in each source field. If a field is entirely space, no data will be copied from it.
STRING will almost certainly cause a run-time routine to be executed and there will be some overhead for that. Gilbert LeBlanc's example may be slightly faster, but with STRING the compiler deals automatically with all the lengths of all the fields. Because you have National fields, ensure you use the figurative-constant SPACE (or SPACES, they are identical) not a literal value which you think contains a space " ". It does, but it doesn't contain a National space.
If the result of the STRING is greater than 34 characters, the excess characters will be quietly truncated. If you want to deal with that, STRING has an ON OVERFLOW phrase, where you specify what you want done in that case. If using ON OVERFLOW, or indeed NOT ON OVERFLOW you should use the END-STRING scope-terminator. A full-stop/period will terminate the STRING statement as well, but when used like that it can never, with ON/NOT ON, be used within a conditional statement of any type.
Don't use full-stops/periods to terminate scopes.
COBOL doesn't have "strings". You cannot get rid of trailing spaces in fixed-length fields, unless the data fills the field. Your output IBAN will always contain trailing spaces when the data is short.
If you were to actually have embedded blanks at the field level:
Firstly, if you want to "squeeze out" embedded blanks so that they don't appear in the output, I can't think of a simpler way (using COBOL) than Gilbert's.
Otherwise, if you want to preserve embedded blanks, you have no reasonable choice other than to count the trailing blanks so that you can calculate the length of the actual data in each field.
COBOL implementations do have Language Extensions. It is unclear which COBOL compiler you are using. If it happens to be AcuCOBOL (now from Micro Focus) then INSPECT supports TRAILING, and you can count trailing blanks that way. GnuCOBOL also supports TRAILING on INSPECT and in addition has a useful intrinsic FUNCTION, TRIM, which you could use to do exactly what you want (trimming trailing blanks) in a STRING statement.
move space to your-output-field
string function
trim
( your-first-national-source
trailing )
function
trim
( your-second-national-source
trailing )
function
trim
( your-third-national-source
trailing )
...
delimited by size
into your-output-field
Note that other than the PIC N in your definitions, the code is the same as if using alphanumeric fields.
However, for Standard COBOL 85 code...
You mentioned using FUNCTION REVERSE followed by INSPECT. INSPECT can count leading spaces, but not, by Standard, trailing spaces. So you can reverse the bytes in a field, and then count the leading spaces.
You have National data (PIC N). A difference with that is that it is not bytes you need to count, but characters, which are made up of two bytes. Since the compiler knows you are using PIC N fields, there is only one thing to trip you - the Special Register, LENGTH OF, counts bytes, you need FUNCTION LENGTH to count characters.
National data is UTF-16. Which happens to mean the two bytes for each character happen to be "ASCII", when one of the bytes happens to represent a displayable character. That doesn't matter either, running on z/OS, an EBCDIC machine, as the compiler will do necessary conversions automatically for literals or alpha-numeric data-items.
MOVE ZERO TO a-count-for-each-field
INSPECT FUNCTION
REVERSE
( each-source-field )
TALLYING a-count-for-each-field
FOR LEADING SPACE
After doing one of those for each field, you could use reference-modification.
How to use reference-modification for this?
Firstly, you have to be careful. Secondly you don't.
Secondly first:
MOVE SPACE TO output-field
STRING field-1 ( 1 : length-1 )
field-2 ( 1 : length-2 )
DELIMITED BY SIZE
INTO output-field
Again deal with overflow if possible/necessary.
It is also possible with plain MOVEs and reference-modification, as in this answer, https://stackoverflow.com/a/31941665/1927206, whose question is close to a duplicate of your question.
Why do you have to be careful? Again, from the answer linked previously, theoretically a reference-modification can't have a zero length.
In practice, it will probably work. COBOL programmers generally seem to be so keen on reference-modification that they don't bother to read about it fully, so don't worry about a zero-length not being Standard, and don't notice that it is non-Standard, because it "works". For now. Until the compiler changes.
If you are using Enterprise COBOL V5.2 or above (possibly V5.1 as well, I just haven't checked) then you can be sure, by compiler option, if you want, that a zero-length reference-modification works as expected.
Some other ways to achieve your task, if embedded blanks can exist and can be significant in the output, are covered in that answer. With National, just always watch to use FUNCTION LENGTH (which counts characters), not LENGTH OF (which counts bytes). Usually LENGTH OF and FUNCTION LENGTH give the same answer. For multi-byte characters, they do not.
I have no way to verify this COBOL. Let me know if this works.
77 SUB1 PIC S9(4) COMP.
77 SUB2 PIC S9(4) COMP.
MOVE 1 TO SUB2
PERFORM VARYING SUB1 FROM 1 BY 1
UNTIL SUB1 > LENGTH OF INPUT-IBAN
IF INPUT-IBAN(SUB1:1) IS NOT EQUAL TO SPACE
MOVE INPUT-IBAN(SUB1:1) TO OUTPUT-IBAN(SUB2:1)
ADD +1 TO SUB2
END-IF
END-PERFORM.
We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).
I appreciate that there is no absolute answer - any escaping can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.
A typical example is the "degrees" symbol:
HTML Entity (decimal) °
HTML Entity (hex) °
HTML Entity (named) °
How to type in Microsoft Windows Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex) 0xC2 0xB0 (c2b0)
UTF-8 (binary) 11000010:10110000
UTF-16 (hex) 0x00B0 (00b0)
UTF-16 (decimal) 176
UTF-32 (hex) 0x000000B0 (00b0)
UTF-32 (decimal) 176
C/C++/Java source code "\u00B0"
Python source code u"\u00B0"
We are also likely to encounter TeX
$10\,^{\circ}{\rm C}$
or
\degree
so backslashes, curlies and dollars are a poor idea.
We could for example use markup like:
__deg__
__#176__
and this will probably work but I'd appreciate advice from those who have similar problems.
update I accept #MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.
Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.
Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.
Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.
Example, something like
the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee
your marker is a uuid, and the entity is ° encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.
Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).
If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:
>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>
If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:
>>> uuid.uuid1(node=0x1234567890)
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>
You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).