PDFJS losing check marks on pdf forms that are converted to text - pdf.js

I have been using an adaptation of code from these posts:
PDF to Text extractor in nodejs without OS dependencies
pdfjs: get raw text from pdf with correct newline/withespace
to convert pdfs to text:
import pdfjsLib from 'pdfjs-dist/legacy/build/pdf.js';
import {
TextItem,
DocumentInitParameters,
} from 'pdfjs-dist/types/src/display/api';
const getPageText = async (pdf: pdfjsLib.PDFDocumentProxy, pageNo: number) => {
const page = await pdf.getPage(pageNo);
const tokenizedText = await page.getTextContent();
var textItems = tokenizedText.items;
var finalString = '';
var line = 0;
// Concatenate the string of the item to the final string
for (var i = 0; i < textItems.length; i++) {
if (line != (textItems[i] as TextItem).transform[5]) {
if (line != 0) {
finalString += '\r\n';
}
line = (textItems[i] as TextItem).transform[5];
}
var item = textItems[i];
finalString += (item as TextItem).str;
}
return finalString;
};
export const getPDFText = async (
data: string,
password: string | undefined = undefined
) => {
const initParams: DocumentInitParameters = {
data: Buffer.from(data, 'base64'),
//useSystemFonts: true,
//disableFontFace: false,
standardFontDataUrl: 'standard_fonts/'
};
if (password !== undefined) {
initParams.password = password;
}
const pdf = await pdfjsLib.getDocument(initParams).promise;
const maxPages = pdf.numPages;
const pageTextPromises = [];
for (let pageNo = 1; pageNo <= maxPages; pageNo += 1) {
pageTextPromises.push(getPageText(pdf, pageNo));
}
const pageTexts = await Promise.all(pageTextPromises);
const joined = pageTexts.join(' ');
return joined;
};
With version 3.1.81 of pdfjs-dist this looks pretty good, but checkboxes on form fields are lost and text field's values show up at the end of each page instead of remaining in context. I feel like this page: https://pdftotext.com/ uses pdfjs based on similarities with my output, but they get the checks on the boxes and their text field "answers" are by the question.
Run with:
import { join } from 'path';
import { readFileSync } from 'fs';
const rawContents = readFileSync(join('directory', 'file.pdf'), 'base64');
const pdfText = await getPDFText(rawContents as string);
Anyone have an idea why I am losing the checks (the boxes are there)?
Sample of what I get:
22. when something something?
☐ 0-3 months ago
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don't know
here is what that webpage gets:
22. when something something?
✔ 0-3 months ago
☐
☐ 4-6 months ago
☐ 7-12 months ago
☐ 13-18 months ago
☐ 19-24 months ago
☐ 25-60 months ago
☐ I don’t know
Again, my output looks like theirs but has lost these checks. I don't know for sure they use pdfjs but i think they do.
Note that I have downloaded a put a couple fonts in the standard_fonts directory. Should I copy them all even if I see no warning message?

In forms Check Boxes are a field boundary not part of any nearby text (true of all fields they are not directly connected to their description), they simply have a name and value, Here Check Box1 & Box2 are placed and Box3 is awaiting surface appearance.
NOTE especially they are not of fixed appearance they morph when displayed they are chimera looking like they are present.
In these AcroForm cases they have no native plain text equivalence, there is nothing to detect the index is simply pointing to page co-ordinates.
PDF.js is a PDF2HTML converter so can easily ! display those indexed areas as html fields,
NOTE ITS AN X
In terms of PDF extractable surface there is no text, and we can see for the boxes above and below there is only a description as seen alongside those radio boxes
NOTE ITS A TICK nothing differs except the displayer (viewer)
If we try to extract text using PDF.js (here in browser) we get just the text
In some cases where Symbol or ZapfDingbats native fonts or other TTF with those code points have been embeded and adapted for state it may be possible to get a fonted checkmark symbol but it is rare, except when designed especially.
☐ as you see in your case then to replace with one
☑ is picking the correct one from font and add as
☒ replacement its not very easy but doable.
so the above symbols via html print as pdf may be extracted again as here using simple pdftotext or python
☐ as you see in your case then to replace with one
☑ is picking the correct one from font and add as
☒ replacement its not very easy but doable.

For anyone else out there looking:
https://formulae.brew.sh/formula/poppler
this includes pdftotext command which gets checkmarks
EDIT
After digging in further, I definitely like pdftotext from poppler. It does have one oddity where a line that wraps on a dash will be unwrapped minus the dash. I think its trying to be smart and assume the dash is there to indicate a wrap. Pretty much an edge case, but worth noting.
There is also a node wrapper which saves you from having to deal with temporary files: https://www.npmjs.com/package/node-pdftotext

Related

Google sheet : export in PDF script not working [duplicate]

This question already has answers here:
Generate PDF of only one sheet of my spreadsheet
(1 answer)
Export Single Sheet to PDF in Apps Script
(3 answers)
Closed 7 months ago.
I'm trying to export my spreadsheet in PDF and send it by email. I followed this tutorial : https://spreadsheet.dev/comprehensive-guide-export-google-sheets-to-pdf-excel-csv-apps-script
The only problem is that it exports as a html file even if I specify in the Url the format, which is PDF. Here's the code I'm using.
function exportSheetAsPDF() {
let blob = getFileAsBlob("https://docs.google.com/spreadsheets/d/1DQ6F8NlB0IiUHYh1ILFoOqWrZrPVEXS8XciJJtc1yAY//export?format=pdf&portrait=true&gridlines=false&size=legal&scale=4&top_margin=0.50&bottom_margin=0.50&left_margin=1&right_margin=0");
Logger.log("Content type: " + blob.getContentType());
Logger.log("File size in MB: " + blob.getBytes().length / 1000000);
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[1];
var cell = sheet.getRange('H2').activate();
var value = cell.getValue();
blob.setName(value);
Additionnaly, is there a way to only export the first page? I don't want the select a specific page. (I thought the gid method would work but the ID seems to be unique to the pages.) I will always add one more page at the start, so thats the one I always want exported.
Thanks. (Sorry I'm kind of a beginner at this)

Differentiating Mode 2 Form 1 from Mode 2 Form 2 on XA CD-ROMs?

I'm developing a library for reading CD-ROMs and the ISO9660 file system.
Long story short, pretty much everything is working except for one thing I'm having a hard time figuring out how it's done:
Where does XA standard defines differentiation among Mode 2 Form 1 from Mode 2 Form 2?
Currently, I am using the following pseudo-code to differentiate between both forms; albeit it's a naive heuristic, it does work but well, it's far from ideal:
var buffer = ... // this is a raw sector of 2352 bytes
var m2F1 = ISector.Cast<SectorMode2Form1>(buffer);
var edc1 = EdcHelper.ComputeBlock(0, buffer, 16, 2056);
var edc2 = BitConverter.ToUInt32(m2F1.Edc, 0);
var isM2F1 = edc1 == edc2;
if (isM2F1) return CdRomSectorMode.Mode2Form1;
// NOTE we cannot reliably check EDC of M2F2 since it's optional
var isForm2 =
m2F1.SubHeaderCopy1.SubMode.HasFlag(SectorMode2Form1SubHeaderSubMode.Form2) &&
m2F1.SubHeaderCopy2.SubMode.HasFlag(SectorMode2Form1SubHeaderSubMode.Form2);
if (isForm2) return CdRomSectorMode.Mode2Form2;
return CdRomSectorMode.Mode2Formless;
If you look at some software like IsoBuster, it appears to be a track-level property, however, I'm failing to understand where the value would be read from within the track.
I'm actually doing something similar in typescript for my ps1 mod tools. It seems like you actually probably have it correct here, since I'm going to assume your HasFlag check is checking position bit position 6 of the subheader. If that flag is set, you are in form 2.
So what you probably want something like:
const sectorBytes = new Uint8Arrray(buffer);
if (sectorBytes[0x012] & 0x20) === 0x20) {
return CdRomSectorMode.Mode2Form2;
} else {
return CdRomSectorMode.Mode2Form1;
}
You could of course use the flag code you already have, but that would require you to use one of the types first to get that. This just keeps it generic bytes and checks the flag, then returns the relevant mode.

App Script - Exporting Sheets Hyperlinks to Docs

I have a google sheet - and when a new row appears I am writing the output into a Google Document using a predefined template via a merge.
All is working but as I could only work out how to use the .replaceText() function to achieve the merge, the hyperlinks in some of the sheet columns get exported as plain text.
After much fiddling and cribbing of code (thanks all) I managed to cobble together the following function:
function makeLinksClickable(document) {
const URL_PATTERN = "https://[-A-Za-z0-9+&##/%?=~_|!:,.;]+[-A-Za-z0-9+&##/%=~_|]"
const URL_PATTERN_LENGTH_CORECTION = "".length
const body = document.getBody()
var foundElement = body.findText(URL_PATTERN);
while (foundElement != null) {
var foundText = foundElement.getElement().asText();
const start = foundElement.getStartOffset();
const end = foundElement.getEndOffsetInclusive() - URL_PATTERN_LENGTH_CORECTION;
const url = foundText.getText().substring(start,end+1)
foundText.setLinkUrl(url)
foundElement = body.findText(URL_PATTERN, foundElement);
}
}
After writing out all the columns to the document I call this function on the created document to look for a hyperlink and make it hyper :)
As long as each cell only contains one hyperlink my function works.
It also works where there are multiple hyperlinks in the document.
However, some cells can have multiple hyperlinks and writes them out to the document with a new line for each one.
Although the function finds the multiple URLs correctly and makes them clickable in the document there is a problem.
For example, if there are 2 hyperlinks in the cell they get exported to 2 lines in the document, but after running them through the function - both hyperlinks will now link to the same image (the first) even though each hyperlink itself is the unique link from the original cell.
2 converted hyperlinks that link to the same image
(Note - If I don't run my function and leave the exported hyperlinks as text. Then go into the created document and manually add a space to the ends of the exported hyperlinks then they turn blue and become clickable and link to the correct image, I did try to add a space programmatically before this but couldn't work that out either)
I have exhausted my limited coding ability and can't see why my function which "seems" to work its way through each hyperlink correctly doesn't make it then link to the right image in the document.
Any help would be most appreciated.
Thanks
// ----------------------------------------------------------------------
Thank you for taking the time to look at this, I will try to explain the issues further. It is hard to show here as the links actually work properly when copied here they only misbehave in the google document.
A cell in the exported row has multiple hyperlinks separated by a comma.
they get exported from the cell to the document as text strings like this:
Links in single Sheets Cell for exporting:
"hyperlink-1-as-a-string", (links to image 1)
"hyperlink-2-as-a-string", (links to image 2)
"hyperlink-3-as-a-string", (links to image 3)
"hyperlink-4-as-a-string", (links to image 4)
"hyperlink-5-as-a-string" (links to image 5)
I then run my funtion to make them clickable again.
If there are two are more hyperlinks in the same cell when exported then I get the following issue after running the function.
Exported Text links converted by to clickable hyperlinks:
"hyperlink-1-as-a-string", (links to image 5)
"hyperlink-2-as-a-string", (links to image 5)
"hyperlink-3-as-a-string", (links to image 5)
"hyperlink-4-as-a-string", (links to image 5)
"hyperlink-5-as-a-string" (links to image 5)
I "think" what happens is that my function makes all 5 hyperlinks one big hyperlink that happens to use the last hyperlinks image.
If I copy and paste the URLs into a separate document like an email then they appear as one large hyperlink, not 5 separate ones.
// ---------------------------------------------------------------
The function searches for text patterns that are in fact google hyperlinks.
(starting https:// etc)
When it finds one it works out the length to the end of the text string and then uses setLinkUrl() to make the hyperlink - into a clickable hyperlink.
If there is only one text hyperlink then it works.
If there is more than one text hyperlink, separated by commas then it does not.
I worked something out. This is what I ended up with, it is basically put together from a few other questions & answers - It's not very clever but it works.
Thanks to the various posters who enabled me to figure this out.
function sortLinks(colId, mapPoint, myBody) {
var urls = [];
if (colId.includes(",")) { // IE theres more than one URL
var tmp = colId.split(",");
urls = urls.concat(tmp);
}
else {
urls[0] = colId; // 1 URL no "," add to array[0]
}
if (urls.length > 0) {
var tag = mapPoint;
var newLine = "\n";
var element = myBody.findText(tag);
if (element) {
var start = element.getStartOffset();
var text = element.getElement().asText();
text.deleteText(start, start + tag.length - 1);
urls.forEach((url, index) => {
url = url.trim();
var name = "Image-Video" + (index + 1);
text.appendText(name).setLinkUrl(start, start + name.length - 1, URL);
text.appendText(newLine);
start = start + name.length + newLine.length;
});
}

readByteSync - is this behavior correct?

stdin.readByteSync has recently been added to Dart.
Using stdin.readByteSync for data entry, I am attempting to allow a default value and if an entry is made by the operator, to clear the default value. If no entry is made and just enter is pressed, then the default is used.
What appears to be happening however is that no terminal output is sent to the terminal until a newline character is entered. Therefore when I do a print() or a stdout.write(), it is delayed until newline is entered.
Therefore, when operator enters first character to override default, the default is not cleared. IE. The default is "abc", data entered is "xx", however "xxc" is showing on screen after entry of "xx". The "problem" appears to be that no "writes" to the terminal are sent until newline is entered.
While I can find an alternative way of doing this, I would like to know if this is the way readByteSync should or must work. If so, I’ll find an alternative way of doing what I want.
// Example program //
import 'dart:io';
void main () {
int iInput;
List<int> lCharCodes = [];
print(""); print("");
String sDefault = "abc";
stdout.write ("Enter data : $sDefault\b\b\b");
while (iInput != 10) { // wait for newline
iInput = stdin.readByteSync();
if (iInput == 8 && lCharCodes.length > 0) { // bs
lCharCodes.removeLast();
} else if (iInput > 31) { // ascii printable char
lCharCodes.add(iInput);
if (lCharCodes.length == 1)
stdout.write (" \b\b\b\b chars cleared"); // clear line
print ("\nlCharCodes length = ${lCharCodes.length}");
}
}
print ("\nData entered = ${new String.fromCharCodes(lCharCodes).trim()}");
}
Results on Command screen are :
c:\Users\Brian\dart-dev1\test\bin>dart testsync001.dart
Enter data : xxc
chars cleared
lCharCodes length = 1
lCharCodes length = 2
Data entered = xx
c:\Users\Brian\dart-dev1\test\bin>
I recently added stdin.readByteSync and readLineSync, to easier create small scrips reading the stdin. However, two things are still missing, for this to be feature-complete.
1) Line mode vs Raw mode. This is basically what you are asking for, a way to get a char as soon as it's printed.
2) Echo on/off. This mode is useful for e.g. typing in passwords, so you can disable the default echo of the characters.
I hope to be able to implement and land these features rather soon.
You can star this bug to track the development of it!
This is common behavior for consoles. Try to flush the output with stdout.flush().
Edit: my mistake. I looked at a very old revision (dartlang-test). The current API does not provide any means to flush stdout. Feel free to file a bug.

Int32.ParseInt throws FormatException after web post

Update
I've found the problem, the exception came from a 2nd field on the same form which indeed should have prompted it (because it was empty)... I was looking at an error which I thought came from trying to parse one string, when in fact it was from trying to parse another string... Sorry for wasting your time.
Original Question
I'm completely dumbfounded by this problem. I am basically running int.Parse("32") and it throws a FormatException. Here's the code in question:
private double BindGeo(string value)
{
Regex r = new Regex(#"\D*(?<deg>\d+)\D*(?<min>\d+)\D*(?<sec>\d+(\.\d*))");
Regex d = new Regex(#"(?<dir>[NSEW])");
var numbers = r.Match(value);
string degStr = numbers.Groups["deg"].ToString();
string minStr = numbers.Groups["min"].ToString();
string secStr = numbers.Groups["sec"].ToString();
Debug.Assert(degStr == "32");
var deg = int.Parse(degStr);
var min = int.Parse(minStr);
var sec = double.Parse(secStr);
var direction = d.Match(value).Groups["dir"].ToString();
var result = deg + (min / 60.0) + (sec / 3600.0);
if (direction == "S" || direction == "W") result = -result;
return result;
}
My input string is "32 19 17.25 N"
The above code runs on a .NET 4 web hosting service (aspspider) on an ASP.NET MVC 3 web application (with Razor as its view engine).
Note the assersion of degStr == "32" is valid! Also when I take the above code and run it in a console application it works just fine. I've scoured the web for an answer, nothing...
Any ideas?
UPDATE (stack trace)
[FormatException: Input string was not in a correct format.]
System.Number.StringToNumber(String str, NumberStyles options, NumberBuffer& number, NumberFormatInfo info, Boolean parseDecimal) +9586043
System.Number.ParseInt32(String s, NumberStyles style, NumberFormatInfo info) +119
System.Int32.Parse(String s) +23
ParkIt.GeoModelBinder.BindGeo(String value) in C:\MyProjects\ParkIt\ParkIt\GeoBinder.cs:42
Line 42 is var deg = int.Parse(degStr); and note that the exception is in System.Int32.Parse (not in System.Double as was suggested).
You are wrongly thinking that it is the following line that is throwing the exception:
int.Parse("32")
This line is unlikely to ever throw an exception.
In fact it is the following line:
var sec = double.Parse(secStr);
In this case secStr = "17.25";.
The reason for that is that your hosting provider uses a different culture in which the . is not a decimal separator.
You have the possibility to specify the culture in your web.config file:
<globalization culture="en-US" uiCulture="en-US" />
If you don't do that, then auto is used. This means that the culture could be set based on the client browser preferences (which are sent with each request using the Accept-Language HTTP header).
Another possibility is to specify the culture when parsing:
var sec = double.Parse(secStr, CultureInfo.InvariantCulture);
This way you know for sure that . is the decimal separator for the invariant culture.
Testing this (via PowerShell):
PS [64] E:\dev #43> '32 19 17.25 N' -match "\D*(?\d+)\D*(?\d+)\D*(?\d+(\.\d*))"
True
PS [64] E:\dev #44> $Matches
Name Value
---- -----
sec 17.25
deg 32
min 19
1 .25
0 32 19 17.25
So the regex is working with all three named captures getting a value, all of which will parse OK (ie. it isn't something like \d matching something like U+0660: ARABIC-INDIC DIGIT ZERO that Int32.Parse doesn't handle).
But you do not check that the regex actually makes a match.
Therefore I suspect that the value passed to the function is not the input you expect. Put a breakpoint (or logging) at the start of the function and get the actual value of value.
I think what is happening is:
Value isn't what you think it is.
The regex fails to match.
The captures are empty
Int32.Parse("") is throwing (just confirmed: it throws a FormatException "Input string was not in a correct format.")
Adendum: Just noted you comment on the assertion.
If things seem contradictory go back to basics: at least one of your assumptions is wrong eg. there could be an off by one in the exception's line number (an edit to the file before going to that line number: very easy to do).
Stepping through with a debugger in this case is by far the easiest approach. On every expression check everything.
If you cannot use a debugger then try and remove that restriction, if not how about IntelliTrace? Othewrwise use some kind of logging (if you app doesn't have it, add it as you'll need it in the future for things like this).
try remove non unicode ( if any - non-visible) chars from string :
string s = "søme string";
s = Regex.Replace(s, #"[^\u0000-\u007F]", string.Empty);
edit
also - try to see its hex values to see where it is doing exceptio n :
BitConverter.ToString(buffer);
this will show you the hex values so you can verify...
also paste its value so we can see it.
It turns out that this is a non-question. The problem was that the exception came from a 2nd field on the same form which indeed should have prompted it (because it was empty)... I was looking at an error which I thought came from trying to parse one string, when in fact it was from trying to parse another string...
Sorry for wasting your time.

Resources