Encoding issue with wkhtml2pdf footer-html/header-html parameters

Encoding issue with wkhtml2pdf footer-html/header-html parameters - character-encoding

I'm generating PDF from HTML file using utf-8 encoding. As a matter of fact, some of my titles contain non-ascii characters. In main PDF, everything is fine, encoding is right, rendering is OK.
When setting my footer.html file, I retreive the current section which does not display right whenever there is a non-ascii character in section name.
This section name is passed as a "section" URL parameter to footer.html call ; its encoding is already wrong. For example, I changed my ToC titel to the french "Table des matières" in my xsl file (that's the only change made to default xsl).
ToC page is fine in PDF, the "è" is correct. But in my footer, I retreive URI via JS which give me before unencoding : mati%C3%A8res ( when I expected it to be mati%E8res), which of course displays in my PDF footer as matiÃ¨res
I've found numerous pages about encoding issues in templates, main html file or header/footer html file, none of them was related to these parameters. How can I make it right in PDF ?
Sample run :
F:\test>wkhtmltopdf.exe --encoding "UTF-8" --footer-html footer.html \
--print-media-type toc --xsl-style-sheet default.xsl test.html test.pdf
Loading pages (1/6)
Counting pages (2/6)
Loading TOC (3/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done

The problem was from my JS code. On parameters I used unencode() instead of using the correct function decodeURIcomponent()

Related

How to set font in Texinfo

I'm using the GNU texinfo package to generate both PDF and .info files from a .texi file.
I'm trying to update an old .texi file (not changed since 2001) and generate the same PDF output. I've resolved a number of issues, but there are a couple outstanding. In the old PDF, the title was in Helvetica and body text is Liberation Serif. In the new PDF, both are Computer Modern.
I've read everything that I can find about fonts, but I'm not able to change the fonts. Nothing that I do seems to work. Everything I have tried generates errors.
In my .texi file, before any \setfont directives, I have \def\fontprefix{uh} which, if I read the pdftex.map file correctly, should select the NimbusSanL font set (e.g. uhvr8a.pfb). I get the following errors:
mktexnam: Could not map source abbreviation for uhss10.
kpathsea: Running mktexmf uhss10
! I can't find file uhss10'.
<*> ...ljfour; mag:=1; ; nonstopmode; input uhss10`
Does anyone have a example .texi file which sets the font family to use? Or an explanation of what I'm doing wrong?

TCPDF Embedded Base64 Encoded Images in HTML String

I have read several posts on this subject but didn't want to piggy-back on any of them with additional questions.
Specifically this post: TCPDF and insert an image base64 encoded
I am generating a PDF from within a custom theme in Wordpress. I'm using TCPDF 6.2.3 (latest stable release, I believe).
I am building this PDF from the same HTML I am using to display on the page. If I embed the full base64 encoded string, it works correctly in the browser, but the image is missing from the PDF.
If I use the "#" method described in the linked post, I get a broken image in the browser (expectedly) but still nothing in the PDF.
All the rest of my HTML markup is rendering in the PDF, images are just not showing.
Is there some other setting or option I need to set in order to get the images to appear in the PDF, and/or can you spot anything I'm doing wrong here? No errors, the images are just not visible in the PDF.
This is how I set the image up:
$imageLocation = $img_root.$imgsrc;
$ext = end(explode(".", $imageLocation));
$image = base64_encode(file_get_contents($imageLocation));
//$response .= "<img src='data:image/$ext;base64,$image'>"; //works in browser but not in PDF
$response .= "<img src='#$image' class='socf_image'>"; //does not work in browser or PDF
And here is the method to create the PDF:
function createPDF($response)
{
// Include the main TCPDF library (search for installation path).
require_once('tcpdf_6_3_2/tcpdf/tcpdf.php');
// create new PDF document
$pdf = new TCPDF(PDF_PAGE_ORIENTATION, PDF_UNIT, PDF_PAGE_FORMAT, true, 'UTF-8', false);
// set document information
$pdf->SetCreator(PDF_CREATOR);
$pdf->SetAuthor('test');
$pdf->SetTitle('test');
$pdf->SetSubject('test');
$pdf->SetKeywords('test');
// set default header data
$pdf->SetHeaderData(PDF_HEADER_LOGO, PDF_HEADER_LOGO_WIDTH, PDF_HEADER_TITLE.' 001', PDF_HEADER_STRING, array(0,64,255), array(0,64,128));
$pdf->setFooterData(array(0,64,0), array(0,64,128));
// set header and footer fonts
$pdf->setHeaderFont(Array(PDF_FONT_NAME_MAIN, '', PDF_FONT_SIZE_MAIN));
$pdf->setFooterFont(Array(PDF_FONT_NAME_DATA, '', PDF_FONT_SIZE_DATA));
// set default monospaced font
$pdf->SetDefaultMonospacedFont(PDF_FONT_MONOSPACED);
// set margins
$pdf->SetMargins(PDF_MARGIN_LEFT, PDF_MARGIN_TOP, PDF_MARGIN_RIGHT);
$pdf->SetHeaderMargin(PDF_MARGIN_HEADER);
$pdf->SetFooterMargin(PDF_MARGIN_FOOTER);
// set auto page breaks
$pdf->SetAutoPageBreak(TRUE, PDF_MARGIN_BOTTOM);
// set image scale factor
$pdf->setImageScale(PDF_IMAGE_SCALE_RATIO);
// set default font subsetting mode
$pdf->setFontSubsetting(true);
// Set font
$pdf->SetFont('helvetica', '', 14, '', true);
// Add a page
$pdf->AddPage();
$html = $response;
$pdf->writeHTML($response, true, false, true, false, '');
return $pdf;
}

Well, fortunately, I was able to figure it out on my own. Perhaps this isn't the best forum for seeking help with this library? If anyone can suggest a better place to get help, I'd appreciate the direction.
Ultimately, the issue was two-fold:
The "#" notation is required for the PDf while the approach is what works for displaying the HTML in browser. So a string replace before creating the PDF solves that.
This is the tricky part. The HTML needs to use double-quotes around the properties, not single quotes. My code was using double quotes for the PHP strings, so the HTML properties were surrounded with single quotes and that was the issue. Swapping the two quote types was the last piece of the puzzle to get the images to appear in the PDF.
Hopefully this will help someone else who is pulling their hair out trying to blindly find their way through this library like me.

importxml of url with Hebrew returns in encoding other than UTF-8 that chrome doesn't recognize

For example, in the dummy spreadsheet (tab 'desired outcome'), under "Link 1" you will see this URL:
http://www.promotion-il.co.il/service/%5DE%5E4%5D9%5E5-%5E8%5D9%5D7-%5D7%5E9%5DE%5DC%5D9-%5DC%5E2%5E1%5E7%5D9%5DD/
However, the actual URL in UTF-8 is:
http://www.promotion-il.co.il/service/%D7%9E%D7%A4%D7%99%D7%A5-%D7%A8%D7%99%D7%97-%D7%97%D7%A9%D7%9E%D7%9C%D7%99-%D7%9C%D7%A2%D7%A1%D7%A7%D7%99%D7%9D/
The actual URL string that contains Hebrew is:
http://www.promotion-il.co.il/service/מפיץ-ריח-חשמלי-לעסקים/
I will also add that the same URL has returned with a proper UTF-8 encoding for other blog posts. (See second example in the same tab).
Why is it happening?
How can it be fixed?
Thanks in advance!

This is the solution I came up with eventually:
I saw that for the imported urls - in order to fix a broken url 2 repalcements were needed:
5D --> D7%9
5E --> D7%A
I used this formula in a separate column to achieve it:
==ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE((<COLUMN WITH IMPORTED URLS HERE>),"5D","D7%9"),"5E","D7%A"))

Why FormFeed Character in a txt File is rendered as Page Break by MS Word / WordPad but not by NotePad?

I have a .txt file and it has FormFeed character between para1 and para 2.
Para2 needs to be shown in next page on printing hence FormFeed is placed here.
sample txt file layout:
para1
formFeedCharacter
para2
expected layout on printing:
para1 is shown in 1st page and para 2 is shown in 2nd page as formFeed acts as page break.
When opened and printed with MS Word/WordPad:
expected layout is coming in 2 pages as expected.
When opened and printed with NotePad:
1)FormFeed is not acting as Page Break and all content is printed in 1 page only
2)FormFeed is displayed as unreadable Symbol
Final Printed layout when used Notepad:
para1
Unreadable symbol (caused by FormFeed)
para2
Why Notepad is unable to render FormFeed as pageBreak ?
Is it because NotePad is a text Editor While WordPad/MS Word is Word processor ?
Is there any way how we can make this work with NotePad ?

Notepad:
1)It is a text Editor program and cannot interpret Form Feed character as Page Break.
2) Hence there is no way we can make formFeed work as page break and print it by using NotePad.
WordPad/MS Word:
1) Both are Word Processor softwares and can interpret FormFeed correctly as Page Break.
Hence Unreadable symbol is not shown on opening txt file with them
2) We can also see the page Break by Print Preview feature in wordpad/NotePad.
This hyperLink provides additional information on this topic:
Additional Info
Also below hyperlink shows similar topic post asking for a universal solution for pageBreak Feature using txt file.
page Break in txt file Universal Solution

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks

Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Encoding issue with wkhtml2pdf footer-html/header-html parameters - character-encoding

The problem was from my JS code. On parameters I used unencode() instead of using the correct function decodeURIcomponent()

Related

How to set font in Texinfo

TCPDF Embedded Base64 Encoded Images in HTML String

importxml of url with Hebrew returns in encoding other than UTF-8 that chrome doesn't recognize

Why FormFeed Character in a txt File is rendered as Page Break by MS Word / WordPad but not by NotePad?

HTML decoding in C/C++

Categories

Resources