ghostscript pdf/a conversion problem on ubuntu 18.10 and docker - docker

I am using Ghostscript to convert pdf1.3 to pdf/a-1b using this command:
gs -dPDFA -dBATCH -dNOPAUSE -dNOOUTERSAVE -sColorConversionStrategy=sRGB -sDEVICE=pdfwrite -sOutputFile=output.pdf PDFA_def.ps input.pdf
The PDFA_def.ps is customized to use the srgb icc profile. Except that change it is the standard def file which comes with GS 9.26.
Now comes the tricky part:
1- running this conversion locally on a ubuntu 18.10, GS 9.26 it works fine an i get a valid pdf/a
2- running the same command in a docker container (ubuntu 18.10. GS 9.26) creates a pdf/a as well, which is considered to be valid
However, in the first scenario I can process the file using mustang (https://github.com/ZUGFeRD/mustangproject) to create a valid electronic invoice. In the second scenario (docker container) this fails, since the file is not considered to be valid pdf by mustang.
checking both pdf files I would have expected them to be identical since i am running the same converison on it. However they are not. The PDF create in the dockerfile is 10 bytes smaller and shows some different metainformation in the file itself.
I suspect that there must be some "hidden depdencies" that make GS to act different on my host system compared to a docker container, but it feels entirely wrong and I am running out of means to debug further.
Does anyone know, wether GS has some more depdencies that might cause the same command to produce different results?

The answer is 'maybe'. It depends on how Ghostscript was built for starters.
I assume you are using a package, and not building from source yourself. In which case there are a number of dependencies including; FreeType, LibJPEG, JBIG2dec, LibTIFF, JPEG-XR, LCMS2, LibPNG, OpenJPEG, Expat, zlib, potentially IJS, CUPS and X-Windows, depending on what devices were built in.
Any or all of these could be system shared libraries instead of being built using the specific version shipped by Artifex. They could also be different versions on the two systems.
That said, I think its 'unlikely' that this is your problem. However, without seeing the PDF files I can't tell you why there is a difference. Differences in the metadata are to be expected, since that includes a date/time stamp.
I'd really need to see examples of the original and the two output PDF files to be able to comment further.
[Edit]
Looking at the files they have been produced compressed (unsurprisingly) which can obviously lead to differences in size if there are small differences in the input streams. So the first task was to decompress the files.
That done I see there are, essentially, no differences between them. One of the operating systems is using a time zone 7 hours behind UTC, the other is in UTC so where one of the systems is time stamping with (eg)
2019-04-26T19:49:01Z
The other is using
2019-04-26T12:51:35-07:00
So instead of Z (for UTC) you get -07:00 which is where the extra 10 bytes are coming from. Other than that, the Unique IDs are (as you would imagine) different, the Length values for the streams are different for the streams containing dates, and the startxref is different because the streams are different lengths.
Both files claim to be compatible with PDF/A-1b. In short I can see no real differences between them. Since you are using a tool to further process the files, I'd suggest you try taking the PDF file from your working system and try processing it on the non-working system, and vice versa, it seems to me that the problem may be the later processing rather than the PDF file itself. Perhaps you have different versions of that tool on the two systems.
For what it may be worth, Ghostscript can be induced into creating a ZUGFeRD file directly, see this bug report and this commit to the repository.

Related

Printing postscript with GSView 5.0

I've been using GSView 5.0 and GhostScript 9.52 to do postscript printing on vellums. However, today GSView started throwing error codes on every .ps file I've attempted to print. I'm using Windows 10 Pro and the printer is an Epson Artisan 1430.
The error is as follows:
GPL Ghostscript 9.52: **** Could not open file 00000e60.
Unrecoverable error: invalidfileaccess in showpage
Operand stack:
--nostringval-- 1 true
gsapi_execute_cont returns -9
gsapi_exit returns 0
I've tried changing permissions for the files and different printer drivers to no avail. I'm sorry I can't be more descriptive on this issue as it's hard to articulate.
OK... You must have recently updated to a new version of Ghostscript. I can reproduce your problem, and it comes down to a recent (documented) change in behaviour for Ghostscript.
Due to the well-documented public disclosure of security exploits using Ghostscript a couple of years ago, the current version (and any version since 9.50) now defaults to running in SAFER mode.
When running in SAFER, Ghostscript prevents access by the PostScript interpreter to the file system. For those unaware of the problem; PostScript is a full-blown programming language and, by design, permits programs to access the underlying file system. SAFER mode prevents this so that malicious PostScript programs cannot, for example, run arbitrary code on your computer.
It seems that GSView is using Ghostscript in a way which requires it to read the PostScript program to be printed using the PostScript interpreter, instead of the more normal practice of specifying the input file as one of the arguments. For simplicity the input file is granted read availability by the Ghostscript executable. I suspect that GSview is using the DLL directly and not adding that extra information.
Now there are ways to permit access to specific files or folders, so that existing PostScript programs can continue to work, but obviously this requires some changes in the calling application. GSview has not changed in, literally, years so obviously it does not take any such action.
You can, however, get GSview to work as before. Under Options select Advanced Configure. In the resulting dialog look for the 'Ghostscript options' text box. In there add -dNOSAFER, that should get it to work again, though you may have to reboot the computer if the OS print subsystem has stalled.
Yes, this does open you up to the sorts of exploits I alluded to above, you should only do this with PostScript programs that you trust.

Can a LyX/LaTeX file be too large to create a PDF?

I've been working on my bachelors thesis in LyX for about a month without encountering any problems and today, all of a sudden, when creating a PDF LyX just loads indefinitely and even asks me at some point if I want to stop the PDF creating since it takes such a long time. Am I doing something wrong? I have about 100 pages and the PDFs I created lately have been around 100 mb large since they hold very high res images and a lot of them.
In case anyone is struggling with the "convert" functionality usage in Lyx, this is some additional info:
Initially I struggled to make eps to load and be displayed on screen as well as to get it exported to PDF file. I saw that the Lyx latest install had already all "convert blah-blah $$ii $$o" commands predefined and it was still not working.
Here is what worked for me:
sudo mv /etc/ImageMagick-6/policy.xml /etc/ImageMagick-6/policy.xmlout
Here there are two parts -
a) imagemagick needs to be installed on the machine as it provides most of the converters. Following command on terminal would check if imagemagick is installed or not on your system.
identify -version
b) Imagemagick tools should be "allowed" to run - "convert" being one of those. You need to relax some default security policies for that. That is what the above renaming of the policy file does. Detailed information is given in answer to this question on ubuntu forum.
Note - This security policy relaxation is not recommended for web-server machines. Only desktop users may take the risk.

Difference between dcm2pnm, dcmj2pnm and dcml2pnm

The title says it all. What is the difference between dcm2pnm (http://support.dcmtk.org/docs/dcm2pnm.html), dcmj2pnm (http://support.dcmtk.org/docs/dcmj2pnm.html) and dcml2pnm (http://support.dcmtk.org/docs/dcml2pnm.html) commands of dcmtk toolkit (http://support.dcmtk.org/docs/pages.html)? They all seem to convert dicom images to other formats. Are there any special situations where one should be preferred over others?
Edit: It seems dcml2pnm supports more formats. Why not use that for all purposes? What are the advantages of other commands?
I am the DCMTK developer.
The difference between the three DCMTK command line tools is: support for compressed DICOM images and output formats.
dcm2pnm was the original tool that has been developed more than 20 years ago and which originally only supported the image format PNM/PGM for output (that's also why the tool is called "dcm2pnm" and not "dcm2img" or the like). And, because at that time the DCMTK did not support any encapsulated transfer syntaxes (i.e. compression), only uncompressed DICOM images can be read.
dcmj2pnm is located in DCMTK's submodule "dcmjpeg" and adds support for JPEG-compressed DICOM images (based on the IJG library) as well as the JPEG image format for output.
dcml2pnm is located in DCMTK's submodule "dcmjpls" and adds support for JPEG-LS-compressed DICOM images (based on the CharLS library). It does not support conventional JPEG.
All this is probably more obvious from the source code package than from the binary package but it is also mentioned in the above referenced documentation (see "Description" section).
If you'd ask why there are three different tools (in fact, there is also a fourth one for JPEG-2000 support but that not part of the public DCMTK), my answer would be: this is mainly for historical reasons but also for reason of keeping dependencies between the various DCMTK modules as simple as possible.
Furthermore, we consider the command line tools as a kind of sample applications of the underlying C++ class library. So, if you'd need a tool that supports all image compression schemes available in the DCMTK, it should be easy to write such a tool.
dcmj2pnm adds JPEG codecs to the dcm2pnm functionality. Thus, it is capable of handling JPEG compressed DICOM data and produce JPEG output images. So it is a superset of dcm2pnm's functionality.
I think this is, because dcmtk offers different compile options which allow to include / exclude libjpeg. Just reflects the toolkit's options to the accompanying command line tools. Confirmed by the list of file formats when you start with option -h
For dcml2pnm I am not sure, but this is a good guess: Same as for JPEG but includes the JPEG-LS encoder which is another optional 3rd party toolkit for dcmtk.

How do compilers create the executable file at the end of the compilation process?

I've been reading on the compilation process, I understand some of the earlier concepts like parsing but I stop short of understanding how the executable file is created at the end.
In the examples I've seen around the "compiler" takes input in the form of a lang defined by BNF and then upon parsing it outputs assembly.
Is the executable file literally just that assembly in binary form? I feel like this can't be the case given that there are applications for making executables from assembly?
If this isn't answerable (ie it's too complex for the stack overflow format) I'd totally be happy with links/books so I can educate myself.
The compiler (or more specifically, the linker) creates the executable.
The format of the file generally vary depending on the operating system.
There are currently two main formats ELF and COFF
http://en.wikipedia.org/wiki/Executable_and_Linkable_Format
http://en.wikipedia.org/wiki/COFF
If you understand the concept of a structure, this is the same, only within a file. Each file has a first structure called a header, and from there you can access the other structures as required.
In most cases, only the resulting binary code is saved in these files, although you often find debug information. Some formats could save the source along the code, but now a day it only saves the necessary references to the source.
With dynamic linking, you also find symbol tables that include the actual symbol name. Otherwise, only relocation tables would be required.
Under the Amiga we also had the possibility to define code in a "segment". Only one segment could be loaded at a time. Once you were done with the segment, you could unload it and load another. Yet, in the end the concepts were similar. Structures in a file.
Microsoft offers a PDF about the COFF format. I could not find it on their website just now, but it looks like others have it. ELF has many links in the Wikipedia page so you should be able to find a PDF to get started.
Not all but some (gcc, etc) compilers go from the high level language to assembly language then spawn the assembler. The assembler reads the assembly language and generates machine code and generates an object file which as you have guessed contains more than just the machine code bits. If you think of it for second you may realize that a variable or function that is defined in another source file which means its code lives in another object file, until link time one object doesnt know how to get at that external function, so 1) the machine code is not finished, patching up external addresses is not done until link time 2) there needs to be some information in the object file that defines what public items are in this object file and what external items are missing, names of functions for example which are obviously not embedded in the machine code. So the objects have machine code in various states of completion as well as other data needed by the linker. the linker then...links...the objects together into one program with everything resolved, it basically completes all the machine code and puts the fragments of machine code (in separate objects) into one place. Then it has to save all that on the disk in some format and typically that format is not just raw machine code. It has extra stuff in the file, starting with a header and the a way to define each binary blob and where it needs to live in memory before executing. When you run a program on the command line of your operating system or double clicking or whatever in a file manager gui, the operating system knows how to read that file format, extract the blobs of binary, place those blobs of binary in ram defined by this file format and then start executing at the place defined by this file format.
aout, elf, coff, intel hex, motorola s-record are all popular formats as well as raw binary which some toolchains can produce. the gnu tools will default to one (coff or elf or exe or aout) and then objcopy is used to convert from one to another or at least the default one to the others and there is help to show what your possible choices are. then simply google those or wikipedia them and find the definitions of the file formats. Intel hex or motorola srecord are good ones to start with at wikipedia then elf perhaps.
if you want to produce native executable file you have 2 options. you can assembly the binary form yourself or you can transalte your program to another language and use its compiler to producte the executable

Is there a way to programatically export files using Wireshark's facilities?

I am trying to automate a repetitive manual process for which I use WireShark:
1) Load a given pcap file
2) Apply a simple filter for a given protocol
3) Use the export dialog box to export the displayed packets to CSV file
4) Use the export dialog box to export the displayed packets in XML PDML form.
This is tedious, and requires human involvement in the middle of a process that is mostly automated (including the analysis of the files to produce reports).
Is there some way to either automate Wireshark, or do somehow access the underlying libraries used for export?
UPDATE: As several people here indicated, TShark turns out to be the way to go.
The exact command line I ended up using is:
tshark -r MyDataFile.pcap -T pdml -R MyProtocol > MyOutputFile.xml\
I then use an event based XML parser (Python's expat) to parse the generated 2GB file
I watched at the dependency list of wireshark on my debian system, and I found Tshark: it's the command line version of wireshark.
It seems interesting, but I didn't read the manual yet, however it's for sure more script friendly.
Also I'll stay tuned on this thread and post more info when I'll start using it.
I think what you should do is look into tshark. That is the linux command line version, which will allow exactly what you ask for (assuming you have access to it). And of course, this assumes that it is acceptable to run tshark and then review the outputs manually.
I haven't ever tried to automate Wireshark before, though I have had to do something similar to what you describe. I ended up reducing the handful of human (and thus error-prone) steps to one step that was automated.
Autohotkey is my solution for lots of repetitive GUI-based tasks. You can very easily write a keystroke playback script that will do all of the above steps. You'll probably have to have it increment the filename for you automatically. You could also have your other automated tool kick off the Autohotkey script with a keystroke.

Resources