JDOM2 Outputter inserting 4 errant bytes at start of XML file - jdom-2

I'm using Java 8 and JDOM 2.0.6 (Mac-Yosemite + Eclipse) to generate an XML file.
The prolog of the file comes out with these bytes preceding <?xml
C2 A8 C3 8C
I'm using XMLOutputter.output() to write out the Document. When I direct the output to console, it comes out correctly. When directed output to a file, I get the errant bytes inserted.
relevant code:
`
private Document outputDoc = new Document();
outputDoc.setRootElement(new Element("GraphicalAlgorithm_" + challengeID, DFG2D_NAMESPACE));
outputDoc.getRootElement().addContent(..my Element...);
XMLOutputter xmlOutputter = new XMLOutputter(Format.getPrettyFormat());
//TEST ONLY: writes to console
xmlOutputter.output(outputDoc, System.out);
xmlOutputter.output(outputDoc, fileStream);;
`
I'm stumped on this one.

I stepped into this minefield by copying and pasting the "file output" method I had been previously using for Serialization (.ser) file output.
The errant 4-bytes are a "magic ID" that Java serialization stamps into a FileOutpuStream which has been previously attached to an ObjectOutputStream (the specialized outputter you use for serialized output calls, e.g. "writeObject(obj)". One might assume that you have to actually invoke "writeOutput(obj)" for the serializer to mark the file this way, but no.
A complete analysis/write-up and repair can be found here:
https://github.com/hunterhacker/jdom/issues/193

Related

OleDb Connection string for tab-delimited files

I need to read a variety of data file types, such as xlsx, csv, txt, and mdb, and I want to use an OleDB connection so that the process of reading the files is the same, just with a different connection string. However, OleDB is ignoring the delimiter in connection strings such as the following and only reads comma-delimited.
Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties='Text;HDR=Yes;Delimited(\t)';
Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties='Text;HDR=Yes;FMT=TabDelimited';
I would prefer to have the OleDB engine do the work rather than parse the tab-delimited files myself.
There are several StackOverflow questions concerning this, and the solution is usually to create an .ini file in the same directory, but sometimes my users do not have write access to the folder. Seeing as all of the StackOverflow questions similar to mine are at least a couple years old, does anybody have any updated information on this issue?
This is how I've used | delimiter to read |-delimited .csv or .txt files using OleDB, however, I was using ACE engine and constructing connection string from C#:
connectionString = "Provider=Microsoft.ACE.OLEDB.12.0;Data Source=" + Path.GetDirectoryName(catalogFile) + ";Extended Properties='text;HDR=YES;FMT=Delimited(" + (char)124 + ")'";
(char)124 stands for the ASCII code of |. Knowing that ASCII code of TAB is 9 you may try using this in your connection string:
...;Extended Properties='text;HDR=YES;FMT=Delimited(" + (char)9 + ")'";
Try the above code snippet and also try your code using MS Access Database Engine driver. Since it's newer, maybe it has better delimiter config handling.

Using Jena to convert an owl file to N-Triples from terminal returns an empty file

I have generated an owl file using this generator http://swat.cse.lehigh.edu/projects/lubm/
I want to transform the file in N-triples and have done it before using
$ riot -out N-TRIPLE ~/lubm20/*.owl > lubm20.nt
for some reason now I get an empty file (lubm20.nt)
and when I use
$ rdfcat -out N-TRIPLE ~/lubm20/*.owl > lubm20.nt
I get this error
Exception in thread "main" org.apache.jena.riot.RiotException: <file:///root/lubm20/classes\University0_0.owl> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
at org.apache.jena.riot.s5ystem.IRIResolver.exceptions(IRIResolver.java:371)
at org.apache.jena.riot.system.IRIResolver.resolve(IRIResolver.java:328)
at org.apache.jena.riot.system.IRIResolver$IRIResolverSync.resolve(IRIResolver.java:489)
at org.apache.jena.riot.system.IRIResolver.resolveIRI(IRIResolver.java:254)
at org.apache.jena.riot.system.IRIResolver.resolveString(IRIResolver.java:233)
at org.apache.jena.riot.SysRIOT.chooseBaseIRI(SysRIOT.java:109)
at org.apache.jena.riot.adapters.AdapterFileManager.readModelWorker(AdapterFileManager.java:286)
at org.apache.jena.util.FileManager.readModel(FileManager.java:341)
at jena.rdfcat.readInput(rdfcat.java:328)
at jena.rdfcat$ReadAction.run(rdfcat.java:473)
at jena.rdfcat.go(rdfcat.java:231)
at jena.rdfcat.main(rdfcat.java:206)
The generator would generate a well known semantic web benchmark dataset so how can it have
UNWISE_CHARACTER s?
edit:
for the question asked
I used this line to generate the *.owl files
java edu.lehigh.swat.bench.uba.Generator -onto http://swat.cse.lehigh.edu/onto/univ-bench.owl univ 20
then moved the *.owl files to lubm20 folder
I used rdf2rdf instead of jena
java -jar rdf2rdf-1.0.1-2.3.1.jar /lubmData/lubm100/*.owl lubm100.nt
worked like a charm
enter link description here

PHPEXCEL weird characters on form inputs

I need some help with PHPEXCEL library, everything works great, I'm successfully extracting my SQL query to excel5 file, I need to give this file to transport company in order to auto collect informations about packages, unfotunately the generated excel file has some ascii characters between each letter of the cell text, and when the excel file is imported you need to manually delete these charaters.
If I open the excel file, everything is fine I see: COMPANY NAME, If I open the excel file with notepad++, I see the cell values this way: C(NUL)O(NUL)M(NUL)P(NUL)A(NUL)N(NUL)Y N(NUL)A(NUL)M(NUL)E
If I open again the file with excel and save, then reopen with notepad++ I see COMPANY NAME.
So I do not understan why every time I create an excel file using PHPEXCEL my every letter of all words are filled with (nul) every letter.
So how do I prevent the generated excel file to include (nul) between every word????
Also if you open the original excel file generated from PHPExcel samples are also filled with (nul) and if you open and save it, the (nul) is gone.
Any help would be appreciated, thanks.
what is the (nul) ??? 0x00??? char(0)???
ok, here is the example:
error_reporting(E_ALL);
ini_set('display_errors', TRUE);
ini_set('display_startup_errors', TRUE);
date_default_timezone_set('Europe/London');
if (PHP_SAPI == 'cli')
die('Disponibile solo su browser');
require_once dirname(__FILE__) . '/Classes/PHPExcel.php';
$objPHPExcel = new PHPExcel();
$objPHPExcel->getProperties()->setCreator("Solidus")
->setLastModifiedBy("Solidus")
->setTitle("Import web")
->setSubject("Import File")
->setDescription("n.a")
->setKeywords("n.a")
->setCategory("n.a");
$objPHPExcel->setActiveSheetIndex(0)
->setCellValueExplicit("A1", "COMPANY")
->setCellValue('A2', 'SAMSUNG');
$objPHPExcel->getActiveSheet()->setTitle('DDT');
$objPHPExcel->setActiveSheetIndex(0);
header('Content-Type: application/vnd.ms-excel');
header('Content-Disposition: attachment;filename="TEST.xls"');
header('Cache-Control: max-age=0');
header('Cache-Control: max-age=1');
header('Cache-Control: private',false);
$objWriter = PHPExcel_IOFactory::createWriter($objPHPExcel, 'Excel5');
ob_end_clean();
$objWriter->save('php://output');
As you can see from this little example, this scripts creates a file excel5 with 2 cells, A1 = COMPANY, A2 = SAMSUNG
when I send this file to the transport company, they import the file into their system, but as you can see from the picture, there is an weird character between each letter.
so I noticed every time I open the generated Excel5 with notepad++ file I get:
S(nul)A(nul)M(nul)S(nul)U(nul)N(nul)G
If I save the save with excel and then open it again with notepad++ I get:
SAMSUNG
and this file is ok for the transport company
so my question is, how should I avoid the file generated to contain thi '(nul) charachter between each letter????
some help?
weird characters
SAMSUNG
I found the soluion by myself, I explain just in case anyone has also this problem:
there is not way to change the way the excelfile is encoded by PHPEXCEL
so I figured out the problem was reading the file, I did some simulations and reproduce the problem, every time a read the file and put the result into inputs a get weird characters:
C�O�M�P�A�N�Y�
If I set the output enconding enconding as follows:
$excel->setOutputEncoding('UTF-8');
the file loads fine, so the problem was not creating the excel file, but reading the excel file.
If I print the variable with ECHO I get: "COMPANY",
if I put the variable on input as value I get: "C�O�M�P�A�N�Y�"
setting the output solves the problem, but I would like to know why the difference when I put the variable on input as value, thanks

Identifying source of parser errors in Apache Fuseki

I am getting the following error in trying to load a large RDF/XML document into Fuseki:
> Code: 4/UNWISE_CHARACTER in PATH: The character matches no grammar rules of URIs/IRIs. These characters are permitted in RDF URI References, XML system identifiers, and XML Schema anyURIs.
How do I find out what line contains the offending error?
I have tried turning up the output in Log4j.properties and I also tried validating the RDF/XML file using the Jena commandline rdfxml tool (as well as utf8 & riot) --- it validates with no errors reported. But I'm new to this toolset.
(version?)
Check the ""-strings in your RDF/XML data for undesiravle URIs - especially spaces in URIs.
Best to validate before loading : try riot YourFile and send stderr and stdout to a file. The errors should be approximately in the position of the parser output (N-triples) at the time.

Offending Command error while Printing EPS

I am printing an EPS File generated with following credentials.
%-12345X#PJL JOB
#PJL ENTER LANGUAGE = POSTSCRIPT
%!PS-Adobe-3.0
%%Title: InvoiceDetail_combine
%%Creator: PScript5.dll Version 5.2.2
%%CreationDate: 10/7/2011 4:46:59
%%For: Administrator
%%BoundingBox: (atend)
%%Pages: (atend)
%%Orientation: Portrait
%%PageOrder: Special
%%DocumentNeededResources: (atend)
%%DocumentSuppliedResources: (atend)
%%DocumentData: Clean7Bit
%%TargetDevice: (HP Color LaserJet 4500) (2014.200) 0
%%LanguageLevel: 2
%%EndComments
While doing Selection Printing on Ricoh Afficio 2090 or any other drivers/printers get the following error printed on the sheets
ERROR: undefined
OFFENDING COMMAND: F4S47
Stack:
.
Kindly Review and suggest a turn around for the same as i am already stuck in this hell. I have tried to convert/extract in PS but all in vain. I am using gsview to Print and view these files.
This is the problem:
%%PageOrder: Special
A ps document with "Special" page order can NOT be re-ordered. You cannot do a selection or range with this file because it is broken for this use. You must reprocess the file using Distiller or ghostscript (ps2ps or ps2pdf) in order to print selected or re-ordered pages from the document.
You can avoid this by generating your postscript files with a real Postscript™ driver (one not created by Microsoft).
The GSView Documentation has more about this.
Previously:
This line ...
%%TargetDevice: (HP Color LaserJet 4500) (2014.200) 0
... tells us that the file was generated with HP printers as a target. So this really is not an EPS file. Because it's not Encapsulatable. To generate output on a printer the file has to execute the showpage operator, which is a no-no for EPS files.
So uncheck the EPS box (it's a big fat lie, anyway), and select (install) a Generic Postscript driver. If you need to send it to multiple makes of printer, the file needs to make as few assumptions about the printer as possible.
The first thing is that this is not a valid EPS file, as it has PJL attached at the front. Many PostScript printers will strip this off, but by no means all.
This probably is not the source of the problem.
There is no way to 'review' the problem as you have not supplied the complete PostScript program. Without that there is no way to tell what is actually wrong, the error message tells you that the interpreter encountered 'F4547' while trying to parse a token, and that this has not been defined as a routine.
Most likely the file is corrupt, either damaged in some way, or possibly it is a biinary file and has been transmitted by some process which does has done some kind of conversion (CR/LF is common). The offending command looks like its ASCIIHex encoded, so that may be a red herring.
If you want additional help, you are going to have to make the whole program available somewhere.

Resources