Data missing in AVRO file - avro

I'm attempting to convert some CSV files into AVRO files.
The code I've written runs fine on many CSV files that I've tested against but on some files I find that there is some data missing from the AVRO file.
Here is the outline of code in the csv->avro conversion. I'm using 1.7.5 of the C library
// initialize line counter
lineno = 0;
// make a schema first
avro_schema_from_json_length (...);
// make a generic class from schema
iface = avro_generic_class_from_schema( schema );
// get the record size and verify that it is 109
avro_schema_record_size (schema);
// get a generic value
avro_generic_value_new (iface, &tuple);
// make me an output file
fp = fopen ( outputfile, "wb" );
// make me a filewriter
avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);
// now for the code to emit the data
while (...)
{
avro_value_reset (&tuple);
// get the CSV record into the tuple
...
// write that tuple
avro_file_writer_append_value (db, &tuple);
lineno ++;
// flush the file
avro_file_writer_flush (db);
}
// close the output file
avro_file_writer_close (db);
// other cleanup
avro_value_iface_decref (iface);
avro_value_decref (&tuple);
// close output file
fflush (outfp);
fclose (outfp);
When I run this program on a CSV file with 448621 rows of data and one header row, it comes out correctly with the fact that it processed 448621 rows of data.
Now the reader of this is a modified avrocat.c
Here is the code.
wschema = avro_file_reader_get_writer_schema(reader);
iface = avro_generic_class_from_schema(wschema);
avro_generic_value_new(iface, &value);
int rval;
lineno = 0;
while ((rval = avro_file_reader_read_value(reader, &value)) == 0) {
lineno ++;
avro_value_reset(&value);
}
// If it was not an EOF that caused it to fail,
// print the error.
if (rval != EOF)
{
fprintf(stderr, "Error: %s\n", avro_strerror());
}
else
{
printf ( "%s %lld\n", filename, lineno );
}
When I run this against the avro file I just created, I find that it only has 448609 rows of data.
Not sure what happened to the rest ...
What am I missing, doing wrong?
What additional information would someone need to help debug this?
I've tried a bunch of things.
The addition of the flush code to the avro file is one.
I've tried to dump the avro file (using avrocat) and find out what is missing and it tends to be rows at the end.

It appears that this is a bug in c 1.7.5 which was just fixed in c 1.7.6.
The bug in question is
https://issues.apache.org/jira/browse/AVRO-1364
Solution: upgrade to 1.7.6 ... where I verified that this problem does not exist.

Related

Saxonica - .NET API - XQuery - XPDY0002: The context item for axis step root/descendant::xxx is absent

I'm getting same error as this question, but with XQuery:
SaxonApiException: The context item for axis step ./CLIENT is absent
When running from the command line, all is good. So I don't think there is a syntax problem with the XQuery itself. I won't post the input file unless needed.
The XQuery is displayed with a Console.WriteLine before the error appears:
----- Start: XQUERY:
(: FLWOR = For Let Where Order-by Return :)
<MyFlightLegs>
{
for $flightLeg in //FlightLeg
where $flightLeg/DepartureAirport = 'OKC' or $flightLeg/ArrivalAirport = 'OKC'
order by $flightLeg/ArrivalDate[1] descending
return $flightLeg
}
</MyFlightLegs>
----- End : XQUERY:
Error evaluating (<MyFlightLegs {for $flightLeg in root/descendant::FlightLeg[DepartureAirport = "OKC" or ArrivalAirport = "OKC"] ... return $flightLeg}/>) on line 4 column 20
XPDY0002: The context item for axis step root/descendant::FlightLeg is absent
I think that like the other question, maybe my input XML file is not properly specified.
I took the samples/cs/ExamplesHE.cs run method of the XQuerytoStream class.
Code there for easy reference is:
public class XQueryToStream : Example
{
public override string testName
{
get { return "XQueryToStream"; }
}
public override void run(Uri samplesDir)
{
Processor processor = new Processor();
XQueryCompiler compiler = processor.NewXQueryCompiler();
compiler.BaseUri = samplesDir.ToString();
compiler.DeclareNamespace("saxon", "http://saxon.sf.net/");
XQueryExecutable exp = compiler.Compile("<saxon:example>{static-base-uri()}</saxon:example>");
XQueryEvaluator eval = exp.Load();
Serializer qout = processor.NewSerializer();
qout.SetOutputProperty(Serializer.METHOD, "xml");
qout.SetOutputProperty(Serializer.INDENT, "yes");
qout.SetOutputStream(new FileStream("testoutput.xml", FileMode.Create, FileAccess.Write));
Console.WriteLine("Output written to testoutput.xml");
eval.Run(qout);
}
}
I changed to pass the Xquery file name, the xml file name, and the output file name, and tried to make a static method out of it. (Had success doing the same with the XSLT processor.)
static void DemoXQuery(string xmlInputFilename, string xqueryInputFilename, string outFilename)
{
// Create a Processor instance.
Processor processor = new Processor();
// Load the source document
DocumentBuilder loader = processor.NewDocumentBuilder();
loader.BaseUri = new Uri(xmlInputFilename);
XdmNode indoc = loader.Build(loader.BaseUri);
XQueryCompiler compiler = processor.NewXQueryCompiler();
//BaseUri is inconsistent with Transform= Processor?
//compiler.BaseUri = new Uri(xqueryInputFilename);
//compiler.DeclareNamespace("saxon", "http://saxon.sf.net/");
string xqueryFileContents = File.ReadAllText(xqueryInputFilename);
Console.WriteLine("----- Start: XQUERY:");
Console.WriteLine(xqueryFileContents);
Console.WriteLine("----- End : XQUERY:");
XQueryExecutable exp = compiler.Compile(xqueryFileContents);
XQueryEvaluator eval = exp.Load();
Serializer qout = processor.NewSerializer();
qout.SetOutputProperty(Serializer.METHOD, "xml");
qout.SetOutputProperty(Serializer.INDENT, "yes");
qout.SetOutputStream(new FileStream(outFilename,
FileMode.Create, FileAccess.Write));
eval.Run(qout);
}
Also two questions regarding "BaseURI".
1. Should it be a directory name, or can it be same as the Xquery file name?
2. I get this compile error: "Cannot implicity convert to "System.Uri" to "String".
compiler.BaseUri = new Uri(xqueryInputFilename);
It's exactly the same thing I did for XSLT which worked. But it looks like BaseUri is a string for XQuery, but a real Uri object for XSLT? Any reason for the difference?
You seem to be asking a whole series of separate questions, which are hard to disentangle.
Your C# code appears to be compiling the query
<saxon:example>{static-base-uri()}</saxon:example>
which bears no relationship to the XQuery code you supplied that involves MyFlightLegs.
The MyFlightLegs query uses //FlightLeg and is clearly designed to run against a source document containing a FlightLeg element, but your C# code makes no attempt to supply such a document. You need to add an eval.ContextItem = value statement.
Your second C# fragment creates an input document in the line
XdmNode indoc = loader.Build(loader.BaseUri);
but it doesn't supply it to the query evaluator.
A base URI can be either a directory or a file; resolving relative.xml against file:///my/dir/ gives exactly the same result as resolving it against file:///my/dir/query.xq. By convention, though, the static base URI of the query is the URI of the resource (eg file) containing the source query text.
Yes, there's a lot of inconsistency in the use of strings versus URI objects in the API design. (There's also inconsistency about the spelling of BaseURI versus BaseUri.) Sorry about that; you're just going to have to live with it.
Bottom line solution based on Michael Kay's response; I added this line of code after doing the exp.Load():
eval.ContextItem = indoc;
The indoc object created earlier is what relates to the XML input file to be processed by the XQuery.

Unexpected error "incorrect use of identifier" when running DXL code

I have a script which calls functions from this script:
https://www.baselinesinc.com/dxl-repository/?filerepoaction=showpost&filepost_id=13
which is distributed under the LGPL. I have found several references to this script online, indicating that it works. For my own purposes, I removed the part of the code which executes on load (some 20 lines at the bottom of the file) and converted the file to an .inc, to serve as a library. I haven't done any other changes.
My own script has worked for a few months at this point. But now, all of a sudden, when I try to run it, I get these errors:
-E- DXL: <DOORSImporter\Import_From_Excel.inc:291> incorrect use of identifier (rtfValue)
-E- DXL: <DOORSImporter\Import_From_Excel.inc:292> incorrect use of identifier (nonRTFValue)
-E- DXL: <DOORSImporter\Import_From_Excel.inc:346> incorrect use of identifier (temp)
-E- DXL: <DOORSImporter\Import_From_Excel.inc:464> incorrect use of identifier (columnHeading)
-E- DXL: <DOORSImporter\Import_From_Excel.inc:1127> incorrect use of identifier (cellValue)
-I- DXL: All done. Errors reported: 5. Warnings reported: 0.
Below is a snippet of code which shows this error. I marked lines 291 and 292
bool richTextTruncated(int rowNumber, int columnNumber) {
Buffer rtfValue = create
Buffer nonRTFValue = create
rtfValue = ""
nonRTFValue = ""
rtfValue = copyRichTextFromCell(""intToCol(columnNumber) "" rowNumber"") // get the rich text
rtfValue = trimWhitespace(stripRichText(tempStringOf(rtfValue))) // strip off RTF formatting. <----- 291
nonRTFValue = trimWhitespace(getCellValue(rowNumber, columnNumber)) //<----- 292
if(tempStringOf(rtfValue) != tempStringOf(nonRTFValue)) { // if the two strings aren't equal
delete(rtfValue)
delete(nonRTFValue)
return true // return that text was truncated
}
delete(rtfValue)
delete(nonRTFValue)
return false
}
The function trimWhitespace itself is defined as
string trimWhitespace(string s) so it's OK to assign its output to a buffer.
So this is all very confusing because, like I said, this has worked and the code has not been changed (last check in of the file DOORSImporter\Import_From_Excel.inc is from last November, whereas I have used the script even in January.
Our IT department might have updated DOORS recently, I don't know. I checked the latest version of the dxl reference manual for any recent changes which might have caused this, but I couldn't find any change to the way buffers are handled.
To be fair, I have found working with buffers quite confusing. In my own code, I ended up discarding them altogether. However, the script Import_From_Excel has worked unchanged for a long time now.
Any support with this will be greatly appreciated.
********** Edit 27.02.2020 **************
The function copyRichTextFromCell is defined as below. Its sourced is from the link below.
https://www.baselinesinc.com/dxl-repository/?filerepoaction=showpost&filepost_id=10
string copyRichTextFromCell(string loc) {
if(!getCell(loc)) {
return("error");
}
if(!checkResult(oleMethod(objCell, cMethodSelect))) { // selects the cell
return("error");
}
if(!checkResult(oleMethod(objCell, cMethodCopy))) { // copies the contents of the cell to the clipboard
return("error");
}
return stringOf(richClip)
}
********** Edit 28.02.2020 **************
The function trimWhitespace is as below.
string trimWhitespace(string s) {
int first = 0;
int last = length(s) - 1;
while(last > 0 && (isspace(s[last]) || s[last] == '\n' || s[last] == '\r' || s[last] == '\t')) {
last--;
}
while(isspace(s[first]) && first < last) {
first++;
}
if(s[first:last] == " ") {
return("");
}
return(s[first:last]);
}
While playing around, I discovered that if I modify as below, one of the errors disappears. It is beyond me as to why that might happen, because the code is equivalent, as far as I can tell...
bool richTextTruncated(int rowNumber, int columnNumber) {
Buffer rtfValue = create
Buffer nonRTFValue = create
string fooString
rtfValue = ""
nonRTFValue = ""
rtfValue = copyRichTextFromCell(""intToCol(columnNumber) "" rowNumber"") // get the rich text
fooString = trimWhitespace(stripRichText(tempStringOf(rtfValue)))
rtfValue = fooString // strip off RTF formatting.
fooString = trimWhitespace(getCellValue(rowNumber, columnNumber))
nonRTFValue = fooString
if(tempStringOf(rtfValue) != tempStringOf(nonRTFValue)) { // if the two strings aren't equal
delete(rtfValue)
delete(nonRTFValue)
return true // return that text was truncated
}
delete(rtfValue)
delete(nonRTFValue)
return false
}
So the lines with fooString do not throw an error anymore. In the initial version, these lines were just e.g. rtfValue = trimWhitespace(stripRichText(tempStringOf(rtfValue))) and this would throw an error.

How can I efficiently parse formatted text from a file in Qt?

I would like to get efficient way of working with Strings in Qt. Since I am new in Qt environment.
So What I am doing:
I am loading a text file, and getting each lines.
Each line has text with comma separated.
Line schema:
Fname{limit:list:option}, Lname{limit:list:option} ... etc.
Example:
John{0:0:0}, Lname{0:0:0}
Notes:limit can be 1 or 0 and the same as others.
So I would like to get Fname and get limit,list,option values from {}.
I am thinking to write a code with find { and takes what is inside, by reading symbol by symbol.
What is the efficient way to parse that?
Thanks.
The following snippet will give you Fname and limit,list,option from the first set of brackets. It could be easily updated if you are interested in the Lname set as well.
QFile file("input.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
qDebug() << "Failed to open input file.";
QRegularExpression re("(?<name>\\w+)\\{(?<limit>[0-1]):(?<list>[0-1]):(?<option>[0-1])}");
while (!file.atEnd())
{
QString line = file.readLine();
QRegularExpressionMatch match = re.match(line);
QString name = match.captured("name");
int limit = match.captured("limit").toInt();
int list = match.captured("list").toInt();
int option = match.captured("option").toInt();
// Do something with values ...
}

Invalid argument supplied for foreach()

The code down below delete files inside the folder " Images " every 60 seconds, it works, but when the folder is empty it says: Warning: Invalid argument supplied for foreach()
How can that be fixed like if there is no files, say " folder empty instead of that warning..
<?php
$expiretime=1;
$tmpFolder="Images/";
$fileTypes="*.*";
foreach (glob($tmpFolder . $fileTypes) as $Filename) {
// Read file creation time
$FileCreationTime = filectime($Filename);
// Calculate file age in seconds
$FileAge = time() - $FileCreationTime;
// Is the file older than the given time span?
if ($FileAge > ($expiretime * 60)){
// Now do something with the olders files...
echo "The file $Filename is older than $expiretime minutes\r\n";
//delete files:
unlink($Filename);
}
}
?>
Since glob() may not reliably return an empty array for an empty match (See "note" in Return section of the docs), you just need an if statement protecting your loop, like so:
$files = glob($tmpFolder . $fileTypes);
if (is_array($files) && count($files) > 0) {
foreach($files as $Filename) {
// Read file creation time
$FileCreationTime = filectime($Filename);
// Calculate file age in seconds
$FileAge = time() - $FileCreationTime;
// Is the file older than the given time span?
if ($FileAge > ($expiretime * 60)){
// Now do something with the olders files...
echo "The file $Filename is older than $expiretime minutes\r\n";
//delete files:
unlink($Filename);
}
} else {
echo 'Your error here...';
}

Extract document name from print job

Is there a reliable way to extract a document name or job name from a postscript print job if you have the raw postscript data?
I've seen print release station software that labels each job with a document name or url it was printed from, so it seems possible.
There is no reliable way to do this, as there is no such (metadata) information in the PostScript language. If your files are DSC (Document Structuring Convention) compliant then you can look for comments. These are documented in the DSC reference manual. Valid PostScript files need not be DSC compliant.
Other than that, there is no information there to extract, at least as far as PostScript is concerned.
To extract document name from print job using C++.
#include <WinSpool.h>
wstring GetDocumentName(wstring m_strFriendlyName)
{
wstring strDocName = L"";
HANDLE hPrinter ;
if ( OpenPrinter(const_cast<LPWSTR>(m_strFriendlyName.c_str()), &hPrinter, NULL) == 0 )
{
/*OpenPrinter call failed*/
return false;
}
DWORD dwBufsize = 0;
PRINTER_INFO_2* pinfo = 0;
GetPrinter(hPrinter, 2,(LPBYTE)pinfo, dwBufsize, &dwBufsize); //Get dwBufsize
PRINTER_INFO_2* pinfo2 = (PRINTER_INFO_2*)malloc(dwBufsize); //Allocate with dwBufsize
GetPrinter(hPrinter, 2,(LPBYTE)pinfo2, dwBufsize, &dwBufsize);
DWORD numJobs = pinfo2->cJobs;
free(pinfo2);
JOB_INFO_1 *pJobInfo = 0;
DWORD bytesNeeded = 0, jobsReturned = 0;
//Get info about jobs in queue.
EnumJobs(hPrinter, 0, numJobs, 1, (LPBYTE)pJobInfo, 0,&bytesNeeded,&jobsReturned);
pJobInfo = (JOB_INFO_1*) malloc(bytesNeeded);
EnumJobs(hPrinter, 0, numJobs, 1, (LPBYTE)pJobInfo, bytesNeeded, &bytesNeeded, &jobsReturned);
JOB_INFO_1 *pJobInfoInitial = pJobInfo;
for(unsigned short count = 0; count < jobsReturned; count++)
{
if (pJobInfo != NULL)
{
strDocName = pJobInfo->pDocument; // Document name
DWORD dw = pJobInfo->Status;
}
pJobInfo++;
}
free(pJobInfoInitial);
ClosePrinter( hPrinter );
return strDocName;
}
What you might be seeing is the document name that the application submitted to the print spooler. Also, it may not be reliable but most print drivers put the document name in PJL or XML at the top of the print job. With some flexible rules you might be able to pull this data with some confidence.
This, of course, assumes that the PS data was generated by a printer drivers.

Resources