Unspecified error when truncating large dbf - oledb

I'm using a OleDbCommand to truncate DBF-files. It works fine for most files, but if the file size is e.g. 400 MB I get an "Unspecified error". I've read somewhere that the size limit of a dbf-file is 2 GB, so I hope there is a way to work with files that large...
System.Data.OleDb.OleDbException: Unspecified error
at System.Data.OleDb.OleDbCommand.ExecuteCommandTextErrorHandling(OleDbHResult hr)
at System.Data.OleDb.OleDbCommand.ExecuteCommandTextForSingleResult(tagDBPARAMS dbParams, Object& executeResult)
at System.Data.OleDb.OleDbCommand.ExecuteCommandText(Object& executeResult)
at System.Data.OleDb.OleDbCommand.ExecuteCommand(CommandBehavior behavior, Object& executeResult)
at System.Data.OleDb.OleDbCommand.ExecuteReaderInternal(CommandBehavior behavior, String method)
at System.Data.OleDb.OleDbCommand.ExecuteNonQuery()
at OleDbTruncateTest.Program.Main(String[] args) in C:\Users\henjoh\Visual Studio 2008\Projects\OleDbTruncateTest\OleDbTruncateTest\Program.cs:line 22
Below is the essential code for the operation:
using System;
using System.Data.OleDb;
using System.IO;
namespace OleDbTruncateTest
{
class Program
{
static void Main(string[] args)
{
try
{
string file = #"C:\Temp\largefile.DBF";
string pathName = Path.GetDirectoryName(file);
string fileName = Path.GetFileName(file);
using (OleDbConnection connection = new OleDbConnection("Provider=Microsoft.Jet.OLEDB.4.0; Data Source=" + pathName + "; Extended Properties=dBase III"))
{
connection.Open();
using (OleDbCommand comm = new OleDbCommand("DELETE FROM " + fileName, connection))
{
comm.ExecuteNonQuery();
}
}
Console.WriteLine("Done");
}
catch (Exception ex)
{
Console.WriteLine(ex.ToString());
}
Console.WriteLine("ENTER to exit...");
Console.ReadLine();
}
}
}
Any ideas on how to be able to truncate large dbf-files?

With .dbf files, typically originated with dBASE, Clipper and Foxpro (and Visual FoxPro), they were all designed 32-bit and thus a cap of 2 gig for any single file size. No choice, that's it. If the file is OVER the 2 gig file limit, it must be a .DBF file that is handled by some other product that can read .dbf files, such as Sybase's Advantage Database Server which can directly read/write/support .DBF files and go beyond the 2 gig limit.
If you want to truly truncate (ie: remove all records), delete from will only mark the records for deletion and leave the records there until you "PACK" the table. That said, and since I don't use the Microsoft JET OleDB provider, but use Microsoft Visual FoxPro OleDbProvider download.
Then, I would build a string containing VFP commands to explicitly open the table exclusively and ZAP it (which deletes all records and packs and rebuilds indexes too)... something like
string VFPScript = "ExecScript( "
+ "[USE " + fileName + " EXCLUSIVE] +chr(13)+chr(10) + "
+ "[IF USED( '" + fileName + "')] + chr(13)+chr(10) + "
+ "[ZAP] +chr(13)+chr(10)+ "
+ "[ENDIF] +chr(13)+chr(10)+ "
+ "[USE] )";
// put this script into command object, then execute it...
using (OleDbCommand comm = new OleDbCommand( VFPScript, connection))
{
comm.ExecuteNonQuery();
}
NOTE. The only command here that I don't know if JET recognizes is the "ExecScript()" function which in VFP, allows you to pass a string as a block of commands and execute it as if it was a .prg. So you can do things like loops and IF/ENDIF blocks (with some limitations). However, this example is building the string as
USE YourFile EXCLUSIVE
if used( "YourFile" )
ZAP
ENDIF
USE
FINAL NOTE. When dealing with the table names. The .DBF extension is IMPLIED when going through the OleDB provider, so you'll want to NOT use the .dbf extension as part of the string. BOTH OleDbProviders will still find the table as long as it is in the path pointed to by the connection string.
Good luck.

Related

How to add column name as header when using dataflow to export data to csv

I am exporting some data to csv by Dataflow, but beyond data I want to add each column names as the first line on the output file such as
col_name1, col_name2, col_name3, col_name4 ...
data1.1, data1.2, data1.3, data1.4 ...
data2.1 ...
Is there anyway to do with current API?(searched around TextIO.Write but didn't find anything seems relevant...) or is there anyway I could sort of "insert" column name at the head of to-be-exported PCollection and enforce the the data to be written in order...?
There is no built-in way to do that using TextIO.Write. PCollections are unordered so it isn't possible ot add an eleemnt to the front. You could write a custom BoundedSink which does this.
Custom sink APIs are now available if you want to be the brave one to craft a CSV sink. Current workaround which builds up the output as a single string and outputs it all at finish bundle:
PCollection<String> output = data.apply(ParDo.of(new DoFn<String, String>() {
private static final long serialVersionUID = 0;
String new_line = System.getProperty("line.separator");
String csv_header = "id, stuff1, stuff2, stuff3" + new_line;
StringBuilder csv_body = new StringBuilder().append(csv_header);
#Override
public void processElement(ProcessContext c) {
csv_body.append(c.element()).append(newline);
}
#Override
public void finishBundle(Context c) throws Exception {
c.output(csv_body);
}
})).apply(TextIO.Write.named("WriteData").to(options.getOutput()));
This will only work if your BIG output string fits in memory
As of Dataflow SDK version 1.7.0, you have withHeader function in TextIO.Write .
So you can do this:
TextIO.Write.named("WriteToText")
.to("/path/to/the/file")
.withHeader("col_name1,col_name2,col_name3,col_name4")
.withSuffix(".csv"));
A new line character is automatically added to the end of the header.

How to Get Filename when using file pattern match in google-cloud-dataflow

Someone know how to get Filename when using file pattern match in google-cloud-dataflow?
I'm newbee to use dataflow. How to get filename when use file patten match, in this way.
p.apply(TextIO.Read.from("gs://dataflow-samples/shakespeare/*.txt"))
I'd like to how I detect filename that kinglear.txt,Hamlet.txt, etc.
If you would like to simply expand the filepattern and get a list of filenames matching it, you can use GcsIoChannelFactory.match("gs://dataflow-samples/shakespeare/*.txt") (see GcsIoChannelFactory).
If you would like to access the "current filename" from inside one of the DoFn's downstream in your pipeline - that is currently not supported (though there are some workarounds - see below). It is a common feature request and we are still thinking how best to fit it into the framework in a natural, generic and high-performant way.
Some workarounds include:
Writing a pipeline like this (the tf-idf example uses this approach):
DoFn readFile = ...(takes a filename, reads the file and produces records)...
p.apply(Create.of(filenames))
.apply(ParDo.of(readFile))
.apply(the rest of your pipeline)
This has the downside that dynamic work rebalancing features won't work particularly well, because they currently apply at the level of Read PTransform's only, but not at the level of ParDo's with high fan-out (like the one here, which would read a file and produce all records); and parallelization will only work to the level of files but files will not be split into sub-ranges. At the scale of reading Shakespeare this is not an issue, but if you are reading a set of files of wildly different size, some extremely large, then it may become an issue.
Implementing your own FileBasedSource (javadoc, general documentation) which would return records of type something like Pair<String, T> where the String is the filename and the T is the record you're reading. In this case the framework would handle the filepattern matching for you, dynamic work rebalancing would work just fine, however it is up to you to write the reading logic in your FileBasedReader.
Both of these work-arounds are non-ideal, but depending on your requirements, one of them may do the trick for you.
Update based on latest SDK
Java (sdk 2.9.0):
Beams TextIO readers do not give access to the filename itself, for these use cases we need to make use of FileIO to match the files and gain access to the information stored in the file name. Unlike TextIO, the reading of the file needs to be taken care of by the user in transforms downstream of the FileIO read. The results of a FileIO read is a PCollection the ReadableFile class contains the file name as metadata which can be used along with the contents of the file.
FileIO does have a convenience method readFullyAsUTF8String() which will read the entire file into a String object, this will read the whole file into memory first. If memory is a concern you can work directly with the file with utility classes like FileSystems.
From: Document Link
PCollection<KV<String, String>> filesAndContents = p
.apply(FileIO.match().filepattern("hdfs://path/to/*.gz"))
// withCompression can be omitted - by default compression is detected from the filename.
.apply(FileIO.readMatches().withCompression(GZIP))
.apply(MapElements
// uses imports from TypeDescriptors
.into(KVs(strings(), strings()))
.via((ReadableFile f) -> KV.of(
f.getMetadata().resourceId().toString(), f.readFullyAsUTF8String())));
Python (sdk 2.9.0):
For 2.9.0 for python you will need to collect the list of URI from outside of the Dataflow pipeline and feed it in as a parameter to the pipeline. For example making use of FileSystems to read in the list of files via a Glob pattern and then passing that to a PCollection for processing.
Once fileio see PR https://github.com/apache/beam/pull/7791/ is available, the following code would also be an option for python.
import apache_beam as beam
from apache_beam.io import fileio
with beam.Pipeline() as p:
readable_files = (p
| fileio.MatchFiles(‘hdfs://path/to/*.txt’)
| fileio.ReadMatches()
| beam.Reshuffle())
files_and_contents = (readable_files
| beam.Map(lambda x: (x.metadata.path,
x.read_utf8()))
One approach is to build a List<PCollection> where each entry corresponds to an input file, then use Flatten. For example, if you want to parse each line of a collection of files into a Foo object, you might do something like this:
public static class FooParserFn extends DoFn<String, Foo> {
private String fileName;
public FooParserFn(String fileName) {
this.fileName = fileName;
}
#Override
public void processElement(ProcessContext processContext) throws Exception {
String line = processContext.element();
// here you have access to both the line of text and the name of the file
// from which it came.
}
}
public static void main(String[] args) {
...
List<String> inputFiles = ...;
List<PCollection<Foo>> foosByFile =
Lists.transform(inputFiles,
new Function<String, PCollection<Foo>>() {
#Override
public PCollection<Foo> apply(String fileName) {
return p.apply(TextIO.Read.from(fileName))
.apply(new ParDo().of(new FooParserFn(fileName)));
}
});
PCollection<Foo> foos = PCollectionList.<Foo>empty(p).and(foosByFile).apply(Flatten.<Foo>pCollections());
...
}
One downside of this approach is that, if you have 100 input files, you'll also have 100 nodes in the Cloud Dataflow monitoring console. This makes it hard to tell what's going on. I'd be interested in hearing from the Google Cloud Dataflow people whether this approach is efficient.
I also had the 100 input files = 100 nodes on the dataflow diagram when using code similar to #danvk. I switched to an approach like this which resulted in all the reads being combined into a single block that you can expand to drill down into each file/directory that was read. The job also ran faster using this approach rather than the Lists.transform approach in our use case.
GcsOptions gcsOptions = options.as(GcsOptions.class);
List<GcsPath> paths = gcsOptions.getGcsUtil().expand(GcsPath.fromUri(options.getInputFile()));
List<String>filesToProcess = paths.stream().map(item -> item.toString()).collect(Collectors.toList());
PCollectionList<SomeClass> pcl = PCollectionList.empty(p);
for(String fileName : filesToProcess) {
pcl = pcl.and(
p.apply("ReadAvroFile" + fileName, AvroIO.Read.named("ReadFromAvro")
.from(fileName)
.withSchema(SomeClass.class)
)
.apply(ParDo.of(new MyDoFn(fileName)))
);
}
// flatten the PCollectionList, combining all the PCollections together
PCollection<SomeClass> flattenedPCollection = pcl.apply(Flatten.pCollections());
This might be a very late post for the above question, but I wanted to add answer with Beam bundled classes.
This could also be seen as an extracted code from the solution provided by #Reza Rokni.
PCollection<String> listOfFilenames =
pipe.apply(FileIO.match().filepattern("gs://apache-beam-samples/shakespeare/*"))
.apply(FileIO.readMatches())
.apply(
MapElements.into(TypeDescriptors.strings())
.via(
(FileIO.ReadableFile file) -> {
String f = file.getMetadata().resourceId().getFilename();
System.out.println(f);
return f;
}));
pipe.run().waitUntilFinish();
Above PCollection<String> will have a list of files available at any provided directory.
I was struggling with the same use case while using wildcard to read files from GCS but also needed to modify the collection based on the file name.The key is to use ReadFromTextWithFilename instead of readfromtext In java you already have a way out and you can use:
String filename =context.element().getMetadata().resourceId().getCurrentDirectory().toString()
inside your processElement method.
But for Python below technique will work:
-> Use beam.io.ReadFromTextWithFilename for reading the wildcard path from GCS
-> As per the document, ReadFromTextWithFilename returns the file's name and the file's content.
Below is the code snippet:
class GetFileNameFromWildcard(beam.DoFn):
def process(self, element, *args, **kwargs):
file_path, content = element
schema = ["id","name","mob","email","dept","store"]
store_name = file_path.split("/")[-2]
content_list = content.split(",")
content_list.append(store_name)
out_dict = dict(zip(schema,content_list))
print(out_dict)
yield out_dict
def run():
pipeline_options = PipelineOptions()
with beam.Pipeline(options=pipeline_options) as p:
# saving main session so that it can load global namespace on the Cloud Dataflow Worker
init = p | 'Begin Pipeline With Initiator' >> beam.Create(
["pcollection initializer"]) | 'Read From GCS' >> beam.io.ReadFromTextWithFilename(
"gs://<bkt-name>/20220826/*/dlp*", skip_header_lines=1) | beam.ParDo(
GetFileNameFromWildcard()) | beam.io.WriteToText(
'df_out.csv')

External files with locale messages with a page in tapestry 5

We are using Tapestry 5.4-beta-4. My problem is:
I need to keep files with locale data in an external location and under different file name then tapestry usual app.properties or pageName_locale.properties. Those files pool messages that should be then used on all pages as required (so no tapestry usual one_page-one_message_file). The files are retrieved and loaded into tapestry during application startup. Currently i am doing it like this:
#Contribute(ComponentMessagesSource.class)
public void contributeComponentMessagesSource(OrderedConfiguration<Resource> configuration, List<String> localeFiles, List<String> languages) {
for(String language: languages){
for(String fileName : localeFiles){
String localeFileName = fileName + "_" + language + ".properties";
Resource resource = new Resource(localeFileName );
configuration.add(localeFileName, resource, "before:AppCatalog");
}
}
}
The above code works in that the message object injected into pages is populated with all the messages. Unfortunatly these are only the messages that are in the default ( first on the tapestry.supported-locales list) locale. This never changes.
We want the locale to be set to the browser locale, send to the service in the header. This works for those messages passed to tapestry in the traditional way (through app.properties) but not for those set in the above code. Actually, if the browser language changes, the Messages object changes too but only those keys that were in the app.properties are assigned new values. Keys that were from external files always have the default values.
My guess is that tapestry doesn't know which keys from Messages object it should refresh (the keys from external files ale not beeing linked to any page).
Is there some way that this could be solved with us keeping the current file structure?
I think the problem is that you add the language (locale) to the file name that you contribute to ComponentMessagesSource.
For example if you contribute
example_de.properties
Tapestry tries to load
example_de_<locale>.properties
If that file does not exist, it will fall back to the original file (i.e. example_de.properties).
Instead you should contribute
example.properties
and Tapestry will add the language to the file name automatically (see MessagesSourceImpl.findBundleProperties() for actual implementation).
#Contribute(ComponentMessagesSource.class)
public void contributeComponentMessagesSource(OrderedConfiguration<Resource> configuration, List<String> localeFiles, List<String> languages) {
for(String language: languages){
for(String fileName : localeFiles){
String localeFileName = fileName + ".properties";
Resource resource = new Resource(localeFileName );
configuration.add(localeFileName, resource, "before:AppCatalog");
}
}
}

How to add Apache Any23 RDF Statements to Apache Jena?

Basically, I use the Any23 distiller to extract RDF statements from files embedded with RDFa (The actual files where created by DBpedia Spotlight using the xhtml+xml output option). By using Any23 RDFa distiller I can extract the RDF statements (I also tried using Java-RDFa but I could only extract the prefixes!). However, when I try to pass the statements to a Jena model and print the results to the console, nothing happens!
This is the code I am using :
File myFile = new File("T1");
Any23 runner= new Any23();
DocumentSource source = new FileDocumentSource(myFile);
ByteArrayOutputStream outA = new ByteArrayOutputStream();
InputStream decodedInput=new ByteArrayInputStream(outA.toByteArray()); //convert the output stream to input so i can pass it to jena model
TripleHandler writer = new NTriplesWriter(outA);
try {
runner.extract(source, writer);
} finally {
writer.close();
}
String ttl = outA.toString("UTF-8");
System.out.println(ttl);
System.out.println();
System.out.println();
Model model = ModelFactory.createDefaultModel();
model.read(decodedInput, null, "N-TRIPLE");
model.write(System.out, "TURTLE"); // prints nothing!
Can anyone tell me what I have done wrong? Probably multiple things!
Is there any easy way i can extract the subjects of the RDF statements directly from any23 (bypassing Jena)?
As I am quite inexperienced in programming any help would be really appreciated!
You are calling
InputStream decodedInput=new ByteArrayInputStream(outA.toByteArray()) ;
before calling any23 to insert triples. At the point of the call, it's empty.
Move this after the try-catch block.

wicket: how to stream a resource from a database

I'm trying to generate a sitemap dynamically for a large web site with thousands of pages.
Yes, I have considered generating the sitemap file offline and simply serving it statically, and I might end up doing exactly that. But I think this is a generally useful question:
How can I stream large data from a DB in Wicket?
I followed the instructions at the Wicket SEO page, and was able to get a dynamic sitemap implementation working using a DataProvider. But it doesn't scale- it runs out of memory when it calls my DataProvider's iterator() method with a count arg equal to the total number of objects I'm returning, rather than iterating over them in chunks.
I think the solution lies somewhere with WebResource/ResourceStreamingRequestTarget. But those classes expect an IResourceStream, which ultimately boils down to providing an InputStream implementation, which deals in bytes, rather than DB records. I wouldn't know how to implement the length() method in such a case, as that would require visiting every record ahead of time to compute the overall length.
From the doc of the IResourceStream.length() method:
/**
* Gets the size of this resource in bytes
*
* TODO 1.5: rename to lengthInBytes() or let it return some sort of size object
*
* #return The size of this resource in the number of bytes, or -1 if unknown
*/
long length();
So I think it would be ok if your IResourceStream implementation tells that the length is unknown and you stream the data directly as you get the records from the database.
You could return -1, indicating an unknown length, or you could write the result in a memory buffer or disk, before rendering it to the client.
You could also use this file as a cache, so that you don't need to regenerate it every time this resource is requested (remember you have to handle concurrent requests, though). Dedicated caching solutions (e.g. memcache, ehcache, etc.) can also be considered.
It may be cleaner than publishing a static file, although static files are probably better if performance is critical.
I ended up using an AbsractResourceStreamWriter subclass:
public class SitemapStreamWriter extends AbstractResourceStreamWriter
{
#Override
public void write(OutputStream output)
{
String HEAD = "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" +
"<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\"\n" +
" xmlns:wicket=\"http://wicket.apache.org/dtds.data/wicket-xhtml1.4-strict.dtd\">\n";
try
{
output.write(HEAD.getBytes());
// write out a <loc> entry for each of my pages here
output.write("</urlset>\n".getBytes());
}
catch (IOException e)
{
throw new RuntimeException(e.getMessage(), e);
}
}
}

Resources