In pyspark, reading csv files gets failed if even 1 path does not exist. How can we avoid this? - path

In pyspark reading csv files from different paths gets failed if even one path does not exist.
Logs = spark.read.load(Logpaths, format="csv", schema=logsSchema, header="true", mode="DROPMALFORMED");
Here Logpaths is an array that contain multiple paths. And these paths are created dynamically depending upon given startDate and endDate range. If Logpaths contain 5 paths and first 3 exists but 4th does not exist. Then whole extraction gets failed. How can I avoid this in pyspark or how can I check there existance before reading?
In scala I did this by checking file existance and filter out non-existed records by using hadoop hdfs filesystem globStatus function.
Path = '/bilal/2018.12.16/logs.csv'
val hadoopConf = new org.apache.hadoop.conf.Configuration()
val fs = org.apache.hadoop.fs.FileSystem.get(hadoopConf)
val fileStatus = fs.globStatus(new org.apache.hadoop.fs.Path(Path));

So I got what I was looking for. Like the code I posted in the question which can be used in scala for file existance check. We can use below code in case of PySpark.
fs = sc._jvm.org.apache.hadoop.fs.FileSystem.get(sc._jsc.hadoopConfiguration())
fs.exists(sc._jvm.org.apache.hadoop.fs.Path("bilal/logs/log.csv"))
This is exactly the same code also used in scala, so in this case we are using java library for hadoop and java code runs on JVM on which spark is running.

Related

Apache Beam: Wait for AvroIO write step is done before start ImportTransform Dataflow template

I'm using apache beam to create a pipeline where basically reads an InputFile, Convert to Avro, write the AvroFile to a bucket and then Import these avro files to Spanner using Dataflow template
The problem that I'm facing is that the last step (Import the Avro files to the Database) is starting before the previous (write Avro Files to the bucket) is done.
I tried to add the Wait.on function but that only works if returns a PCollection, but when I write the files to avro it returns PDone.
Example of the Code:
// Step 1: Read Files
PCollection<String> lines = pipeline.apply("Reading Input Data exported from Cassandra",TextIO.read().from(options.getInputFile()));
// Step 2: Convert to Avro
lines .apply("Write Item Avro File",AvroIO.writeGenericRecords(spannerItemAvroSchema).to(options.getOutput()).withSuffix(".avro"));
// Step 3: Import to the DataBase
pipeline.apply( new ImportTransform(
spannerConfig,
options.getInputDir(),
options.getWaitForIndexes(),
options.getWaitForForeignKeys(),
options.getEarlyIndexCreateFlag()));
Again, the problem is because step 3 starts before Step 2 is done
any ideas?
This is a flaw in the API, see, e.g. a recent discussion on this on the beam dev list. The only solutions for now are to either fork AvroIO to return a PCollection or run two pipelines sequentially.

How to find the file paths of all namespace loaded in an application using tcl/TK under Unix?

For existing flow, there would be a whole bunch of namespaces loaded when running some script job.
However, if I want to check & trace the usage of some command in some namespace, I need to find the script path of the certain namespace.
Is there some way to get that? Particularly, I'm talking about Primetime scripts.
Technically, namespaces don't have script paths. But we can do something close enough:
proc report_current_file {call code result op} {
if {$code == 0} {
# If the [proc] call was successful...
set cmd [lindex $call 1]
set qualified_cmd [uplevel 1 [list namespace which $cmd]]
set file [file normalize [info script]]
puts "Defined $qualified_cmd in $file"
}
}
trace add execution proc leave report_current_file
It's not perfect if you've got procedures creating procedures dynamically — the current file might be wrong — but that's fortunately not what most code does.
Another option that might work for you is to use tcl::unsupported::getbytecode, which produces a lot of information in machine-readable format (a dictionary). One of the pieces of information is the sourcefile key. Here's an example running interactively on my machine:
% parray tcl_platform
tcl_platform(byteOrder) = littleEndian
tcl_platform(engine) = Tcl
tcl_platform(machine) = x86_64
tcl_platform(os) = Darwin
tcl_platform(osVersion) = 20.2.0
tcl_platform(pathSeparator) = :
tcl_platform(platform) = unix
tcl_platform(pointerSize) = 8
tcl_platform(threaded) = 1
tcl_platform(user) = dkf
tcl_platform(wordSize) = 8
% dict get [tcl::unsupported::getbytecode proc parray] sourcefile
/opt/local/lib/tcl8.6/parray.tcl
Note that the procedure has to be already defined for this to work. And if Tcl's become confused about what file the code was in (because of dynamic programming trickery) then that key is absent.

What does "Building tree for null using TinyBuilder" mean with Saxon extension function and using -t option?

With my Saxon extension function code I have the log messages:
> java -cp ./saxon-he-10.2.jar:./ net.sf.saxon.Transform -t -init:MyInitializer -xsl:./exttest.xsl -o:./out.xml -it:initialtemplate
Saxon-HE 10.2J from Saxonica
Java version 14.0.2
Stylesheet compilation time: 305.437652ms
Processing (no source document) initial template = initialtemplate
Using parser com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for null using class net.sf.saxon.tree.tiny.TinyBuilder
Tree built in 0.850325ms
Tree size: 3 nodes, 0 characters, 0 attributes
Execution time: 29.965658ms
Memory used: 14Mb
It is not clear to me wether Building tree for null using class net.sf.saxon.tree.tiny.TinyBuilder means that there is something wrong with my code https://gitlab.com/ms452206/socode20200915 and how to avoid it.
It's a poor message and I will improve it; the "null" would be the base URI (or systemId) of the document if it had one. The fact that the document has no known base URI could be a predictor of trouble downstream, since some things rely on a document having a known base URI; but it's not an error in itself.
It's most likely to happen if you build a document using a JAXP Source object whose systemId property is null. Which is what you have done when you wrote:
new StreamSource(new StringReader("<foo/>"))
This is likely to cause failures only if the document contains relative URIs (for example in external entity references or href or xml:base attributes), which is not the case with your simple XML document.

Different checksum results for jar files compiled on subsequent build?

I am working verifying the jar files present on remote unix boxes with that of built on local machine(Windows & Cygwin) with same JVM.
As a POC I am trying to verify if same checksum is produced with jar files generated on my machine with consecutive builds, I tried below,
Generated the jar file first time using ant script
Calculated the checksum (e.g. "xyz abc")
Generated the jar file again with same ant script without changing anything
I got different checksum but same byte count (e.g. "xvw abc")
I am not sure how java internal processes produce the class files and then the jar files, Can someone please help me understand below points
Does the cksum utility of unix/cygwin consider timestamp of the file while coming up with the value?
Will the checksum be different for compiled class files/jar file produced if we keep every other things same [Compiler version + sourcecode + machine + environment]?
Answer to question 1: cksum doesn't consider the timestamp of the archive (e.g. jar-file) but it does consider the timestamps of the files inside the jarfile.
Answer to question 2: The checksums of the individual class-files will be the same with all other things the same (source-code, compiler etc.) The checksums of the jar-files will be different. Causes of differences can be the timestamp of the files inside the jarfile or if files are put into the archive in different orders (e.g. caused by parallel builds).
If you want to create a reproducible build with gradle you can do so with the config below:
tasks.withType(AbstractArchiveTask) {
preserveFileTimestamps = false
reproducibleFileOrder = true
}
Maven allows something similar, sorry I don't know how to do this with ant..
More info here:
https://dzone.com/articles/reproducible-builds-in-java
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74682318

Neo4j: Inserting 7k nodes is slow (Spring Data Neo4j / SpringRestGraphDatabase)

I'm building an application where my users can manage dictionaries. One feature is uploading a file to initialize or update the dictionary's content.
The part of the structure I'm focusing on for a start is Dictionary -[:CONTAINS]->Word.
Starting from an empty database (Neo4j 1.9.4, but also tried 2.0.0M5), accessed via Spring Data Neo4j 2.3.1 in a distributed environment (therefore using SpringRestGraphDatabase, but testing with localhost), I'm trying to load 7k words in 1 dictionary. However I can't get it done in less than 8/9 minutes on a linux with core i7, 8Gb RAM and SSD drive (ulimit raised to 40000).
I've read lots of posts about loading/inserting performance using REST and I've tried to apply the advices I found but without better luck. The BatchInserter tool doesn't seem to be a good option to me due to my application constraints.
Can I hope to load 10k nodes in a matter of seconds rather than minutes ?
Here is the code I came up with, after all my readings :
Map<String, Object> dicProps = new HashMap<String, Object>();
dicProps.put("locale", locale);
dicProps.put("category", category);
Dictionary dictionary = template.createNodeAs(Dictionary.class, dicProps);
Map<String, Object> wordProps = new HashMap<String, Object>();
Set<Word> words = readFile(filename);
for (Word gw : words) {
wordProps.put("txt", gw.getTxt());
Word w = template.createNodeAs(Word.class, wordProps);
template.createRelationshipBetween(dictionary, w, Contains.class, "CONTAINS", true);
}
I resolve such problem by just creating some CSV file and after that read it from Neo4j. It is needed to make such steps:
Write some class which get input data and base on it create CSV file (it can be one file per node kind or even you can create file which will be used to build relation).
In my case I have also create servlet which allow Neo4j to read that file by HTTP.
Create proper Cypher statements which allow to read and parse that CSV file. There are some samples of which I use (if you use Spring Data also remember about labels):
simple one:
load csv with headers from {fileUrl} as line
merge (:UserProfile:_UserProfile {email: line.email})
more complicated:
load csv with headers from {fileUrl} as line
match (c:Calendar {calendarId: line.calendarId})
merge (a:Activity:_Activity {eventId: line.eventId})
on create set a.eventSummary = line.eventSummary,
a.eventDescription = line.eventDescription,
a.eventStartDateTime = toInt(line.eventStartDateTime),
a.eventEndDateTime = toInt(line.eventEndDateTime),
a.eventCreated = toInt(line.eventCreated),
a.recurringId = line.recurringId
merge (a)-[r:EXPORTED_FROM]->c
return count(r)
Try the below
Usw native Neo4j API rather than spring-data-neo4j while performing batch operations.
Commit in batches i.e. may be for each 500 words
NOTE: There are certain properties (type) added by SDN which will be missing when using the native approach.

Resources