Use sentence transformers models with Apache Beam - google-cloud-dataflow

I have an apache beam pipeline that works flawlessly with a DirectRunner, but not with a DataflowRunner :
When using a DataflowRunner I get a "Error 413 (Request entity too large)"
From what I understand, it is because the pipeline file is too large. (I get it with the following option : --dataflow_job_file=gs://...
And this is caused by the model I use :
embeding_model = SentenceTransformer('sentence-transformers/paraphrase-MiniLM-L3-v2')
Have anyone already experimented something similar ?

You are correct in the assumption that the pipeline file is too large--the direct runner doesn't have that limitation but I believe Dataflow limits the JSON to something like 20mb.
I'm guessing you're embedding the model into that JSON? You'll probably be better off loading it from an external source. For example, RunInference in the Python SDK allows for loading custom models.

Related

Automated ansible output parsing in jenkins pipeline

How do you automatically parse ansible warnings and errors in your jenkins pipeline jobs?
I greatly enjoy the power of leveraging in ansible in jenkins when it works. Upon a failure, the hunt to locate the actual error can be challenging.
I use WarningsNG which supports custom parsers (and allows their programmatic generation)
Do you know of any plugins or addons that already transform these logs into the kind charts similar to WarningsNG?
I figured I'd ask as I go off into deep regex land and make my own.
One good way to achieve this seems to be the following:
select an existing structured output ansible callback plugin (json, junit and yaml are all viable) . I selected junit as I can play with the format to get a really nice view into the playbook with errors reported in a very obvious way.
fork that GPL file (yes, so be careful with that license) to augment with the following:
store output as file
implement the missing callback methods (the three mentioned above do not implement the v2...item callbacks.
forward events to the default or debug callback to ensure operators see something when they execute the plan
add a secrets cleaner - if you use jenkins credentials-binding-plugin it will hide secrets from the console, it will not not hide secrets within stored files. You'll need to handle that in your playbook or via some groovy code (if groovy, try{...} finally { clean } seems a good pattern)
Snippet - forewarding to default callback
from ansible.plugins.callback.default import CallbackModule as CallbackModule_default
...
class CallbackModule(CallbackBase):
CALLBACK_VERSION = 2.0
CALLBACK_TYPE = 'stdout'
CALLBACK_NAME = 'json'
def __init__(self, display=None):
super(CallbackModule, self).__init__(display)
self.default_callback = CallbackModule_default()
...
def v2_on_file_diff(self, result):
self.default_callback.v2_on_file_diff(result)
... do whatever you'd want to ensure the content appears in the json file

Apache Beam and avro : Create a dataflow pipeline without schema

I am building a dataflow pipeline with Apache beam. Below is the pseudo code:
PCollection<GenericRecord> rows = pipeline.apply("Read Json from PubSub", <some reader>)
.apply("Convert Json to pojo", ParDo.of(new JsonToPojo()))
.apply("Convert pojo to GenericRecord", ParDo.of(new PojoToGenericRecord()))
.setCoder(AvroCoder.of(GenericRecord.class, schema));
I am trying to get rid of setting the coder in the pipeline as schema won't be known at pipeline creation time (it will be present in the message).
I commented out the line that sets the coder and got an Exception saying that default coder is not configured. I used one argument version of of method and got the following Exception:
Not a Specific class: interface org.apache.avro.generic.GenericRecord
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:285)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:594)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:218)
at org.apache.avro.specific.SpecificData$2.load(SpecificData.java:215)
at avro.shaded.com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3568)
at avro.shaded.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2350)
at avro.shaded.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2313)
at avro.shaded.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2228)
... 9 more
Is there any way for us to supply the coder at runtime, without knowing the schema beforehand?
This is possible. I recommend the following approach:
Do not use an intermediate collection of type GenericRecord. Keep it as a collection of your POJOs.
Write some transform that extracts the schema of your data and makes it available as a PCollectionView<however you want to represent the schema>.
When writing to BigQuery, write your PCollection<YourPojo> via write().to(DynamicDestinations), and when writing to Avro, use FileIO.write() or writeDynamic() in combination with AvroIO.sinkViaGenericRecords(). Both of these can take a dynamically computed schema from a side input (that you computed above).

Dataflow GroupBy -> multiple outputs based on keys

Is there any simple way that I can redirect the output of GroupBy into multiple output files based on Group keys?
Bin.apply(GroupByKey.<String, KV<Long,Iterable<TableRow>>>create())
.apply(ParDo.named("Print Bins").of( ... )
.apply(TextIO.Write.to(*Output file based on key*))
If Sink is the solution, would you please share a sample code w/ me?
Thanks!
Beam 2.2 will include an API to do just that - TextIO.write().to(DynamicDestinations), see source. For now, if you'd like to use this API, you can use the 2.2.0-SNAPSHOT version. Note that this API is experimental and might change in Beam 2.3 or onwards.

Machine parseable error messages

(From https://groups.google.com/d/msg/bazel-discuss/cIBIP-Oyzzw/caesbhdEAAAJ)
What is the recommended way for rules to export information about failures such that downstream tools can include them in UIs.
Example use case:
I ran bazel test //my:target, and one of the actions for //my:target fails because there is an unknown variable "usrname" in my/target.foo at line 7 column 10. It would also like to report that "username" is a valid variable and this is a possible misspelling. And thus wants to suggest an addition of an "e" character.
One way I have thought to do this is to have a separate file that my action produces //my:target.errors that is in a separate output group and have it write machine parseable data there in addition to human readable data on stdout.
I can then find all of these files and parse the data in them in downstream tools.
Is there any prior work on this, or does everything just try to parse the human readable output?
I recommend running the error checkers as extra actions.
I don't think Bazel currently has hooks for custom error handlers like you describe. Please consider opening a feature request: https://github.com/bazelbuild/bazel/issues/new

Loading a .trig file with inference to Fuseki using the 'tdbloader" bulk loader?

I am currently writing some Java code extracting some data and writing them as Linked Data, using the TRIG syntax. I am now using Jena, and Fuseki to create a SPARQL endpoint to query and visualize this data.
The data is written so that each source dataset gives me a .trig file, containing one named graph. So I want to load thoses files in Fuseki. Except that it doesn't seem to understand the Trig syntax...
If I remove the named graphs, and rename the files as .ttl, everything loads perfectly in the default graphs. But if I try to import trig files :
using Fuseki's webapp uploader, it either crashes ("Can't make new graphs") or adds nothing except the prefixes, as if the graphs other than the default ones could not be added (the logs say nothing helpful except the error code and description).
using Java code, the process is too slow. I used this technique : " Loading a .trig file into TDB? " but my trig files are pretty big, so this solution is not very good for me.
So I tried to use the bulk loader, the console command 'tdbloader'. This time everything seems fine, but in the webapp, there is still no data.
You can see the process going fine here : Quads are added just fine
But the result still keeps only the default graph and its original data : Nothing is added
So, I don't know what to do. The guys behind Jena and Fuseki suggested not to use the bulk loader in the Java code (rather than the command line tool), so that's one solution I guess I'd like to avoid.
Did I miss something obvious about how to load TRIG files to Fuseki? Thanks.
UPDATE :
As it seemed to be a problem in my configuration (see the comments of this post for a link to my config file; I cannot post more than 2 links), I tried to add some kind of specification for some named graphs I would like to see added to the dataset on Fuseki.
I added code to link (with ja:namedgraph) external graphs that I added via tdbloader. This seems to work. Great!
Now another problem : there's no inference, even when my config file specifies an Inference model... I set that queries should be applied with named graphs merged as the default graph, but this does not seem to carry the OWL Inference rules...So simple queries work, but I have 1/ to specify the graph I query (with "FROM") and 2/ no inference in my data.
The two methods are to use the tdb bulkloader offline or you can POST data into the dataset directly. (i.e. HTTP POST operations to http://localhost:3030/ds).
You can test where your graph are there with a query like
SELECT (count(*) AS ?C) { GRAPH ?g { ?s ?p ?o } }
The named graphs will show up when the Fuseki server is started unless your configuration of the SPARQL services only exports one graph.

Resources