Quartz.net transactions? - quartz.net

Is there a way to control the transactions used by Quartz.net?
We use Quartz to schedule jobs, but we keep metadata about the jobs in a separate table in a more readable format for the UI. Obviously it would be bad if the network failed right after the quartz job was scheduled but before we could write the metadata to the other table. I'd like to wrap everything up in a transaction but I don't know how to get quartz to use it. Is this supported?

Related

Creating a structured Jenkins Failing Test Report

The situation right now:
Every Monday morning I manually check Jenkins jobs jUnit results that ran over the weekend, using Project Health plugin I can filter on the timeboxed runs. I then copy paste this table into Excel and go over each test case's output log to see what failed and note down the failure cause. Every weekend has another tab in Excel. All this makes tracability a nightmare and causes time consuming manual labor.
What I am looking for (and hoping that already exists to some degree):
A database that stores all failed tests for all jobs I specify. It parses the output log of a failed test case and based on some regex applies a 'tag' e.g. 'Audio' if a test regarding audio is failing. Since everything is in a database I could make or use a frontend that can apply filters at will.
For example, if I want to see all tests regarding audio failing over the weekend (over multiple jobs and multiple runs) I could run a query that returns all entries with the Audio tag.
I'm OK with manually tagging failed tests and the cause, as well as writing my own frontend, is there a way (Jenkins API perhaps?) to grab the failed tests (jUnit format and Jenkins plugin) and create such a system myself if it does not exist?
A good question. Unfortunately, it is very difficult in Jenkins to get such "meta statistics" that spans several jobs. There is no existing solution for that.
Basically, I see two options for getting what you want:
Post-processing Jenkins-internal data to get the statistics that you need.
Feeding a database on-the-fly with build execution data.
The first option basically means automating the tasks that you do manually right now.
you can use external scripting (Python, Perl,...) to process Jenkins-internal data (via REST or CLI APIs, or directly reading on-disk data)
or you run Groovy scripts internally (which will be faster and more powerful)
It's the most direct way to go. However, depending on the statistics that you need and depending on your requirements regarding data persistance , you may want to go for...
The second option: more flexible and completely decoupled from Jenkins' internal data storage. You could implement it by
introducing a Groovy post-build step for all your jobs
that script parses job results and puts data of interest in a custom, external database
Statistics you'd get from querying that database.
Typically, you'd start with the first option. Once requirements grow, you'd slowly migrate to the second one (e.g., by collecting internal data via explicit post-processing scripts, putting that into a database, and then running queries on it). You'll want to cut this migration phase as short as possible, as it eventually requires the effort of implementing both options.
You may want to have a look at couchdb-statistics. It is far from a perfect fit, but at least seems to do partially what you want to achieve.

Is google dataflow BQ/BT Write atomic per job?

maybe I am a bad seeker but I couldn't find my answers in documentation, so I just want to try my luck here
So my question is that say I have a dataflow job that write to a BigQuery or BigTable and the job failed. Will dataflow will able to rollback to state before it started or there might simply be partial data in my table?
I know that write to GCS seems not atomic that there will be partial output partition produced along the way when the job is running.
However, I have tried dumping data into BQ by dataflow, and it seems that the output table will not be exposed to users until the job claimed success.
In Batch, Cloud Dataflow uses the following procedure for BigQueryIO.Write.to("some table"):
Write all data to a temporary directory on GCS.
Issue a BigQuery load job with an explicit list of all the temporary files containing the rows to be written.
If there are failures when the GCS writes are only partially complete, we will recreate the temp files on retry. Exactly one complete copy of the data will be produced by step 1 and used for loading in step 2, or the job will fail before step 2.
Each BigQuery load job, as in William V's answer, is atomic. The load job will succeed or fail, and if it fails there will be no data written to BigQuery.
For slightly more depth, Dataflow also uses a deterministic BigQuery job id (like dataflow_job_12423423) so that if the Dataflow code monitoring the load job fails and is retried we will still have exactly-once write semantics to BigQuery.
Together, this design means that each BigQueryIO.Write transform in your pipeline is atomic. In a common case, you have only one such write in your job, and so if the job succeeds the data will be in BigQuery and if the job fails there will be no data written.
However: Note that if you have multiple BigQueryIO.Write transforms in a pipeline, some of the writes may have successfully completed before the Dataflow job fails. The completed writes will not be reverted when the Dataflow job fails.
This means that you may need to be careful when rerunning a Dataflow pipeline with multiple sinks in order to ensure correctness in the presence of commited writes from the earlier failed job.
I can speak for Bigtable. Bigtable is atomic at the row level, not at the job level. A Dataflow job that fails part way will write partial data into Bigtable.
BigQuery jobs fail or succeed as a unit. From https://cloud.google.com/bigquery/docs/reference/v2/jobs
Each action is atomic and only occurs if BigQuery is able to complete the job successfully. Creation, truncation and append actions occur as one atomic update upon job completion.
Though, just to be clear, BigQuery is atomic at the level of the BigQuery job, not at the level of a Dataflow job that might have created the BigQuery job. E.g. if your Dataflow job fails but it has written to BigQuery before failing (and that BigQuery job is complete) then the data will remain in BigQuery.

how should I do the scheduling

I have never written a windows service or any scheduler before so I couldn't figure out what to do.
I need to write a windows service. There is a Report table in my DB, and I need to check it every day to see if there are new reports added. Reports have receivers and the time settings, such as 15th of every month at 14:00, or daily at 12:35 or weekly on Wednesdays at 13:00. And I need to send emails with some reports at these times.
As I have decided, I will use Quartz.NET. But there are a couple of things I don't understand. So I will have 2 Jobs I think. One for checking the DB every day, to see if there are new reports that users want. And when I receive them, I'll create new different amount of Jobs with new triggers based on the times in the DB? Do I create new triggers in the job of the first daily check? I didn't understand it.
And when for example a time of one report is updated, or deleted, Do I need to delete the Job and the trigger from the scheduler? I'd appreciate the help. I am using VS 2015 with C#.
And when I do the windows service, I'll just initiate this Quartz thing that I have written? Sorry I couldn't understand what I have read so far.
I would recommend Hangfire IO over Quartz.net
http://hangfire.io/
Its a more modern approach to scheduled jobs. In the past I've used Quartz.net as well. First of all, using hangfire requires no service. The jobs are persistent, and retries are built in. The syntax is also easier.
I've used hangfire and its wonderful and simple.
however, Hangfire does not support Oracle db so far. Also Quartz provide more flexibility in terms of scheduling (calendars, end dates etc).

Multiple export using google dataflow

Not sure whether this is the right place to ask but I am currently trying to run a dataflow job that will partition a data source to multiple chunks in multiple places. However I feel that if I try to write to too many table at once in one job, it is more likely for the dataflow job to fail on a HTTP transport Exception error, and I assume there is some bound one how many I/O in terms of source and sink I could wrap into one job?
To avoid this scenario, the best solution I can think of is to split this one job into multiple dataflow jobs, however for which it will mean that I will need to process same data source multiple times (once for which dataflow job). It is okay for now but ideally I sort of want to avoid it if later if my data source grow huge.
Therefore I am wondering there is any rule of thumb of how many data source and sink I can group into one steady job? And is there any other better solution for my use case?
From the Dataflow service description of structuring user code:
The Dataflow service is fault-tolerant, and may retry your code multiple times in the case of worker issues. The Dataflow service may create backup copies of your code, and can have issues with manual side effects (such as if your code relies upon or creates temporary files with non-unique names).
In general, Dataflow should be relatively resilient. You can Partition your data based on the location you would like it output. The writes to these output locations will be automatically divided into bundles, and any bundle which fails to get written will be retried.
If the location you want to write to is not already supported you can look at writing a custom sink. The docs there describe how to do so in a way that is fault tolerant.
There is a bound on how many sources and sinks you can have in a single job. Do you have any details on how many you expect to use? If it exceeds the limit, there are also ways to use a single custom sink instead of several sinks, depending on your needs.
If you have more questions, feel free to comment. In addition to knowing more about what you're looking to do, it would help to know if you're planning on running this as a Batch or Streaming job.
Our solution to this was to write a custom GCS sink that supports partitions. Though with the responses I got I'm unsure whether that was the right thing to do or not. Writing Output of a Dataflow Pipeline to a Partitioned Destination

How do I trigger a job when another completes?

I have two jobs, consider them to be the super simple jobs that just print a line and have no triggers or timeouts defines. They work fine when I call them from a controller class through: <name of my class>Job.triggerNow()
What I want is to trigger one job and, as it as it finishes, trigger a consequent different job.
I have tried using the quartzScheduler, but I can't seem to get a JobDetail from my job classes, so I'm not sure what is the correct way for doing this. I also want to pass some results from the first job onto the second one.
I know I can trigger the second job as the last line on my first job's execute method, but this is not desirable since its technically not part of the first job and couples things more than I would like.
Any help will be greatly appreciated. thanks
What it sounds like you are after is an asynchronous "pipeline" of work where there are different workers that are all in a line and pass data to be worked on from one to the next. This sort of architecture is amazingly flexible and applies to a large number of very common applications
The best way that I have found to get such an architecture in place with Grails is to use a message queue, like RabbitMQ for example, with a series of queues (one for each step in the pipeline), and then have the controller(s) put messages into the first step of the pipeline.
Then, you have a worker (just a service within the Grails app if you use the excellent RabbitMQ Grails plugin) listen to the queue that holds jobs for them to work on. As work comes into the queue, the worker will pop the job off, processes it, and then put a message into the queue of the next step in the pipeline.
I've found this to be the best way to architect just about any asynchronous pipeline, since it allows you to scale each piece separately as needed and doesn't have too much overhead. There are also ways to decouple the jobs from having to know about the next step in the pipeline, but I've found that in most cases this isn't really needed and just adds useless complexity.
Quartz is great for jobs that need to happen on a schedule, but a pipeline is much better at processing things as it comes in in a scaleable way
Please have a look #
JobListener
You can utilize
public void jobWasExecuted(JobExecutionContext context,
JobExecutionException jobException);
I built something similar to this in my web application using queue messaging technique with Redis. I simply define the dependency structure for all the jobs, and have a master job with the only purpose is to monitor/update the status of other jobs and trigger dependent jobs if needed.
Each job will have to report its status running/finish/cancel using the Redis queue. Master job pop each queue message and process it properly.

Resources