Windowing and Watermark in Apache beam : Google dataflow

Windowing and Watermark in Apache beam : Google dataflow - google-cloud-dataflow

I have a fixed window of 1 minute. I am considering event time.
beam.WindowInto(window.FixedWindows(300))
When I deploy this code ,is the window created instantly even if I have not published any message .suppose I deployed at 6:30 , is it like the windows are automatically created as 6:30 to 6:35, 6:35 to 6:40 and so on ?
If I publish a message to topic having
event timestamp = 6:31 (unix seconds i.e 10,176589653)
when system time = 6:36
..does it mean the watermark for that specific message is at 6:31 and it will miss the window as system time is at 6:36 and allowed lateness=0 and will be rejected.

Windows are always created using UNIX time 0 as a base, meaning, no matter if you start the pipeline at 6:31, 6:32 or 6:35, the windows would always be [6:30, 6:35), [6:35, 6:40).... Note that this also applies for days, the windows would start at 00:00 UTC.
If you want to change this, there's an offset parameter.

Related

Can I make flex template jobs take less than 10 minutes before they start to process data?

I am using terraform resource google_dataflow_flex_template_job to deploy a Dataflow flex template job.
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
}
}
Its all working fine however without my requesting it the job seems to be using flexible resource scheduling (flexRS) mode, I say this because the job takes about ten minutes to start and during that time has state=QUEUED which I think is only applicable to flexRS jobs.
Using flexRS mode is fine for production scenarios however I'm currently still developing my dataflow job and when doing so flexRS is massively inconvenient because it takes about 10 minutes to see the effect of any changes I might make, no matter how small.
In Enabling FlexRS it is stated
To enable a FlexRS job, use the following pipeline option:
--flexRSGoal=COST_OPTIMIZED, where the cost-optimized goal means that the Dataflow service chooses any available discounted resources or
--flexRSGoal=SPEED_OPTIMIZED, where it optimizes for lower execution time.
I then found the following statement:
To turn on FlexRS, you must specify the value COST_OPTIMIZED to allow the Dataflow service to choose any available discounted resources.
at Specifying pipeline execution parameters > Setting other Cloud Dataflow pipeline options
I interpret that to mean that flexrs_goal=SPEED_OPTIMIZED will turn off flexRS mode. However, I changed the definition of my google_dataflow_flex_template_job resource to:
resource "google_dataflow_flex_template_job" "streaming_beam" {
provider = google-beta
name = "streaming-beam"
container_spec_gcs_path = module.streaming_beam_flex_template_file[0].fully_qualified_path
parameters = {
"input_subscription" = google_pubsub_subscription.ratings[0].id
"output_table" = "${var.project}:beam_samples.streaming_beam_sql"
"service_account_email" = data.terraform_remote_state.state.outputs.sa.email
"network" = google_compute_network.network.name
"subnetwork" = "regions/${google_compute_subnetwork.subnet.region}/subnetworks/${google_compute_subnetwork.subnet.name}"
"flexrs_goal" = "SPEED_OPTIMIZED"
}
}
(note the addition of "flexrs_goal" = "SPEED_OPTIMIZED") but it doesn't seem to make any difference. The Dataflow UI confirms I have set SPEED_OPTIMIZED:
but it still takes too long (9 minutes 46 seconds) for the job to start processing data, and it was in state=QUEUED for all that time:
2021-01-17 19:49:19.021 GMTStarting GCE instance, launcher-2021011711491611239867327455334861, to launch the template.
...
...
2021-01-17 19:59:05.381 GMTStarting 1 workers in europe-west1-d...
2021-01-17 19:59:12.256 GMTVM, launcher-2021011711491611239867327455334861, stopped.
I then tried explictly setting flexrs_goal=COST_OPTIMIZED just to see if it made any difference, but this only caused an error:
"The workflow could not be created. Causes: The workflow could not be
created due to misconfiguration. The experimental feature
flexible_resource_scheduling is not supported for streaming jobs.
Contact Google Cloud Support for further help. "
This makes sense. My job is indeed a streaming job and the documentation does indeed state that flexRS is only for batch jobs.
This page explains how to enable Flexible Resource Scheduling (FlexRS) for autoscaled batch pipelines in Dataflow.
https://cloud.google.com/dataflow/docs/guides/flexrs
This doesn't solve my problem though. As I said above if I deploy with flexrs_goal=SPEED_OPTIMIZED then still state=QUEUED for almost ten minutes, yet as far as I know QUEUED is only applicable to flexRS jobs:
Therefore, after you submit a FlexRS job, your job displays an ID and a Status of Queued
https://cloud.google.com/dataflow/docs/guides/flexrs#delayed_scheduling
Hence I'm very confused:
Why is my job getting queued even though it is not a flexRS job?
Why does it take nearly ten minutes for my job to start processing any data?
How can I speed up the time it takes for my job to start processing data so that I can get quicker feedback during development/testing?
UPDATE, I dug a bit more into the logs to find out what was going on during those 9minutes 46 seconds. These two consecutive log messages are 7 minutes 23 seconds apart:
2021-01-17 19:51:03.381 GMT
"INFO:apache_beam.runners.portability.stager:Executing command: ['/usr/local/bin/python', '-m', 'pip', 'download', '--dest', '/tmp/dataflow-requirements-cache', '-r', '/dataflow/template/requirements.txt', '--exists-action', 'i', '--no-binary', ':all:']"
2021-01-17 19:58:26.459 GMT
"INFO:apache_beam.runners.portability.stager:Downloading source distribution of the SDK from PyPi"
Whatever is going on between those two log records is the main contributor to the long time spent in state=QUEUED. Anyone know what might be the cause?

As mentioned in the existing answer you need to extract the apache-beam modules inside your requirements.txt:
RUN pip install -U apache-beam==<version>
RUN pip install -U -r ./requirements.txt

While developing, I prefer to use DirectRunner, for the fastest feedback.

Unison fails because it cannot set time stamp

Trying to sync from a mac to a linux machine, I get multiple failures with a message of the following type:
100% 00:00 ETAFailed [www/sandbox/my-vue-buefy-project/node_modules/spdy-transport/lib/spdy-transport/protocol/spdy]:
Failed to set modification time of file /users/guerrini/www/sandbox/my-vue-buefy-project/node_modules/spdy-transport/lib/spdy-transport/protocol/.unison.spdy.1db0b477154fc6ddf40346e8e27082da.unison.tmp/constants.js
to 1970-01-01 at 1:00:00 (0.000000):
the time was set to 2018-04-12 at 8:49:57 (1523515797.000000) instead`
It seems that it cannot set the modification time and that it uses the current time instead. But, unfortunately, after this the synchronisation of all the files with the above modification date fails.
Moreover, I have tried to set modification date to the given time by hand with "touch" and it works.

How to do time comparison between client's system time and date time from internet server (actual local time)

I am trying to validate the system time of client’s computer with the actual time (internet time). If for some reason the client’s time settings are not correct or the time and timezone don’t match the local time, I want to notify them to sync the time with their local time in order to use the application. If my question is not clear then this is something that I am trying to mimic, https://knowledge.autodesk.com/support/autocad/troubleshooting/caas/sfdcarticles/sfdcarticles/Incorrect-System-time-warning-when-starting-an-Autodesk-360-application.html
How can I do this time comparison/validation in dart?

The main question is IMHO what accuracy you need.
You can just query a NTC and report if there is a discrepancy. If the server is synchronized with such a time server, there shouldn't be a problem.
You can also add an API to your server that returns the server time.
Then you read the time from the local system and from the server and check the difference
bool compareTime() {
var serverTime = await getTimeFromServer(); // not yet existing method to fetch the date and time from the server
var clientTime = new DateTime.now().toUtc();
var diff = serverTime.difference(clientTime).abs();
if(diff > const Duration(seconds: 5)) {
print('Time difference too big');
} else {
print('Time is fine');
}
}
Ensure that the time returned from the server is UTC as well, otherwise you're comparing apples with pears.

If you're running server-side, you can shell out to ntpdate -d pool.ntp.org and read the output, parsing the last line. If the time offset is small enough, you're good.
For browser apps, see the StackOverflow at Getting the current GMT world time for a few options.

Multiple times trigger generation in Zabbix

I am new to zabbix. I have a basic requirement of monitoring occurrence of different log messages using zabbix. Say, when there is a log message "server starting", zabbix should show that alert. The idea is that if the server (re)starts 10 times in last 10 minutes, the zabbix dashboard (or at any other place) should display that 10 times.
I have done the following for that :
Created an item under template MyTemplate:
Type : Zabbix Agent (Active)
key : log[/opt/mylog/logs/abc.log,server starting]
Type of information : Log
Update Interval (in sec) : 30
Created a trigger with expression :
{MyTemplate:log[/opt/mylog/logs/abc.log,server
starting].logeventid(1)}=0
With logeventid(1), I am seeing that the alert (trigger) is getting generated only once. It appears only once in the Dashboard --> Last 20 issues. If I go to Monitoring --> Trigger, I see the alert only once, although the log files have 10 entries of the message "server starting" (server restarted 10 times).
Then I set the trigger to following :
{MyTemplate:log[/opt/mylog/logs/abc.log,server
starting].nodata(300)}=0
Now, at Monitoring --> Trigger, I see the alert (trigger) 10 times, but, from the Dashboard --> Last 20 issues it vanishes just after 300 seconds.
My questions are :
What should be the trigger function, I should use? I want to see 10 alerts in zabbix if the same message appears 10 times in the log file within a period of time.
With nodata(300), why does the alert vanish after 300 sec?
Is it ok if I use 30 minutes instead of 300 seconds as an argument of nodata()?

Function logeventid() is normally used for Windows and VMware event logs. In this case, it should probably not be used and it is suspicious that it fires, which might indicate a bug in Zabbix.
Anyway, you can check "Multiple PROBLEM events generation" box in trigger configuration and the trigger will generate a new PROBLEM event every time the condition is true, regardless of its previous value. Instead of logeventid(), you can try using a function that is always true, for instance, strlen()>0.
If you wish the trigger to go into OK state after some time, say, 10 minutes, you can add nodata(10m). Then your trigger will look like this:
{MyTemplate:log[/opt/mylog/logs/abc.log,server starting].strlen()}>0 and
{MyTemplate:log[/opt/mylog/logs/abc.log,server starting].nodata(10m)}=0

php 5.2.17 date('T') gives incorrect timezone. Server time+phpinfo time are correct

I am having a bit of a weird issue:
the date function gives timezone=MST
the date function from the centOS prompt gives me EST
the phpinfo() function returns America/New_York
As Plesk was showing America/New_York but centOS was not, Techsupport did something to the
/usr/share/zoneinfo/ files, because they said that somehow the New_York file was showing MST (Mountain Time).
After that operation, centOS time and phpinfo() display EST correctly but the date function still display MST.
Any ideas?

Did you tried date_default_timezone_set()?
Since PHP 5.1.0 (when the date/time functions were rewritten), every call to a date/time function will generate a E_NOTICE if the timezone isn't valid, and/or a E_WARNING message if using the system settings or the TZ environment variable.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Windowing and Watermark in Apache beam : Google dataflow - google-cloud-dataflow

Related

Can I make flex template jobs take less than 10 minutes before they start to process data?

Unison fails because it cannot set time stamp

How to do time comparison between client's system time and date time from internet server (actual local time)

Multiple times trigger generation in Zabbix

php 5.2.17 date('T') gives incorrect timezone. Server time+phpinfo time are correct

Categories

Resources