Creating watermark tracking pubsub subscription failure after applying withTimestampAttribute to PubsubIO - google-cloud-dataflow

In Google Dataflow streaming, I want to use record timestamps provided in Pub/Sub message attributes rather than the published timestamp.
PubsubIO.Read<T> pubsubIO = PubsubIO.readProtos(type).fromSubscription(options.getPubSubSubscription())
pubsubIO.withTimestampAttribute("eventTimestamp");
When I added withTimestampAttribute to the code, it worked fine with local direct runner, but failed when using Google Cloud Dataflow Runner.
Workflow failed. Causes: Step setup_resource_/subscriptions/xxxxxx: Set up of resource /subscriptions/xxxxxx failed, Creating watermark tracking pubsub subscription projects/xxxxxx to topic projects/xxxxxx failed with error: User not authorized to perform this action.
The service account to run dataflow have admin roles on dataflow, pubsub, etc, so I assume we can rule out the IAM issue.
Not sure if I missed any required configurations, e.g extra setup when create the pub/sub topic and subscription?

The problem has been resolved after I added Pub/Sub Subscriber role to the service account. So it is the IAM issue...
Looks like the Dataflow is creating a subscription on the fly when it runs, that's why only admin role on resources level(topic/subscription) was not enough, it needs to add on service account IAM.

Related

Additional Identities Plugin - how to configure?

I'm struggling with duplicated users in my Jenkins and Not sending mail to unregistered user ... problem.
I installed the plugin as per this answer but cannot configure it properly by reading the plugin documentation.
My jenkins collects data from Active Directory and some users have duplicated entry, e.g.:
john.doe, john.doe#mycompany.com -> duplicated user which is detected by Jenkins
doej, john.doe#mycompany.com -> correct user which is used when logging in
Jenkins after collecting responsible people from git changes in job ends with
Not sending mail to unregistered user john.doe#mycompany.com
I tried adding additional identity to user doej by setting:
Identity: john.doe#mycompany.com
Realm: <empty>
but it doesn't work.
How should I configure correctly the Additional Identities Plugin?
Seems it's not possible to configure the additional identity plugin in Jenkins for merge the duplicate user
As Jenkins is missing a way to ensure users unicity(unique) since they are created from various sources: authentication method (LDAP in my case), code commits (Subversion, Mercurial, Git, ...).
Depending on the way the user is retrieved by Jenkins (from a commit on a given SCM or its authentication), multiple identities are created for the same real user.
As a consequence, some features are not fully or badly working (login, notifications, user's builds, continuous integration game, ...) and configuration of users is a pain as it must be done multiple times for each real user.
Still the required features are:
a merging features. Allow to merge multiple Jenkins users into a single account.
a user pattern per SCM. Allow to choose how to extract a username from a commit for each SCM and how to optionally match existing one instead of creating a new user.
an id pattern per notification type. Allow to define how to generate the default id used for notification from the user data (from his jenkins id, his name, his scm id, ...): for instance, his mail or his jabber id, ...
Reference: [JENKINS-10258] Allow users unicity - Jenkins Jira
Solution is Jenkins 1.480 but this is still in Vulnerabilities state and have bug as well.
Jenkins 1.480 introduces an extension point to resolve jenkins user "canonical" ID when searching for user in Database by id or full name. This plugin uses this extension point to let user configure external identities as user properties.
You can reach out to Jenkins community or Support team to know the status or when they will final release

Can Google Cloud Run be used to run a continuously-listening Python script?

I'd like to run a Python script in the cloud. It would use Tweepy Streaming to continuously listen for Tweets containing certain keywords. So it needs to run uninterrupted, 24/7.
Would Google Cloud Run be suitable for this use case?
The Quotas and Limits page mentions that requests timeout after 60 minutes max, but I don't know exactly what this means.
Thank you.
No, it would not be a good choice. Serverless infrastructure provided by products like Cloud Run and Cloud Functions is generally assumed to expand and contract server instances on demand, and server instances are never guaranteed a long uptime. If you absolutely require 24/7 uninterrupted operation of some background task not tied to an event or HTTP request, you should use a different cloud product, such as App Engine or Compute Engine.
"some background task not tied to an event or HTTP request"
Isn't what the OP wants? Merely to listen for tweets 24/7? Detecting a tweet is an event and an HTTP request. Cloud Run and Cloud Functions can be triggered 24/7 via its URL endpoints.
In the Cloud Functions Page, if you scroll down there is a section called "Integration with third-party services and APIs".
I quote this section:
"... capabilities such as sending a confirmation email after a successful Stripe payment or responding to Twilio text message events"
listening for tweets counts too. So it seems Google Cloud Functions/ Cloud Run can be used for the OP's use case.

Is it possible to execute some code like logging and writing result metrics to GCS at the end of a batch Dataflow job?

I am using apache beam 2.22.0 (java sdk) and want to log metrics and write them to a GCS bucket after a batch pipeline finishes execution.
I have tried using result.waitUntilFinish() followed by the intended code:
DirectRunner- GCS object is created as expected and the logs appear on the console
DataflowRunner- GCS object is created but logs (post pipeline exec) don't appear on stackdriver
Problem: When a GCS template is created for the same, Neither the GCS object is created nor logs appear using the template.
what you are doing is the correct way of getting a signal for when the pipeline is done. There is no direct API in Apache Beam that allows for getting that signal within the running pipeline aside from wait_until_finish().
For your logging problem, you need to use the Cloud Logging API in your code. This is because the pipeline is submitted to the Dataflow service and runs in GCE VMs which logs to Cloud Logging. However, the code outside of your pipeline runs locally.
See Perform action after Dataflow pipeline has processed all data for a little more information.
It is possible to export the logs from your Dataflow job to Google Cloud Storage, Big Query or PubSub. In order to do that, you can use Cloud Logging Console, Cloud Logging API or gcloud logging to export the desired metrics to a specific sink.
In summary, to use the log export:
Create a sink, selecting Google Cloud Storage as the Sink Service( or one of the desired other options).
Within the sink, create a query to filter your logs (Optional)
Export destination
Afterwards, every time Cloud Logging receives new entries it will add them to the sink, only the new entries.
While you did not mention if you are using custom metrics, I should point that you should follow the Metrics naming rules, here. Otherwise, it won't show up in StackDriver.

Gerrit/NoteDB User Management

I am in the process of switching the LDAP backend that we use to authenticate access to Gerrit.
When a user logs in via LDAP, a local account is created within Gerrit. We are running version 2.15 of Gerrit, and therefore our local user accounts have migrated from the SQL DB into NoteDB.
The changes in our infrastructure, mean that once the LDAP backend has been switched, user logins will appear to Gerrit as new users and therefore a new local account will be generated. As a result we will need perform a number of administrative tasks to the existing local accounts before and after migration.
The REST API exposes some of the functionality that we need, however two key elements appear to be missing:
There appears to be no way to retrieve a list of all local accounts through the API (such that I could then iterate through to perform the administrative tasks I need to complete). The /accounts/ endpoint insists on a query filter being specified, which does not appear to include a way to simply specify 'all' or '*'. Instead I am having to try and think of a search filter that will reliably return all accounts - I haven't succeeded yet.
There appears to be no way to delete an account. Once the migration is complete, I need to remove the old accounts, but nothing is documented for the API or any other method to remove old accounts.
Has anybody found a solution to either of these tasks that they could share?
I came to the conclusion that the answers to my questions were:
('/a/' in the below examples is accessing the administrative endpoint and so basic Auth is required and the user having appropriate permissions)
Retrieving all accounts
There is no way to do this in a single query, however combining the results of:
GET /a/accounts?q=is:active&n=<number larger than the number of users>
GET /a/accounts?q=is:inactive&n=<number larger than the number of users>
will give effectively the same thing.
Deleting an account
Seems that this simply is not supported. The only option appears to be to set an account inactive:
DELETE /a/accounts/<account_id>/active

Emqtt - How to implement ACL for huge no. of clients

I am using Emqtt (emqtt.io) broker for my next application. The scene is -
I’ll have multiple clients(10,000s) and each of them will be publishing or subscribing to topics. But i want to restrict every client to publish and subscribe only on topics congaing there own client id - For ex-
Topics will be-
my_device/12345/update
my_device/99998/update
my_device/88888/update
If the middle attribute is the client ID, how can i restrict clients to do a pubs only on that particular topic and no one should be able to subscribe to
my_device/# and hence receiving all my messages.
I saw ACL plugin, saw this code ( {allow, {user, "dashboard"}, subscribe, ["$SYS/#"]}. ) but there i have to define every client manually ? and what if a new user is added, how will i add one more rule automatically ? because with my understanding, this file is loaded on starting up of the broker, right ?. I want to use ACL based on some database. Can You help me with that ?
The Emqtt user guide lists a set of plugins that can be used to store the ACL in a database:
http://emqtt.io/docs/v2/guide.html
The links in the that doc are broken, but the projects are hosted under the same git organisation
A. auth plugin
1. login
https://emqtt.io/docs/v2/guide.html#authentication
lot of ways to check login
http
redis/mysql...
2. acl
also can control acl access
http
redis/mysql..
but internal conf More efficient
B. acl internel
the magic var in topic pattern
%c - clientid
%u - username
the operation
subscribe
publish
pubsub
acl.conf example
allow clientid XXX sub clients/XXX:
{allow, all, subscribe, ["clients/%c"]}.
allow username XXX pub/sub clients/XXX:
{allow, all, pubsub, ["clients/%u"]}.
deny all other:
{deny, all}.
https://github.com/emqx/emqx/wiki/ACL-Design#examples
example from v4, but v2 also support %c %u
apply change
$ emqttd_ctl acl reload
NOTE: all cluster node should config
Best option is to use a plugin for auth/acl. I prefer mongodb plugin but there are other plugins provided.
From their docson github:MongoDB plugin setup for emqtt
It works great for authentication but I haven't yet been able to subscribe or publish using the plugin settings currently.
Also if the plugins are giving you problems with authentication, try building your emqtt from source

Resources