Is it possible to run Cloud Dataflow with custom packages?

Is it possible to run Cloud Dataflow with custom packages? - google-cloud-dataflow

Is it possible to provision Dataflow workers with custom packages?
I'd like to shell out to a Debian-packaged binary from inside a computation.
Edit: To be clear, the package configuration is sufficiently complex that it's not feasible to just bundle the files in --filesToStage.
The solution should involve installing the Debian package at some point.

This is not something Dataflow explicitly supports. However, below are some suggestions on how you could accomplish this. Please keep in mind that things could change in the service that could break this in the future.
There are two separate problems:
Getting the debian package onto the worker.
Installing the debian package.
For the first problem you can use --filesToStage and specify the path to your debian package. This will cause the package to be uploaded to GCS and then downloaded to the worker on startup. If you use this option you must include in the value of --filesToStage all your jars as well since they will not be included by default if you explicitly set --filesToStage.
On the java worker any files passed in --filesToStage will be available in the following directories (or a subdirectory of)
/var/opt/google/dataflow
or
/dataflow/packages
You would need to check both locations in order to be guaranteed of finding the file.
We provide no guarantee that these directories won't change in the future. These are simply the locations used today.
To solve the second problem you can override StartBundle in your DoFn. From here you could shell out to the command line and install your debian package after finding it in /dataflow/packages.
There could be multiple instances of your DoFn running side by side so you could get contention issues if two processes try to install your package simultaneously. I'm not sure if the debian package system can handle this or you need to so in your code explicitly.
A slight variant of this approach is to not use --filesToStage to distribute the package to your workers but instead add code to your startBundle to fetch it from some location.

Related

Canonical way to make terraform use only local plugins

As far as I can see there are three ways to make Terraform use prepopulated plugins (to prevent downloads from web on init command).
terraform provider mirror command + provider_installation in .terraformrc (or terraform.rc)
terraform init -plugin-dir command
warming up provider-plugin-cache
Are they all equivalent? Which one is recommended? My use case is building "deployer" docker image for CI/CD pipeline and also I am considering the possibility to use Terraform under Terraspace.

The first two of these are connected in that they all share the same underlying mechanism: the "filesystem mirror" plugin installation method.
Using terraform init -plugin-dir makes Terraform in effect construct a one-off provider_installation block which contains only a single filesystem_mirror block referring to the given directory. It allows you to get that effect for just one installation operation, rather than configuring it in a central place for all future commands.
Specifically, if you run terraform init -plugin-dir=/example then that's functionally equivalent to the following CLI configuration:
provider_installation {
filesystem_mirror {
path = "/example"
}
}
The plugin cache directory is different because Terraform will still access the configured installation methods (by default, the origin registry for each provider) but will skip downloading the plugin package file (the file actually containing the plugin code, as opposed to the metadata about the release) if it's already in the cache. Similarly, it'll save any new plugin package it downloads into the cache for future use.
This therefore won't stop Terraform from trying to install any new plugins it encounters via network access to the origin registry. It is just an optimization to avoid re-downloading the same package repeatedly.
There is a final approach which is similar to the first one but with a slight difference: Implied Local Mirror Directories.
If you don't have a provider_installation block in your configuration then Terraform will construct one for itself by searching the implied mirror directories and treating any provider it finds there as a local-only one. For example, if /usr/share/terraform/plugins contains any version of registry.terraform.io/hashicorp/aws (the official AWS provider) then Terraform will behave as if it were configured as follows:
provider_installation {
filesystem_mirror {
path = "/usr/share/terraform/plugins"
include = ["registry.terraform.io/hashicorp/aws"]
}
direct {
exclude = ["registry.terraform.io/hashicorp/aws"]
}
}
This therefore makes Terraform treat the local directory as the only possible installation source for that particular provider, but still allows Terraform to fetch any other providers from upstream if requested.
If your requirement is for terraform init to not consult any remote services at all for the purposes of plugin installation, the approach directly intended for that case is to write a provider_installation block with only a filesystem_mirror block inside of it, which will therefore disable the direct {} installation method and thus prevent Terraform from trying to access the origin registry for any provider.

Not sure about Terraspace. Only regarding plugins:
terraform provider mirror command + provider_installation in .terraformrc (or terraform.rc): seems more secure version, but it requires to update the local mirror whenever you change plugin versions. It's not very clear whether you can reuse the same mirror location for different configurations requiring different set or versions of plugins.
terraform init -plugin-dir command: terraform commands will fail if required plugins and specific versions are not preinstalled. This approach seems to be the most time consuming and the most controlling of available plugins.
When this option is used, only the plugins in the given directory are available for use.
warming up provider-plugin-cache: this one can re-use pre-downloaded plugin versions and also will try to download new versions when you update the constraints. This approach will work if your cache path is writable. If it is not then terraform probably will fail as 2nd option. This option seems to be the least time consuming and more close to the local development. Cache is not automatically cleaned up and will need some cleaning automation.
Depending on whether you have many different configurations, what level of security is required, whether you have capacity to update caches/mirrors frequently enough to follow with required versions, the choice could be different as well.

Coldfusion Docker is uninstalling modules on build

I'm having issues with creating a useable docker container for a ColdFusion 2021 app. I can create the container, but everytime it is rebuilt I have to reinstall all of the modules (admin, search, etc.). This is an issue because the site that the container will be housed on will be rebuilding the container everyday.
The container is being built with docker-compose. I have tried using the installModule and importModule environmental variables, running the install command from the Dockerfile, building the container and creating a .car file to keep the settings, and disabling the secure mode using the environmental variables.
I have looked at the log, and all of the different methods used to install/import the modules are actually downloading and installing the modules. However, when the container first starts to spin up there's a section where the selected modules are installed (and the modules that are not installed are listed). That section is followed by the message that the coldfusion services are available, then it starts services, security, etc. and uninstalls (and removes) the modules. It then says that no modules are going to be installed because they are not present, and it gives the "services available" message again.
Somehow, it seems that one of the services is uninstalling and removing the module files, and none of the environmental variables (or even the setupscript) are affecting that process. I thought it might be an issue with the secure setup, but even with disabling that the problem persists. My main question is, what could be causing it to be uninstalled?
I was also looking for clarification on a couple of items:
a) all of the documentation I could find said that the .CAR file would be automatically loaded if it was in the /data folder (and in one spot it's referred to the image's /data folder). That would be at the top level with /opt and /app, right? I couldn't find an existing data folder anywhere.
b) Several of the logs and help functions mention a /docs folder, but I can't find it in the file directory. Would anyone happen to know where I can find them? It seems like that would be helpful for solving this.
Thank you in advance for any help you can give!

I don't know if the Adobe images provide a mechanism to automatically install modules every time the container rebuilds, but I recommend you look into the Ortus CommandBox-based images. They have an environment variable for the cfpm packages you want installed and CFConfig which is much more robust than car files.
https://hub.docker.com/r/ortussolutions/commandbox/
FYI, I work for Ortus Solutions.

DevOps Simple Setup

I'm looking to start creating proper isolated environments for django web apps. My first inclination is to use Docker. Also, it's usually recommended to use virtualenv with any python project to isolate dependencies.
Is virtualenv still necessary if I'm isolating projects via Docker images?

If your Docker container is relatively long-lived or your project dependencies change, there is still value in using a Python virtual-environment. Beyond (relatively) isolating dependencies of a codebase from other projects and the underlying system (and notably, the project at a given state), it allows for a certain measure of denoting the state of requirements at a given time.
For example, say that you make a Docker image for your Django app today, and end up using it for the following three weeks. Do you see your requirements.txt file being modified between now and then? Can you imagine a scenario in which you put out a hotpatch that comes with environmental changes?
As of Python 3.3, virtual-env is stdlib, which means it's very cheap to use, so I'd continue using it, just in case the Docker container isn't as disposable as you originally planned. Stated another way, even if your Docker-image pipeline is quite mature and the version of Python and dependencies are "pre-baked", it's such low-hanging fruit that while not explicitly necessary, it's worth sticking with best-practices.

No not really if each Python / Django is going to live in it's own container.

Does it make sense to bake your AMI if you use cfn-init in your cloudformation template?

I am starting to doubt if I might be missing the whole point of cfn-init. I started thinking that I should bake my AMI used in my cfn template to save time so it doesn't waste time reinstalling all the packages so I can quickly test the next boostrapping steps. But if I have in my cfn-init commands to download awslogs and stream my logs by executing the cfn-init command in my userdata, if I bake that in, my log group will be created, but doesn't the awslog program need to run a fresh command to start streaming logs, it just does not make sense if that command is baked in. Which brings me to my next question, is cfn-init bootstrapping designed (or at least best practice) to run it everytime a new ec2 is spun up, i.e. you cannot or should not bake in the cfn-init part?

Your doubt is very valid and it is purely the design approach and style of working of the devop.
If your cfn-int just accomplishes installing few packages; very well this can be baked in the AMI. As you rightly pointed it would save time and ensure faster stack creation.
However what if you would like to install the latest version of the packages; in That case you can just add the latest flag / keyword to the cfn-init package section. I have used the cfn-init to dynamically to accept the BIOS name of the Active Directory - Domain Controller; so in this case I wouldn't be able to bake that in AMI.
Another place where cfn-init helps is that assume that you have configured 4 packages to be installed; what if you there is a requirement of yet another package to be also installed; in that case - If it is CloudFormation cfn-init - it is another single line of code to be added. If it is AMI - a new AMI approach new AMI has to Baked.
This is purely a trade off.

Identifying files contained within a docker image (or Application dependencies)

I'm currently running a project with Docker, and I have this question. Is there any possible way that you can know what are the exact files that are contained in a docker image, without running that image ?
In a different way, I'm scheduling docker containers to run on the cloud based on the binaries/libraries they use, so that containers with common dependencies (have common binaries and libraries) are scheduled to run on the same host thus share these dependencies on the hosts os. Is there a possible way to identify the dependencies / do that ?

You could run docker diff on each layer to see what files were added. It is not very succinct, but it should be complete.
Alternatively your distro and programming language may have tools that help identify which dependencies have been added. For example, Python users can check the output of pip freeze and Debian users can check the output of dpkg --get-selections to check on what system packages have been installed.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart