Something I’ve been pushing (and this is pretty much a truism amongst anyone who’s looked at “Cloud”) is the idea of automation. It doesn’t matter if you’re just treating the cloud as an outsourced datacenter or if you’re doing full 12-factor dynamically scalable apps. Automation is the key to consitency and control.
So, ideally, this means your automation system is the “single point of
truth” for your estate. Whether you use ansible
or chef
or (saints
preserve us) cfengine
, your configuration file explicitly defines
your target state. You can learn everything from that.
But is this true?
It’s nice in theory but, as is always the case, practice may be different.
Your source of truth may contradict itself.
Now cfengine
is easy to see; one promise could say “X is true” and another
promise could says “!X is true”. cfengine
will complain that these rules
don’t converge (assuming anyone reads the logs) and your server is in an
unknown state. This is simple.
But there’s a more subtle failure mode.
Let’s say we use ansible
to build our environment. The build process
calls a sequence of playbooks to take your machine from raw state through
to final configuration. So far, so good.
Now let’s say each playbook should be in its own git
repo; after all,
the playbook that installs and configures apache
doesn’t really need
to impact the playbook for postfix
. It makes sense to seperate out these
playbooks into different areas; different teams may be responsible; different
access controls can be applied (you don’t want the SMTP team to impact your
web servers).
OK, that’s a contrived example, but you can see how it goes; the team building out your Postgres database automation shouldn’t necessarily have the ability to change the configuration of your OpenLDAP servers.
But here’s where things get complicated…
Sometimes there is overlap. Your apache
automation may configure the
addresses of your single sign on servers. Your nginx
configuration may
require the same data. If they’re in different repo’s, then how do you
ensure consistency?
Your single point of truth (“this is the single signon server”) may not be consistent.
Iteration
There’s no simple answer to this. How you factor your code repositories, how you factor your automation, how you build systems will evolve over time. But be aware; if you define a variable (“single signon server”) for one playbook, maybe it’s also useful elsewhere? Define a global namespace?
Laziness
I spotted this in my own tooling. I have a script that will build my DNS and DHCP configuration. Given an entry in a config file it will build A, AAAA and PTR records for the machine.
I noticed, today, that one of my domains isn’t controlled this way. It has an A and AAAA record that’s hard-coded. I’m sure this will bite me in the bum down the line (when the primary server fails and I need to failover to a secondary). Will I remember this? Or should I fix my automation. The answer is obvious…