GitHub for server automation — SSH certificates
ci cdRecently, I have come across a HackerNews post on SSH certificate. While SSH certificate probably doesn’t makes sense for someone with only a handful of servers, I decided to do it anyway, just for fun.
SSH Certificate in a Nutshell #
It’s 2022. Everybody has started to use at least public key authentication, which is sketched below. The server and the client each holds a private key. The server has a list of authorized client public keys in ~/.ssh/authorized_keys
. The client has a list of known server public keys in ~/.ssh/known_hosts
. During some SSH handshake, the two parties exchange their public keys and perform authentication.
All is good, except that given N clients and M servers, we have a N × M problem of public key delivery. Using SSH certificate, we can instead ask each client to trust some fixed certificate authority (CA), and each some server some other CA, reducing the complexity of delivery to N + M.
Certificate authentication is rather simple. A CA is nothing more than an OpenSSH private key. A client can trust a server-side CA (host CA) by adding its public key to its known_hosts
file. Similarly, a server can trust client-side CAs (user CA) by adding their public keys to its sshd
configuration. We then use the host CA to sign a server’s public key, along with an expiration date, its domain name, etc. Likewise, we use the user CA to sign a client’s public key, along with an expiration date, its allowed usernames, and allowed SSH capabilities. At the beginning of some SSH connection, each side presents its certificate to the other. The other side checks its signature against trusted CA’s public keys.
Notice:
- You can, and probably should, use different CAs for client and server certificates.
- You do not need to use SSH certificates for both sides. For instance, you can use certificate authentication only for the client. Server authentication is still done via
known_hosts
.
For a concrete set-up tutorial, There is a fantastic tutorial by Teleport, which guides my little experiment.
Automation #
I choose to only automate the renewal of server certificates. Client certificates are manually renewed because I only have one regularly used Mac. There is little to be gained by automating it with launchd
.
There are a few platforms that I can use to sign and deliver the certificates:
- My own router. It feels secure but its reliability is not so good. Both power outage and network issues can occur quite frequently.
- One of the server I own. It is much more reliable but I have the bad habit of wiping servers without caution.
So I end up with serverless solutions. My first thought is, of course, AWS Lambda. However, AWS Lambda has a few issues:
- AWS is just so overly complicated. I need to learn lots of things just to pass some credentials to my function.
- AWS Lambda has a really minimal runtime. The environment does not even include OpenSSH!
- The editing experience is horrible. I cannot even upload a file. All I can do is to wipe out my whole package with a zip.
- The code is not version controlled. It is as ad-hoc as putting the scripts on one of my server.
So I probably should use GitHub and continuously deploy to AWS Lambda via GitHub actions. But wait, why not simply use GitHub Actions?
Turns out GitHub Actions is the perfect candidate. Its Ubuntu image is battery included — it even has Swift preinstalled! And the first 2000 minutes of every month is free.
After hours of experimenting, I came to the following CI workflow. Both my host and user CAs are uploaded as encrypted secrets. The workflow’s first job is to create a temporary OpenSSH keypair, sign it, and set up permissions correctly, which gives the runner SSH access to all my servers. After that, it reads from a file containing a list of servers to renew certificates for. That file is version controlled. Finally, renewing certificates is no more than downloading the public key from the server, signing it locally, and uploading it back.
The workflow is scheduled to run every day. I also set the workflow_dispatch
flag, so I can manually trigger it. During debug, I also set it to run on push.
One big mistake I made was that I thought I could directly reference all secrets of a repo using environment variables. I cannot, which seems obvious afterwards. I need to declare a new environment variable for that step/use, and reference the secret there. This use case is explicitly demoed at the very end of the encrypted secrets documentation, which I should have read more carefully.
A more time-consuming mistake is related to echo
and shell. When writing a environment variable containing the OpenSSH private key to file using
echo $SSH_KEY > ~/.ssh/id_rsa
, one would get a file where all newline characters become the plain space. OpenSSH will report that key as invalid. This cost me nearly one hour to debug, because GitHub kindly masked any occurenece of any secret, making me unable to log out the content of the dumped key. Little tip: you can log your secret in base64 and GitHub cannot mask that.
The issue is due to the way shell interprets scripts. Shell splits the input by whitespace characters into a list of arguments, and echo
inserts a space character between each argument as its output. To overcome this, simply do
echo "$SSH_KEY" > ~/.ssh/id_rsa
instead.
One last thing. One of my server is hosted in Aliyun. Aliyun will send a warning whenever there is SSH login from a new place. I have no idea where the GitHub hosted runners are hosted. Therefore, if I log into my server from runners directly, I might be spammed with false warnings, and let the actual useful ones slip. To solve this, I use ProxyJump
to ask SSH to proxy all access to one of my other server, which has a fixed IP. That server essentially becomes a bastion host.