Building deterministic, reproducible assets with Sprockets
This is a story that begins with airplane wifi, and ends with the recognition that everything is related in web development.
While on slow airplane wifi, I was syncing this blog’s git repo, and it was taking forever. That was surprising because this blog is mostly text, which I expected shouldn’t require many bits to transfer for Git. Looking more deeply into it (I had a 4-hour flight), I discovered that the vast majority of the bits were in the git branch of built assets that gets deployed to GitHub Pages (gh-pages
) when I build my Rails app into a static site with Parklife. And the bits in that branch were assets (css, javascript, and a few icons and fonts) built by Sprockets, whose contents were changing every time the blog was built and published. What changed?
- Sprockets creates a file manifest that is randomly named
".sprockets-manifest-#{SecureRandom.hex(16)}.json"
. - Within the file manifest, there is an entry for every file built by Sprockets, that includes that original asset’s
mtime
—when the file on the filesystem was last touched, even if the contents didn’t change. - By default, Sprockets generates gzipped
.gz
copies of compressible assets, and it includes the uncompressed file’smtime
in the gzipped file’s header, producing different binary content even though the compressed payloads’ contents didn’t change.
Do I need that? Let’s go through it.
The Sprockets Manifest
The Sprockets Manifest is pretty cool (I mean public/assets/.sprockets-manifest-*.json
, not app/assets/config/manifest.js
which is different). The manifest is how Sprockets is able to add unique cache-breaking digests to each file while still remembering what the file was originally named. When building assets on a server with a persisted filesystem, Sprockets also uses the manifest to keep old versions of files around: bin/rails assets:clean
will keep the last 3 versions of built assets, which is helpful for blue-green deployments. Heroku also has a bunch of custom stuff powered by this too to make deployments seamless.
But none of that is applicable to me and this blog, which gets built from scratch and committed to Git. Or for that matter, when I build some of my other Rails apps with Docker; not unnecessarily busting my cached file layers would be nice 💅
The following is a monkeypatch, which works with Sprockets right now but I’m hoping to ultimately propose as a configuration option upstream (as others have proposed).
# config/initializers/sprockets.rb
module SprocketsManifestExt
def generate_manifest_path
# Always generate the same filename
".sprockets-manifest-#{'0' * 32}.json"
end
def save
# Use the epoch as the mtime for everything
zero_time = Time.at(0).utc
@data["files"].each do |(_path, asset)|
asset["mtime"] = zero_time
end
super
end
Sprockets::Manifest.prepend self
end
Now, if you’re like me (on a plane), you might be curious about why the obsessive tracking of mtime
. I have worked alongside several people in my career with content-addressable storage obsessions. The idea being: focus on the contents, not the container. And mtime
is very much a concern of the container. But Sprockets makes the case that “Compiling assets is slow” so I can see it’s useful to quickly check when the file was modified, in a lot of cases… but not mine.
Let’s move on.
GZip, but maybe you don’t need it
So… everything in web development is connected. While wondering why new copies of every .gz
file were being committed on every build, I remembered what my buddy Rob recently did in Rails: MakeActiveSupport::Gzip.compress
deterministic.
I have some tests of code that uses
ActiveSupport::Gzip.compress
that have been flaky for a long time, and recently discovered this is because the output of that method includes the timestamp of when it was compressed. If two calls with the same input happen during different seconds, then you get different output (so, in my flaky tests, they fail to compare correctly).
GZip takes a parameter called mtime
, which is stored and changes the timestamp of the compressed file(s) when they are uncompressed. It changes the content of the gzipped file, because it stores the timestamp in the contents of the file, but doesn’t affect the mtime of the gzipped file container.
So in the case of Sprockets, if the modification date of the uncompressed asset changes, regardless of whether its contents have changed, a new and different (according to git or Docker) gzipped file will be generated. This was really bloating up my git repo.
Props to Rack maintainer Richard Schneeman who dug further down this hole previously, admirably asking the zlib group themselves for advice. The commentary made a mention of nginx docs, which I assume is for ngx_http_gzip_static_module
which says:
The files can be compressed using the gzip command, or any other compatible one. It is recommended that the modification date and time of the original and compressed files be the same.
But that’s not GZip#mtime
value stored inside the contents of the gzip file, that’s the mtime of the .gz
file container. Sprockets also sets that, with File.utime
.
It’s easy enough to patch the mtime to the “unknown” value of 0
:
# config/initializers/sprockets.rb
module SprocketsGzipExt
def compress(file, _target)
archiver.call(file, source, 0)
nil
end
Sprockets::Utils::Gzip.prepend self
end
…though if you’re in my shoes, you might not even need these gzipped assets. afaict only Nginx makes use of them with the non-default ngx_http_gzip_static_module
module; Apache requires some complicated RewriteRules; Puma doesn’t serve them, CDNs don’t request them. Maybe turn them off? 🤷
# config/initializers/sprockets.rb
Rails.application.configure do
config.assets.gzip = false
end
Fun fact: that configuration was undocumented
Maybe please don’t even pass mtime
to gzip for web assets
All of this stuff about file modification dates reminded me of another thing I had once previously rabbit-holed on, which was poorly behaved conditional requests in RSS Readers. The bad behavior involved inappropriately caching web requests whose Last-Modified HTTP header changed, but their contents didn’t. And how do webservers generate their Last-Modified header value? That’s right, file mtime
, the one that can be set by File.utime
!
…but not the one set by GZip#mtime=
. I cannot find any evidence anywhere that value, in the contents of the gzip file matters. Nada. All it does is make the gzip file’s contents be different, because of that one tiny value being included. I can’t imagine anything cares about the original mtime when it’s unzipped, that wasn’t already transmitted via the Last-Modified HTTP header. What am I missing?
Of the evidence I have, it seems like developers set GZip#mtime=
… because it’s an option? I couldn’t find a reason in the Sprockets history. I noticed that Rack::Deflater does the same for reasons I haven’t figured out in their history either. This behavior probably is not busting a lot of content-based caches unnecessarily, but it probably does some. So maybe don’t do it unless you need to.