Farewell Brompt

I’m planning to shut down Brompt, which I previously wrote about in 2008, 2011, and 2022. I archived the code on GitHub.

Let’s do a final confetti drop 🎉 together.

Farewell 👋 Brompt is shutting down

I have some sad news to share: I’m planning to shut down this service, Brompt, at the end of the month (February, 2024).

Shutting down Brompt means that you’ll no longer receive these automated reminders for your blog or writing. I’ve been running Brompt since 2008 and unfortunately, I haven’t been able to make it sustainable. You’re one of about 80 people who are still using the service, though I’m never sure if you or anyone ever opens the emails regularly.

Regardless of Brompt shutting down, I hope you’re doing well and I’d love to stay in touch. Send me an email at [email protected] or read my own blog at https://island94.org

All the best,

Ben (the person who made Brompt)

Screenshots from Brompt

Replacing Devise with Rails has_secure_password and friends

I love the Devise user-authentication gem. I’ve used it for years, and I recently moved off of it in one of my personal apps and replaced it with Rails’s built-in has_secure_password and generates_secure_token and a whole bunch of custom controllers and helpers and code that I now maintain myself. I do not recommend this! User authentication is hard! Security is hard!

And… maybe you need to walk the same path too. So I want to share what I learned through the process.

Ok, so to back up, why did I do this?

  • Greater compatibility with Rails main. My day job runs Rails main, and I’m more frequently contributing to Rails development; I’d like to run my personal projects on Rails main too. When I looked back on upgrade-blocking gems, Devise (and its dependencies, like Responders) topped my list.
  • More creative onboarding flows. I’ve twisted Devise quite a bit (it’s great!) to handle the different ways I want users to be able to register (elaborate onboarding flows, email-only subscriptions, optional passwords, magic logins). I’ve already customized or overloaded nearly every Devise controller and many model methods, so it didn’t seem like such a big change anyway.
  • Hubris. I’ve built enterprise auth systems from scratch, managed the Bug Bounty program, and worked with security researchers. I have seen and caused and fixed some shit. (Fun fact: I have been paid for reporting auth vulnerabilities on the bug bounty platforms themselves.) I know that even if it’s not a bad idea for me, it’s not a great idea either. Go read all of the Devise-related CVEs; seriously, it’s a responsibility.

That last bit is why this blog post will not be like, “Here’s everything you need to know and do to ditch Devise.” Don’t do it! Instead, here’s some stuff I learned that I want to remember for the next app I work on.

A test regime

I went back through all of my system tests for auth, and here is a cleaned-up, though not exhaustive list of my scenarios and assertions. It seems like a lot. It is! There are also unit tests for models and controllers and mailers and separate API tests for the iOS and Android apps. Don’t take this lightly! (Remember, many of these are specific to my custom onboarding flows).

  • When a new user signs up for an account
    • Their email is valid, present and stored; password is nil.
    • They are not confirmed
    • They receive a confirmation email
    • If not confirmed, registering again with the same email resends the confirmation email but does not leak account presence
    • If the email associated account already exists and is confirmed, sends a “you already have an account” email and does not leak account presence.
    • Following the link in the confirmation email confirms the new account and redirects to the account setup page.
  • When a user sets up their account
    • They can assign a username and password
    • A password cannot be assigned if a password already exists
    • A username cannot be assigned if a username already exists
    • If a username and password already exist, the setup page redirects to the account update page
    • The account update page redirects to the setup page if a username or password does not yet exist
    • Signing in with an unsetup account redirects to setup page
    • Resetting password with an unsetup account redirects to setup page
    • Adding a password invalidates reset-password links.
  • When a user updates their account
    • The current password is required to update email, username, or password.
    • When the email address is changed, a new confirmation email is sent out to that email address.
    • An email change confirmation can be confirmed with or without an active session.
    • If the email address is already confirmed by a different account, send the “you already have an account” email and do not leak account presence.
    • Multiple accounts can have the same unconfirmed email address.
  • When a user performs a password reset
    • Can’t be accessed with an active session
    • Link is invalidated after 20 minutes, or when email, or password changes.
    • Can be performed on an unsetup account
    • Confirms an email but not an email change
    • Signs in the user
    • Does not leak account presence
    • Is throttled to only send once a minute.
  • When a user performs or resends an email confirmation
    • Can be accessed with an active session.
    • Cannot resend confirmation of an email change without an active session.
    • Link is invalidated after 20 minutes, or when email, unconfirmed email, confirmed at, or password changes.
    • Signs in the user
    • Does not leak account presence
    • Is throttled to only send once a minute.
    • When user is already confirmed, send them an email with a link to reset their password
  • When a user signs into a session
    • Requires a valid email or username, and password
    • Cannot sign in with a nil, blank, or absent password param (unsetup account)
    • Session is invalidated when email or password changes.
    • Does not leak account presence with missing or invalid credentials
    • Redirects to the session[:return_to] path if present, otherwise the root path.

Using has_secure_password

This was a fairly simple change. I had to explicitly add bcrypt to the gemfile, and then add to my User model:

# models/user.rb
alias_attribute :password_digest, :encrypted_password
has_secure_password :password

I’ll eventually rename the database column, but this was a zero-migration change.

Also, you might need to use validations: false on has_secure_password and implement your own validations if you have custom onboarding flows like me. Read the docs and the Rails code.

When authenticating on sign in, you’ll want to use User.authenticate_by(email:, password:), which is intended to avoid timing attacks.

Using generates_token_for

The generates_token_for methods are new in Rails 7.1 and really nice. They create a signed token containing the user id and additional matching attributes and it doesn’t need to be stored in the database:

# models/user.rb
generates_token_for :email_confirmation, expires_in: 30.minutes do
  [confirmed_at, email, unconfirmed_email, password_salt]

generates_token_for :password_reset, expires_in: 30.minutes do
  [email, password_salt]

I’ll explain that password_salt in a bit.

To verify this, you want to do use something like this: User.find_by_token_for(:email_confirmation, value_from_the_link).

btw security: when you put a link in an email message, you can only use a GET , because emails can’t reliably submit web forms (some clients can, but it’s weird and unreliable). So your link is going to look like https://example.com/account/reset_password?token=blahblahblahblahblah. If there is any links to 3rd party resources like script tags or off-domain images, you will leak the token through the referrer when the page is loaded with the ?token= in the URL. Devise never fixed it (😱) . What you should do is take value out of the query param and put it in the session and redirect back to the same page without the query parameter and use the session value instead. (Fun fact: this is a bug bounty that got me paid.)

Authenticatable salts

Here’s where I explain that password_salt value.

There’s several places I’ve mentioned where tokens and sessions should be invalidated when the account password changes. When bcrypt stores the password digest in the database, it also generates and includes a random “salt” value that changes every time the password changes. Comparing that salt is a proxy for “did the password change?” and it’s safer to embed that random salt in cookies and tokens instead of the user’s hashed password.

Devise uses the first 29 characters of the encrypted password (which is technically the algorithm, cost and salt):

# models/user.rb
def authenticatable_salt
  encrypted_password[0, 29] if encrypted_password

But it’s also possible to simply get the salt. I dunno if the difference matters (tell me!):

# models/user.rb
def password_salt
  BCrypt::Password.new(password_digest).salt[-10..] if password_digest

A nice session

There’s a lot to write about creating sessions and remember-me cookies, that I won’t be writing here. The main thing to note is that I’m storing and verifying both the user id and their password salt in the session; that means all of their session are invalidated when they change their password:

# app/controllers/application_controller.rb
UNASSIGNED = Module.new
USER_SESSION_KEY = "_yoursite_user".freeze

def initialize
  @_current_user = UNASSIGNED

def sign_in(user)
  session[USER_SESSION_KEY] = [user.id, user.password_salt]

def current_user
  return @_current_user unless @_current_user == UNASSIGNED

  # Check if the user was already loaded by route helpers
  @_current_user = if request.env.key?("current_user")
                     user_id, password_salt = session[USER_SESSION_KEY]
                     User.find_by_id_and_password_salt(user_id, password_salt) if user_id && password_salt

In doing this project I learned that Rail’s cookies will magically serialize/deserialize arrays and hashes. I’ve been manually and laboriously converting them into JSON strings for years 🥵

btw, if that UNASSIGNED stuff is new to you, go read my Writing Object Shape friendly code in Ruby.

Rotating active sessions

This is a little extra but I wanted the switchover to be transparent to users. To do so, I read from Devise’s active sessions and then create a session cookie using the new format. It looks something like this:

# controllers/application_controller.rb
before_action :upgrade_devise_session

def upgrade_devise_session
  # Devise session structure: [[USER_ID],"AUTHENTICATABLE_SALT"]
  if session["warden.user.user.key"].present?
    user_id = session["warden.user.user.key"].dig(0, 0)
    user_salt = session["warden.user.user.key"].dig(1)
  elsif cookies.signed["remember_user_token"].present?
    user_id = cookies.signed["remember_user_token"].dig(0, 0)
    user_salt = cookies.signed["remember_user_token"].dig(1)
  return unless user_id.present? && user_salt.present?

  # Depending on your deploy/rollout strategy ,
  # you may want need to retain and dual-write both
  # Devise and new user session values instead of this.

  user = User.find_by(id: user_id)
  sign_in(user) if user && user.devise_authenticatable_salt == user_salt

Route helpers

Devise mixes some nice helper methods into Rails’s routing DSL like authenticated; they’re even necessary if you need to authenticate Rails Engines that can’t easily access the app’s ApplicationController methods. Here’s how to recreate them using Route Constraints and monkeypatching ActionDispatch::Routing::Mapper (that’s how Devise does it)

# app/constraints/current_user_constraint.rb
class CurrentUserConstraint
  def self.matches?(request)

  def initialize(&block)
    @block = block

  def matches?(request)
    current_user = if request.env.key?("current_user")
                      user_id, password_salt = request.session[USER_SESSION_KEY]
                      request.env["current_user"] = User.find_by_id_and_password_salt(user_id, password_salt) if user_id && password_salt

    if @block
      @block.call(current_user, request)

# config/routes.rb
module ActionDispatch
  module Routing
    class Mapper
      def authenticated(&)
        scope(constraints: CurrentUserConstraint, &)

      def unauthenticated(&)
        scope(constraints: CurrentUserConstraint.new { |user| user.blank? }, &)

      def admin_only(&)
        scope(constraints: CurrentUserConstraint.new { |user| user&.admin? }, &)

Rails.application.routes.draw do
  # ...
  authenticated do
    resources :special_somethings

Because routing happens before a controller is initialized, the current user is put into request.env so that the controller won’t have to query it a second time from the database. This could also be done in a custom Rack Middleware.

If you want to put stuff into not-the-session cookies, those cookies can be accessed via request.cookie_jar, e.g., request.cookie_jar.permanent.encrypted["_my_cookie"].

Closing thoughts

That was all the interesting bits for me. I also learned quite a bit poking around Dave Kimura’s ActionAuth (thank you!), and am thankful for the many years of service I’ve gotten from Devise.

Two stories about technical debt, I guess

One activity I don’t enjoy very much: griping about “technical debt”; that label just never seems descriptive enough. And the things people gripe about seem to mostly fall into:

  • Deferred maintenance: we haven’t updated to the latest version of X, or it’s written in A language/framework but now everyone is much more familiar with B, or we know M raises an exception at N when they O while P
  • Just not the quality we expect, because reasons. It’s ok, we’re all winging it.

…and those categories crowds out the real sweaty palms stuff, the “we did do a good job but we know more now” that I think is the real deal. I can talk about that.

I’ve never found the particular post/video/talk? again (I’ve looked!), but it described technical debt as like: the distance between the current understanding of the business domain and the technical implementation’s modeling of the business domain. It had a chart that has stuck in my mind; it looked something like this:

A chart that shows the coming together and divergence of business and technical domain knowledge

That definition of “technical debt” clicked for me. For myself and the high performing teams I’ve worked with, we’re most productive, pushing value, when we’re asking ourselves: does this make sense? Given what we know now, and what we know about where we’re going, do the objects and their interactions conceptually map to how the business is being talked about by the non-technical people? Are we adapting at the same rate we’re learning? If yes, we’re golden; if sorta with some tweaks that’s good; when no… that’s bad, disordered, schismatic: carrying a burden of always translating between the language and models in the technical system and the language and concepts in the business domain. That sucks!

Aside: There’s a funny/terrifying/random thing this makes me think of: “We’ll Never Make That Kind of Movie Again”: An oral history of The Emperor’s New Groove, a raucous Disney animated film that almost never happened. One of the screenwriters describes the process of making an animated film:

In a normal four-year process, you’ve got meetings, you’ve got development people going, “What if the girl was a boy? What if the bird was a flower?” And then you have to run all those ideas.

The software projects I’ve worked on are a conveyor belt of “What if the bird was a flower?” decision making and idea running. Extend that object, repurpose this method, bolt on that, change this, swap that, rename this, better leave a comment on that. When it’s going well, it doesn’t matter that it was a bird yesterday and a flower today… as long as it’s not retaining so much birdliness that it compromises its flowerability. It’s when you’re having to remember and callback “hey, well, um, it was once a bird, fyi, that’s why it’s so much trouble to do this flower stuff”, then you’re sunk.

It’s a bad sign when forward development requires telling stories about the past.

Here’s two stories…

The isolated component

When I was working at a website hosting startup, we had one component that I could just never get the support to integrate into the larger application. It was the form that would create a new website project for the user. When the startup began, the original technical architecture was a bunch of interlinked but separated components: User Registration, User Account Management, Website Creation, Website Management, Organization Management, Agency/Reseller Management, etc. It made sense early as the business was figuring out what worked as a business, and then during my tenure we brought those components together into a unified application. Well, almost all of them.

There was a lot of give and take in the product integration; sometimes me and the other engineers would just do it and other times we’d defer it until there was a particular feature that necessitated it, and then we’d include that extra work in our estimates. It frequently took a couple passes of code cleanup and bringing it onto the design system, and that was ok. That’s the job!

That last, last, last component of Website Creation eluded us though, and it was outside our control. At that point, development was transitioning from “engineering led” to “product management and design led” and I had been instructed that engineering couldn’t make any product changes unless they were connected to an active PRD (Product Requirements Document) controlled by the PMs.

There was plenty of demand to make changes to Website Creation: smooth out the first-time account registration flow into creating a website; allow the user to do some activities in parallel while the website was spinning up like inviting team members or perusing billing levels; decouple the concept of “a website is a marketing strategy” from “upload your code and have some containers and a load balancer provisioned” so that non-developers could still plan a website without invoking all the technical bits.

But not enough appetite to get it done.

Of “technical debt”: everyone except our little engineering team maintaining the frontend didn’t think anything special of Website Creation. It wasn’t obvious unless you carefully watched for the hard-browser refresh, or noticed the navigation bar change slightly. Conceptually it was a unified product (heck, I even remember a product launch called “One”), but the work hadn’t yet been done on the application side and we engineers carried the burden.

It was funny because every time a product change that touched Website creation was discussed, the same thing happened:

PM: Your effort estimate seems really high. What’s going on?
Me: Well, this involves Website Creation and it’s still its own component and can’t access any of those other systems that are necessary to build the feature. We’d need to bring it into the rest of the application. It’s definitely possible! There’s a few open questions with product and design implications, so we’d need to work together on that.
PM: Oh, well, huh, I didn’t expect that. Hmm, we don’t have the bandwidth for all that. Let’s pass on it for now.

This happened multiple times! It was weird too because the particular project being planned would be spiked, and then the engineering team would have to wait around while a new project was spun up and that likely took just as long as it would have taken to do the work on the Website Creation component. If I hadn’t been explicitly told to sit on my hands otherwise, I would have probably just done this as off-the-books, shucks-I-didn’t-think-it-would-be-a-big-deal, shadow-work.

It never got done during my tenure; I think they later decided the problem was that the whole thing wasn’t written in React 🤷

The campaign message

When I was a developer on GetCalFresh, the functionality with perpetually unexpected estimations was “Campaign Messages”.

GetCalFresh would help people apply for their initial food stamp application, at which point it would be sent to the applicant’s county for processing. Over the next 14 to 28 days the county would schedule an in-person or telephone interview, and request various documents like pay stubs and rental leases, and the applicant would have to navigate all of that. (The administrative state suuuuucks!) To help, GetCalFresh would send a sequence of email and/or SMS messages to the applicant over this time period explaining what was needed and what to expect next. A messaging campaign, y’know, of “campaign messages”

When GetCalFresh was first laid down in code, there were two “types” of campaign messages: General and Expedited. Under a couple of different circumstances, if an applicant is homeless or at risk of being homeless, or has no cash-on-hand to purchase food, their application is eligible for expedited processing and thus we’d send them a message sequence over ~14 days; everyone else would receive a message sequence over ~28 days. We were sending the same messages, just on a different schedule.

So when we engineers were then asked to customize a message, like “if the applicant is a student, add this description about their special eligibility”… we just if/elsed it on in there. Oh, now this county is piloting a special process, let’s make a little carve out for them too and swap this message. Still, same sequence, just tweaks, right? Well, all those small tweaks and carve-outs build up, and all of a sudden we’re having to ask “ok, so you want us to rewrite this one itty bitty message, well we also need you to specify what it should be for students, who do and don’t qualify for County X’s special process too”. It got twistier and twistier. And when requests like “don’t send that message in this special circumstance” or “add a totally new message but just as this one-off” came in, we’d be like “totally possible! and that’s gonna take more work than you think!”

GetCalFresh had the best Product Managers I have ever worked with in my life, and we still got locked into a similar loop as the last story: we’d do our estimation with the PMs, it exposed the fruit hung more high than low, and the change would be deprioritized. I think the PMs got it, but the challenge was that the other folks, client support and the folks coordinating with the counties and the datascience team, would be like “we heard that Engineering doesn’t want to build it.” So weird! Not engineering’s call! (Aside: I spent so much time coaching non-technical stakeholders on how to work in a PM-led system, but always more coaching to do.)

I remember making a Google Doc to analyze why and explain how the system we initially designed for (same sequence of messages with different schedules) didn’t match our understanding of the problem today. The doc listed out all of the different reasons we knew of why we might customize the message. It was at least 10 bullet points. And there were a lot of other learnings too: initially we designed around customizing for just 3 major county systems (State Automated Welfare Systems - SAWS), but later found ourselves doing county-by-county customizations (California has 52 counties). I advocated for configuring each county in its own object despite the scale brainworms demanding a singular and infinitely abstracted model (I call these things “spreadsheet problems” when you can simply list the entire domain in a spreadsheet).

Of “technical debt”, I still can replay in my brain the deliberate mental spatial shift of imagining the campaign model as a 2-column array (General and Expedited) with 10+ rows of logic shifts and then flopping it onto its side to make a change. All that mental context has a huge carrying cost that all of us had to budget for when making a change.

During my tenure, we never did the significant reworking to how campaign messages were implemented, though some bold colleagues did their best to make changes as safe and simple as possible with DSLs and test macros. Thank you! 🙏

That’s it

Sorry, no great lessons here. Just stories to share (“ideally you’d try to see it as a funny story you can tell friends rather than a hurtful snub that needs to bother you”) I mentioned coaching folks on working with PMs, and I think the frequent advice I gave non-technical folks probably holds true for engineers too when asking:

Always have your top 3 ranked work items ready to go when talking to decision makers (the PM?). Don’t bring new stuff unless it changes those top 3.

(I mean sure, share context and adapt, but allow yourself no doubt that you’ve’ clearly and consistently communicated what your top priorities are before they get dumped in with everyone else’s.)

(But also, if you’re an engineer and you can and no one is breathing down your neck, simply get it done and celebrate. The PM doesn’t have to lead everything. You can do it! 👍)

The answer is in your heap: debugging a big memory increase in Ruby on Rails

I recently participated in an interesting series of debugging sessions tracking down the source of a large increase in memory when upgrading a Rails application. We ultimately tracked down the cause using John Hawthorn’s Sheap heap analyzer and successfully submitted a patch to Rails. I thought it was interesting enough to write up because maybe the general approach to debugging memory issues would be helpful (and this is the kind of stuff that I very quickly forget unless I write it down).

How it started: Reddit

Lots of people ask for help on r/rails, and it can be difficult to debug at a distance. This time it was a little different. I recognized the username’s owner, Jonathan Rochkind, because he’s been a familiar and helpful face in GoodJob’s discussions and I’ve sponsored his aggregator Rubyland News. The observed problem was that after upgrading from Rails 7.0 to Rails 7.1, their application’s memory footprint increased by about 25%. Weird!

Working the problem

We worked through a bunch of questions:

  • Was the memory increase at startup or over time? Not at boot, but memory increased very quickly.
  • Did anything change with Puma configuration? Nope.
  • Get set up with derailed_benchmarks, and create a bin/profile Rails binstub to make it easy to boot into a production-like configuration for profiling. Here’s what my very polished one looks like:

    #!/usr/bin/env ruby
    # This file is a wrapper around the rails and derailed executables
    # to make it easier to boot in PRODUCTION mode.
    # Usage: bin/profile [rails|derailed] [command]
    ENV["RAILS_ENV"] = ENV.fetch("RAILS_ENV", "production")
    ENV["RACK_ENV"] = "production"
    ENV["RAILS_LOG_TO_STDOUT"] = "true"
    ENV["FORCE_SSL"] = "false"
    ## ^^ Put ENV to boot in production mode here ^^
    executable = ARGV.shift
    if executable == "rails"
      load File.join(File.dirname(__FILE__), "rails")
    elsif executable == "derailed"
      require 'bundler/setup'
      load Gem.bin_path('derailed_benchmarks', 'derailed')
      puts "ERROR: '#{executable}' is not a valid command."
      puts "Usage: bin/profile [rails|derailed]"
      exit 1

We flailed around with Derailed Benchmarks, as well as John Hawthorn’s Vernier profiler’s memory mode (aside: John Hawthorn is doing amazing stuff with Ruby).

At this point, we had a general understanding of the application memory footprint, which involved a large number of model instances (Work), many of which contained a big blob of json. For some reason they were sticking around longer than a single web request, but we weren’t able to find any smoking guns of like, memoized class instance variables that were holding onto references. So we kept digging.

You can read along to all of this here: https://github.com/sciencehistory/scihist_digicoll/issues/2449

Analyzing memory with Sheap

I used Derailed Benchmark’s perf:heap to generate heap dumps (also possible using rbtrace --heapdump), and then plugged those into Sheap. Sheap is a relatively new tool, and where it shines is being interactive. Instead of outputting a static report, Sheap allows for exploring a heap dump (or diff: to identify retained objects), and ask questions of the dump. In our case: what objects are referencing this object and why is it being retained?

# $ irb
require './lib/sheap.rb

diff = Sheap::Diff.new("/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-1.ndjson", "/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-2.ndjson")

# Find one of the Work records that's been retained
model = diff.after.class_named("Work").first.instances[200]
=> <OBJECT 0x117cf5c98 Work (4 refs)>

# Find the path to the (default) root
[<ROOT vm (2984 refs)>,
 <IMEMO 0x126c9ab68 callcache (1 refs)>,
 <IMEMO 0x126c9acf8 ment (4 refs)>,
 <CLASS 0x12197c080 (anonymous) (15 refs)>,
 <OBJECT 0x122ddba08 (0x12197c080) (3 refs)>,
 <OBJECT 0x117cfc458 WorksController (13 refs)>,
 <OBJECT 0x117cf7318 WorkImageShowComponent (15 refs)>,
 <OBJECT 0x117cf5c98 Work (4 refs)>]

# What is that initial callcache being referenced by the ROOT?
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# And then a method entry
irb(main):015> diff.after.at("0x126c9acf8").data
 "references"=>["0x12197c080", "0x12197c080", "0x126c9ade8", "0x126c9b4a0"],
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# Aha, and a singleton for RoutesProxy!
 "references"=> [...],
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# I expect the next object to be the RoutesProxy instance
=> "ActionDispatch::Routing::RoutesProxy"

Sheap is pretty great! In the above example, we were able to find a Work model instance in the heap, and then using find_path identify what was referencing it all the way back to the heap’s root, which is what causes the object to be “retained”; if there was no path to the root, the object would be garbage collected.

(I have the huge benefit of having John as a colleague at GitHub and he helped me out a lot with this. Thank you, John!)

What we’re looking at is something in Rails’ RoutesProxy holding onto a reference to that Work object, via a callcache, a method entry (ment), a singleton class, a RouteSet, and then a Controller. What the heck?!

The explanation

Using Rails’ git history, we were able to find that a change had been made to the RoutesProxy ’s behavior of dynamically creating a new method: a class_eval had been changed to an instance_eval.

Calling instance_eval "def method...." is what introduced a new singleton class, because that new method is only defined on that one object instance. Singleton classes can be cached by the Ruby VM (they’ll be purged when the cache fills up), and that’s what, through that chain of objects, was causing the model instances to stick around longer than expected and bloat up the memory! It’s not that instance_evaling new methods is itself inherently problematic, but when those singleton methods are defined on an object that references an instance of an Action Controller, which has many instance variables that contained big Active Record objects…. that’s a problem.

(Big props, again, to John Hawthorn who connected these dots.)

Having tracked down the problem, we submitted a patch to Rails to change the behavior and remove the instance_eval -defined methods. It’s been accepted and it should be released in the next Rails patch (probably v7.1.3); the project temporarily monkey-patched in that change too has been released as part of Rails 7.1.3.

I realize that’s all a big technical mouthful, but the takeaway should be: Sheap is a really great tool, and exploring your Ruby heap can be very satisfying.

Update: Jean Boussier pointed me to the fix in Ruby 3.3 for call Cache for singleton methods can lead to “memory leaks” 🎉 And suggested looking at harb too as a Sheap-like.

Trigger GitHub Actions workflows with inputs from Apple Shortcuts

I’ve been using Apple Shortcuts to invoke GitHub Actions workflows to create webpage bookmarks. It’s been great! (disclosure: I do work at GitHub)

My use case: I’ve been wanting to quit Pinboard.in, so I needed an alternative way to create and host my web bookmarks, some of which date back to ~2005 del.icio.us vintage. It’s been easy enough for me to export of all my bookmarks (settings -> backup -> JSON) and convert them to YAML files to be served by Jekyll and GitHub Pages. But I also needed an easy way to create new bookmarks that would work on all my Apple devices. I ended up with:

  1. Bookmarks are organized as individual yaml files, in this blog’s repository.
  2. A Ruby script to take some simple inputs (url, title, notes), generate a new yaml file, and commit it to the repo using Octokit.
  3. A GitHub Actions workflow that accepts those same inputs and can be manually triggered, that runs the script. One thing to note is that I echo the inputs to $GITHUB_STEP_SUMMARY early in the workflow in case a later step errors, so I won’t lose the bookmark details and can go back later and manually fix it up.
  4. An Apple Shortcut that asks for those inputs (either implicitly via the Share Sheet or via text inputs) and then manually triggers the GitHub Actions workflow via the GitHub API.

The only difficult part for me was getting Apple Shortcuts to work nicely with the GitHub REST API. Here’s what worked for me:

Use Get Contents of URL Action:

  • URL: https://api.github.com/repos/USER/REPOSITORY/actions/workflows/WORKFLOW.yml/dispatches
  • Method: POST
  • Headers:
    • Accept: application/vnd.github.v3+jsonp
    • Authorization: Bearer GITHUB_ACCESS_TOKEN
  • Request Body: JSON
    • ref: main (or whatever branch you’re using)
    • inputs (Dictionary):
      • INPUT: VALUE
      • … and your other GitHub Actions workflow inputs

Here’s what it looks like all together (btw, Dictionary-type inputs were broken in iOS 16 / Mac 13 😨) :

Screenshot of Apple Shortcut with the previous configuration

Recently, January 3, 2023

  • I’ve now watched the Taylor Swift Eras movie twice, once at home, and a second time over the holidays with niece (completely) and nephew (partly). My most burning question is whether Taylor menaces the same dancer every show’s “Tolerate it”, or if they share rhetorical pain. My Apple Music Replay also ranked highly with Taylor Swift, though also apparently Andrew McMahon; unexpected.
  • I started playing Talos Principle 1 after beating 2, though it’s a lot more intense with guns and exploding things and so many timing-based puzzles. I’ve almost beaten it… but also took a break to play Super Mario Wonder which is much more fun fun (especially, again, with niece and nephew).
  • I finished reading “Babel”, started then stopped reading “Wolf Hall”, and am picking my way through “Less”. There’s several nonfiction books in there that I’ve read about chapter of, but nothing so memorable to note here. Prior to that I read “Translation State”, and realizing I didn’t remember half of what was happening, re-read the previous 4 Imperial Radch books, which only now makes me think I should read-read Translation State again for the clean sweep. And I guess I finished the latest Bruno book, “A Chateau Under Siege”, and it was fine (there’s only so many times one can describe Kir Royale, and I think we’re past it).
  • I believe the Crumpler Soupansalad Son-o is the perfect jacket-and-book-and-water-bottle-and-also-half-gallon-of-milk-and-greek-yogurt bag. It also hasn’t been made in like 15 years so I bought another one on eBay (I attempted to do this via Poshmark 6 months ago but it failed to deliver). Now I own two: my longtime bag in brown, the new one in black.
  • After 7 years of service, I’ve stepped down from my church’s Property Committee. The buildings are being sold. As Boenhoeffer wrote, albeit about Germany’s embrace of fascism and not a congregational vote on the role of real-estate within a small church’s modest investment portfolio: “If you board the wrong train, it is no use running along the corridor in the other direction.” I’m currently serving on zero committees anywhere; a special moment.
  • Reflecting on my past year at GitHub, I’ll just directly quote Rachel Andrew’s blog; search and replace as necessary:

    The layoffs at Google at the beginning of 2023, didn’t impact me or my writing team directly, however they cast a shadow over the year. I try to look at difficult situations through a lens of what I can actually do to change things or improve the situation. At my level of management I’m not privy to layoff decisions, but I can be there to support my team and make space to talk about their concerns. I can strive to make sure our work and the impact of it is visible, and I can make sensible business decisions to make the most of resources in a more constrained environment. And so, after the initial shock of it all, that’s how I’ve approached this year.

  • I’m slowly trending towards GoodJob v4.0. It’s looking like it will be a noop: a chance for me to clean up all of the migrations and deprecation warnings and probably stop support for Ruby < 3.0, but nothing noticeable otherwise. I want to have everything ready for anyone to opt into FOR UPDATE SKIP LOCK from Advisory Locks, but as a feature that won’t be part of four-point-zero. We’ve managed to get to 3.22 (that’s 22 minor releases!) without a breaking change.
  • I’m not big on setting yearly goals, but I liked this evergreen advice from Ask a Manager, so I’m gonna continue aspiring to that:

    ideally you’d try to see it as a funny story you can tell friends rather than a hurtful snub that needs to bother you

Solid Queue first impressions: Nice!

Solid Queue was released yesterday, a new relational-database backend for Active Job.

I’m the author of GoodJob and I’ve been following along and am very interested and excited about Solid Queue. These are some things I noticed when first going through it.

tl;dr; It’s nice! I learned some things. It makes a few different choices than GoodJob and I’m very curious how they turn out (in a good way!).

I admit, I didn’t run Solid Queue in production. I poked through the code, got the development environment set up (spent the majority of my time trying to get mysql2 to compile, which is no surprise for me; Trilogy is my day job), ran the test suite and tried TDDing a new feature for it: perform_all_later support. These are just my notes, something to refer back to.

Lots of database tables: Solid Queue has many database tables. There is a “hot” table (my terminology) in which job records are queued/dequeued/locked from by all the workers, and then several other tables where job records are staged (e.g. they’re scheduled in the future, so don’t insert them into the hot table yet) or archived after they complete/error. This seems smart because that hot table and its indexes stays compact and single purpose, which is good for performance. Compare that to GoodJob in which the jobs table has like 8 indexes to cover both queue/dequeue and Dashboard listings and everything else, which does slow down inserts and updates. I’ve had the impression with GoodJob that orchestrating across multiple tables would be more difficult (everything is tradeoffs!), so I’m very curious to see an alternative implementation in Solid Queue.

Note: I wasn’t successfully able to implement perform_all_later in my 1 hour timebox because it was more complicated than an insert_all because of the necessity of writing to multiple tables.

Aside: One of the very first comments I got when I launched GoodJob 3 years ago was like “your design assumptions are less than ideal” and then they never replied to any of my follow-ups. That sucked! This is not that. Nothing in Solid Queue is particularly concerning, just different (sometimes better!). Kudos to Rosa Gutiérrez and the Solid Queue developers; you’re doing great work! 💖

Again, lots of database tables: GoodJob is easy mode just targeting Postgres, because there are Advisory Locks and lots of Postgres-only niceties. I do not envy Solid Queue being multi-database, because it has to implement a bunch of stuff with a coarser toolbox. For example, there is a semaphores table, which is used for the Concurrency Controls feature (🎉). I think the “SOLID” libraries (also Solid Cache) are interesting because they have to implement behavior in a relational database that come for free in in-memory databases (example: TTL/record expiration/purging).

Puma Plugin: TIL. Looks nicer and more explicit than GoodJob trying to transparently detect it’s in the webserver to run asynchronously

Multiprocess. A nice surprise to me, Solid Queue has a multiprocess supervisor. It does seem like even the Puma plugin forks off another process though; that could have implications for memory constrained environments (e.g. Heroku dynos). I’m nearly 4 years into GoodJob and haven’t tackled multiprocess yet, so exciting to see this in Solid Queue’s first release.

Queue priority: Nice! I have opinions about how people set up their application’s queues, along the lines of: many people do it wrong, imo. Solid Queue looks like it provides a lot of good flexibility to let people easily migrate and configure their queues initially (though wrongly, by dependency, imo), but then reorient them more performantly (by latency, again imo). A single thread-pool/worker can pull from multiple queues.

Care. I notice lots of little things that are nice in Solid Queue. The code is clean. The indexes are named for their purpose/usage rather than just like index_table_column_column. The Puma Plugin is nice. There are things in GoodJob that I dream about what a clean-room, lessons-learned reimplementation would look like, but it’s never top of my priorities, and some things are never going back in the stable (table names are basically forever). Reading the Solid Queue code was a vicarious-nice! experience.

Differences. Do they even matter? I dunno:

  • No Dashboard yet. Waiting on Mission Control. GoodJob definitely got more twisty as I learned all of the things of “you want a button to do what now with those jobs? …oh, I guess that makes sense. hmm.”
  • No LISTEN/NOTIFY (yet?). Seems possible, but would be Postgres only so maybe not. That means latency will never be less than the polling frequency, though an example shows 0.1 seconds which seems good to me.
  • No cron-like functionality. It took me a minute to come around to the the necessity of this, maybe Solid Queue will too. 🤦 I missed this on first read through: “Unique jobs and recurring, cron-like tasks are coming very soon.” 🙌

Final thoughts: Path dependency is hard, so I don’t imagine lots of people should swap out their job backend just because there is something new (please, don’t let me ever read a “3 job backends in 4 years” blog post). New projects and applications will be more likely making these choices (and they shouldn’t be valueless choices, hence my excitement for Solid Queue becoming first party to Rails) and I’m really excited to see how Solid Queue grows up with them, and alongside other options like GoodJob and Sidekiq and Delayed et al.


  • One of my coworkers said this week “You’ve been an engineering director and in leadership before, right? I appreciate your perspective; any advice and resources you’d recommend?” So that set my mind racing. I dunno. On one hand, it’s like, well, first, you grind out 10 years of 1-year of experience 10 times, but do it 50 times a year. On the other, keep a delta file and I also think about Secrets of Consulting quite a lot (content warning: I haven’t re-read it in a long time; I tried reading the same author’s The Psychology of Computer Programming more recently and couldn’t do it).
  • Work otherwise is in the final marathon of promo packets and performance reviews and quarterly planning and a reorg and oh, the next version of Ruby is released in 3 weeks and it’s go time. Then we do it all again. I love my team so much.
  • I finished reading The Final Architecture series. I didn’t enjoy it as much as Children of Time (“the octopus books”). After reading all the Gateway books and 3-body problem books, I’m a bit over the idea that there’s a malicious (or at least self-interested) group of people who are unhappy with the current value of the Planck constant and are doing something about it. I was into the subplot of alien criminality.
  • I finished Talos Principle 2, but I screwed up the golden gates thing, so I have to beat it again just to get the special special ending. I’ve been playing it in parallel with a friend and appreciate our back and forth:

    Me: I’m personally more worried about environmental catastrophe than AI, but i guess they’re intertwined. Material conditions that are unfit for life. Like some of the talos robots seem to touch on my philosophical question which is like: how do we maximize individual agency+satisfaction while also avoiding collective/systemic fucking-around-and-finding-out.

    Friend: we can see from the game the answer lies somewhere on the spectrum between having 1000 robits around a crumbling power source vs having a magic 3D printing pyramid for use to conquer the stars

    I also started playing Talos 1 and it’s much less chill than the 2nd game. I may not finish it.

  • For GoodJob, all of the things I want have been labeled “Help Wanted”. I do want to get the row-locking foundations in place myself, though I think the safe upgrade path for it might take a little while to straighten out. I think I have finally mastered advisory locks, so, of course, that means change it all up.
  • I ran bundle update on Day of the Shirt, which means I also upgraded to Shakapacker, which means that I have, once again, spent an entire weekend fumbling with Webpack configuration to get window.$ working. I also got a nice email from the owner of a t-shirt website that validated my thesis that no one visits websites anymore, let alone to buy t-shirts: the website owner got a (different) full-time job.

The Rails Executor: increasingly everywhere

The Rails Executor rules everything around you your code.

If you write multithreaded-Rails code—like me, author of GoodJob—you’re probably familiar with the Rails Executor which is described in the Rails Multithreading Guide.

If you’re new to the Rails Executor: it sets up and tears down a lot of Rails’ framework magic. Code wrapped with a Rails Executor or its sibling, the Reloader, pick up a lot of powerful behavior:

  • Constant autoloading and reloading
  • Database connection/connection-pool management and query retries
  • Query Cache
  • Query Logging
  • CurrentAttributes
  • Error reporting

You usually won’t think about it. The Rails framework already wraps every Controller Action and Active Job with an Executor. Recently, as of Rails v7.1, it’s showing up everywhere within the Rails codebase:

The effect of these small changes could be surprising:

  • I came to write this blog post because I saw a Rails Discussion asking how “Rails 7.1 uses query cache for runner scripts” and aha, I knew the answer: the Executor.
  • I recently fixed a bunch of flaky GoodJob unit tests by wrapping each RSpec example in a Rails Executor. This is a problem specific to GoodJob, which uses connection-based Advisory Locks, but I discovered that if an Executor context was passed through (for example, executing an Active Job inline), the current database connection would be returned to the pool, sometimes breaking the Advisory Locks when a different connection was checked back out to continue the test. This was only a fluke of the tests, but was a longtime annoyance. I’ve previously had to work around a similar reset of CurrentAttributes that occurs too.
  • At my day job, GitHub, we’ve also been double-checking that all of our Rails-invoking scripts and daemons are wrapped with Rails Executors. Doing so has fixed flukey constant lookups, reduced our database connection error rate and increased successful query retries, and necessitated updating a bunch of tests that counted queries that now hit the query cache.

The Rails Executor is great! Your code is probably already wrapped by the Rails framework, but anytime you start writing scripts or daemons that require_relative "./config/environment.rb" you should double-check, and definitely if you’re using Thread.new, Concurrent::Future or anything that runs in a background thread.

I used the following code in GoodJob to debug that database connection checkout occurs in a Rails Executor, maybe you could adopt something similar too:

# config/initializers/debug_executors.rb

ActiveSupport.on_load :active_record do
  ActiveRecord::ConnectionAdapters::AbstractAdapter.set_callback :checkout, :before, (lambda do |conn|
    unless ActiveSupport::Executor.active?
      $stdout.puts "WARNING: Connection pool checkout occurred outside of a Rails Executor"

One last thing about Executors, you want to make sure that you’re wrapping individual units of work, so the execution context has a chance to reset itself (check-in database connections, unload and reload code, etc.):

# scripts/do_all_the_things.rb
# ...

# bad
Rails.application.executor.wrap do
  loop { MyModel.do_something }

# good
loop do
  Rails.application.executor.wrap { MyModel.do_something }

Update: I offered a Rails PR to make the script runner’s Executor conditional because the introduction of an Executor around bin/rails runner script.rb could introduce problems if the script is long-running/looping/daemon-like; developers would still need to use an Executor, but to wrap individual units of work in their longrunning script.

Reflections on GoodJob for Solid Queue

Rails World presents Solid Queue and Mission Control

GoodJob, via its introductory blog post, was highlighted last week at Rails World. A new Active Job queue backend, Solid Queue, was announced, and I’m excited to see where it goes!

I attended Rails World in Amsterdam this past week. During the conference, a new Active Job backend was announced: Solid Queue (video), which has the potential to become first, first-party backend in Rails. Solid Queue, like my GoodJob, is backed by a relational database. I’m very excited about this! I had a chance to talk to Rosa Gutierrez, who is leading the effort at 37signals, and I’m hopeful that I’ll be able to contribute to Solid Queue and who knows, maybe it could even become a successor to GoodJob.

With that thought in mind, I reflected on some of the design and motivations that became GoodJob, and that I believe are important regardless of the Active Job backend under development. These are not intended to be design documents but more a list of things that I have learned or come across during my 3 years working on GoodJob. It would be nice to keep these in mind when designing a potential successor to GoodJob. And I hope they can be the seed to further conversations, rather than a fully realized proposal or argument. Let’s go:

  • SIGKILL Safety. Recovering from a SIGKILL (or someone unplugging the power cord) is always number one in my mind when thinking of GoodJob. That informed my desire to use Advisory Locks (which are automatically released on disconnect), and my future thinking about heartbeats if GoodJob switched over to using FOR UPDATE SKIP LOCK instead of Advisory Locks. I do not think jobs should be limited to a specific timeout (as Delayed Job’s design uses) as that also creates significant retry latency when resumed, and jobs definitely shouldn’t be wrapped with a transaction either.
  • (Human) Exception and Retry Workflows. Everybody has a different workflow for how they deal with errors, and I believe that a backend needs to track, report (e.g. send to Sentry or Bugsnag) and expose the various reasons an error appears: retried, retry stopped, explicitly discarded, SIGKILLed/interrupted, unhandled error, etc. I still am dialing this in on GoodJob because there is wide variability of how people and teams manage their error workflows. I’m always learning something new. For example, there are very different answers on “when using retry_on SpecialError, attempts: 3 should the 4th error be reported to the exception tracker? What about an explicit discard_on? Should a discard_on error be reviewed and reenqueued or not?” If a job is SIGKILLed/interrupted, should it be automatically restarted or held for manual review? Everyone seems to do it differently! I haven’t cracked the code on what is “ideal” or reasonable to say “nope, don’t do it that way.” Active Job’s error handling isn’t clear cut either, so maybe we can make that better and come around to a more opinionated (but still inclusive) design. Maybe!
  • Process Harnesses. I think it’s interesting that Rails might ship with a 1st party queue backend before it ships with a 1st party webserver: there is a lot of operational overlap. Signal handling, timeouts, daemonization, liveness and healthcheck probes, monitoring and scaling instrumentation. There’s quite a lot of ground to cover, and a lot different systems and tooling: Kubernetes, systemd, rc.d, Heroku, Judoscale, to name just a few of the various operational targets that I’ve spent considerable time supporting.
  • Repeating Jobs / Clock Process. It took me a while to come around to this in GoodJob, but I believe that performing work repetitively on a schedule (“cron-like”) is very much in the same problem-domain as background jobs. There’s lots of different ways to design it that I don’t feel strongly about, for example GoodJob minimizes autoloading by keeping schedules separate from job classes, but I do think it is necessary to plan for scheduled jobs in a well-architected Rails application.
  • Unique Jobs, Throttles, Fuses and other Concurrency Controls,. Similarly to Repeating Jobs, demand is high for everything I’d bucket under “concurrency controls”, which I’ll say covers both enqueue and dequeue complexity. And these features are tough because they sit in counterbalance to overall performance: do you want to run jobs faster or smarter? And these are the features that I think are legit because there are other features below under Queue Design that I think are bunk. There’s a lot of discernment to do!
  • Queue design and multi-queue execution pools. I do think queue design is a place where lots of people do it wrong. I believe queues should be organized by maximum total latency SLO (latency_15s, latency_15m , latency_8h) and not by their purpose or dependencies (mailers, billing, api). Nate Berkopec believes similarly. And I think that informs that execution pools (e.g. thread pools) should be able to work from multiple queues and have independent concurrency configuration (e.g. number of threads), both to ease transition from the latter to the former, but also because it allows sharing resources as optimally as possible (having 3 separate pools that pull from "latency_15s", "latency_15m, latency_15s", and "latency_8h,*" in GoodJob’s syntax). I personally think concepts like priority or ordered-queues lead to bad queue design, so I wouldn’t sweat that. Any ordering regime more complex than first-in-first-out (FIFO) prioritizes capacity (or lack thereof) over latency. This might sound strange coming from me who champions running workloads in the webbrowser on tiny dynos, but it’s different in my mind: I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.
  • Labels. Per the previous point, though I have yet to implement this in GoodJob (soon!), I think that giving developers the option to label their jobs might break their bad habit of using queue names as functional labels, instead of what I believe queues should be appropriately used for: latency and quality-of-service thresholds. I mention it here just in case that informs Solid Queue’s design.
  • Observability. GoodJob maintains a lot of bookkeeping, keeping job and granular execution data around after execution so it can be inspected. People seem to like that, and it’s necessary to keep them around for calculating accurate latency metrics, though it all is a trade-off against performance. It makes for a fun Web Dashboard too.
  • Performance Envelope. I dunno, I mention this just because I think people spend an inordinate amount of time comparing queue backend performance and asking “do the jobs go brrrrr?” GoodJob targets the small and medium end of projects (though some big ones use it too) and prioritizes operational simplicity over performance. That works for me (and a lot of others!) but also isn’t really reflective of the scale of companies leading Rails development. There’s a tension here.
  • Making better mistakes tomorrow. I’m really proud of having a reputation for being helpful and responsive and curious in the GoodJob issue queue and discussions and various support Slacks (like Rails Link). I think there is a lot to the queue backend domain that won’t be learned by upfront analysis, and that can’t be easily bucketed into either “the library is doing it wrong” or “the developer is doing it wrong” There’s a lot of variation! (not to mention across JRuby,etc. and various database versions). I’m able to do things with GoodJob that I think is unlikely on a 1st party Rails queue backend (like cutting a new release after every patch and fix), and I’m able to stay oriented to the people and the problem they’re trying to solve over the technological solution itself. I hope all that can be preserved as these things move upstream.

That’s it! I’m probably forgetting stuff, so I’ll reserve the right to keep adding to this list. I’d love to keep talking about this and hope that Solid Queue will be fantastic!

Oh, and Solid Queue isn’t released yet, so if this seems compelling, use GoodJob in the meantime.