Two stories about technical debt, I guess

One activity I don’t enjoy very much: griping about “technical debt”; that label just never seems descriptive enough. And the things people gripe about seem to mostly fall into:

  • Deferred maintenance: we haven’t updated to the latest version of X, or it’s written in A language/framework but now everyone is much more familiar with B, or we know M raises an exception at N when they O while P
  • Just not the quality we expect, because reasons. It’s ok, we’re all winging it.

…and those categories crowds out the real sweaty palms stuff, the “we did do a good job but we know more now” that I think is the real deal. I can talk about that.

I’ve never found the particular post/video/talk? again (I’ve looked!), but it described technical debt as like: the distance between the current understanding of the business domain and the technical implementation’s modeling of the business domain. It had a chart that has stuck in my mind; it looked something like this:

A chart that shows the coming together and divergence of business and technical domain knowledge

That definition of “technical debt” clicked for me. For myself and the high performing teams I’ve worked with, we’re most productive, pushing value, when we’re asking ourselves: does this make sense? Given what we know now, and what we know about where we’re going, do the objects and their interactions conceptually map to how the business is being talked about by the non-technical people? Are we adapting at the same rate we’re learning? If yes, we’re golden; if sorta with some tweaks that’s good; when no… that’s bad, disordered, schismatic: carrying a burden of always translating between the language and models in the technical system and the language and concepts in the business domain. That sucks!

Aside: There’s a funny/terrifying/random thing this makes me think of: “We’ll Never Make That Kind of Movie Again”: An oral history of The Emperor’s New Groove, a raucous Disney animated film that almost never happened. One of the screenwriters describes the process of making an animated film:

In a normal four-year process, you’ve got meetings, you’ve got development people going, “What if the girl was a boy? What if the bird was a flower?” And then you have to run all those ideas.

The software projects I’ve worked on are a conveyor belt of “What if the bird was a flower?” decision making and idea running. Extend that object, repurpose this method, bolt on that, change this, swap that, rename this, better leave a comment on that. When it’s going well, it doesn’t matter that it was a bird yesterday and a flower today… as long as it’s not retaining so much birdliness that it compromises its flowerability. It’s when you’re having to remember and callback “hey, well, um, it was once a bird, fyi, that’s why it’s so much trouble to do this flower stuff”, then you’re sunk.

It’s a bad sign when forward development requires telling stories about the past.

Here’s two stories…

The isolated component

When I was working at a website hosting startup, we had one component that I could just never get the support to integrate into the larger application. It was the form that would create a new website project for the user. When the startup began, the original technical architecture was a bunch of interlinked but separated components: User Registration, User Account Management, Website Creation, Website Management, Organization Management, Agency/Reseller Management, etc. It made sense early as the business was figuring out what worked as a business, and then during my tenure we brought those components together into a unified application. Well, almost all of them.

There was a lot of give and take in the product integration; sometimes me and the other engineers would just do it and other times we’d defer it until there was a particular feature that necessitated it, and then we’d include that extra work in our estimates. It frequently took a couple passes of code cleanup and bringing it onto the design system, and that was ok. That’s the job!

That last, last, last component of Website Creation eluded us though, and it was outside our control. At that point, development was transitioning from “engineering led” to “product management and design led” and I had been instructed that engineering couldn’t make any product changes unless they were connected to an active PRD (Product Requirements Document) controlled by the PMs.

There was plenty of demand to make changes to Website Creation: smooth out the first-time account registration flow into creating a website; allow the user to do some activities in parallel while the website was spinning up like inviting team members or perusing billing levels; decouple the concept of “a website is a marketing strategy” from “upload your code and have some containers and a load balancer provisioned” so that non-developers could still plan a website without invoking all the technical bits.

But not enough appetite to get it done.

Of “technical debt”: everyone except our little engineering team maintaining the frontend didn’t think anything special of Website Creation. It wasn’t obvious unless you carefully watched for the hard-browser refresh, or noticed the navigation bar change slightly. Conceptually it was a unified product (heck, I even remember a product launch called “One”), but the work hadn’t yet been done on the application side and we engineers carried the burden.

It was funny because every time a product change that touched Website creation was discussed, the same thing happened:

PM: Your effort estimate seems really high. What’s going on?

Me: Well, this involves Website Creation and it’s still its own component and can’t access any of those other systems that are necessary to build the feature. We’d need to bring it into the rest of the application. It’s definitely possible! There’s a few open questions with product and design implications, so we’d need to work together on that.

PM: Oh, well, huh, I didn’t expect that. Hmm, we don’t have the bandwidth for all that. Let’s pass on it for now.

This happened multiple times! It was weird too because the particular project being planned would be spiked, and then the engineering team would have to wait around while a new project was spun up and that likely took just as long as it would have taken to do the work on the Website Creation component. If I hadn’t been explicitly told to sit on my hands otherwise, I would have probably just done this as off-the-books, shucks-I-didn’t-think-it-would-be-a-big-deal, shadow-work.

It never got done during my tenure; I think they later decided the problem was that the whole thing wasn’t written in React 🤷

The campaign message

When I was a developer on GetCalFresh, the functionality with perpetually unexpected estimations was “Campaign Messages”.

GetCalFresh would help people apply for their initial food stamp application, at which point it would be sent to the applicant’s county for processing. Over the next 14 to 28 days the county would schedule an in-person or telephone interview, and request various documents like pay stubs and rental leases, and the applicant would have to navigate all of that. (The administrative state suuuuucks!) To help, GetCalFresh would send a sequence of email and/or SMS messages to the applicant over this time period explaining what was needed and what to expect next. A messaging campaign, y’know, of “campaign messages”

When GetCalFresh was first laid down in code, there were two “types” of campaign messages: General and Expedited. Under a couple of different circumstances, if an applicant is homeless or at risk of being homeless, or has no cash-on-hand to purchase food, their application is eligible for expedited processing and thus we’d send them a message sequence over ~14 days; everyone else would receive a message sequence over ~28 days. We were sending the same messages, just on a different schedule.

So when we engineers were then asked to customize a message, like “if the applicant is a student, add this description about their special eligibility”… we just if/elsed it on in there. Oh, now this county is piloting a special process, let’s make a little carve out for them too and swap this message. Still, same sequence, just tweaks, right? Well, all those small tweaks and carve-outs build up, and all of a sudden we’re having to ask “ok, so you want us to rewrite this one itty bitty message, well we also need you to specify what it should be for students, who do and don’t qualify for County X’s special process too”. It got twistier and twistier. And when requests like “don’t send that message in this special circumstance” or “add a totally new message but just as this one-off” came in, we’d be like “totally possible! and that’s gonna take more work than you think!”

GetCalFresh had the best Product Managers I have ever worked with in my life, and we still got locked into a similar loop as the last story: we’d do our estimation with the PMs, it exposed the fruit hung more high than low, and the change would be deprioritized. I think the PMs got it, but the challenge was that the other folks, client support and the folks coordinating with the counties and the datascience team, would be like “we heard that Engineering doesn’t want to build it.” So weird! Not engineering’s call! (Aside: I spent so much time coaching non-technical stakeholders on how to work in a PM-led system, but always more coaching to do.)

I remember making a Google Doc to analyze why and explain how the system we initially designed for (same sequence of messages with different schedules) didn’t match our understanding of the problem today. The doc listed out all of the different reasons we knew of why we might customize the message. It was at least 10 bullet points. And there were a lot of other learnings too: initially we designed around customizing for just 3 major county systems (State Automated Welfare Systems - SAWS), but later found ourselves doing county-by-county customizations (California has 52 counties). I advocated for configuring each county in its own object despite the scale brainworms demanding a singular and infinitely abstracted model (I call these things “spreadsheet problems” when you can simply list the entire domain in a spreadsheet).

Of “technical debt”, I still can replay in my brain the deliberate mental spatial shift of imagining the campaign model as a 2-column array (General and Expedited) with 10+ rows of logic shifts and then flopping it onto its side to make a change. All that mental context has a huge carrying cost that all of us had to budget for when making a change.

During my tenure, we never did the significant reworking to how campaign messages were implemented, though some bold colleagues did their best to make changes as safe and simple as possible with DSLs and test macros. Thank you! 🙏

That’s it

Sorry, no great lessons here. Just stories to share (“ideally you’d try to see it as a funny story you can tell friends rather than a hurtful snub that needs to bother you”) I mentioned coaching folks on working with PMs, and I think the frequent advice I gave non-technical folks probably holds true for engineers too when asking:

Always have your top 3 ranked work items ready to go when talking to decision makers (the PM?). Don’t bring new stuff unless it changes those top 3.

(I mean sure, share context and adapt, but allow yourself no doubt that you’ve’ clearly and consistently communicated what your top priorities are before they get dumped in with everyone else’s.)

(But also, if you’re an engineer and you can and no one is breathing down your neck, simply get it done and celebrate. The PM doesn’t have to lead everything. You can do it! 👍)

The answer is in your heap: debugging a big memory increase in Ruby on Rails

I recently participated in an interesting series of debugging sessions tracking down the source of a large increase in memory when upgrading a Rails application. We ultimately tracked down the cause using John Hawthorn’s Sheap heap analyzer and successfully submitted a patch to Rails. I thought it was interesting enough to write up because maybe the general approach to debugging memory issues would be helpful (and this is the kind of stuff that I very quickly forget unless I write it down).

How it started: Reddit

Lots of people ask for help on r/rails, and it can be difficult to debug at a distance. This time it was a little different. I recognized the username’s owner, Jonathan Rochkind, because he’s been a familiar and helpful face in GoodJob’s discussions and I’ve sponsored his aggregator Rubyland News. The observed problem was that after upgrading from Rails 7.0 to Rails 7.1, their application’s memory footprint increased by about 25%. Weird!

Working the problem

We worked through a bunch of questions:

  • Was the memory increase at startup or over time? Not at boot, but memory increased very quickly.
  • Did anything change with Puma configuration? Nope.
  • Get set up with derailed_benchmarks, and create a bin/profile Rails binstub to make it easy to boot into a production-like configuration for profiling. Here’s what my very polished one looks like:

    #!/usr/bin/env ruby
    
    # This file is a wrapper around the rails and derailed executables
    # to make it easier to boot in PRODUCTION mode.
    #
    # Usage: bin/profile [rails|derailed] [command]
    
    ENV["RAILS_ENV"] = ENV.fetch("RAILS_ENV", "production")
    ENV["RACK_ENV"] = "production"
    ENV["RAILS_LOG_TO_STDOUT"] = "true"
    ENV["RAILS_SERVE_STATIC_FILES"] = "true"
    ENV["FORCE_SSL"] = "false"
    ## ^^ Put ENV to boot in production mode here ^^
    
    executable = ARGV.shift
    if executable == "rails"
      load File.join(File.dirname(__FILE__), "rails")
    elsif executable == "derailed"
      require 'bundler/setup'
      load Gem.bin_path('derailed_benchmarks', 'derailed')
    else
      puts "ERROR: '#{executable}' is not a valid command."
      puts "Usage: bin/profile [rails|derailed]"
      exit 1
    end
    

We flailed around with Derailed Benchmarks, as well as John Hawthorn’s Vernier profiler’s memory mode (aside: John Hawthorn is doing amazing stuff with Ruby).

At this point, we had a general understanding of the application memory footprint, which involved a large number of model instances (Work), many of which contained a big blob of json. For some reason they were sticking around longer than a single web request, but we weren’t able to find any smoking guns of like, memoized class instance variables that were holding onto references. So we kept digging.

You can read along to all of this here: https://github.com/sciencehistory/scihist_digicoll/issues/2449

Analyzing memory with Sheap

I used Derailed Benchmark’s perf:heap to generate heap dumps (also possible using rbtrace --heapdump), and then plugged those into Sheap. Sheap is a relatively new tool, and where it shines is being interactive. Instead of outputting a static report, Sheap allows for exploring a heap dump (or diff: to identify retained objects), and ask questions of the dump. In our case: what objects are referencing this object and why is it being retained?

# $ irb
require './lib/sheap.rb

diff = Sheap::Diff.new("/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-1.ndjson", "/Users/bensheldon/Repositories/sciencehistory/scihist_digicoll/tmp/2023-12-07T13:24:15-08:00-heap-2.ndjson")

# Find one of the Work records that's been retained
model = diff.after.class_named("Work").first.instances[200]
=> <OBJECT 0x117cf5c98 Work (4 refs)>

# Find the path to the (default) root
diff.after.find_path(model)
=>
[<ROOT vm (2984 refs)>,
 <IMEMO 0x126c9ab68 callcache (1 refs)>,
 <IMEMO 0x126c9acf8 ment (4 refs)>,
 <CLASS 0x12197c080 (anonymous) (15 refs)>,
 <OBJECT 0x122ddba08 (0x12197c080) (3 refs)>,
 <OBJECT 0x117cfc458 WorksController (13 refs)>,
 <OBJECT 0x117cf7318 WorkImageShowComponent (15 refs)>,
 <OBJECT 0x117cf5c98 Work (4 refs)>]

# What is that initial callcache being referenced by the ROOT?
diff.after.at("0x126c9ab68").data
=>
{"address"=>"0x126c9ab68",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"callcache",
 "references"=>["0x126c9acf8"],
 "file"=>"/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
 "line"=>48,
 "method"=>"public_send",
 "generation"=>288,
 "memsize"=>40,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# And then a method entry
irb(main):015> diff.after.at("0x126c9acf8").data
=>
{"address"=>"0x126c9acf8",
 "type"=>"IMEMO",
 "shape_id"=>0,
 "slot_size"=>40,
 "imemo_type"=>"ment",
 "references"=>["0x12197c080", "0x12197c080", "0x126c9ade8", "0x126c9b4a0"],
 "file"=>
  "/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
 "line"=>33,
 "method"=>"method_missing",
 "generation"=>288,
 "memsize"=>48,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# Aha, and a singleton for RoutesProxy!
diff.after.at("0x12197c080").data
=>
{"address"=>"0x12197c080",
 "type"=>"CLASS",
 "shape_id"=>14,
 "slot_size"=>160,
 "class"=>"0x12211c308",
 "variation_count"=>0,
 "superclass"=>"0x12211c3a8",
 "real_class_name"=>"ActionDispatch::Routing::RoutesProxy",
 "singleton"=>true,
 "references"=> [...],
 "file"=>
  "/Users/bensheldon/.rbenv/versions/3.2.2/lib/ruby/gems/3.2.0/gems/actionpack-7.1.2/lib/action_dispatch/routing/routes_proxy.rb",
 "line"=>33,
 "method"=>"method_missing",
 "generation"=>288,
 "memsize"=>656,
 "flags"=>{"wb_protected"=>true, "old"=>true, "uncollectible"=>true, "marked"=>true}}

# I expect the next object to be the RoutesProxy instance
diff.after.at("0x122ddba08").klass.data["real_class_name"]
=> "ActionDispatch::Routing::RoutesProxy"

Sheap is pretty great! In the above example, we were able to find a Work model instance in the heap, and then using find_path identify what was referencing it all the way back to the heap’s root, which is what causes the object to be “retained”; if there was no path to the root, the object would be garbage collected.

(I have the huge benefit of having John as a colleague at GitHub and he helped me out a lot with this. Thank you, John!)

What we’re looking at is something in Rails’ RoutesProxy holding onto a reference to that Work object, via a callcache, a method entry (ment), a singleton class, a RouteSet, and then a Controller. What the heck?!

The explanation

Using Rails’ git history, we were able to find that a change had been made to the RoutesProxy ’s behavior of dynamically creating a new method: a class_eval had been changed to an instance_eval.

Calling instance_eval "def method...." is what introduced a new singleton class, because that new method is only defined on that one object instance. Singleton classes can be cached by the Ruby VM (they’ll be purged when the cache fills up), and that’s what, through that chain of objects, was causing the model instances to stick around longer than expected and bloat up the memory! It’s not that instance_evaling new methods is itself inherently problematic, but when those singleton methods are defined on an object that references an instance of an Action Controller, which has many instance variables that contained big Active Record objects…. that’s a problem.

(Big props, again, to John Hawthorn who connected these dots.)

Having tracked down the problem, we submitted a patch to Rails to change the behavior and remove the instance_eval -defined methods. It’s been accepted and it should be released in the next Rails patch (probably v7.1.3); the project temporarily monkey-patched in that change too has been released as part of Rails 7.1.3.

I realize that’s all a big technical mouthful, but the takeaway should be: Sheap is a really great tool, and exploring your Ruby heap can be very satisfying.

Update: Jean Boussier pointed me to the fix in Ruby 3.3 for call Cache for singleton methods can lead to “memory leaks” 🎉 And suggested looking at harb too as a Sheap-like.

Trigger GitHub Actions workflows with inputs from Apple Shortcuts

I’ve been using Apple Shortcuts to invoke GitHub Actions workflows to create webpage bookmarks. It’s been great! (disclosure: I do work at GitHub)

My use case: I’ve been wanting to quit Pinboard.in, so I needed an alternative way to create and host my web bookmarks, some of which date back to ~2005 del.icio.us vintage. It’s been easy enough for me to export of all my bookmarks (settings -> backup -> JSON) and convert them to YAML files to be served by Jekyll and GitHub Pages. But I also needed an easy way to create new bookmarks that would work on all my Apple devices. I ended up with:

  1. Bookmarks are organized as individual yaml files, in this blog’s repository.
  2. A Ruby script to take some simple inputs (url, title, notes), generate a new yaml file, and commit it to the repo using Octokit.
  3. A GitHub Actions workflow that accepts those same inputs and can be manually triggered, that runs the script. One thing to note is that I echo the inputs to $GITHUB_STEP_SUMMARY early in the workflow in case a later step errors, so I won’t lose the bookmark details and can go back later and manually fix it up.
  4. An Apple Shortcut that asks for those inputs (either implicitly via the Share Sheet or via text inputs) and then manually triggers the GitHub Actions workflow via the GitHub API.

The only difficult part for me was getting Apple Shortcuts to work nicely with the GitHub REST API. Here’s what worked for me:

Use Get Contents of URL Action:

  • URL: https://api.github.com/repos/USER/REPOSITORY/actions/workflows/WORKFLOW.yml/dispatches
  • Method: POST
  • Headers:
    • Accept: application/vnd.github.v3+jsonp
    • Authorization: Bearer GITHUB_ACCESS_TOKEN
  • Request Body: JSON
    • ref: main (or whatever branch you’re using)
    • inputs (Dictionary):
      • INPUT: VALUE
      • … and your other GitHub Actions workflow inputs

Here’s what it looks like all together (btw, Dictionary-type inputs were broken in iOS 16 / Mac 13 😨) :

Screenshot of Apple Shortcut with the previous configuration

Recently, January 3, 2023

  • I’ve now watched the Taylor Swift Eras movie twice, once at home, and a second time over the holidays with niece (completely) and nephew (partly). My most burning question is whether Taylor menaces the same dancer every show’s “Tolerate it”, or if they share rhetorical pain. My Apple Music Replay also ranked highly with Taylor Swift, though also apparently Andrew McMahon; unexpected.
  • I started playing Talos Principle 1 after beating 2, though it’s a lot more intense with guns and exploding things and so many timing-based puzzles. I’ve almost beaten it… but also took a break to play Super Mario Wonder which is much more fun fun (especially, again, with niece and nephew).
  • I finished reading “Babel”, started then stopped reading “Wolf Hall”, and am picking my way through “Less”. There’s several nonfiction books in there that I’ve read about chapter of, but nothing so memorable to note here. Prior to that I read “Translation State”, and realizing I didn’t remember half of what was happening, re-read the previous 4 Imperial Radch books, which only now makes me think I should read-read Translation State again for the clean sweep. And I guess I finished the latest Bruno book, “A Chateau Under Siege”, and it was fine (there’s only so many times one can describe Kir Royale, and I think we’re past it).
  • I believe the Crumpler Soupansalad Son-o is the perfect jacket-and-book-and-water-bottle-and-also-half-gallon-of-milk-and-greek-yogurt bag. It also hasn’t been made in like 15 years so I bought another one on eBay (I attempted to do this via Poshmark 6 months ago but it failed to deliver). Now I own two: my longtime bag in brown, the new one in black.
  • After 7 years of service, I’ve stepped down from my church’s Property Committee. The buildings are being sold. As Boenhoeffer wrote, albeit about Germany’s embrace of fascism and not a congregational vote on the role of real-estate within a small church’s modest investment portfolio: “If you board the wrong train, it is no use running along the corridor in the other direction.” I’m currently serving on zero committees anywhere; a special moment.
  • Reflecting on my past year at GitHub, I’ll just directly quote Rachel Andrew’s blog; search and replace as necessary:

    The layoffs at Google at the beginning of 2023, didn’t impact me or my writing team directly, however they cast a shadow over the year. I try to look at difficult situations through a lens of what I can actually do to change things or improve the situation. At my level of management I’m not privy to layoff decisions, but I can be there to support my team and make space to talk about their concerns. I can strive to make sure our work and the impact of it is visible, and I can make sensible business decisions to make the most of resources in a more constrained environment. And so, after the initial shock of it all, that’s how I’ve approached this year.

  • I’m slowly trending towards GoodJob v4.0. It’s looking like it will be a noop: a chance for me to clean up all of the migrations and deprecation warnings and probably stop support for Ruby < 3.0, but nothing noticeable otherwise. I want to have everything ready for anyone to opt into FOR UPDATE SKIP LOCK from Advisory Locks, but as a feature that won’t be part of four-point-zero. We’ve managed to get to 3.22 (that’s 22 minor releases!) without a breaking change.
  • I’m not big on setting yearly goals, but I liked this evergreen advice from Ask a Manager, so I’m gonna continue aspiring to that:

    ideally you’d try to see it as a funny story you can tell friends rather than a hurtful snub that needs to bother you

Solid Queue first impressions: Nice!

Solid Queue was released yesterday, a new relational-database backend for Active Job.

I’m the author of GoodJob and I’ve been following along and am very interested and excited about Solid Queue. These are some things I noticed when first going through it.

tl;dr; It’s nice! I learned some things. It makes a few different choices than GoodJob and I’m very curious how they turn out (in a good way!).

I admit, I didn’t run Solid Queue in production. I poked through the code, got the development environment set up (spent the majority of my time trying to get mysql2 to compile, which is no surprise for me; Trilogy is my day job), ran the test suite and tried TDDing a new feature for it: perform_all_later support. These are just my notes, something to refer back to.

Lots of database tables: Solid Queue has many database tables. There is a “hot” table (my terminology) in which job records are queued/dequeued/locked from by all the workers, and then several other tables where job records are staged (e.g. they’re scheduled in the future, so don’t insert them into the hot table yet) or archived after they complete/error. This seems smart because that hot table and its indexes stays compact and single purpose, which is good for performance. Compare that to GoodJob in which the jobs table has like 8 indexes to cover both queue/dequeue and Dashboard listings and everything else, which does slow down inserts and updates. I’ve had the impression with GoodJob that orchestrating across multiple tables would be more difficult (everything is tradeoffs!), so I’m very curious to see an alternative implementation in Solid Queue.

Note: I wasn’t successfully able to implement perform_all_later in my 1 hour timebox because it was more complicated than an insert_all because of the necessity of writing to multiple tables.

Aside: One of the very first comments I got when I launched GoodJob 3 years ago was like “your design assumptions are less than ideal” and then they never replied to any of my follow-ups. That sucked! This is not that. Nothing in Solid Queue is particularly concerning, just different (sometimes better!). Kudos to Rosa Gutiérrez and the Solid Queue developers; you’re doing great work! 💖

Again, lots of database tables: GoodJob is easy mode just targeting Postgres, because there are Advisory Locks and lots of Postgres-only niceties. I do not envy Solid Queue being multi-database, because it has to implement a bunch of stuff with a coarser toolbox. For example, there is a semaphores table, which is used for the Concurrency Controls feature (🎉). I think the “SOLID” libraries (also Solid Cache) are interesting because they have to implement behavior in a relational database that come for free in in-memory databases (example: TTL/record expiration/purging).

Puma Plugin: TIL. Looks nicer and more explicit than GoodJob trying to transparently detect it’s in the webserver to run asynchronously

Multiprocess. A nice surprise to me, Solid Queue has a multiprocess supervisor. It does seem like even the Puma plugin forks off another process though; that could have implications for memory constrained environments (e.g. Heroku dynos). I’m nearly 4 years into GoodJob and haven’t tackled multiprocess yet, so exciting to see this in Solid Queue’s first release.

Queue priority: Nice! I have opinions about how people set up their application’s queues, along the lines of: many people do it wrong, imo. Solid Queue looks like it provides a lot of good flexibility to let people easily migrate and configure their queues initially (though wrongly, by dependency, imo), but then reorient them more performantly (by latency, again imo). A single thread-pool/worker can pull from multiple queues.

Care. I notice lots of little things that are nice in Solid Queue. The code is clean. The indexes are named for their purpose/usage rather than just like index_table_column_column. The Puma Plugin is nice. There are things in GoodJob that I dream about what a clean-room, lessons-learned reimplementation would look like, but it’s never top of my priorities, and some things are never going back in the stable (table names are basically forever). Reading the Solid Queue code was a vicarious-nice! experience.

Differences. Do they even matter? I dunno:

  • No Dashboard yet. Waiting on Mission Control. GoodJob definitely got more twisty as I learned all of the things of “you want a button to do what now with those jobs? …oh, I guess that makes sense. hmm.”
  • No LISTEN/NOTIFY (yet?). Seems possible, but would be Postgres only so maybe not. That means latency will never be less than the polling frequency, though an example shows 0.1 seconds which seems good to me.
  • No cron-like functionality. It took me a minute to come around to the the necessity of this, maybe Solid Queue will too. 🤦 I missed this on first read through: “Unique jobs and recurring, cron-like tasks are coming very soon.” 🙌

Final thoughts: Path dependency is hard, so I don’t imagine lots of people should swap out their job backend just because there is something new (please, don’t let me ever read a “3 job backends in 4 years” blog post). New projects and applications will be more likely making these choices (and they shouldn’t be valueless choices, hence my excitement for Solid Queue becoming first party to Rails) and I’m really excited to see how Solid Queue grows up with them, and alongside other options like GoodJob and Sidekiq and Delayed et al.

Recently

  • One of my coworkers said this week “You’ve been an engineering director and in leadership before, right? I appreciate your perspective; any advice and resources you’d recommend?” So that set my mind racing. I dunno. On one hand, it’s like, well, first, you grind out 10 years of 1-year of experience 10 times, but do it 50 times a year. On the other, keep a delta file and I also think about Secrets of Consulting quite a lot (content warning: I haven’t re-read it in a long time; I tried reading the same author’s The Psychology of Computer Programming more recently and couldn’t do it).
  • Work otherwise is in the final marathon of promo packets and performance reviews and quarterly planning and a reorg and oh, the next version of Ruby is released in 3 weeks and it’s go time. Then we do it all again. I love my team so much.
  • I finished reading The Final Architecture series. I didn’t enjoy it as much as Children of Time (“the octopus books”). After reading all the Gateway books and 3-body problem books, I’m a bit over the idea that there’s a malicious (or at least self-interested) group of people who are unhappy with the current value of the Planck constant and are doing something about it. I was into the subplot of alien criminality.
  • I finished Talos Principle 2, but I screwed up the golden gates thing, so I have to beat it again just to get the special special ending. I’ve been playing it in parallel with a friend and appreciate our back and forth:

    Me: I’m personally more worried about environmental catastrophe than AI, but i guess they’re intertwined. Material conditions that are unfit for life. Like some of the talos robots seem to touch on my philosophical question which is like: how do we maximize individual agency+satisfaction while also avoiding collective/systemic fucking-around-and-finding-out.

    Friend: we can see from the game the answer lies somewhere on the spectrum between having 1000 robits around a crumbling power source vs having a magic 3D printing pyramid for use to conquer the stars

    I also started playing Talos 1 and it’s much less chill than the 2nd game. I may not finish it.

  • For GoodJob, all of the things I want have been labeled “Help Wanted”. I do want to get the row-locking foundations in place myself, though I think the safe upgrade path for it might take a little while to straighten out. I think I have finally mastered advisory locks, so, of course, that means change it all up.
  • I ran bundle update on Day of the Shirt, which means I also upgraded to Shakapacker, which means that I have, once again, spent an entire weekend fumbling with Webpack configuration to get window.$ working. I also got a nice email from the owner of a t-shirt website that validated my thesis that no one visits websites anymore, let alone to buy t-shirts: the website owner got a (different) full-time job.

The Rails Executor: increasingly everywhere

The Rails Executor rules everything around you your code.

If you write multithreaded-Rails code—like me, author of GoodJob—you’re probably familiar with the Rails Executor which is described in the Rails Multithreading Guide.

If you’re new to the Rails Executor: it sets up and tears down a lot of Rails’ framework magic. Code wrapped with a Rails Executor or its sibling, the Reloader, pick up a lot of powerful behavior:

  • Constant autoloading and reloading
  • Database connection/connection-pool management and query retries
  • Query Cache
  • Query Logging
  • CurrentAttributes
  • Error reporting

You usually won’t think about it. The Rails framework already wraps every Controller Action and Active Job with an Executor. Recently, as of Rails v7.1, it’s showing up everywhere within the Rails codebase:

The effect of these small changes could be surprising:

  • I came to write this blog post because I saw a Rails Discussion asking how “Rails 7.1 uses query cache for runner scripts” and aha, I knew the answer: the Executor.
  • I recently fixed a bunch of flaky GoodJob unit tests by wrapping each RSpec example in a Rails Executor. This is a problem specific to GoodJob, which uses connection-based Advisory Locks, but I discovered that if an Executor context was passed through (for example, executing an Active Job inline), the current database connection would be returned to the pool, sometimes breaking the Advisory Locks when a different connection was checked back out to continue the test. This was only a fluke of the tests, but was a longtime annoyance. I’ve previously had to work around a similar reset of CurrentAttributes that occurs too.
  • At my day job, GitHub, we’ve also been double-checking that all of our Rails-invoking scripts and daemons are wrapped with Rails Executors. Doing so has fixed flukey constant lookups, reduced our database connection error rate and increased successful query retries, and necessitated updating a bunch of tests that counted queries that now hit the query cache.

The Rails Executor is great! Your code is probably already wrapped by the Rails framework, but anytime you start writing scripts or daemons that require_relative "./config/environment.rb" you should double-check, and definitely if you’re using Thread.new, Concurrent::Future or anything that runs in a background thread.

I used the following code in GoodJob to debug that database connection checkout occurs in a Rails Executor, maybe you could adopt something similar too:

# config/initializers/debug_executors.rb

ActiveSupport.on_load :active_record do
  ActiveRecord::ConnectionAdapters::AbstractAdapter.set_callback :checkout, :before, (lambda do |conn|
    unless ActiveSupport::Executor.active?
      $stdout.puts "WARNING: Connection pool checkout occurred outside of a Rails Executor"
    end
  end)
end

One last thing about Executors, you want to make sure that you’re wrapping individual units of work, so the execution context has a chance to reset itself (check-in database connections, unload and reload code, etc.):

# scripts/do_all_the_things.rb
# ...

# bad
Rails.application.executor.wrap do
  loop { MyModel.do_something }
end

# good
loop do
  Rails.application.executor.wrap { MyModel.do_something }
end

Update: I offered a Rails PR to make the script runner’s Executor conditional because the introduction of an Executor around bin/rails runner script.rb could introduce problems if the script is long-running/looping/daemon-like; developers would still need to use an Executor, but to wrap individual units of work in their longrunning script.

Reflections on GoodJob for Solid Queue

Rails World presents Solid Queue and Mission Control

GoodJob, via its introductory blog post, was highlighted last week at Rails World. A new Active Job queue backend, Solid Queue, was announced, and I’m excited to see where it goes!

I attended Rails World in Amsterdam this past week. During the conference, a new Active Job backend was announced: Solid Queue (video), which has the potential to become first, first-party backend in Rails. Solid Queue, like my GoodJob, is backed by a relational database. I’m very excited about this! I had a chance to talk to Rosa Gutierrez, who is leading the effort at 37signals, and I’m hopeful that I’ll be able to contribute to Solid Queue and who knows, maybe it could even become a successor to GoodJob.

With that thought in mind, I reflected on some of the design and motivations that became GoodJob, and that I believe are important regardless of the Active Job backend under development. These are not intended to be design documents but more a list of things that I have learned or come across during my 3 years working on GoodJob. It would be nice to keep these in mind when designing a potential successor to GoodJob. And I hope they can be the seed to further conversations, rather than a fully realized proposal or argument. Let’s go:

  • SIGKILL Safety. Recovering from a SIGKILL (or someone unplugging the power cord) is always number one in my mind when thinking of GoodJob. That informed my desire to use Advisory Locks (which are automatically released on disconnect), and my future thinking about heartbeats if GoodJob switched over to using FOR UPDATE SKIP LOCK instead of Advisory Locks. I do not think jobs should be limited to a specific timeout (as Delayed Job’s design uses) as that also creates significant retry latency when resumed, and jobs definitely shouldn’t be wrapped with a transaction either.
  • (Human) Exception and Retry Workflows. Everybody has a different workflow for how they deal with errors, and I believe that a backend needs to track, report (e.g. send to Sentry or Bugsnag) and expose the various reasons an error appears: retried, retry stopped, explicitly discarded, SIGKILLed/interrupted, unhandled error, etc. I still am dialing this in on GoodJob because there is wide variability of how people and teams manage their error workflows. I’m always learning something new. For example, there are very different answers on “when using retry_on SpecialError, attempts: 3 should the 4th error be reported to the exception tracker? What about an explicit discard_on? Should a discard_on error be reviewed and reenqueued or not?” If a job is SIGKILLed/interrupted, should it be automatically restarted or held for manual review? Everyone seems to do it differently! I haven’t cracked the code on what is “ideal” or reasonable to say “nope, don’t do it that way.” Active Job’s error handling isn’t clear cut either, so maybe we can make that better and come around to a more opinionated (but still inclusive) design. Maybe!
  • Process Harnesses. I think it’s interesting that Rails might ship with a 1st party queue backend before it ships with a 1st party webserver: there is a lot of operational overlap. Signal handling, timeouts, daemonization, liveness and healthcheck probes, monitoring and scaling instrumentation. There’s quite a lot of ground to cover, and a lot different systems and tooling: Kubernetes, systemd, rc.d, Heroku, Judoscale, to name just a few of the various operational targets that I’ve spent considerable time supporting.
  • Repeating Jobs / Clock Process. It took me a while to come around to this in GoodJob, but I believe that performing work repetitively on a schedule (“cron-like”) is very much in the same problem-domain as background jobs. There’s lots of different ways to design it that I don’t feel strongly about, for example GoodJob minimizes autoloading by keeping schedules separate from job classes, but I do think it is necessary to plan for scheduled jobs in a well-architected Rails application.
  • Unique Jobs, Throttles, Fuses and other Concurrency Controls,. Similarly to Repeating Jobs, demand is high for everything I’d bucket under “concurrency controls”, which I’ll say covers both enqueue and dequeue complexity. And these features are tough because they sit in counterbalance to overall performance: do you want to run jobs faster or smarter? And these are the features that I think are legit because there are other features below under Queue Design that I think are bunk. There’s a lot of discernment to do!
  • Queue design and multi-queue execution pools. I do think queue design is a place where lots of people do it wrong. I believe queues should be organized by maximum total latency SLO (latency_15s, latency_15m , latency_8h) and not by their purpose or dependencies (mailers, billing, api). Nate Berkopec believes similarly. And I think that informs that execution pools (e.g. thread pools) should be able to work from multiple queues and have independent concurrency configuration (e.g. number of threads), both to ease transition from the latter to the former, but also because it allows sharing resources as optimally as possible (having 3 separate pools that pull from "latency_15s", "latency_15m, latency_15s", and "latency_8h,*" in GoodJob’s syntax). I personally think concepts like priority or ordered-queues lead to bad queue design, so I wouldn’t sweat that. Any ordering regime more complex than first-in-first-out (FIFO) prioritizes capacity (or lack thereof) over latency. This might sound strange coming from me who champions running workloads in the webbrowser on tiny dynos, but it’s different in my mind: I don’t think it’s possible to meet a latency target through prioritization when there is a fundamental lack of capacity.
  • Labels. Per the previous point, though I have yet to implement this in GoodJob (soon!), I think that giving developers the option to label their jobs might break their bad habit of using queue names as functional labels, instead of what I believe queues should be appropriately used for: latency and quality-of-service thresholds. I mention it here just in case that informs Solid Queue’s design.
  • Observability. GoodJob maintains a lot of bookkeeping, keeping job and granular execution data around after execution so it can be inspected. People seem to like that, and it’s necessary to keep them around for calculating accurate latency metrics, though it all is a trade-off against performance. It makes for a fun Web Dashboard too.
  • Performance Envelope. I dunno, I mention this just because I think people spend an inordinate amount of time comparing queue backend performance and asking “do the jobs go brrrrr?” GoodJob targets the small and medium end of projects (though some big ones use it too) and prioritizes operational simplicity over performance. That works for me (and a lot of others!) but also isn’t really reflective of the scale of companies leading Rails development. There’s a tension here.
  • Making better mistakes tomorrow. I’m really proud of having a reputation for being helpful and responsive and curious in the GoodJob issue queue and discussions and various support Slacks (like Rails Link). I think there is a lot to the queue backend domain that won’t be learned by upfront analysis, and that can’t be easily bucketed into either “the library is doing it wrong” or “the developer is doing it wrong” There’s a lot of variation! (not to mention across JRuby,etc. and various database versions). I’m able to do things with GoodJob that I think is unlikely on a 1st party Rails queue backend (like cutting a new release after every patch and fix), and I’m able to stay oriented to the people and the problem they’re trying to solve over the technological solution itself. I hope all that can be preserved as these things move upstream.

That’s it! I’m probably forgetting stuff, so I’ll reserve the right to keep adding to this list. I’d love to keep talking about this and hope that Solid Queue will be fantastic!

Oh, and Solid Queue isn’t released yet, so if this seems compelling, use GoodJob in the meantime.

Writing Object Shape friendly code in Ruby

Update: Jean Boussier wrote a deeper explaination of how Ruby Object Shapes are implemented (and more up-to-date for Ruby 3.3, unreleased as of October 23, 2023) and when and how to optimize for them.

My rule of thumb is that one or two memoized variables in a class are fine, but more than that likely deserve a quick refactor.

My original post is below…

Ruby 3.2 includes a performance optimization called Object Shapes, that changes how the Ruby VM stores, looks up, and caches instances variables (the variables that look like @ivar) . YJIT also takes advantage of Object Shapes, and the upcoming Ruby 3.3 has further improvements that improve the performance of Object Shapes.

This is a brief blog post about how to write your own Ruby application code that is optimized for Object Shapes. If instead you’d like to learn more about how Object Shapes is implemented in Ruby, watch Aaron Patterson’s RubyConf 2022 video or read this explanation from Ayush Poddar .

Big thank you to my colleagues John Hawthorn and Matthew Draper for feedback on the coding strategies described here. And John Bachir, Nate Matykiewicz, Josh Nichols, and Jean Boussier whose conversation in Rails Performance Slack inspired it.

The general rule: define your instance variables in the same order every time

To take advantage of Object Shape optimizations in your own Ruby Code, the goal is to minimize the number of different shapes of objects that are created and minimize the number of object shape transitions that occur while your application is running:

  • Ensure that instances of the same class share the same object shape
  • Ensure that objects do not frequently or unnecessarily transition or change their shape
  • Help objects that could share the same object shape (e.g. substitutable child classes) to do so, with reasonable effort and without compromising readability and maintainability.

This succinct explanation is from Ayush Poddar, and explains the conditions that allow objects to share a shape:

New objects with the same [instance variable] transitions will end up with the same shape. This is independent of the class of the object. This also includes the child classes since they, too, can re-use the shape transitions of the parent class. But, two objects can share the same shape only if the order in which their instance variables are set is the same.

That’s it, that’s what you have to do: if you want to ensure that two objects share the same shape, make sure they define their instance variables in the same order. Let’s start with a counterexample:

# Bad: Object Shape unfriendly
class GroceryStore
  def fruit
    @fruit = "apple"
  end

  def vegetable
    @vegetable = "broccoli"
  end
end

# The "Application"
alpha_store = GroceryStore.new
alpha_store.fruit # defines @fruit first
alpha_store.vegetable # defines @vegetable second

beta_store = GroceryStore.new
beta_store.vegetable # defines @vegetable first
beta_store.fruit # defines #fruit second 

In this example, alpha_store and beta_store do not share the same object shape because the order in which their instance variables are defined depends on the order the application calls their methods. This code is not Object Shape friendly.

Pattern: Define your instance variables in initialize

The simplest way to ensure instance variables are defined in the same order every time is to define the instance variables in #initialize:

# Good: Object Shape friendly
class GroceryStore
  def initialize
    @fruit = "apple"
    @vegetable = nil # declare but assign later
  end

  def fruit
    @fruit
  end

  def vegetable
    @vegetable ||=  "broccoli"
  end
end

It’s also ok to define instance variables implicitly with attr_* methods in the class body, which has the same outcome of always defining the instance variables in the same order. Update: Ufuk Kayserilioglu informed me that attr_* do not define the instance variable until they are first called, meaning that these methods or their associated instance variables should also be declared with a value in #initialize.

Now I realize this is a very simplistic example, but that’s really all there is to it. If it makes you feel better, at GitHub where I work, we have classes with upwards of 200 instance variables. In hot code, where we have profiled, we go to a negligible effort of making sure those instance variables are defined in the same order; it’s really not that bad!

Pattern: Null memoization

Using instance variables to memoize values in your code may present a challenge when nil is a valid memoized value. This is a common pattern in Ruby that is not Object Shape friendly:

# Bad: Object Shape unfriendly
class GroceryStore
  def fruit
    return @fruit if defined?(@fruit)
    @fruit = an_expensive_operation
  end
end

Rewrite this by creating a unique NULL constant and check for its presence instead:

# Good: Object Shape friendly
class GroceryStore
  NULL = Object.new
  NULL.freeze # not strictly necessary, but makes it Ractor-safe

  def initialize
    @fruit = NULL
  end

  def fruit
    return @fruit unless @fruit == NULL
    @fruit = an_expensive_operation 
  end
end

Alternatively, if you’re doing a lot of meta or variable programming and you need an arbitrary number of memoized values, use a hash and key check instead:

# Good: Object Shape friendly
class GroceryStore
  def initialize
    @produce = {}
  end

  def produce(type)
    return @produce[type] if @produce.key?(type)
    @produce[type] = an_expensive_operation(type) 
  end
end

That’s it

Creating Object Shape friendly code is not very complicated!

Please reach out if there’s other patterns I’m missing: [email protected] / twitter.com/@bensheldon / ruby.social/@bensheldon

In defense of consensus

There’s a style of reactionary meme that takes a photo of like, empty store-shelves or a trash-strewn street, and applies the image macro “This is what Communism looks like”. But upon closer inspection (and social media lampooning), it’s a photo of America, capitalist America, very much not under communism. We’ll come back to this.

Let’s talk about “consensus”. Not a week goes by in my Bay Area tech worklife where I don’t read or hear someone dragging consensus. Consensus is pilloried: weak, indecisive, lowest-common denominator, unclear, drawn out… consensus is bad, they say.

Working in tech for a decade, I have to admit this struck me as strange the first time I heard a coworker complain about that bogeyman “consensus”. I’ve been a facilitator of consensus-based practice for 13 years. These practices, taught to me through the Institute for Cultural Affair’s ToP (“Technology of Participation”) series, served me well when I was doing nonprofit and community work, serving on boards and facilitating offsites. And consensus-based practices have served me well in tech and business too: using its methods to do discovery, lead meetings, get feedback, and drive decision-making and action. I do strategic planning consultation too.

The consensus-based practices I’ve learned take a group through a process: beginning with a prompt or need, then collecting facts and inputs, understanding people’s reactions to them, their interpretations and implications, and ultimately describing a series of actions and commitments to take. This can be a simple conversation, or a multi-day event that builds fractally on itself: a preceding session’s final actions could be deciding on what will be the following session’s initial inputs. When I’m working with leaders to design the process, we’ll discuss what responsibilities we want to delegate to the group, and what decisions will be retained among leadership. Leadership remains accountable, in the sense that there is a legible decision-making process, which is a strong benefit of deliberative practice. That’s “consensus”.

Alternatively, the Bay Area tech process, not “consensus”, oh no, seems to follow these recipes:

  • Plan and socialize
  • Disagree and commit

I was introduced to “plan and socialize” in my second tech job, being mentored by the Director of Engineering. To “socialize” is more than informing people, it’s having conversations and helping them understand how a plan or proposal will affect their work, and getting feedback that might lead to adjustments or compensatory actions. It’s also somewhat vague: asking people to leave comments in a google doc, attend an office hours, or a loosely moderated feedback session. Decisions, once made, are also socialized: explained, defended, adjusted, or white-knuckled through.

Depending on their power level, leaders may then ask people to “disagree and commit” meaning that the (negative) feedback has been heard but those underlings must commit to carrying the plan out regardless. Suck it up, professionally, so to speak. Sometimes this is used as performance feedback: “I’m aware you’ve been sharing your dislike of the plan with coworkers. That lack of trust is undermining the business. I need you to disagree and commit”… and keep your thoughts to yourself.

Under the spotlight, these approaches look less like bold and steely decision-making, and more like mumbly plan shifting backed by blusterful threats. Like the “this is what communism looks like”-meme, the scary-othered threat is not “consensus” but simply the current reality: confused, inadequate, probationary, triangulating, embarrassing, shameful.

There’s a joke in civic tech: government tech projects may say they can’t do incremental development, but that’s exactly what happens after their big-bang waterfall launch crashes-and-burns and they end up having to fix it one piece at a time. Clay Shirky captures it in “How Willful Ignorance Doomed HealthCare.gov”:

It is hard for policy people to imagine that HealthCare.gov could have had a phased rollout, even while it is having one. At launch, on Oct. 1, only a tiny fraction of potential users could actually try the service. They generated errors. Those errors were handed to a team whose job was to improve the site, already public but only partially working. The resulting improvements are incremental and put in place over a period of months. That is a phased rollout, just one conducted in the worst possible way.

Bay Area tech has the same relationship to decisions and consensus: by “socializing” plans and decisions, leaders are trying to craft a deliberativeu process for information sharing, feedback gathering, and alignment building. They’re simply doing it after they’ve already written and decided on an insufficient course of action and are grasping for a fix. Ultimately they are reaching for consensus, just consensus conducted in the worst possible way.

Please think of this the next time you hear (or say) something bad about consensus. Consensus is pretty great, and even better when used from the start.

The Institute for Cultural Affairs has lots of trainings on consensus-based facilitation. The Center for Strategic Facilitation is the Bay Area’s local trainer and service provider, but there are trainers and service providers all over the globe.

There is a system known as “Formal Consensus” which gained some notability during the 1999 “Battle of Seattle” WTO protests as a means of empowering small groups, particularly indigenous representatives, by providing a limited and fixed number of “blocks” during deliberations to stop actions proposed by far larger groups. Also how my buddy organized FreeGeek Chicago. I have never heard anyone in Bay Area tech reference any of this in regards to what they mean by consensus.


Newer posts Older posts