GoodJob v1.2: Multithreaded queue isolation and LISTEN/NOTIFY

GoodJob version 1.2 has been released. GoodJob is a multithreaded, Postgres-based, ActiveJob backend for Ruby on Rails. If you’re new to GoodJob, read the introductory blog post.

GoodJob’s v1.2 release adds multithreaded queue isolation for easier congestion management, and usage of Postgres LISTEN/NOTIFY to greatly reduce queue latency.

Version 1.2 comes out 2 weeks after GoodJob v1.1, and 5 weeks after GoodJob’s initial v1.0 release.

Multithreaded queue isolation

GoodJob v1.2 adds multithreaded queue isolation for easier congestion management. Queue isolation ensures that slow, long-running jobs do not block the execution of higher priority jobs.

Achieving queue isolation has always been possible by running multiple processes, but GoodJob v1.2 makes it easy to configure multiple isolated thread-pools within a single process.

For example, to create a pool of 2 threads working from the mice queue, and 1 thread working from the elephants queue:

$ bundle exec good_job --queues="mice:2;elephants:1"

Or via an environment variable:

$ GOOD_JOB_QUEUS="mice:2;elephants:1" bundle exec good_job

Additional examples and syntax:

  • --queues=*:2;mice,sparrows:1 will create two thread-pools, one running jobs on any queue, and another dedicated to mice and sparrows queued jobs.
  • --queues=-elephants,whales:2;elephants,whales:1 will create two thread-pools, one running jobs from any queue except the elephants or whales, and another dedicated to elephants and whales queued jobs.

LISTEN/NOTIFY

GoodJob now uses Postgres LISTEN/NOTIFY to push newly enqueued jobs for immediate execution. LISTEN/NOTIFY greatly reduces queue latency, the time between when a job is enqueued and execution begins.

LISTEN/NOTIFY works alongside GoodJob’s polling mechanism. Together, jobs queued for immediate execution (ExampleJob.perform_later) are executed immediately, while future scheduled jobs (ExampleJob.set(wait: 1.hour).perform_later) are executed at (or near) their set time.

Upcoming

In the next release, v1.3, I plan to include a simple web dashboard for inspecting job execution performance, and focus on improving GoodJob’s documentation.

Contribute

Code, documentation, and curiousity-based contributions are welcome! Check out the GoodJob Backlog, comment on or open a Github Issue, or make a Pull Request.

I’ve also set up a GitHub Sponsors Profile if you’re able to support me and GoodJob monetarily. It helps me stay in touch and send you project updates too.


GoodJob v1.1: async and improved documentation

GoodJob version 1.1 has been released. GoodJob is a multithreaded, Postgres-based, ActiveJob backend for Ruby on Rails. If you’re new to GoodJob, read the introductory blog post.

GoodJob’s v1.1 release contains a new, economical execution mode called “async” to execute jobs within the webserver process with the same reliability as a separate job worker process.

This release also contains more in-depth documentation based on feedback and questions I’ve received since the v1.0 release.

Version 1.1 comes out 3 weeks after GoodJob v1.0. The initial release of GoodJob was featured on Ruby Weekly, A Fresh Cup, Awesome Ruby, and was as high as #8 on Hacker News. GoodJob has since received nearly 500 stars on Github.

Async mode

In addition to the $ good_job executable, GoodJob now can execute jobs inside the webserver process itself. For light workloads and simple applications, combining web and worker into a single process is very economical, especially when running on Heroku’s free or hobby plans.

GoodJob’s async execution is compatible with Puma, in multithreaded (RAILS_MAX_THREADS), multi-process (WEB_CONCURRENCY), and memory efficient preload_app! configurations. GoodJob is built with Concurrent Ruby which offers excellent thread and process-forking safety guarantees. Read the GoodJob async documentation for more details.

On a personal level, I’m very excited to have this feature in GoodJob. Async execution was the compelling reason I had previously adopted Que, another Postgres-based backend, in multiple projects and I was heartbroken when Que dropped support for async execution.

Improved documentation

Since GoodJob was released 3 weeks ago, the documentation has been significantly expanded. It contains more code and examples for ensuring reliability and handling job errors. I’ve had dozens of people ask questions through Github Issues and Ruby on Rails Link Slack.

Upcoming

In the next release, v1.2, I plan to simplify the creation of multiple dedicated threadpools within a single process. The goal is to provide an economical solution to congestion when the execution of a number of slow, low-priority jobs (elephants) are being performed and there are no execution resources available for newly introduced fast, high priority jobs (mice) until the currently executing elephants complete.

A proposed configuration, for example:

--queues=mice:2,elephants:4

…would allocate 2 dedicated threads for jobs enqueued on the mice queue, and 4 threads for the elephants queue. Learn more in the feature’s Github Issue.

Contribute

GoodJob continues to be enjoyable to develop and build upon Rails’ ActiveJob and Concurrent Ruby. Contributions are welcomed: check out the GoodJob Backlog, comment on or open a Github Issue, or make a Pull Request.


Introducing GoodJob 1.0, a new Postgres-based, multithreaded, ActiveJob backend for Ruby on Rails

GoodJob is a new Postgres-based, multithreaded, second-generation ActiveJob backend for Ruby on Rails.

Inspired by Delayed::Job and Que, GoodJob is designed for maximum compatibility with Ruby on Rails, ActiveJob, and Postgres to be simple and performant for most workloads.

  • Designed for ActiveJob. Complete support for async, queues, delays, priorities, timeouts, and retries with near-zero configuration.
  • Built for Rails. Fully adopts Ruby on Rails threading and code execution guidelines with Concurrent::Ruby.
  • Backed by Postgres. Relies upon Postgres integrity and session-level Advisory Locks to provide run-once safety and stay within the limits of schema.rb.
  • For most workloads. Targets full-stack teams, economy-minded solo developers, and applications that enqueue less than 1-million jobs/day.

Visit Github for instructions on adding GoodJob to your Rails application , or read on for the story behind GoodJob.

A “Second-generation” ActiveJob backend

Why “second-generation*”? GoodJob is designed from the beginning to be an ActiveJob-backend in a conventional Ruby on Rails application.

First-generation ActiveJob backends, like Delayed::Job and Que, all predate ActiveJob and support non-Rails applications. First-generation ActiveJob backends are significantly more complex than GoodJob because they separately maintain a lot of functionality that comes with a conventional Rails installation (ActiveRecord, ActiveSupport, Concurrent::Ruby) and re-implement job lifecycle hooks so they can work apart from ActiveJob. I’ve observed that this can make them slow to keep up with major Rails changes. An impetus for GoodJob was reviewing the number of outages, blocked upgrades, and forks of first-generation backends I’ve managed during both major and minor Rails upgrades over the years.

As a second-generation ActiveJob backend, GoodJob can draft off of all the advances and solved problems of ActiveJob and Ruby on Rails. For example rescue_from, retry_on, discard_on are all implemented already by ActiveJob.

GoodJob is significantly thinner than first-generation backends, and over the long run hopefully easier to maintain and keep up with changes to Ruby on Rails. For example, GoodJob is currently ~600 lines of code, whereas Que is ~1,200 lines, and Delayed::Job is ~2,300 lines (2,000 for delayed_job, and an additional 300 for delayed_job_active_record).

*“Second generation” was coined for me by Daniel Lopez on Ruby on Rails Link Slack.

Postgres-based

I love Postgres. Postgres offers a lot of features, has safety and integrity guarantees, and simply running fewer services (skipping Redis) means less complexity in development and production.

GoodJob builds atop ActiveRecord. It’s numbingly boring, in a good way.

GoodJob uses session-level Advisory Locks to provide run-once guarantees with relatively little performance implications for most workloads.

GoodJob’s session-level Advisory Lock implementation is perhaps the only “novel” aspect, that comes from my experience orchestrating complex web-driving of government systems (“the browser is the API”) for Code for America. GoodJob uses a Common Table Expression (CTE) to find, lock, and return the next workable job in a single query. Session-level Advisory Locks will gracefully relinquish that lock if interrupted, without having to maintain a transaction for the duration of the job.

Multi-threaded

GoodJob uses Concurrent::Ruby to scale and manage jobs across multiple threads. “Concurrent Ruby makes one of the strongest thread-safety guarantees of any Ruby concurrency library”. Ruby on Rails has adopted Concurrent Ruby, and GoodJob follows its lead and thread-execution and safety guidelines.

In building GoodJob I leaned heavily on my positive experiences running Que, another multithreaded backend, on Heroku. Threads are great for balancing simplicity, economy, and performance for typical IO-bound workloads like heavy database queries, API requests, Selenium web-driving, or sending emails.

A feature that won’t be in GoodJob 1.0, but I hope to implement soon, is the ability to run the GoodJob scheduler inside the webserver process (“async mode”). This was a feature withdrawn from Que , but I believe can be safely implemented with Concurrent Ruby. An async mode would offer even greater economy, for example, in Heroku’s constrained environment.

GoodJob is right for me

GoodJob’s design is based directly on my experience in 2-pizza, full-stack teams, and as an economy-minded solo developer. GoodJob already powers Day of the Shirt and Brompt performing tens-of-thousands of real-world jobs a day.

Is GoodJob right for you?

Try it out and let me know.


Retail politics

I will quote anything that reinforces the necessity of showing up. From SF Weekly’s “The Many Faces of Leland Yee: A Politician’s Calculated Rise and Dramatic Fall” :

Upon reflection, Yee’s principles may be ever-shifting and his policies may be decorative, but he found a way around this: by being omnipresent.

He knew the name of every neighborhood stalwart from every neighborhood club; he cleaned hundreds of plates at hundreds of Chinatown banquets; he sat through countless community meetings, gathering hundreds of converts at a time: “In local politics,” says one longtime player, “a cup of coffee and a handshake can win you a friend for life.”

Yee showed up at your kid’s bar mitzvah or high school graduation; he showed up at your community gathering; he showed up at your neighborhood bazaar — in short, he showed up. His staff returned your phone call. And he read your letters: A former associate says Yee never failed to leave the office at the end of a long day toting a thick stack of mail that he made a point of poring through. In insider jargon, this is known as “retail politics.” Few worked harder or did it better.


Engineering Operations is not the same as Development

I wrote this memo several years ago when I joined GetCalFresh as the first outside engineering hire. An early focus of mine was helping the team move more confidently into an operational mindset: this memo reframes the teams existing values that drive development as values that also support operations. This also overlaps greatly with a talk I gave at Code for America’s 2018 Summit “Keeping Users at the Forefront While Scaling Services”.

Over the past month GetCalFresh has tripled the number of food stamp applications we’re processing. We often talk about “build the right thing”, but I wanted to focus on what it means to “operate a thing safely”.

Understanding operational failure

GetCalFresh collects foodstamp applicant’s information via a series of webforms, and then submits that applicant information to the county to begin the foodstamp eligibility process.

The website and webforms being offline or unavailable is bad.

Failing to submit application information to the county in a timely manner is awful. Foodstamp benefits are prorated to the day that the client’s application arrives at the county before 5pm. Failing to deliver a clients application in a timely manner literally means less food on the table for a hungry family.

Our system is operationally “safe” when it ensures that client information is transmitted to the county in a timely manner. Our system experiences an operational “failure” when information is not submitted in a timely manner. Our system has operational “risk” that degrades safety and is the potential for an operational failure.

Risks in complicated, complex and chaotic systems

Keeping a website online is complicated, but can be addressed with good practice. We use boring technologies: Ruby on Rails, SQL, AWS, that scale and respond predictably and are part of a mature ecosystem of monitoring tools and practice.

Submitting client information to the county is complex and sometimes chaotic. Because county systems often have no API, we have a queue of jobworkers that use Selenium Webdriver to click through and type into a “virtualized” headless Firefox browser. Automating this leads to emergent and novel problems. Client data must be transformed into a series of scripted actions to be performed across multiple county webpages, with dynamic forms and data fields. The county websites may be offline or degraded, and occasionally their structure and content changes. Additional client documents may need to be faxed, emailed or uploaded to the county, and those systems can be degraded as well.

Our applicants themselves can cause operational risks. As we target new populations and demographics (e.g. seniors, students, military families, homeless, low-literacy or non-English-speaking), we discover new usability issues and challenges in collecting and transforming data from our webforms into county systems. For example, different county systems have different optional and required fields and expect names and addresses to by sanitized and tokenized differently.

In this system, we cannot reliably (or affordably, with time and resources) predict how this system will respond as it scales to new users or integrates with new counties.

Creating safety with staff and time

We ensure that foodstamp applications are submitted in a timely manner through existing staff and dedicated time. Because we cannot reliably predict how our system scales or responds to changes, we have systems that alerts us to the risk of operational failure and engineers who are available to respond, remediate, and harden against similar circumstances in the future.

Every day, engineers block out 4pm to 5pm as “Apps & Docs”. We use this time to review any food stamp applications that failed our automated submission process to ensure the applications are submitted to the county by the daily deadline. Problems are documented and potential improvements are added to or reprioritized within the team’s backlog. We create safety by sometimes reaching out to clients for clarification or correction. In the event of an operational failure (we are not able to submit their application that day), we try to make things right; sometimes offering a gift card the client can use to purchase food.

Examples of problems identified during our hour of Apps & Docs:

  • Services not allowing multiple parallel sessions using the same credentials.
  • Inconsistent address tokenization for college campuses, military bases, PO boxes, and Private Mail Boxes
  • Frequency of people uploading iexplore.exe and notes.app instead of their intended document
  • Forms that do not allow non-ASCII characters
  • Forever optimizing headless Firefox, writing flexible and reliable Selenium scripts, and managing an increasing fleet of specialized jobworkers

Trade operational risk for speed of learning

We can’t predict the exact operational issues we’ll experience during a given day, but by scheduling and protecting one hour per day for operational tasks, we can deliberately trade risk for flexibility. Flexibility comes because we can accept small risks by introducing incomplete or manual-intervention-required workflows into the system. We do not have to build for every edge case or automate every action. We can develop features faster and create more opportunities to learn with real users in a real operational environment. This is an operationalization of our engineering principle “don’t argue, ship”.

Takeaways

  • Define operational failure: Leaving failure ambiguous can lead to fire-drills on every bad experience and exception, even if they may not have a material impact on business process or metrics. Defining service level objectives helps everyone self-organize, prioritize and understand the impact of their work.
  • Operationalize operations: Unexpected things happen all the time, but merely saying “high priority interrupt” does not expose the actual cost of response and remediation. Blocking out explicit times and spaces helps measure, and thus manage, work that might otherwise be overlooked.
  • Protect Developers’ time only so much: “Any improvement not made at the constraint is an illusion.” Approaching automation as an iterative and forever-incomplete process enables our team to move quickly in optimizing the system as a whole. When manual remediation is at risk of overflowing our time block, we dedicate time to greater automation; when we have perceived sufficient tolerances, we can push product features faster by manually tasking edge-cases.
  • Operations is a practice: Product Design and Development principles and practice provide a strong foundation and an experienced team can greatly reduce the risk of technical and market failure… but they can’t eliminate it. Operations is a field and practice that can reinforce and elevate Product Design and Development.

Decade in Review 2010-2019

In loose category and no particular order, other than I think they warrant mentioning.

Personal

  • Communications and mental health. Two things that really greatly influenced me was reading Nonviolent Communications and doing Mood Gym.
  • Inclusion (continuation). Compared to last decade I’ve practiced in larger groups and communities, from workplace to church. Two books that stick with me are White Fragility and Dear Church: A Love Letter From a Black Preacher to the Whitest Denomination in the US.
  • Business. I incorporated my own business, Day of the Shirt, for which I’ve been filing taxes, hiring contractors, and businessing since 2011.
  • Fiction. Malazan Book of the Fallen. Jemisin’s Inheritance and broken Earth trilogies. Up to book 26 of The Cat Who…. Remembrance of Earth’s Past trilogy. The Dark Tower series. And the entirety of Discworld.
  • Many deaths. Dottie Stephens. Many folks from Church: Dale, Clifton, Sam, Kirsten.
  • Affluence and finance. The move to software engineering has had a four-fold increase on my income. As well as the matters of founders stock, options, shares, RSUs, etc. We bought a new car.

Family

  • Marriage. Angelina and I got married in 2014 in San Francisco. We’ve also been together for the entirety of the decade.
  • Membership organizations (continuation). I became a member of St. Francis Lutheran Church, the South End Rowing Club, Golden Gate Angling and Casting Club, and numerous museums.
  • Cat changes. We lost Jose Pierpont, but gained Sally Ride and Billie Jean King.
  • Extended family. Living near a lot of extended family has been a new experience and we’ve gained many new nephews and nieces around the country.
  • Spending time together. This decade has been marked by a ramping up of weekend trips and travel, from Calistoga to Australia.

Career

  • San Francisco. 8 years of this decade have been spent in San Francisco, longer even then my time in Boston where I spent the majority of the naughties.
  • Transition from community-based work to software/tech. Shutting down the Transmission Project and Digital Arts Service Corps was hard. Software/Tech is fine.
  • I have great appreciation for friends and colleagues who have introduce me to the body of work on ergonomics is software development. For example, DevOps, Extreme programming, TDD, and Christopher Alexander.
  • Facilitation, Coaching and Sponsorship (continue). Still doing it.

The water never goes away

From Ronan Farrow’s Catch and Kill:

Perez said that she urged Sciorra to speak by describing her own experience of going public about her assault. “I told her, ‘I used to tread water for years. It’s fucking exhausting, and maybe speaking out, that’s your lifeboat. Grab on and get out,’” Perez recalled. “I said, ‘Honey, the water never goes away. But, after I went public, it became a puddle and I built a bridge over it, and one day you’re gonna get there, too.’”

From Thomas Page McBee’s Man Alive:

“Abandon all hope,” I’d written on a Post-it note, and I watched it move gently beneath the heat duct. I read it in some book. The idea was that hope misses the point: it’s either going to happen or not. You can’t make a new reality, only fashion something real from the one that you’ve got.


To care about AI

From the endnotes of Ted Chiang’s Exhalation on the short story “The lifecycle of softare objects”:

I’ve read stories in which people argue that AIs deserve legal rights, but in focusing on the big philosophical question, there’s a mundane reality that these stories gloss over. It’s similar to the way movies always depict love in terms of grand romantic gestures when, over the long term, love also means working through money problems and picking dirty laundry off the floor. So while achieving legal rights for AIs would be a major step, another milestone that would be just as important is people putting real effort into their individual relationships with AIs.

And even if we don’t care about them having legal rights, there’s still good reason to treat conscious machines with respect. You don’t have to believe that bomb-sniffing dogs deserve the right to vote to recognize that abusing them is a bad idea. Even if all you care about is how well they can detect bombs, it’s in your best interest that they be treated well. No matter whether we want AIs to fill the role of employees, lovers, or pets, I suspect they will do a better job if, during their development, there were people who cared about them.


Deterministic test data with Faker, FactoryBot, and RSpec

I get a lot of joy from using Faker and FactoryBot to efficiently generate real-world test data, but its randomness can be a liability when trying to debug complicated specs or when setting up systems that require repeatable data across RSpec test runs like Percy’s visual diffs.

Without deterministic test data, generating three new users with 3.times { puts Faker::Name.first_name } would result in Danny, Solomon, Fabian when run once, then Jordon, Shawn, Asa when run a second time, then Bruce, Leonor, Paulette when run a third time.

With deterministic test data, I expect to always generate the same set of names no matter how many times the code is run. Faker has documented how to configure and seed the random number generator and this can be achieved with:

3.times do |n| 
  Faker::Config.random = Random.new(n)
  puts Faker::Name.first_name 
end

This script outputs Zachery, Dawna, Desmond every single time it is run, meaning that it’s deterministic.

Faker’s deterministic configuration can be combined with a FactoryBot sequence to always get the same data every time a new factory instance is created. For example, here’s what a deterministic User factory could look like:

# spec/factories/users.rb

FactoryBot.define do
  factory :user do
    sequence(:first_name) do |n|
      Faker::Config.random = Random.new(n)
      Faker::Name.first_name
    end

    sequence(:last_name) do |n|
      Faker::Config.random = Random.new(n)
      Faker::Name.last_name
    end

    email { "#{first_name.parameterize}.#{last_name.parameterize}@example.com" }
    password { 'password123' }
  end
end

Within every sequence, the Faker random number generator is seeded with Faker::Config.random = Random.new(n) , where n is the integer generated by the sequence.

Unfortunately, just using a sequence isn’t completely sufficient when running tests in random order, or inserting new tests or rearranging the tests, as one would expect in an active codebase. FactoryBot sequences are global, meaning that they don’t reset by default between each and every test; a FactoryBot instance during one test run might use a different sequence number than a previous test run.

Therefore, it’s also necessary to rewind FactoryBot sequences after each RSpec example. Place this in your spec/rails_helper.rb or spec/support directory:

# spec/support/factory_bot.rb

RSpec.configure do |config|
  config.after do
    FactoryBot.rewind_sequences
  end
end

That’s all you need to combine Faker and FactoryBot to get deterministic test data in your RSpec tests. Have fun!


Because it helps them to release software

I have been thinking a lot about the framing of a sentence in this piece on Agile. If it’s given that software is the strategy, then it’s legitimate to focus on being better at releasing software.

From Graham Lee’s The value of the things on the left:

That software engineering department now has different management and is Agile. They have releases at least every month (they already released daily, though those releases were of minimal scope). They respond to change rather than follow a plan (they already did this, though through hefty “change control” procedures). They meet daily to discuss progress (they already did this).

But, importantly, they do the things they do because it helps them release software, not because it helps them hit project milestones. The revolution really did land there.