Thursday, March 11, 2010

Disabling internal services gracefully with resque, resque-scheduler, and redis_feature_control

Moved here

We've got a data warehouse that is separate from our Rails app's database. We aggregate data there and then feed summaries back into our Rails app to power all kinds of statistics for our users (impression data, traffic, etc). We heavily use Resque for our backend jobs, including pulling data from our warehouse. It works great. A user requests some data, resque serves it up. It also sends out periodic update emails which include data sourced by our warehouse. We started running into problems when we wanted to run migrations on our warehouse that took several hours. Being down during this time is not really acceptable and we didn't want to lose jobs that happened to run and depended on the warehouse being up and in a consistent state. We needed to be able to tell Resque to stop processing warehouse jobs (but still come back for them later). We also needed to be able to tell the user that this report was temporarily disabled while we upgraded (rather than timing out). The rest of the website should continue to run as usual. Basically, we needed a central place for processes to look for whether a service (in this case our warehouse) was available. This switch needed to be able to be turned off programatically (ie: during deployment of a magration) or manually (ie: via an admin tool). It also needed to be lightweight so even the tiniest script could use it. We also needed to be able to requeue jobs that needed to wait until the warehouse was back up. But we didn't want to just requeue them because they would immediately get popped again and could potentially starve lower priority jobs. We solved the first problem by coming up with redis_feature_control. Basically, a very simple on/off switch back by redis. Usage looks like this:
  # Check to see if the warehouse is supposed to be up...
  Redis::FeatureControl.enabled?(:warehouse) # => true

  # Disable the warehouse
  Redis::FeatureControl.disable!(:warehouse)
Pretty simple. We then wrapped our capistrano task that migrates our warehouse with disable/enable. We updated our Rails app to display a nice pretty "Please come back in a few minutes" message to our users instantly rather than timing out and detecting errors the hard, ugly way. And we updated our Resque jobs like so:

   def self.perform
     if Redis::FeatureControl.enabled?(:warehouse)
       # do stuff
     else
       # try again in a bit...
     end
   end

Now for the "try again in a bit" part. This was pretty easy with the resque-scheduler (you can read my previous post on it here and here). Basically replace the "try again in a bit" comment with:
  Resque.enqueue_in(1.hour, self)
Done. The job will be pushed back onto the Resque queue in an hour. If the warehouse still isn't available, it will wait another hour and so on. End result: When our warehouse is being migrated, it flags itself as being "off" and the dependent processes take the appropriate action, including delaying jobs to be processed in the future. So far, it's worked like a charm.

No comments: