So, it’s healthy every once in while to have an out-and-out fail.
The code deployment was never an issue; the procedural code was solid, and it always built and deployed successfully to production and staging environments. What burned us was the database migration: the new release entailed a major refactoring of an important subset of the schema.
The root cause of the fail was a database version incompatibility that we should have prepared for, but didn't. Even so, there were steps we could have taken that would have told us the migration was in trouble an hour or two sooner.
The migration was a series of SQL scripts. For most small updates (the app is usually updated on a 2-week scrum cycle), the dev team drops one or two short scripts (to create a new table, add some codes to to a dictionary table, that sort of thing) into a canonically-named folder, and migration is a short 5-minute process that takes no more time than the code deployment. But in this case, we had haphazardly put together a series of about a dozen scripts, two of which ran resource-devouring stored procedures.
What I would have done, in hindsight: organize the scripts so that all of the DML is executed first, then the dictionary tables are populated, and then finally the heavy lifting of moving data from one table to another happens. Add monitoring steps so that we can track the progress of the migration (we did do this for the second try). Seriously rethink the choice of stored procedures, and instead use a scripting language like Perl or Python or a compiled program. To the extent possible, provide a log or audit trail of migrated entities (the team did this very successfully in a smaller-scale migration a couple of years ago).