Pinboard (jm)
https://pinboard.in/u:jm/public/
recent bookmarks from jmAutomating safe, hands-off deployments2020-06-23T22:53:17+00:00
https://aws.amazon.com/builders-library/automating-safe-hands-off-deployments/
jmEach team needs to balance the safety of small-scoped deployments with the speed at which we can deliver changes to customers in all Regions. Deploying changes to 24 Regions or 76 Availability Zones through the pipeline one at a time has the lowest risk of causing broad impact, but it could take weeks for the pipeline to deliver a change to customers globally. We have found that grouping deployments into “waves” of increasing size, as seen in the previous sample prod pipeline, helps us achieve a good balance between deployment risk and speed. Each wave’s stage in the pipeline orchestrates deployments to a group of Regions, with changes being promoted from wave to wave. New changes can enter the production phase of the pipeline at any time. After a set of changes is promoted from the first step to the second step in wave 1, the next set of changes from gamma is promoted into the first step of wave 1, so we don’t end up with large bundles of changes waiting to be deployed to production.
The first two waves in the pipeline build the most confidence in the change: The first wave deploys to a Region with a low number of requests to limit the possible impact of the first production deployment of the new change. The wave deploys to only one Availability Zone (or cell) at a time within that Region to cautiously deploy the change across the Region. The second wave then deploys to one Availability Zone (or cell) at a time in a Region with a high number of requests where it is highly likely that customers will exercise all the new code paths and where we get good validation of the changes.
After we have higher confidence in the safety of the change from the initial pipeline waves’ deployments, we can deploy to more and more Regions in parallel in the same wave. For example, the previous sample prod pipeline deploys to three Regions in wave 3, then to up to 12 Regions in wave 4, then to the remaining Regions in wave 5. The exact number and choice of Regions in each of these waves and the number of waves in a service team’s pipeline depend on the individual service’s usage patterns and scale. The later waves in the pipeline still help us achieve our objective to prevent negative impact to multiple Availability Zones in the same Region. When a wave deploys to multiple Regions in parallel, it follows the same cautious rollout behavior for each Region that was used in the initial waves. Each step in the wave only deploys to a single Availability Zone or cell from each Region in the wave.
]]>automation ops devops amazon aws deployment waves az multi-region ci cdhttps://pinboard.in/https://pinboard.in/u:jm/b:7ecaa0ea15ea/Global Continuous Delivery with Spinnaker2015-11-17T09:37:26+00:00
http://techblog.netflix.com/2015/11/global-continuous-delivery-with.html
jmcontinuous-delivery aws netflix cd devops ops atlas spinnakerhttps://pinboard.in/https://pinboard.in/u:jm/b:5706bd6896f8/Taming Complexity with Reversibility2015-07-28T19:28:48+00:00
https://www.facebook.com/notes/kent-beck/taming-complexity-with-reversibility/1000330413333156
jmDevelopment servers. Each engineer has their own copy of the entire site. Engineers can make a change, see the consequences, and reverse the change in seconds without affecting anyone else.
Code review. Engineers can propose a change, get feedback, and improve or abandon it in minutes or hours, all before affecting any people using Facebook.
Internal usage. Engineers can make a change, get feedback from thousands of employees using the change, and roll it back in an hour.
Staged rollout. We can begin deploying a change to a billion people and, if the metrics tank, take it back before problems affect most people using Facebook.
Dynamic configuration. If an engineer has planned for it in the code, we can turn off an offending feature in production in seconds. Alternatively, we can dial features up and down in tiny increments (i.e. only 0.1% of people see the feature) to discover and avoid non-linear effects.
Correlation. Our correlation tools let us easily see the unexpected consequences of features so we know to turn them off even when those consequences aren't obvious.
IRC. We can roll out features potentially affecting our ability to communicate internally via Facebook because we have uncorrelated communication channels like IRC and phones.
Right hand side units. We can add a little bit of functionality to the website and turn it on and off in seconds, all without interfering with people's primary interaction with NewsFeed.
Shadow production. We can experiment with new services under real load, from a tiny trickle to the whole flood, without affecting production.
Frequent pushes. Reversing some changes require a code change. On the website we never more than eight hours from the next schedule code push (minutes if a fix is urgent and you are willing to compensate Release Engineering). The time frame for code reversibility on the mobile applications is longer, but the downward trend is clear from six weeks to four to (currently) two.
Data-informed decisions. (Thanks to Dave Cleal) Data-informed decisions are inherently reversible (with the exceptions noted below). "We expect this feature to affect this metric. If it doesn't, it's gone."
Advance countries. We can roll a feature out to a whole country, generate accurate feedback, and roll it back without affecting most of the people using Facebook.
Soft launches. When we roll out a feature or application with a minimum of fanfare it can be pulled back with a minimum of public attention.
Double write/bulk migrate/double read. Even as fundamental a decision as storage format is reversible if we follow this format: start writing all new data to the new data store, migrate all the old data, then start reading from the new data store in parallel with the old.
We do a bunch of these in work, and the rest are on the to-do list. +1 to these!]]>software deployment complexity systems facebook reversibility dark-releases releases ops cd migrationhttps://pinboard.in/https://pinboard.in/u:jm/b:6a3089426e1e/'Continuous Deployment: The Dirty Details'2015-04-22T10:08:00+00:00
http://www.slideshare.net/mikebrittain/mbrittain-continuous-deploymentalm3public
jmcd deploy etsy slides migrations database schema ops ci version-control feature-flagshttps://pinboard.in/https://pinboard.in/u:jm/b:7b09e64e7d8f/