Software Production Engineering: 2011

Monday, November 7, 2011

Build & Deploy Decoupling (and get some tests running in between!)

Pre-preface: We have about 50 heavily linked builds that make up our “core” environment. We’re trying to cut "major" release times from ~3 months to ~3 weeks. Check out the prior blog post for more background information.

---

In a continuous integration/continuous deployment shop, if you decouple the deployments away from the builds you will improve tester productivity, decrease test environment outages throughout the day, find defects earlier, speed up the builds, you are encouraged to invest and rewarded by investing in automated tests and ultimately position yourself for continuous production deployments ala Push Engineering. (scary, but woo!) That’s a lot of claims to make, so let’s get on with it.

Preface - this isn’t a step away from the Continuous Integration or Continuous Deployment methodology, it’s a refinement to keep it from becoming Continuous Interruption or Continuous Regression.

Productive testers

I’m willing to bet that if you build on every commit, 95% of your test environment stakeholders (development, QA, product owners, etc) that are using the environment in question aren’t looking for that specific fix/feature/revision at that moment. They want it eventually – usually after they are done testing the story they picked up after they sent the first story back to development with a defect. While they want that fix/feature/etc, they want a Quality change; they want the commit to build successfully so the build isn’t dead on arrival, they want it to pass the unit/system test suite (hopefully you’re at 100% coverage) so there is a high confidence you aren’t regressing existing critical functionality, and they want their test environment to get that revision at an expected time – either scheduled or on demand. The last thing testers want is to have their valuable time spent setting up a test wasted when the environment is reset on them for a defect fix or change they could not care less about.

Outages

A moderately sized software development unit – say, two teams adding up to 10 developers with Quality Assurance, Business Analysts, Project Managers, Database Admins and Developer Operations people – can generate a lot of environmental churn on a daily basis. Your busiest environment will probably be your lowest environment; I frequently see 10 developers able to commit 20-30 revisions every day to the dev branches, and then those revisions ( once they pass whatever gateway test/merge qualifications necessary) being selectively merged to the next more-mature environment – generally it divides by 2 or 3 each environment. So your 30 revisions in Dev turns to 10 in the next environment up, then 5, then 2, then maybe you’re up to Prod and it’s hopefully just one build ;) If you clump these deployment cycles into “a” number – without knowing your business or processes, there is no magic number – but let’s assume for a moment we go to four deployments a day in non-production – 9AM-Noon-3PM-6PM, and only when there are changes to push – you’re now limited to 4 deployment interruptions in your lowest environment, to one or two in all other non-prod environments, and it’s on an expected timetable. From ~50 continuous interruption deployments a day down to ~8 – that’s a slight improvement.

If you have frequent “hot” builds that need to be slammed in as soon as possible, your people environment is probably lacking in software development process maturity. People need to learn the old fashioned way – experience – that there is never going to be a silver bullet fix that solves everything in one build & deploy . You can alleviate the need for “War-room ASAP” builds by working in as close to 100% test coverage percentage as you can afford, level-set the environment stakeholders by explaining how many people/groups are being impacted with a rebuild, communicate the end to end outage length, and put visibility on the rebuild & deployment, each and every one. When the world is watching, that super-important fix can wait 45 minutes until the scheduled, automated deploy happens. And if you’re in a dynamic place where you have multiple groups working on multiple stories simultaneously, it’s extremely rare for a critical group of people to be blocked & waiting for a deployment. They may say they are, but in reality they most likely have several other items ready to test, or items marked resolved that they need to sign off on. Don’t get me wrong – you will need to allow for manual deployment pushes – just make sure the button is pushable by the right stakeholders and there is clear visibility on the deployment occurring, why it’s occurring and who pushed the button/authorized it. Listen to the users too – don’t let a pushy manager terrorize everyone by smashing their builds in. You need to listen and push back if need be.

Defects & defect cost

If your deployments are coupled with your CI builds, even with stepped down commits as you go up in environments, that’s a lot of interruptions in each environment. The closer the environment is to Production makes the outages more expensive in the short-term, due to the perceived instability of the higher environments by senior leadership and external clients and the interruptions of client UAT, sales demos, etc. These outages/interruptions tend to chase the legitimate testers away, and because the lower environments (closest to Development) are being rebuilt most often, you’re training your most valuable testers to stay out and stay away from the environments you actually need them in the most – the sooner you find a defect, the cheaper it is to fix it. When the people who you are making these changes for finally tests a few weeks before release or in an environment right before Production, having the “you didn’t understand our requirements” card get played is somewhere on a level of fun between playing “Sorry” with your significant other (they show no mercy!) and having the Draw-Four card played on you in Uno, except in a large release hundreds of thousands of dollars in time & effort are at stake. That is the long-term cost of low-environment outage and over time will greatly outweigh the short-term cost of your close-to-production environments recycling during the day. Too often we let the “rough” development environments be wild-wild-west, not realizing it’s your most valuable environment to test in.

Speeding up the build

If you have any tests baked into the build, you know they can take just a few seconds to several minutes, depending where they sit between pre-canned, self-contained nunit tests or if you are setting up data and comitting transactions. When you have high percentage levels of test coverage your tests can often exceed the compile times of the build itself; there’s nothing wrong with that. This is where you can successfully argue for more build server horsepower. The time spent on deployments is often out of your hands; you are constrained by network speeds, SAN or local disk speeds, and hard-timed delays waiting for services to stop and restart. The deployment tasks in your build quickly add up; even if deployment tasks are only 10% of your build time, it’s increasing your environment outage by 10%. If you asked QA “Hey, would it be worth trimming 10% off a complete environment rebuild?” they would jump on it. Time is money; a 10% decrease in time is literally dollars – often it’s many dollars added up over dozens of builds a day – being saved. That’s something senior leadership can definitely understand and get behind.

Test coverage

Not slamming the build hot into the environment provides incentive to build up the automated test coverage percentage. Nobody wants a DOA build, and if they are now getting them 4 times a day max (let’s say) they want them to count. Assuming a 9AM/Noon/3PM/6PM deployment cycle, developers at 10AM might have an easier time making sure their change is covered in unit tests when they know the build won’t be deployed until Noon. I have seen project managers and directors literally standing behind a developer waiting for them to commit because “everyone is blocked!” or “The client is waiting to test!”. That’s not a conducive work environment, nor one anyone would want to work in. Taking the deployments out of the build also gives you the ability to put in longer, slower, deeper automated tests and post-build reports, like test coverage reports, in each and every build. It sounds counterintuitive to speed up the build by removing the deploy stage and then slow it down with more automated tests and post-compile reports, but those tests and reports will increase quality, which will decrease the amount of time spent in the whole “code it-throw it over the wall-have it thrown back-fix and throw it over the wall again” cycle.

Continuous Production Deployments

That’s where I am trying to get to, and that’s why I’m removing deployments from our build cycle. I need two major results out of this – our automated tests need to inc rease in quality & importance, and the decoupled deployment stage - broken out into a separate stage – can be perfected in the lower environments and then used in Production. The only reason I haven’t strung together our piecemeal deployment scripts from each build and run them against Production already is we don’t have good automated tests running (so we can’t automatically trust the build), and we don’t always have good visibility on what needs to be released due to how much shared code we use. Yanking deployments out and relying on and improving our test coverage and perfecting our deployment stage gives me half of this; my next post will be on how I’m going to solve the release build of materials question.

-Kelly Schoenhofen

Sunday, October 30, 2011

3AM Deployment? Why not?

Tip o'the hat to Bryan Gertonson for sending me this from Brian Crescimanno's blog - Why are you still deploying overnight?.

It's a good article; I can see alot of overlap between Brian Crescimanno's article and what Facebook is actually doing on a daily basis (see the Push Engineering video). Unfortunately, I'm going to bet that for most of us the article comes across as a white ivory tower/pie-in-the-sky wishes-were-fishes manifesto. It doesn't mean Brian doesn't hit alot of true notes, but pragmatically most of us are stuck doing midnight deployments to avoid customer impact. For me, midnight deployments have gotten really old, and as we expand outside North America our global market doesn't appreciate middle-of-the-day downtime for them. I want to change this.

A project manager where I work passed around the Push Engineering video to management, and the positive end-result that Facebook is reaping really resonated with them. But the problem with silver bullets is what makes it a werewolf killer for one company doesn't slow down the vampires at another company. One part in particular was FB's rich gatekeeper system of enabling features gradually to the customer base; that feature would be useful to us but it would be a huge effort to implement for a limited payback in our specific e-commerce niche. It may have resonated with our management, but it didn't really resonate with the engineers.

In my opinion, the hardest challenge that Facebook faced (no pun intended) and overcame is the divorce of database changes & code (backwards) compatibility. They have achieved one of the holy grails of release engineering by successfully decoupling the two; database changes and code changes are entirely independent of each other. It gives them a tremendous amount of flexibility and makes their top-to-bottom Push Engineering concept a reality.

So after watching the Push Engineering video, management has challenged us (with some funding) to do some of the same things. Our stated goal is get our software release cycle from 3-4 months down to 3-4 weeks. Our current process & technology has been an evolution for the last few years; we last made major changes to our develop-build-deploy process/platform/methodology about 2 years ago and we've been coasting on it since then, reaping the rewards and writing, testing and releasing software as fast as we can.

So while I don't think we're going to achieve database & code separation any time soon, I think we can pull from the Push Engineering concept some of the achievable bits that best apply to how we work and accomplish a few notable things. A handful of us leads sat down and sketched out what it would take (at a high level) to accomplish this.

Here's a few of "my" issues that I intend on addressing.

Our "core" book of sites and services comprise 50 linked builds in a very complex parent-child relationship with a handful of pseudo-circular references thrown in for fun. A full core rebuild on a build server cluster made from virtual machines takes at least 60 long minutes. Big changes or small changes, it takes 60 minutes to see them.
For those 60 minutes, the environment is intermittently unavailable, which testers just love. There is always something unavailable during the rebuild process; the environment may look like it's up but often the core build cycle is single-threaded on building a back-end process or service that is largely transparent to end-users, so just when testers think the environment is stable, whammo, they are kicked out. While testers have the ability to check the build status, it can be difficult to spot pending builds on a build controller hosting 250+ automated builds. I like to say tell them we're a CI shop - it just stands for Continuous Interruptions ;) They don't laugh :(
Sorta-falsely broken builds - builds are often "broken" because after an atomic commit when a child of a child needs a change from a parent of a parent build; because each build is on a separate timer to check for source control changes, and I can only define a single layer of parent and child relationships, often a grandchild build will try to build before a needed grandparent change and the build goes red.
We use tremendous number of post-build events at the project level to keep developers working locally; these events can be fragile and they copy unnecessary binaries around constantly. We need them because developers build out of Visual Studio, not an msbuild script, so none of the msbuild scripted events to push shared dll's into a common library folder occur. Why do we use library folders for these common build artifacts? Since the build server software isolates each build away from all other source to generate a clean build that will stand up to DoD-level regulations, you can't reference other projects outside your top level solution. You have to use static references and maintain a huge library of binary artifacts from your codebase.
Because we go 3-4 months between major releases, when we do a major release code & db drop into Production, we have days of intense firefighting, then a couple weeks of medium firefighting, then a few more weeks of putting our lingering brushfires and smoldering stumps. Then we gear up for our next major release ;)
Our releases also end up being so large, they take 4 hours to deploy and 8 more hours to validate.
Because we use all static references and no project references (outside the solution), I've seen developers keep eight visual studio's open simultaneously to make a change in one webservice; we don't have the funding to make hot-rod workstations, so we're burning expensive developer time having them wait on their machines. There's only so much performance you can get out of a $600 PC from Dell or HP even after you add another $400 of ram and a $50 video card.
We don't have unit tests running consistently due to the evolution of our build system. The developers have written more than 1000 unit tests, but the automated build server is only running 1/4 of them reliably.
Our test coverage tools are manual run only, they don't really fit in our current build structure.

So to sum it up: we have a long rebuild cycle and the environment is down virtually the entire time, Continuous Integration really means Continuous Interruptions, the build frequently breaks because of the actual build order, we are spending alot of our time copying around large binaries and storing them in source control, we release too infrequently and our major releases take 12 hours to deploy, developers have to run several visual studios and constantly have to trigger complex build orders to get changes to show up locally and their PC's aren't equipped to do any of this, only 1/4 of our unit tests get run outside the developer's workstations and test coverage reports aren't being generated.

There's more, but let's stop with these major ones. We have alot of challenges! My next post after this will be my plans for the next 3 months to resolve them.

Think 1 stone, 20 birds. Unfortunately, I have about 100 birds to take out. But that's only 5 stones if I throw them just right!

-Kelly

Wednesday, October 12, 2011

Facebook "Push" system

Chuck Rossi, a head engineer at Facebook, gives a rough description of FB's build and release process as part of the developer onboarding session. It's about 52 minutes long, and well worth it.

http://www.facebook.com/video/video.php?v=10100259101684977

Saturday, January 22, 2011

Anatomy Of A Continuous Integration System Part Two

No matter what generation you grew up in - Mary Poppins tried to entice you with a spoon full of sugar, Barney used out-of-copyright melodies and hypnotising lyrics, and Dora just remixed what Barney came up with - but you have to clean up, and it doesn't have to be painful.

Most people don't get around to writing cleanup routines until long after everything is setup and running (if at all!), but you, the Wise Reader, are going to start with a cleanup routine. Waiting until you have even a basic build compiling before you start cleaning up makes little sense. If you're going to fail, fail early and fail fast. You don't want an accidentally committed binary file from a developer pre-loaded into a project's object folder giving you a false sense of build success, only to have QA report very odd and hard to reproduce errors. Hours or days later you finally track it down to an artifact that was inadvertently added into your source control system from the developer's PC.

While there are lots of ways to limit your exposure from a developer (or wherever) accidentally adding and committing artifacts you don't want or need into source control (and it's fine to spend a little time implementing walls and filters) they are pretty easily bypassed. A lot of times they have to be set at the project folder level for each and every project, so it's ripe for getting forgotten to be added in the first place when the project is first setup (because the developer isn't going to remember to do it), or forgotten in the migration of code from branch to another, or it's just lost when someone commits clearing the properties of the parent folder. It also doesn't address rebuild scenarios on your Continuous Integration (CI) implementation. The word "Continuous" in CI literally says each build isn't a one-time event, so every ounce of robustness you can put into your build strategy & implementation, it's going to pay off quickly. While most CI systems have the option of a clean build each time, I've never seen anyone utilize that option. It adds alot of time to your builds - especially on your flagship product - when you build servers have to pull it down fresh each and every time. And if you're a virtualized shop, all that I/O on your build server farm is going to get you the stink-eye from your infrastructure engineers when the SAN starts to get bogged down.

To avoid this, go for the simplest routines. I remove specific folders and do a top of the project down to the furthest corners cleanup.

Here's a sample cleanup target block.

<Target Name="MasterClean">
  <Exec Command="BuildCleanup.cmd"/>
  <RemoveDir Directories="
   %(ProjectList.RelativeDir)obj;
   %(ProjectList.RelativeDir)bin" />
 </Target>

BuildCleanup.cmd is a script that does a recursive delete of specific filetypes mere moments before the compile starts. I love simple - here's the entire contents of my basic BuildCleanUp.cmd -

@echo off
del *.dll /s
del *.pdb /s
del *.cache /s
del *.exe /s
call CompanyName-LibraryUpdate.cmd
exit 0

It does three things:

Delete the bits you absolutely don't want (dll/pdb/cache/exe)
Makes sure your local library of precompiled/3rd party software is up to date.
Exit with a return code of 0 so nothing calling it confuses a successful return code with an error level.

The other piece I try to accomplish in my cleanup blocks is to remove specific subdirectories that "need" removing (such as obj and bin folders) if it's a quick hit. Some solutions call 12 project files and scripting out a removal of each project's obj & bin folder has a low return rate for your effort. At that point, just rely on your cleanup cmd to remove the bits you absolutely need to get rid of.
Side note - your starting directory by default in an MSBuild script is the location of the .build file, so you don't even need to feed BuildCleanUp.cmd any arguments or pre-load it with fixed directories. It's going to execute at the level of your .build file, which is where you are building out of. You don't need to worry about parent paths because you're not calling any other projects higher than your current level, so a simple script that cleans downhill from your starting point is just what the doctor ordered.

This is really good stuff - you want to make each subsequent build regenerate all local code from scratch so that there's no doubt in QA's mind that a new build of XYZ is going to have exactly what the build notes say it has. Letting a dll from a prior build/different environment creep unintended into a compile is literally horrifying to me. QA's tests start showing odd results, people collectively spend a tremendous amount of time trying to reproduce and understand the issue, and when they track it down to an old dll that is getting carried along each build, I feel like I have personally wasted everyone's time, energy, though processes, bits of their lives, etc, because I could have prevented it if I had been a little more careful or a little more thorough.

You can do this. If you can do this consistently, eventually development, QA and project management/BA's won't jump to the build being the issue when a defect that was reported as fixed isn't showing up as fixed (or with any change in its behavior) in the next deployment it was reportedly fixed in. They will start to believe the builds are rock solid because you've showed them they are. (or at least, they are as good as the input going into them - GIGO)

As a bonus, put all these utility scripts in a folder on your build server - I use C:\BuildTools. Then add C:\BuildTools to the system path and reboot the server. From then on, all builds will be able to call any .cmd file in C:\BuildTools without specifying the path or having to make multiple copies of your cmd files and putting them in the root of all your projects. Putting that single folder with some simple cleanup & deployment assisting scripts is part of the short checklist we have for making a Production build server.

We'll leave the clean-up block behind us for abit, but it will come back for a brief howdy-do when we go over automated deployments.

-Kelly Schoenhofen