Sunday, June 19, 2022

On-demand developer oriented continuous integration

Nowadays, expectations from software engineering (or development, depending how you see it) are extremely high:

  1. MVP (Minimum Viable Product) delivered ASAP.
  2. Agility through the whole development process, in order to adapt to ever changing priorities or client desires
  3. Optimized/reduced costs
  4. Early new problem/bug/error detection
  5. Team collaboration
  6. Security
  7. Early regression discovery

    ...and many others.

One of the practices that helps mitigate several problems that arise during the software engineering process is the continuous integration (CI), concept credited to Grady Booch.

In small, short term projects CI is rarely set up because of the associated hardware or plan(in cases of the ones supporting Cloud) costs and the need for a dedicated workforce that would set it up and maintain it, whereas in the case of medium to large projects this is always a good idea and it pays off over time.

The CI which I had to use over the years was almost exclusively dominated by such a tool as Jenkins, with a slight exception that happened last year, when I had to configure an Azure Pipeline for testing purposes, process which, while pretty straightforward, demonstrated some core conceptual differences between the two, important enough so that the clients(including myself) to complain "But this works like that in Jenkins, why can't I do this here?". 

Funny enough, like when you are developing something using some proprietary tools/frameworks, the answer you give to clients is: "This is not currently supported, it is a closed-source software product owned by company X, so we can't do much about this.". The only thing you can do is wait until lots of other clients need this same functionality and the company X starts considering implementing it.

There are different ways to configure and use Jenkins, but I was introduced, in the last 5 years, to the following big categories:

  • Main CI
  • Custom CI

Main CI is usually automatic. It gathers several commits together and builds CI continuously through the day. Custom CI, however, is on-demand, developer oriented. Imagine that you are working on some important task and before you commit to master(or the now trending main), you want to make sure that nothing is broken by what you've done. Custom CI helps you here. You input your branch into the Custom CI UI, and it will build it like if you would have your new changes already integrated to master.

Now, while this sounds pretty cool and helpful, it is important to understand that Custom CI must be treated as a First Class Citizen (FCC). Custom CI must be maintained, evolved and worked upon even more, sometimes, in my opinion, than Main CI.

I will list below several problems that I personally encountered, or missing functionality that I needed but was missing while I was using Custom CI over the years, exactly because it wasn't treated as FCC.

  1. Bugs in the software itself. For example, this one: Jenkins ZWSP char, which drove me nuts because of the simple fact that after I was building my branch for the first time, in subsequent builds I was using the branch name copied from the CI UI itself while I was monitoring its progress, and it included ZWSP. This defect was solved in a more recent version of Jenkins, but no-one upgraded it and it ate a lot of days and nerves until I finally discovered the problem myself (after an architect from our team looked into the failure and somehow observed the ZWSP in the branch name and suggested me to investigate how it got there).
  2. Docker not starting, due to bad physical machines, corrupted images etc. You just get errors like: Unable to build image, failed to export image, failed to set parent: sha256: ... unknown parent image ID. If Docker does not start, you obviously do not get any Integration Tests executed, any REST tests or any other test types which require live machine. Another cause for Docker not starting might be that your fellow developers created a git branch with special characters and your DevOps colleagues actually used branch name to create docker container. In that case, it will fail with errors like: "Invalid container name..., only [a-zA-Z0-9][a-zA-Z0-9_.-] are allowed.
  3. Limited build history. When you build your code, the Custom CI build is available for a limited number of builds due to resource limitation. If there are lots of people using Custom CI, in several hours your build might be already deleted and you would probably not remember anything from it. You need this history saved somewhere, as in this case you don't care about artifacts, like in Main CI.
  4. Random, flaky tests. Zero predictability. You need to figure out if those tests that failed are actually related to what you did or not. And this is hard. There are lots of reasons why tests might be flaky: parallel execution issues, network dependency, temporal coupling, bad implementation etc.
  5. Missing functionality to compare custom builds between them. Because Custom CI builds will be executed by random people for different branches, the Jenkins functionality of showing you difference from last build, at the least, becomes useless. As such, you need a way to compare the results of your build to some reference build.
  6. Guarantee of two consecutive builds for one user, in case point 5 functionality is missing. In order to achieve that, you actually need two consecutive builds: before your changes and after. You might need to execute them sequentially and very fast, in order to make sure no-one else builds in between any other branch.
  7. Custom branches builds must be seeded from custom branches code, not from master. If your CI configuration source code is located in the same repository as your project and you seed if from your master branch, building of custom branch might fail. The reason is simple, the configuration in master might be changed and not compatible anymore with what you have in custom branch. This results in complete impossibility to build any old branch.
  8. Aborted builds, either because of test failure rules (abort after 1% of tests failing) or anything else.
  9. Slow builds due to ever growing test count or simply badly implemented or non-optimized tests.
  10. Because of most of the points described above, engineers can spam Custom CI with several build requests with the intent to make sure that at least one of them succeeds. This results in the increased build queue size and needless resource abuse, besides the fact that, depending on the available resources, you might not get your build being executed and finished for hours.

Continuous integration, be it Main or Custom, sounds cool when you just talk about it on a high level, and it might be perceived as a one time task that it is easily ignorable after it is made to work, but actually it is not. 

CI maintenance and configuration is a day to day job. You need full time people for that, and more than one, in order, at least, to avoid the bus factor

CI must be:

  1. Timely upgraded
  2. Properly maintained
  3. Carefully evolved

 ... and obviously tailored to your organizational needs, and especially to your engineers needs.

I honestly believe that the most important task of the CI is to protect your engineers creativity, productivity and pro-activity, while giving them fearless freedom so they can use their time the best they can in order to bring success to your organization. Indirectly, this safeguards everything else.

If you do not find these reasons within the ones that made you use CI, in my opinion, your are probably doing it wrong.

As for the above problems, the general recommendations are:

  • Monitor CI tool change log and upgrade it
  • Invest time in understanding your virtualization software specifics and known problems. Understand and improve your network.
  • Extract build history and maintain it for a longer period of time, like 30 days or so.
  • Invest in fixing your tests. There is no worse scenario than trying to figure out what is going on, except for the case when you have wrong tests testing wrong stuff in a wrong way.
  • Provide functionality for build comparison, either by doing it in a third party tool, or implement it yourself, if possible.
  • Provide an interface to build 2 branches at once or think of solutions to avoid too many builds. Potential solution might be creating checkpoints for master that should be used as starting point for custom branches, while providing the build for it. I've never done this, but I believe it might actually work. The toughest part would be to educate the engineers
  • Seed your branch build logic from custom branch source code. Breaking this rule results in failed builds due to crazy reasons that will be hard for you to understand and debug. After countless hours you might find out that some DevOps colleague did push breaking changes and you need to rebase master into your branch. Again.
  • Investigate aborted builds and choose your abort threshold carefully.
  • Split tests into priority and criticality categories. The ones that are passing for a long time already can be delegated to nightly builds and avoided in custom, while those that fail often or are very important for the health of the product should be executed always.

I am an avid advocate for using Custom CI as much as possible. 

Many of these recommendations came from my own experience and conclusions, while others I learned from my wonderful colleagues who gladly shared their own experience with me, for which I am extremely grateful.