Sunday, June 19, 2022

On-demand developer oriented continuous integration

Nowadays, expectations from software engineering (or development, depending how you see it) are extremely high:

  1. MVP (Minimum Viable Product) delivered ASAP.
  2. Agility through the whole development process, in order to adapt to ever changing priorities or client desires
  3. Optimized/reduced costs
  4. Early new problem/bug/error detection
  5. Team collaboration
  6. Security
  7. Early regression discovery

    ...and many others.

One of the practices that helps mitigate several problems that arise during the software engineering process is the continuous integration (CI), concept credited to Grady Booch.

In small, short term projects CI is rarely set up because of the associated hardware or plan(in cases of the ones supporting Cloud) costs and the need for a dedicated workforce that would set it up and maintain it, whereas in the case of medium to large projects this is always a good idea and it pays off over time.

The CI which I had to use over the years was almost exclusively dominated by such a tool as Jenkins, with a slight exception that happened last year, when I had to configure an Azure Pipeline for testing purposes, process which, while pretty straightforward, demonstrated some core conceptual differences between the two, important enough so that the clients(including myself) to complain "But this works like that in Jenkins, why can't I do this here?". 

Funny enough, like when you are developing something using some proprietary tools/frameworks, the answer you give to clients is: "This is not currently supported, it is a closed-source software product owned by company X, so we can't do much about this.". The only thing you can do is wait until lots of other clients need this same functionality and the company X starts considering implementing it.

There are different ways to configure and use Jenkins, but I was introduced, in the last 5 years, to the following big categories:

  • Main CI
  • Custom CI

Main CI is usually automatic. It gathers several commits together and builds CI continuously through the day. Custom CI, however, is on-demand, developer oriented. Imagine that you are working on some important task and before you commit to master(or the now trending main), you want to make sure that nothing is broken by what you've done. Custom CI helps you here. You input your branch into the Custom CI UI, and it will build it like if you would have your new changes already integrated to master.

Now, while this sounds pretty cool and helpful, it is important to understand that Custom CI must be treated as a First Class Citizen (FCC). Custom CI must be maintained, evolved and worked upon even more, sometimes, in my opinion, than Main CI.

I will list below several problems that I personally encountered, or missing functionality that I needed but was missing while I was using Custom CI over the years, exactly because it wasn't treated as FCC.

  1. Bugs in the software itself. For example, this one: Jenkins ZWSP char, which drove me nuts because of the simple fact that after I was building my branch for the first time, in subsequent builds I was using the branch name copied from the CI UI itself while I was monitoring its progress, and it included ZWSP. This defect was solved in a more recent version of Jenkins, but no-one upgraded it and it ate a lot of days and nerves until I finally discovered the problem myself (after an architect from our team looked into the failure and somehow observed the ZWSP in the branch name and suggested me to investigate how it got there).
  2. Docker not starting, due to bad physical machines, corrupted images etc. You just get errors like: Unable to build image, failed to export image, failed to set parent: sha256: ... unknown parent image ID. If Docker does not start, you obviously do not get any Integration Tests executed, any REST tests or any other test types which require live machine. Another cause for Docker not starting might be that your fellow developers created a git branch with special characters and your DevOps colleagues actually used branch name to create docker container. In that case, it will fail with errors like: "Invalid container name..., only [a-zA-Z0-9][a-zA-Z0-9_.-] are allowed.
  3. Limited build history. When you build your code, the Custom CI build is available for a limited number of builds due to resource limitation. If there are lots of people using Custom CI, in several hours your build might be already deleted and you would probably not remember anything from it. You need this history saved somewhere, as in this case you don't care about artifacts, like in Main CI.
  4. Random, flaky tests. Zero predictability. You need to figure out if those tests that failed are actually related to what you did or not. And this is hard. There are lots of reasons why tests might be flaky: parallel execution issues, network dependency, temporal coupling, bad implementation etc.
  5. Missing functionality to compare custom builds between them. Because Custom CI builds will be executed by random people for different branches, the Jenkins functionality of showing you difference from last build, at the least, becomes useless. As such, you need a way to compare the results of your build to some reference build.
  6. Guarantee of two consecutive builds for one user, in case point 5 functionality is missing. In order to achieve that, you actually need two consecutive builds: before your changes and after. You might need to execute them sequentially and very fast, in order to make sure no-one else builds in between any other branch.
  7. Custom branches builds must be seeded from custom branches code, not from master. If your CI configuration source code is located in the same repository as your project and you seed if from your master branch, building of custom branch might fail. The reason is simple, the configuration in master might be changed and not compatible anymore with what you have in custom branch. This results in complete impossibility to build any old branch.
  8. Aborted builds, either because of test failure rules (abort after 1% of tests failing) or anything else.
  9. Slow builds due to ever growing test count or simply badly implemented or non-optimized tests.
  10. Because of most of the points described above, engineers can spam Custom CI with several build requests with the intent to make sure that at least one of them succeeds. This results in the increased build queue size and needless resource abuse, besides the fact that, depending on the available resources, you might not get your build being executed and finished for hours.

Continuous integration, be it Main or Custom, sounds cool when you just talk about it on a high level, and it might be perceived as a one time task that it is easily ignorable after it is made to work, but actually it is not. 

CI maintenance and configuration is a day to day job. You need full time people for that, and more than one, in order, at least, to avoid the bus factor

CI must be:

  1. Timely upgraded
  2. Properly maintained
  3. Carefully evolved

 ... and obviously tailored to your organizational needs, and especially to your engineers needs.

I honestly believe that the most important task of the CI is to protect your engineers creativity, productivity and pro-activity, while giving them fearless freedom so they can use their time the best they can in order to bring success to your organization. Indirectly, this safeguards everything else.

If you do not find these reasons within the ones that made you use CI, in my opinion, your are probably doing it wrong.

As for the above problems, the general recommendations are:

  • Monitor CI tool change log and upgrade it
  • Invest time in understanding your virtualization software specifics and known problems. Understand and improve your network.
  • Extract build history and maintain it for a longer period of time, like 30 days or so.
  • Invest in fixing your tests. There is no worse scenario than trying to figure out what is going on, except for the case when you have wrong tests testing wrong stuff in a wrong way.
  • Provide functionality for build comparison, either by doing it in a third party tool, or implement it yourself, if possible.
  • Provide an interface to build 2 branches at once or think of solutions to avoid too many builds. Potential solution might be creating checkpoints for master that should be used as starting point for custom branches, while providing the build for it. I've never done this, but I believe it might actually work. The toughest part would be to educate the engineers
  • Seed your branch build logic from custom branch source code. Breaking this rule results in failed builds due to crazy reasons that will be hard for you to understand and debug. After countless hours you might find out that some DevOps colleague did push breaking changes and you need to rebase master into your branch. Again.
  • Investigate aborted builds and choose your abort threshold carefully.
  • Split tests into priority and criticality categories. The ones that are passing for a long time already can be delegated to nightly builds and avoided in custom, while those that fail often or are very important for the health of the product should be executed always.

I am an avid advocate for using Custom CI as much as possible. 

Many of these recommendations came from my own experience and conclusions, while others I learned from my wonderful colleagues who gladly shared their own experience with me, for which I am extremely grateful.

 

Saturday, April 9, 2022

Demo Data Nightmare, the beginning

- So what do you think about me taking this task and implementing it before anything else on that topic gets to the team's backlog?

- Sounds cool! 

- This means that I can help everyone, once I'm done, on similar tasks. 

- Yeah, that would be great. I think we should do that. 

This is the quick summary of the discussion that started it all, 7 months ago, with my manager, without anything, even slightly pointing to the months of endless work, rework, overwork and ultimately unhealthy stress that were going to hit me.

The task was pretty simple. Based on some existing custom framework, implement a new widget that is going to show some data as a graph. In theory, I shouldn't have done anything even remotely close to rocket science, as my job was to understand how the existing framework works and how it can be reused for different scenarios.

I started investigating, and while the existing framework wasn't as good as I would have expected it to be, especially for something that is to be used so many times and in so many places, it wasn't horrible. I had to copy some generic JSON that is going to define the overall widget configuration, change some parameters, tinker its configuration types to accept my scenario, put it in a specific file, fiddle with the new UI to-be-displayed parameters and voilĂ , task done. Best case scenario, a week of work, worst case scenario 2 to 3 weeks, depending on the feedback I should be getting from the Functional Architects (FAs), my load on helping the team with all they need and anything that might come along.

After several hours of wandering through project code, short calls with people that already used the framework and code owners, I found myself face to face against the normal situation in which I actually needed the data my widget will be based upon. The bummer about it was the fact that the data was provided by a fairly new and still in development feature and you had to spend many hours or even days in order to generate it, as it depended on some entity state from within the system that you had to actually change, from time to time. The more time you spend on it, the more data you get. There was no way to speed this up. Everyone was waiting to get the data, and so seemed should I. Many were complaining about this, with no effect.  

The whole thing was aggravated by the fact that we have 2 splits per week and usually they are not compatible. If you git pull main branch and then you try to rebase it into your feature branch, you need to recreate the whole database, and obviously everything you generated will be gone.

The prospect of wasting half of my day on generating data, two times a week, didn't sound very pleasing, and as such, the lazy programmer within me said:

- I cannot work like that. I need DEMO DATA!

So, let's snapshot the situation: An experienced engineer gets a task, with a worst case estimation of 3 weeks of work and probably 24 hours of wasting time on generating demo data. In Enterprise terms, 24 hours is basically nothing. But still, laziness in software engineering is not, usually, about time, but about doing the same repetitive task again and again knowing that it can be automated.

What can be automated, eventually will be automated. The only question is: "Who will do it and when?"

In this specific case it was me who thought and really considered that it is better to automate the data generation now, rather than "waste" those 24 hours, especially because I heard several colleagues complain about the same problem as myself.

On occasions like that you imagine yourself, at the end of the task, as Tony Stark presenting some new tech in front of the cheering crowd, bringing something to the table no one was able to, something that will improve everyone's professional day to day life.

(photo Copyright Marvel, Iron Man 2)

And just like that, it all began.