An Extra Mile

Last week, most of my time was occupied in debugging an issue whose cause was not very obvious when we hit it. A teammate of mine opened a pull request that failed CI check every single time. He updated the PR about four times but, CI never felt satisfied. Finally when there was no more code to add to the PR and CI seemed stubborn, there was a need to dig deeper into it.

A few points (disclaimers, maybe):

I had played almost no role in coming up with the tests. So when I had to go figure why CI checks were failing, I had to learn from scratch about how the current setup worked.

Our code-base takes good deal of time to get deployed in test environment. We use Ansible playbook for deploying things in dev, pre-prod and prod environments. Doing deployment on freshly provisioned nodes would take up to 30 minutes while deployment on existing VMs would take about 15 minutes.

After doing the deployment, I had to go to IPython shell and create a few dictionaries that were then served to the test suite. These dictionaries mainly contained IP addresses of the VMs on which tests were to be executed. Finally, the test suite, created using nosetests, was executed.

Surely, I was seeing similar results in dev environment as CI when the test suite was executed. Logs made no sense! Assertions failed and it seemed like they were getting executed before the tasks finished. All I was wondering was, how are assertions going to succeed when they check result of a task that’s still not finished? And it seemed that way to my teammates as well.

I was about to tell the team that this process is becoming frustrating and not taking us anywhere when a strange voice in my head asked me to look one more time - the extra mile!

It took me about 15-20 minutes to find error in one of the worker services. Surprisingly, the error didn’t appear in the CI logs! Nor did it appear in the log file we generate for every job that passes through the service. It might have seemed obvious that I should’ve checked the particular service right from the beginning. I didn’t do that as our team was living in an assumption that we’re spitting out logs for every service in the CI tests; specially when something fails.

Key lessons that I took away:

I wrote this blog to bring out my thoughts. In process, if it helps someone else, I’d be surprised! But if you have any suggestions or comments about how we can improve, please let me know in comments.

Until next time. 😄