Tuesday 16 August 2016

And The Clock Struck 10 ....

This is not an Agatha Christie Novel but the case was really queer and strange and it did need some Hercule Poiroting to solve this one !

For professional reasons, let me keep the names anonymous and let us suppose that this happened in an organization named XYZ. 

But let me tell you this. 

Everything what  I have written is 100% true  and I have personally experienced it. So that is it . 

 Without any more introduction, here I go ....

Company XYZ has only one big customer and it is of utmost importance that they be kept happy. There is a proprietary software developed by XYZ which has evolved over the last 5-6 years and this software is the backbone of the company's operations for their customer.  The software is being used by about 5000 data entry operators through a web based front end.  The heaviest usage is from 6 pm to 10.30 pm. 

One fine evening,  a support engineer reported that the application is running slow.  The core team did not pay much attention to this and there were some deflections in the turn around time of the operations. 

The next day, again at about 7 pm in the evening, the support engineers reported slowness in the system. The core developers had left the office. The techsupport team held the situation and tried to do some minor tweaking of the configuration of the application server.  It was Ok for 1 hour or so but again it became slow. 

The next 2 days again the same problem faced at 7.30 pm and the disruption continued till 9.45 pm. 

Things became bad. 

The core team went into a huddle that weekend . 

Developer A : What should we do ? 

Project Manager : Let us rollback the last application change that we have made.

The Service Level Agreement of the Turn Around Time went for a toss that week. The customer was livid. The Senior management were worried . 

"What is wrong ? Why are we not able to fix this ???"

"It Should not happen again Mam! We have rolled back the changes "

It was Monday . The day of the peak load. 

The application again went slow. From 7 pm . 

The core team looked into the server logs, the timeout functions , the memory leakages and fixed certain things which seemed to be the problem. 

After doing all the changes, the application worked fine from 10 pm after reboot. 

"It should be fine now !" The technical team heaved a sigh of relief.

But again the next day , at 7.20 pm , the software crawled ! By that time, the senior management had been called up several times by the top management of the customer. They were very unhappy and wanted to revoke the contract. 

The core team then focused on the network and the ISP provider. 

"Ah ! The bandwidth is getting choked ! We have to increase the bandwidth"

The bandwidth was increased immediately and at about 10.10 pm, the application started working smoothly .

Everybody heaved a sigh of relief. 

But .... again the next evening at 6.50 pm, the application just stopped working!

"You all are a bunch of fools sitting here ! It is already 8 days and we have not been able to figure out what is wrong !"

"It must be something wrong with the firewall then" The Project  Manager mumbled. "We have looked at everything ! The application, the servers, the network. We even checked the memory leakage and adjusted the server . Everything is eliminated .... !"

It was Friday .A high level emergency meeting was called.  The IT, the operations team were silent.... No one knew how this puzzle would be solved . 

"We just have this weekend to do whatever we have to do. On Monday, I have been called by the Customer . They might cancel the contract. And with that , the jobs of about 4500 employee are at stake ..." The MD said in a tight voice. 

Everyone was silent.  Acfter some time , everybody apart from the core IT team dispersed. 

The Project Manager was going thru a nightmare which did not seem to end . He sat there , his head bowed down on the desk , held by his hands.....

There was one young , bright engineer in the team . He seemed to be deep in thought.... 

"Just wondering ..."

"What ... ? "

"Everyday, the application slows down between 6.30 to 7 pm . Then we do some fixes and again it is up and running from 10 pm ... And we are happy .... but again the next day, it slows down at the same time .... "

"We all know this ... why are you repeating this ... adding salt to our wound ...think of something which we can do ... but What ????"

"There is a pattern ... " the young engineer mumbled  . 

The PM ignored his bantering

"Let us get down to work. Try to look at all the aspects .... let us not leave any stone unturned ..."

"But Sir, we have already done whatever we could do ! I really cannot thin k of anything more ..." The team wailed . 

"None of you will go home . Let us try again..." The PM said . Though he knew that his team was right.  They did not have a clue what was going wrong. 

After everyone went back to their seats from the meeting room, the young engineer still sat there. Thinking. 

There is a pattern. And the pattern is the clue to this mystery. He has to unravel it. 

He will investigate and get to the bottom of it.
He promised himself. 

And he started writing the chronology of events from day 1 of the fiasco. 

He created an excel sheet where he created columns for date, time, event, action taken and result and started filling it up vigorously. 

And as he finished the data, his roving eyes and clear mind started seeing the pattern.  And yes ! He was right ! Every evening, when things went slow, the team found out solutions to it and fixed things. But while doing so, it was already 10 pm and things became normal . And they thought that the application was working well because of the fix done by them. 

His eyes shone . 

That mean that the remedial actions or fixes did NOT work !  The software worked by itself after 10 pm ......

But  WHY ?? That was the million dollar question. And the weird part was that it was working flawlessly 2 weeks back ... 

He closed his eyes.  he must think . Think hard . 

Facts and figures went past his brain like slides of a power point presentation. 

Everything seemed blurred . 

There has to be a light at the end of the tunnel ...

All the application changes done till 3 weeks back have been rolled back.  All the database has been cleaned up, indexed, tested on the staging server... 

Then what ? What else has happened 2 weeks back which is out of the ordinary?

He started browsing thru his emails . He has to get some clue ... some clue ... 

Innocent emails stared at him. Some complaints about some scanners not working, an email about data not being fetched .... 3-5 complaints about previous data not being made available for data churn for warehouse. 

Ah... ! All innocent emails ... Everything had been fixed. People just keep on complaining, a smirk came to his face. The data had been provided a few days back. They had been after the IT team for this .  He saw his PM's note about the data being made available. It was a huge piece of data , he had replied , but it was all done now and the people can process this in the warehouses . 

Suddenly something clicked in his mind. His eyes were wide open now.

He opened the email. Looked hard at the date . 

It was 2 weeks back .  But ... this data had nothing to do with the data entry application ... but .... and why 7 pm to 10 pm ? What happened during that time?

He remembered his encounter with the warehouse manager a couple of times. 

"The warehouse application is extremely slow after you have put in the data ... do look into this..."

But the core team did not have any time to look into this. They had bigger problems in hand. 

Yes. It was falling into place. 

He closed his eyes again .

The warehouse tables had been infused with millions of rows of data two weeks back. The warehouse application was running very slow. The warehouse table was in the same database schema where the data entry application was there .... that means it was using the same memory space ... The maximum load of  concurrent users is between 6 pm to 10 pm ... that was the time memory was totally swapped up by the warehousing application. Though the warehouse application was not used from 7.30 pm, but the database memory did not get released till 10 pm . And invariably, after doing some fixes, the team rebooted the database server around 10 pm. 

That was the time when the memory got released and things worked smoothly. 

So, they have to do something about the warehouse tables. Maybe performance tuning of the SQL queries and they would have to do some structural changes. 

But from tomorrow, till the time the problem gets fixed, it the warehouse application is stopped at 6.30 pm and the database server is restarted, the problem will not occur. 

He smiled. He had to talk to the Project Manager.

The next day, the warehouse application was stopped at 6.30 pm and the database server was restarted. 

The application ran smoothly ! Everyone smiled and heaved a sigh of relief.

After that , the load was balanced. Now it i again running as smoothly as before.

So, dear readers, what do you think ? 

What accolade should the young Hercule Poirot be given ? 

Cheers !

3 comments:

  1. Wonderful. Sometimes a weird idea does work. That is the power of brain storming. In this case a young engineer did a Relentless Root Cause Analysis asking "Why" at least 5 times and found out a "mistake proof" solution. Would like to share a joke with you on this - it was found that in a hospital there were repeated deaths reported day after day. The time of death was 11 a.m. To their utter shock, they found that the housekeeping guy removed the plug of the ventilator to insert the plug of vacuum cleaner which caused the death. Sad, but true. Studying patterns always helps.

    ReplyDelete
  2. In the first half your writing it seemed that something uncanny and mysterious things were happening just like some ghost stories. And then though I don't understand minute technicalities of IT applications, it becomes clear to me that the fault was rectified excellently by an intelligent engineer. And ee should never give up ans surrender to problems under any circumstances. There is solution to every problem. Only one should take heart and be patient.

    ReplyDelete