Thursday, February 5, 2015

When introducing new technology breaks stuff - it's not necessarily a bad thing

We all know that GO Transit has been working diligently to bring real time service updates to its website and applications. Bus passengers were supposed to get a bus tracking application back in January 2014, similar to the application that allows us to track trains, but it has yet to see the light of day.

Remember how I've joked about the hamster being dead when applications (like the fare calculator) found on the GO Transit website misbehaved or malfunctioned? Turns out I may have not been that far off. I've always suspected that service updates and the accompanying emails and text alerts were entered manually, and not tied to any kind of database. I still believe this is a manual process - someone actually physically enters the train and bus status into some kind of web-enabled backend.

Over the weekend, GO Transit went ahead and upgraded their website where the service updates would be distributed by a new, automated app, where updates would be delivered through a feed coming from a third party source. I received an anonymous email where the person wanted to explain what happened because they were frustrated with how the situation was explained in the media.

Because of the weather conditions, the third party source vomited out feed after feed of service disruptions at a pace that the new app built into the website couldn't handle. In an article with the Toronto Star, GO Transit President Greg Percy blamed the website failure on a search function that couldn't handle the demand. He also admitted to updates being done manually. I read that paragraph over and over while trying to tie it in to what he discusses in the next paragraph, that GO is prototyping new technology.

End users don't search for service updates. That's not how it works. Metrolinx PR spokeswoman Anne Marie Aikins said on Twitter that the site was brought down by a problem "deep in the backend". Nothing made sense. No wonder my anonymous friend was frustrated.

So, to explain it like we're all five:

GO Transit upgraded their website to include a new application (let's call it "new code") to display service updates in real time using information pulled in from an outside source, rather than someone keying in the information manually. The outside source overwhelmed the "new code" and the "new code" caused the entire website to crash.

How hard is it to just say, "We thought this upgrade would work and we're sorry it didn't. It is important to us that we try to introduce new technology that will help our customers plan their commute better. We want customers to be informed about service disruptions when they happen, as opposed to receiving information hours later. Unfortunately, the upgrade failed due to circumstances we didn't anticipate. We plan on fixing the problem and taking measures to prevent this from happening again".

I could get into the whole, "Well you should have tested it first" lecture, but I won't. I spent eight years in IT. You can debut all kinds of "new code" with the best intentions. You can test, and test, and test, and still, something can go wrong. You can run into a situation you never thought about - it doesn't make it right, but it happens.

Personally, I don't care that GO Transit ran into a problem while trying to deliver new technology. The fact they are trying is all I care about. Hopefully they learned some lessons from this exercise, and when they finally roll out that bus tracking app, it works like a charm.


Nahid said...

Yep, I've screwed up many times in my IT job, but as long as I transparently explained to the client why things went wrong, both the client and my manager were understanding and fine with it.

Anonymous said...

The real failure was that they didn't properly backup the system before pushing the update live and were unable to easily revert when something went terribly wrong

C.J. Smith said...

Then there was that ...
But as Percy pointed out, the site is made up from many moving parts so it would appear that the restore meant more than just the website data itself but all the other databases it's possibly connected. And it wasn't something that happened quickly.
In my previous job, restoring some of client sites from backups would take hours, especially those that connected to a myriad of external apps. It wasn't instant.

Anonymous said...

I liked how Del Duca put the blame on an increased ridership. Huh?

C.J. Smith said...

Please or please point me to where he said that? I need it for my files.

George said...

@Anon. They have had that update service live for months with very few issues.
How long should a backup be kept? Do you realize how serious a job it is restoring a real time website?

Tyler said...

This doesn't make sense. What's Metrolinx's operating IT budget? It's probably the largest in the entire organization. Why not build the system on a second server? Tie in all the apps. Tie in all the code. Run it and if it fails, switch the IP and point the domain to server #1.
Honestly, I think it's a bunch of 40 to 60 year old upper management dinosaurs that run the tech group at Metrolinx. Guys too scared to be progressive and too stubborn and possessive to try it differently.

C.J. Smith said...

^ You mean like a mirror site?

Anonymous said...

i would think the salaries would be the largest part of the company's budget followed by transit operations.

George said...

From what I've learned, the incredible amount of data being fed by the external services worked like a DDOS attack on their server farm. The website was trying to update while it was updating service bulletins.

That would affect all Metrolinx sites whether mirrored or not.

I guess it's a situation where Murphy's Law applies. You just can't test every scenario. You just can't anticipate this volume of data being sent in such a short time.

Michael Suddard said...

Perhaps the "new app" was tested just as much and perhaps even by the same people who brought us the awesome green fare card! Wouldn't put it past them.

Speaking of the fare card, on Twitter Presto is previewing a new Presto Card.

Someone, not me, even tweeted asking if it would crack less!

Anonymous said...

Gemma said...

I howled with laughter yet again today when I realized I had missed my 7:34 from whitby, only to find it it hadn't even left yet and what I was staring at was the 7:54 :) This week, the commute won: Monday - snowstorm, traffic, delayed trains. Tuesday - I think I was just delayed. Wednesday - hiway accident - I left super late from Oshawa instead of trying for Whitby...and then couldn't get out the lot at the end of the day. Thursday - yesterdays snafu on trains going, coming, not moving, on the way, cancelled and all stops but not and finally Today's "switch issue" all 20 mins delayed. You win, Commute, I wanna work in Durham.

Anonymous said...

I thought the reason for taking GO transit was to eliminate weather related slowdowns?

TomW said...

Bus tracker app... I chatted with someone form GO about this at a public meeting. The fundamental issue is that many GO bus services don't follow fixed routes. (So the Hamilton QEW bus will sometimes go along Lakeshore instead). This makes it a much harder problem to predict what the arrival time will be. (The algorithm has to figure out the arrival time based on the bus driver adapting their route according to the predicted arrival time).

They were in the testing-and-refining stage a few months ago, and were aiming for early 2015 then.
Certainly it's better to have no app than an inaccurate app for now.

Peter said...

Per GO Transit CSR, M.D., on 2013/09/08 regarding failed email alerts for a late bus:

"In the meantime, we will continue to work with our Operations team to ensure all delays are reported."

Per the same CSR, the new delayed bus system was supposed to be "in place by January 2014".

I would rather have a slightly inaccurate app than lies.

In the mean time, we'll have to rely on the old school method:

(i) 1-888-438-6646
(ii) Hello. Can you tell me where my bus is, please? I'll hold while you check with Operations. Thank you.

Tal Hartsfeld said...

The same old story: They get an ambitious grandiose idea or concept, implement it, then are taken by surprise by all "the devils in the works" they overlooked and didn't anticipate.

Robert Wightman said...

"I could get into the whole, "Well you should have tested it first" lecture, but I won't. I spent eight years in IT. You can debut all kinds of "new code" with the best intentions. You can test, and test, and test, and still, something can go wrong. You can run into a situation you never thought about - it doesn't make it right, but it happens."

That is why you should build a parallel test site and turn a bunch of grade 5 to 10 students loose on it with a prize for the person who can cause the first disruption, the most disruptions and the longest disruption. When it can survive that then it is probably ready for use.

It is never possible to over estimate the incompatibility of different programs and operating systems.