This is the first of those later posts I promised.
I got up this morning in the usual kind of way, and discovered that there was a new Beta of Lightroom 3 to play with. Well, I got as far as installing it when that quite unusual thing happened: a text message on my work mobile. At 7:20am. Now I was half expecting it to by my staff[1] telling me he couldn’t come in to work, but no. It was from a rather keen person who likes to start work early, telling me that the office phones were dead and that the network seemed to be down too. I tried to do a bit of remote connection, but nothing in the office wanted to talk to me, so I knew I’d have to check this out. Quick shower, slow bus[2] and I got to the office. I told the few people who were in that I’d have to check what was wrong before I could tell them how bad things were and went down to the server room. I could hear lots of fan type noises from inside, so apparently not everything was down. Maybe it wouldn’t be too bad?
I opened the door to be greeted by one of the worse things you can get from a room full of expensive and business-critical equipment: heat. Both of the air-conditioning units were off, which is a Very Bad Thing Indeed. I asked someone to get hold of whoever deals with such things, then phoned a colleague in another office to tell them what was happening. It seems there had been a fairly major power cut in Newcastle overnight, which was enough to knock everything off and trip out the aircon. Fun.
Before too long, someone who deals with such things got my aircon back on, so I could start working out what was broken. It looked like some kit was down, including the thingy[3] that connects all the computers and phones together, and the wossname that connects the office to the rest of the company and the internet.
After staring at it for a while, I realised the problem: one of the power strips from the very big backup power thingy was off. Rather than try to sort that out, I found another power strip and connected it to an unused output, and got the networky[4] bits started.
At that point, Aaron[5] arrived, and I handed over the business of starting things up to him. Once he’d sorted out the power, everything began to work. The phones came to life, and the servers (including all the shiny new virtual ones) came on, and people could restart their computers and get some work done.
But there was still a problem. Email was not reaching people, or indeed escaping from the office. A quick look at the “front end” mail server, which is in a different location, showed lots of messages queued up ready for delivery, but nothing actually getting through to the mailbox server in Newcastle.
My first thought (after basic things like restarting a few services[6]) was that the mailbox server had started up before the domain controllers, so didn’t know what was what, where it was, or what century it was in. I restarted it, which didn’t help as the mail server services decided not to start.
I went though the event logs trying to see what the problem was. The specific error was “Topology Discovery Failed”, which pointed towards the server not being able to find the global catalog, which was odd, because there are three live GC servers on the same subnet[7]. More digging and a bit of web research on the particular error being logged suggested something interesting. Various people had the same problem when a security group containing the Exchange Servers does not have the right permissions on certain logs. But this usually came up when people were installing Exchange, rather than just after a few servers shut down a bit unexpectedly.
So I looked, and found that the permissions had indeed gone missing. How, I don’t know. Could somebody have changed something, and it only came to light when the mailbox server was restarted? Could the Group Policy have decided to mangle itself? The event logs are silent on the matter, so I don’t suppose I’ll ever know why I had this problem. But once I set the permissions, and restarted the mailbox server one more time, normal service was resumed.
Just in case anyone comes across this and has a similar problem, this forum post has the relevant details of what permissions to set and where:
Ars Technica Forum
I’d rather not have another day like that for a while…
[1] All of him
[2] The traffic was a wee bit heavy
[3] I’m omitting the technical details for the benefit of the general audience
[4] OK, that’s a bit technical
[5] My staff[1]
[6] Sorry, it might get technical from here on. Feel free to fall asleep, or move to another post…
[7] There goes the rest of the audience…