VOLUME 30, NUMBER 21 THURSDAY, February 18, 1999
ReporterFront_Page

UB's central email service back online
Tentative analysis shows failure caused by size of total mail file system, way files are distributed

send this article to a friend

By CHRISTINE VIDAL
News Services Editor

For the third weekend in a row, UB's central email server was down. But unlike previous weekends, this time the shutdown was deliberate.

The university's central email server went off line at midnight Friday to allow Computing and Information Technology (CIT) to dismantle the temporary email server and restore to the central email server all the computer hardware that was scavenged to create it.

As of Monday evening, after a nearly two-week absence, UB's central email service is back online, albeit with a limited number of connections, but with all folders and inbox messages dated Feb. 6 and earlier restored.

Weekend mail, interrupted by the decision to take the server down for three days, has been delivered and the nearly 500,000 messages that were sent to the temporary system and queued-mail received between 12:01 a.m. Feb. 7 and midnight Feb. 12-have begun to be delivered, a process that is expected to take at least a week.

The main email server, which stored more than 8.5 million files, has been reconfigured into 12 smaller segments, a move that is expected to prevent future problems.

And despite the crash, which Voldemar Innus, senior associate vice president for university services and UB's chief information officer, called "catastrophic," the vast majority of the nearly 500,000 pieces of email received between Feb. 3, when the problem first erupted, and Monday, when the central email server finally went back on line, has been received or is recoverable.

There's no question that the crash was an extremely serious event, Innus said. In fact, he added, the only scenario that could have been worse would have been if there had been a fire and the university had lost all of its central email server hardware.

But as of Tuesday afternoon, all indications "look like we're coming back fine," he said.

So what happened?

The failure of the central email server was caused by the size of the total mail file system, which exceeded 8.5 million files, combined with the way the files are distributed over the disk arrays, said Hinrich R. Martens, associate vice president for computing and information technology.

But that analysis is tentative, he emphasized. The software vendor, Veritas, is assembling a duplicate system the size of UB's in an effort to reproduce the failure and confirm the analysis.

The crash came as a complete surprise, both to UB and Veritas, according to Innus.

"We were reviewing our strategy on the growth of that file with the vendor all along, and the vendor gave no indication that the growth would cause a problem," he said.

Throughout the crisis, CIT has been in contact with Veritas, and last week the software company "gave us the indication that they agree with our tentative analysis of the problem," Martens said.

Confirmation notwithstanding, CIT has taken a number of steps to prevent the central email server from crashing again.

"What we've done as part of the restoration process is broken the file system into 12 smaller systems, and we're absolutely sure that we're not going to run into the same problem as before," said Martens. "If the (total of) 8.5 million files was the source of the crash, we won't experience it again. This also gives us a more manageable size if we experience a problem again."

Breaking the server into smaller systems also decreases the chances that all of them could be affected at the same time, so that if another system failure were to occur, there would be a smaller number of files to recover, and restoration could occur in "a matter of hours, not days," Martens said.

The decision to shut down the server for the weekend was made Friday morning, taking into consideration the limits of both the equipment and the people working to fix the problem.

"We took it (the central email server) down to give people time and sleep and the resources and the wherewithal to bring the system fully back up," said Martens.

The bulk of the work that went into toggling together a temporary server and restoring central email service has been performed by eight members of the CIT staff: Gretchen Phillips, Paul Graham, Lisa Maira, Matthew Stock, Stephen Comings, Steven Roder, Leonardo Miceli and Patricia Dennis. All are part of the UNIX system support group and have expert knowledge of the mail system, how it is constructed and how it operates. They also are the individuals who are most familiar with the software company, Martens said.

"People were working literally night and day to restore service. We know how important email is to the university community," Innus said.

Martens added: "It's not an exaggeration to say they've been working close to 18-hour days. Some people spent entire nights here monitoring the system to make sure the vital signs were there to assure continued progress of the recovery plan.

"On Friday, we decided to cut our losses short, take the system off line, rebuild it and give people a chance to sleep a little bit....When you're tired, you make mistakes," he said.

Those people "are owed a tremendous 'thank you' and acknowledgment for their effort in sustaining this restoration."

By last Friday, CIT personnel weren't the only ones reaching critical mass.

Weekdays, the university receives new, incoming email at a rate of 160,000 to 180,000 pieces each day. Weekends, roughly 60,000 pieces are received daily.

The temporary email server that was in place last week was configured to allow users to read their incoming mail and respond, but because the server was not connected with the permanent central email server, mail received could not be filed. So CIT set up a hold queue to make a copy of each piece of email received by the temporary server. Once the central email server was restored, the hold queue would resend all messages it received so users could save them if desired.

By Feb. 12, the system had accumulated between 400,000 and 450,000 pieces of mail in the hold queue, Martens said.

"If we'd let the interim system continue, we'd have accumulated between 800,000 and 1 million pieces of mail," he said, and there was concern about the server's ability to handle that quantity.

When mail stored in the hold queue-mail received between 12:01 a.m. Feb. 7 and midnight Feb. 12-is released, it will be delivered in reverse order, with the most-recently received mail sent out first and the oldest, "stale" mail delivered last.

Mail in the hold queue will be streamed into the system during off hours, a process that was expected to begin yesterday, Innus said.

Unfortunately, not every piece of email received since Feb. 6 will be recoverable, according to Martens. He estimated that as many as one out of four email messages received on Feb. 6 has been lost, with the most critical period occurring between noon and midnight. During that period, the log file also was lost, so there is no way for CIT to trace which messages were affected.

"Chances are about 15,000 pieces of mail were lost, and unfortunately, there's no way to tell how or to whom it happened," Martens said. "Everyone must be aware of that and try to deal with that."

He recommended that members of the university community who believe they should have been in communication with someone during that period of time contact senders and ask that they resend any message that may have been transmitted at that point.

"Individuals have to decide in their own way if that's something they need to do," he said.

Even when the central email server is up, running and fully functional, CIT will be working to make sure that something like this never happens again.

CIT is in the process of forming a campus-wide committee that will be asked to review the approach that was taken in the restoration of the university's central email service, examine the central email system and recommend changes, including longer-term changes, such as replacement of the system, if necessary. However, UB is not considering changing its central email system in the short run, Martens said.

"Our goal is to recover to where we were two weeks ago, but with the appropriate changes to safeguard against another system failure." CIT also plans take further steps to continue to improve the system, he added.

Those steps include having the committee look at other large institutions to provide some sort of benchmark of where they, and UB, are in terms of email.

"Email is something most institutions at this point are working on hard, grappling with," Innus said.

"The big thing here is, with this experience, we need to review what the institutional strategy for email needs to be."

Earlier, UB looked at the general approach other institutions are taking to their information-technology-infrastructure strategy and budget.

"We knew where we were compared to other places, but we didn't look at specific systems like email and administrative systems. As we move forward with IT planning, we'll be looking at those areas," Innus said.

Should UB have brought in outside help to solve the computer crash?

"When you listen fully to the details of what went on and steps that were taken to restore service, everyone will be in unanimous support of the expertise of our staff. That's not to say we can't benefit from the input and wisdom of the review committee," Martens said.

"We believe we're doing our best and our staff is highly qualified, but we welcome recommendations for the future."

On Monday, Martens said he had "a very high level of confidence that we have a solution to a fully restored server."

But he asked that the university community be patient once messages that have been stored begin to make their way from the mail server to individual email accounts.

"It will be at least a week before all queued-up mail is delivered," Martens said. During that time, CIT will continue to post updates as necessary on its Web site at http://wings.buffalo.edu/computing/alert/.




Front Page | Top Stories | Briefly | Events | Electronic Highways | Sports
Jobs | Obituaries | Y2K@UB | Current Issue | Comments? | Archives | Search
UB Home | UB News Services | UB Today