Friday, February 2, 2007

Panel Comm errors -> C-Cure crash

One goal for this blog is to help people learn from the experience of others. This tip is based on a recent painful incident at a high profile. The site was down for several hours event thought they have redundant servers that are functioning properly.

When C-Cure person records are edited, imported, or purged, the changes are downloaded to online panels. If a panel is online, but not communicating due to a hardware or line failure, C-Cure stores the changes for that panel in a download table so they can be sent when comm is restored.

Over time, these records can take up a lot of space in the database, and ultimately kill the driver. If you have a redundant system, the same database will exist on the backup system, so the failure will occur there as well. As far as I know, there is no clear indication of why the system won't work, and to recover, you need restore a backup of a good database or have SH TSG do some database magic.

Moral of story - communication failures should be dealt with immediately, and panels (or comm ports) should be set offline if the fault cannot be repaired promptly.