Friday, February 2, 2007

Panel Comm errors -> C-Cure crash

One goal for this blog is to help people learn from the experience of others. This tip is based on a recent painful incident at a high profile. The site was down for several hours event thought they have redundant servers that are functioning properly.

When C-Cure person records are edited, imported, or purged, the changes are downloaded to online panels. If a panel is online, but not communicating due to a hardware or line failure, C-Cure stores the changes for that panel in a download table so they can be sent when comm is restored.

Over time, these records can take up a lot of space in the database, and ultimately kill the driver. If you have a redundant system, the same database will exist on the backup system, so the failure will occur there as well. As far as I know, there is no clear indication of why the system won't work, and to recover, you need restore a backup of a good database or have SH TSG do some database magic.

Moral of story - communication failures should be dealt with immediately, and panels (or comm ports) should be set offline if the fault cannot be repaired promptly.

2 comments:

Anonymous said...

Jeff,

How many apCs need to be in comm fail and for how long for this to be a concern? What was the case for the indcident in your example? How much activity does that system get/see?

Thanks,
Craig Delgado

Jeff Bennett said...

It is related to both the number of apCs and number of cards being downloaded, and I think the retry frequency. I don't recall the exact numbers, but usually the problem crops up because of large or frequent imports. I'm not sure recent current C-Cure versions, but importing from a text file would trigger an import for each card, even if the records were identical.