Recently we had 2 asterisk servers that were segfaulting, and of course when that happened, all current calls were lost. This was very frustrating for our customers, our customer service reps, and of course for the us developers. It was very hard to explain, there was not silver bullet saying “X” is responsible.
So, my boss Kevin and I spent time going through log files, analyzing time lines, researching on the Internet, etc…
Our conclusion is still dumbfounding.
We recently added a new outbound sip provider. We started only giving them a small percentage of our overall traffic. During this time we saw 0-1 segfaults each day. But we didn’t even know about some of them because asterisk started right back up, and unless there was a complaint, we wouldn’t know.
Of course we got the occasional complaint, but we didn’t have an answer. All of a sudden the last couple of days was showing 4-5 segfaults per day. Some happening within 20 minutes of each other. This was very unacceptable and we spent a lot of time trying to narrow down the issue. It so happened that 2 days ago, we ramped up the amount of traffic this new provider was receiving, subsequently increasing the odds of segfaulting.
At this point, we still don’t know what actually caused the segfault. We just know that removing that carrier has so far stopped them from happening.
Ugggg… what a PITA.