Where can I find information on product outages - Checkpoint
Summarized or attached here are the responses from Checkpoint related to delays or outages
August 27, 2024
August 28, 2024
Below are the responses and RCAs from Checkpoint regarding their reported delays in email processing and sending.
28AUG24
Summary
In the process of hardening against a domain related vulnerability, email deliverability issues were experienced, resulting in extremely delayed delivery for outbound email for customers that had a large amount of domains. The delays would have been noticed in the timeframe listed above. No email was lost and no data was lost during this event.
Root Cause Analysis
A new threat had been identified that allowed bad actors to spoof a domain resulting in mal-intended email to be cloaked and appear as though it was sent from a legitimate and known domain. This allowed for penetration past email security preventative measures. In the process of hardening HEC, we deployed code that did not account for customers with a significantly large number of domains. The end result would have presented as customers experiencing a delay.
Actions Taken
The fix for this vulnerability involved comparing header information to known domains of customers. We had an unknown limit of 100 domains, chosen in no particular order to compare against the headers of email. Ex: Acme Rocket has 400 domains of which 1 is the primary domain of the tenant. From that we would compare headers of outgoing email where Microsoft injects the primary domain of the tenant to known domains, and if there was a match we release the email. If there was no match, we would not release the email. The problem was in our initial deployment we only pulled 100 domains, in no particular order. If a primary domain was not in this list of 100, it would result in all outbound email being withheld.
Corrective Actions
We resolved this situation by removing the limits of the 100 domains to include all domains and deployed this change to our environment. Once this change had been made, all email that was withheld was validated as legitimate and then released. No data was lost, however the previously withheld email was not scanned.
27AUG24
Summary
Visible periodic email delays for the customer: We hit a software capacity issue of which we have not encountered before. Part of the architecture around the ‘milter’ handling had a great deal of opportunity for a more efficient design. We viewed this as a ‘growing pains’ scenario.
Root Cause Analysis
Part of the way we scale up our infrastructure is to create new virtual machines temporarily, to match incoming load. These machines are created pre-configured, and immediately go into service. This helps distribute load/traffic; with the end goal of maintaining service to our customers. The manner of which we utilized certain micro-databases in this scaling process had become obsolete, inefficient and resulted in the milter being overwhelmed with traffic.
Actions Taken
As a temporary measure we have increased the resources necessary to produce a great deal of headroom for these processes. This will allow for uninterrupted service while we implement a change.
Corrective Actions
Our R&D Team is in the process of re-designing the architecture, so the scaling process of creating new machines will be more efficient and data will be accessed in a more linear manner, eliminating the bottlenecks that had been experienced. The need for this action was due to the growing customer base coupled with growing demand. This re-design should be complete and implemented by the end of September, 2024