- : Executing a Policy
- Security Incident Response
- Computer Security Incident Response Teams
- Preparing for Incident Response
- Management of Security by Teams
- Execution of an Incident Response
- Evaluation of a Security Incident
- Containing the Incident
- Eradicating the Incident
- Recovering From an Incident
- Article Series
- About the Author
- Ordering Sun Documents
- Accessing Sun Documentation Online
Recovering From an Incident
Recovery is often regarded as an urgent step to reduce the customer's business losses due to downtime. Recovery involves returning the system to normal by using a predetermined checklist. The checklist might not be a set of hard-and-fast rules. Human intervention is a key element and is why a checklist should be treated in the policy as a flexible process tool. It is only a set of guidelines for the VCSIRT members during the execution of the policy.
Returning the System to Normal
Bringing business back to normal operations with minimal user inconvenience is critical, particularly when customer systems are involved. All of the organization's personnel, under the supervision of the organization's security officer for the geographic area in question, must keep this in mind as the highest priority after an incident.
One of the surest ways to recover is to perform a full-system restore from a trusted media, but the main question is was the media used for restoration purposes adequately safeguarded at all times. It is also time-consuming and difficult if multiple systems were compromised. It is extremely important to note here that a full restore, including changes to every password, is mandatory if an intruder gained superuser access to systems and/or networks.
Recovering data is critical to your constituent's business, but it is also very tricky. You should keep the following in mind:
You can restore data from the last full backup, even if this is not the most perfect solution. You can also use incremental backups, if there were modifications since the last full backup.
You can use fault tolerant storage system hardware, such as RAID, to recover the mirrored or striped data that resides on the redundant hard drives.
You might have to use offsite, safeguarded storage if all of the equipment was compromised. The data might not be current if the last backup was several days or weeks old, which would impact the business; however, this is the real cost for not having fault redundancy in the storage design, secured offline storage, or highly-secured, locked-up storage on site.
Recovery of classified computing systems (as in the case of the U.S. Government) is outside the scope of this article, but in the U.S., note that the government agency that has jurisdiction over the geographic area in which the constituent resides needs to be contacted. For example, in the New York area, there is NYCTF (New York Electronic Crimes Task Force), which is part of a conglomeration of regional task forces known as Electronic Crimes Task Forces (http://www.ectaskforce.org).
In the northeastern states of New England, there is the New England Electronic Crimes Task Force (NET). There are also equivalent organizations in the San Francisco Bay area, Chicago, Las Vegas, Los Angeles, Charlotte, and Miami.
In the case of a disaster, such as a highly destructive network intrusion, priority schemes and escalation procedures should be followed, such as what to do first and whom to warn before attempting to bring the customer site back to normal. The VCSIRT must take proper precautions to provide backups. Key sets of processes and guidelines for disaster recovery will be presented in a future article.
Much of the process development must take place in the incident preparation stage, as described earlier. The maturity of the process design usually takes place in the post-established phase of the worldwide security team. This is discussed in a future article in this series.
The recovery process, occurring after the preservation of the evidence of an attack and the restoration of a secured clean backup, needs involvement by and guidance from a forensic expert that must be engaged by the organization's geo-based security officer. This person will be able to confirm that the eradication, and all of the necessary post-attack data gathering, was successful. The determination of success could also be a team decision involving the forensic expert.
A predetermined checklist (developed and distributed by the organization's worldwide security team for verification) should be used to confirm the return to normalcy before turning the system online or connecting to the Internet. The checklist should serve only as a guideline because it cannot anticipate all situations. Security experts in the geo-based security officer's VCSIRT should judge the validity of each step before executing it. At a minimum, the following broad areas of responsibilities must be considered, beyond what has already been stated.
The following table cannot be considered a complete checklist of responsibilities. It is just a sample.
Formally recording all actions
All actions must be recorded, including the dates and times (in total) required for each person on the VCSIRT.
Periodically notifying users of status
You must keep all of the users of the constituent's customer site informed of the status of the recovery. Business critical issues can be addressed simultaneously by the customers, based on the status reports.
Advising on major breakthroughs, setbacks, or developments
If any major breakthrough or development takes place during the course of recovery, it is the responsibility of the organization's designated security officer to communicate it clearly, yet cautiously, to the constituent's business operations management on behalf of the VCSIRT. All recoveries are not successful. Setbacks should be expected and must be recorded for the lessons-learned documentation. They should also be communicated to the constituent and the VCSIRT.
Adhering to security incident response policy (SIRP) guidelines
As much as possible, the VCSIRT must follow the SIRP guidelines. For example, it must seek legal and PR guidance to protect itself from any liability or undesirable media coverage that can adversely affect the constituent's business.
Patching vulnerabilities and minimizing and/or hardening the system
Patching must be done thoroughly at all levels of software, from operating systems to middleware to applications. Patching vulnerabilities must take place for compromised systems, as well as for systems that were unaffected by the attack, especially those that are not up-to-date. The latter is important because they could be the targets of attackers the next time.
Minimization is the process of removing unnecessary components. (Patching could happen during this process.) The process reduces the number of components to be hardened, patched, configured, and/or reconfigured. Although the process of determining the minimized configuration is time-consuming, removing the components is worth the time because they are generally the most susceptible targets (for example, external Web servers, firewalls, directory servers, and domain name servers). For guidelines on minimization, refer to: http://www.sun.com/blueprints
Hardening the disks against past, and possible future, attacks must be considered. This might mean disabling certain services and modifying configuration files. Vendor packages and scripts might be available (for instance, Sun's Solaris™ Security Toolkit at http://www.sun.com/software/security/jass/).
Removing any interim measures
Administrators use stop gap measures for short-term containment. These measures must be removed before bringing the customer site or system back on line. Examples of such measures are turning off Telnet on tcp port 23 or FTP on tcp ports 20 and 21.
Announcing the completed recovery
At this point, there must be an overall determination (even if a detailed forensic analysis has not been completed) as to how the security was breached and if required reinforcements have been made, before returning the systems to service. It must be the responsibility of the designated geo-based security officer to perform the final check before announcing to the constituent that the system or site is back to normal operation.