Abstract:
Today's cloud system are composed of geographically distributed datacentcr interconnected by high-speed optical networks. Disaster failures can severely affect both the communication network as well as datacenters infrastructure and prevent users from accessing cloud services. After large-scale disasters, recovery efforts on both network and datacenters may take days, and, in some casts, weeks or months. Traditionally, the repair of the communication network has been treated as a separate problem from the repair of datacenters. While past research has mostly focused on network recovery, how' to efficiently recover a cloud system jointly considering the limited computing and networking resources has been an important and open research problem. In this work, we Investigate the problem of progressive datacenter recovery after a large-scale disaster failure, given that a network-recovery plan is made. An efficient recovery plan is explored to determine which datacenters should be recovered at each recovery stage to maximize cumulative content reachability from any source considering limited available network resources. We devise an Integer Linear Program (ILI') formulation to model the associated optimization problem. Our numerical esamples using the ILP show that an efficient progressive datacenter-recovery plan can significantly help to increase reachability of contents during the network recovery phase. We succeeded in increasing the number of important contents in the early stages of recovery compared to a random-recovery strategy with a slight increase in resource consumption.