Header Ads Widget

Why data centers never go down?

How data centers work for 24 X 7? Why they never fails? It is not that devices doesn't fail in a data center for years. Some or the other device or their part do fail, but then how data centers are available throughout the year? 

This is obtained by implementing redundancy in data center. Redundancy in data center is implemented at various levels like power, cooling, IT equipment, parts of equipment, internet links and even on whole data center itself. Although implementation of redundancy improves higher availability of data center services, at the same time it involves additional cost. But the availability of data center services is more important than the cost involved.

Redundancy or resiliency means duplication of critical components of any system to improve reliability and availability of the system with minimum outage of function being performed by the system. As we said a data center hosts critical devices which are expected to run 24 X 7 for 365 days a year to ensure users don't encounter outage of services. Redundancy is very important part of a data center component design. 

Redundancy in data centers are achieved by below implementations inside the data center infrastructure:

  • Device level redundancy
  • Part level redundancy
  • Power level redundancy
  • Redundant cooling
  • Redundant internet links
  • Device monitoring
  • Data center level redundancy

Let us dive deep into each of these options to understand better.

Device Level Redundancy:

Server Cluster

This image shows a simple example of redundancy implemented at device level. Here two sets of servers are used which are interconnected in network. If one device fails the other continue to provide services and the outside users never know that there is any failure in data center equipment. This type of setup is also called clustering (grouping) of devices. Here the requirement is of only one server but we have spent additional money to purchase another set of server. Same services are running on both set of devices which are interconnected on internal network. Both servers will have different IP address internally, but the cluster of both makes a virtual device for external users and have a single IP address. This virtual IP of the cluster is available for outside users, hence they never come to know if any one device has gone down or the IP is not reachable. This type of redundancy level is called "N+1" redundancy where "N" denotes "need" i.e. required quantity of server and "+1" denote "One" number of additional server which act as backup. Similarly we may have N+2 or N+3, based on the requirement and budget. Both devices can be configured in "Active-Standby" mode or "Active-Active" mode. All components of a cluster can be at one place or can be in other city/state/country in different data center connected over internet.

Active - Active mode: In this mode both set of devices in a cluster are active and keep giving services to user.

Active - Standby mode: In this mode one set of device is active and keep giving services and the other set is under standby mode unless the first devise fails. Standby device keep checking the health of active device over network. The moment the active device fails the standby devices takes up the task and continue giving services to outer world.

Part Level Redundancy: 

Servers, storage, network switches and almost all the IT equipment used in data center have part level redundancy also i.e. they have duplicate components inside them to perform same function. For example servers have more than one SMPS (two, four, six etc) based on the type of server. They have more than one processor like 2, 4, 8, 16 and even more. They have multiple hard drives, multiple network ports, multiple management ports, multiple cooling fans. These additional components keep sharing the load when all components are in good condition, that means they are designed to work in Active-Active mode. The moment any component fails, the load is shared by other good components with more load being shared by each components. At the same time, error is indicated on server and alert is sent on monitoring tool which keep checking the health of each device on network. Such alerts help data center engineer to take corrective action and replace the defective part and make the server back to healthy condition. Till the servers have any failed component and still its working with redundant components, it is said to be working in "Degraded Mode" and when all parts are in good condition the server is said to be "Healthy".  For example if one SMPS fails the server will continue to run on other SMPS without interruption.
Important: Except hard disk, redundancy of all other parts of server is automatically managed by the bios of the motherboard without any external configuration. For getting redundancy at hard disk level one need to configure "RAID (Redundant Array of Independent/Inexpensive Disk)" before loading operating system on to it, else even if we have multiple hard disks on a sever, they will not work in redundant mode. 
Similarly switches, storage and other devices also have more than one SMPS to support while one fails. They too have multiple network and uplink ports to continue to support if any one port fails.
There are still few small devices which comes with single power supply, such devices are either deployed in pair in active-standby mode or they can be power by another power device i.e. ATS (Automatic Transfer Switch) or STS (Static Transfer Switch). Detail of this will be shared in upcoming posts.

Power Level Redundancy: 

Power redundancy

We saw that devices have two or more SMPS to run the device even if one SMPS fails. These SMPS are fed through two different power source (PDU) deployed in server racks. As shown in diagram each PDU on left and right side of back of rack are powered from two different power rails coming from two entirely different UPS rooms having multiple UPS sets in each room and different sets of rechargeable batteries. If required load of data center is 800KVA, then two sets of 800KVA UPS will be used in different sources. Again each source may have another 800KVA UPS to support N+1 redundancy at source level. i.e. Source-1 will have 800KVA + 800KVA UPS and Source-2 will have 800KVA + 800KVA UPS. 
If one UPS in source-1 fails, still the UPS will keep supporting the data center and if both UPS of source-1 fails or the supply rails from source-1 itself fails, then also the other source-2 will keep supplying power to data center equipment and they will keep working on power from one SMPS. Same is applicable for source-2 if that fails and source-1 is active.
Again we can see that for attaining power redundancy we need to spend additional money in buying multiple UPS where most of the time either each UPS will be working at its 50% capacity while they are sharing load, or one UPS will be active and other will be in standby waiting for other to fail. That depend on how they are configured to work.
These UPS sources are backed up by AC grid supply and set of diesel generator (DG). If grid supply fails, the DG set is kicked in to power UPS.

Redundant Cooling: 

As we deploy multiple UPS is one source for redundancy, similarly multiple PACs (Precision AC) are deployed in set so as to support in case one PAC goes down. Here also we follow the same concept of N+1 or N+2 redundancy etc. If the requirement is of 100 Ton AC to maintain cooling in the data center hall, then we deploy two set of ACs (may be 5nos. X 20 Ton or as per available capacity) each having capacity to serve total 100 Ton so that each set work in cycle of 12 Hrs to give rest to other set in a cycle of 24 Hrs. There shall be few more ACs which will kick in if any one or two ACs in any set fails to achieve 100 Ton of cooling in that set. 
Even we have PACs which can connect to central chiller of tower/campus and even if compressor of all PACs fails, they can maintain the data center temperature via chilled water supplied from chiller. This way we can keep adding redundancy depending on budget we have in hand.

Redundant Internet Link:

So we have achieved part level redundancy, device level redundancy, power redundancy and cooling redundancy to keep devices running for 24 X 7 X 365 days. Now to keep the data center connected to outer world and to other data centers the internet link or the ISP (Internet Service Provider) link should also be always available without any interruption. So, again we invest on more than one ISP link with same bandwidth as required for the data center. These ISP links should be from different service provider who have different back bones and local hubs, so that disturbance in one line should not impact other ISP. These ISP fiber links should enter in the data center campus from different points far away and the path of these links coming towards data center buildings should be as far away as possible to avoid any outage if any civil work happens inside the campus like digging etc.
Again both links are never utilized to 100% of there capacity unless other one fails. Data traffic is shared on both ISP links or they are configured as active-standby mode. But even if one is not utilized, we do have to pay charges for the standby link to ISP just to maintain data center redundancy.

Important Note: None of the IT equipment, power equipment, cooling equipment and ISP link are loaded more than 80% their full capacity. This ensure there is no over load while they are running 24 X 7 X 365 days.

Device Monitoring:

One important aspect of maintaining the data center's high availability is to monitor all the equipment over network and to configure appropriate alerts on monitoring system which trigger when set threshold are crossed or any part fails in any equipment or equipment itself fails. Monitoring include power equipment (UPS, PDU, ATS), cooling equipment i.e. PAC and off course all the IT equipment. All these devices have management port which are configured with some IP and are connected to network switch to central monitoring server which are managed by command center that can be in same bundling or other building or other location itself. Health status data is fetched from all devices via multiple protocols like SNMPv1 or SNMPv3 and few others and any error is displayed on command center, emails are triggered to specific team members who manage the data center and even SMS is triggered to alert team for immediate corrective action. Today in the age of IOT (Internet of Things) it is now more easy to keep a watch on these measured health parameters on the go through different apps on smart phones.

Data Center level Redundancy:

This is the highest level of redundancy that can be implemented with highest cost involved i.e. creating another data center which replicate the information from one data center and keep the services available even if complete data center fails or isolated from network or from particular geography due to some natural calamity or major fault (like under sea link damage due to some accident). For example Google has around 20+ data centers across globe. These data centers not only serve the purpose of redundancy but they also provide shorter internet path to user accessing these data centers from different parts of geography. e.g. if some one is searching some thing on google from India, his/her request will be processed from google data center in India unless the required information is not available locally rather than sending the request all over the internet to data center in USA. This reduces the load on the internet link and at the same time reduces the time to fetch the data for the user. Such local data centers are also called as edge data centers as they are at the edge near the users.

 Go to Content List

Post a Comment