This is the last of a 3-part series "DCIM for High Availability" presented by GreenField Software. It first defines "high availability" and then gives instances of some recent high profile Data Center failures in spite of their robustness and extreme in-built redundancies. The business impact of Data Center failures is highlighted.
Data Center topology has changed in the last two decades as a result of the High Availability Mantra and new tools are required to effectively manage the Modern Data Center. DCIM Software today has matured to a level where it is no longer an option. Data Centers of all sized need to implement DCIM not just to reduce risks of Data Center failures, but also to arrest increasing capital costs and operating expenses.
GFS Crane DCIM Software is a great example as the two DCIM Case Studies show in this presentation.
The following GFS Crane capabilities have been included in this presentation:
- Improved Availability through Predictability, Visibility and Change Tracking.
- Controlling Capex Costs though better visibility of under-utilized capacities and therefore deferring expensive capital expenditures; and minimizing stranded capacities.
- Reducing Operating Expenses: Real-time monitoring and multi-level PUE helps to reduce power costs; automation of processes improves productivity; and rationalization of assets reduces AMC and space rentals.
The presentation concludes with two GFS Crane DCIM Case Studies: in Financial Services and Telecom verticals.
GreenField Software’s Mission is to help Data Centers control capital expenditures reduce operating expenses and mitigate the risks of Data Center failures. Besides DCIM Software, GFS offers Data Center Advisory Services in the areas of best practices, capacity planning, energy efficiency and business continuity of data centers.
2. 2
Today’s Topics
• High Availability Mantra Revisited
• Anatomy of a DCIM Software: GFS Crane
• How GFS Crane DCIM Delivers Higher Availability
• How GFS Crane DCIM Helps to Reduce Costs
• GFS Crane DCIM Case Studies
3. 3
The High Availability Mantra RevisitedThe High Availability Mantra Revisited
Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) had two
outages in 2012 – each over 3 hours!
• Tier 3/Tier 4 just defined by hardware redundancies
• Glaring gaps in operating procedures to prevent fatal human errors
• Lack of purpose-built BCP software to predict failures
• Lack of chain of custody to detect root cause
Amazon Data Centers (built to Tier 4 standards and with an expected availability of 99.995%) had two
outages in 2012 – each over 3 hours!
• Tier 3/Tier 4 just defined by hardware redundancies
• Glaring gaps in operating procedures to prevent fatal human errors
• Lack of purpose-built BCP software to predict failures
• Lack of chain of custody to detect root cause
Availability % Downtime per year Downtime per month* Downtime per week
99% ("two nines") 3.65 days 7.20 hours 1.68 hours
99.5% 1.83 days 3.60 hours 50.4 minutes
99.8% 17.52 hours 86.23 minutes 20.16 minutes
99.9% ("three nines") 8.76 hours 43.8 minutes 10.1 minutes
99.95% 4.38 hours 21.56 minutes 5.04 minutes
99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes
99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds
99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds
99.99999% ("seven nines") 3.15 seconds 0.259 seconds 0.0605 seconds
4. 4
Did You Know?
90% of DC Failures Are From Common Preventable Causes90% of DC Failures Are From Common Preventable Causes
5. 5
Did You Know?
Average Failure of an Online System: 36 hours per annum.
That’s only 99.6% Uptime
Average Failure of an Online System: 36 hours per annum.
That’s only 99.6% Uptime
6. 6
Did You Know?
75% of Businesses Without a BC Plan Fail Within 3 Years after a Major
Disruption in their IT Systems
75% of Businesses Without a BC Plan Fail Within 3 Years after a Major
Disruption in their IT Systems
8. 8
Improves Availability: Predictability, Visibility & Change Tracking
Advanced Alarm Management and analytics helps in failure
predictability, faster turn-around-time, improved availability and SLA
Consolidation of alarms from different facilities helps in centralized
monitoring
Improved visibility of the power chain and the relationships among
critical components of the infrastructure helps in better impact analysis of
device malfunction or failure and doing RCA
Change Tracking in the data center environment helps in doing impact
analysis of any change and root cause analysis of any outage occurring due
to a change
Predictive
Analytics
Predictive
Analytics
Visibility from
Power Chain
Visibility from
Power Chain
Change TrackingChange Tracking
9. 9
Improves Availability: Predictability from Proactive Alarms
Proactive Real-time alarms
Alarms on power, PUE and environmental
conditions like temperature, humidity, smoke,
fire, WLD, door-open and motion
Alarms can be sent on e-mail & SMS
Alarm Dashboard
Alarms from multiple data centers are
consolidated on a dashboard
Analysis on alarms based on severity, type,
source, duration etc.
Advanced Alarm Management helps in failure
predictability, faster turn-around-time,
improved availability & SLA compliance
10. 10
Improves Availability: Visibility from Power Chain
Maps relationships among critical
components of electrical infrastructure
Create power chain for electrical infrastructure
Map asset relationships and redundancies
starting from power source to customers and
applications
Asset Relationship Mapping
Improved visibility of the power chain and
relationships among critical components of
the infrastructure help in better impact
analysis of device malfunction or failure
and doing root cause analysis
11. 11
Improves Availability: Change Tracking
Maintains an audit trail for all
Installation/Move/Add/Change activity in
the data center
Integration with existing ITSM tool
enables running the tracked changes
through a workflow system for change
approvals
Audit Trail of DC Configuration Changes
Tracking changes in the data center environment helps in doing impact analysis
of any change and root cause analysis of any outage occurring due to a change
12. 12
Reduces Cost: Capex & Opex
Better visibility helps discovering under-utilized computing capacities
-> defers capex purchases
Better visibility helps avoiding stranded capacities on rack space &
power use: maximizes utilization of available capacities
Better monitoring & analytics reduces operating cost on power
Automation of processes like Asset Tracking, Provisioning &
Monitoring improves productivity
Rationalizing asset base helps in lower maintenance costs like
equipment AMC
Reduces CapexReduces Capex
Reduces OpexReduces Opex
13. 13
Reduces CapEx: Monitoring IT Utilization
Visibility of hidden compute capacity
Calculates the average utilization of all
computing devices in the data center
Identifies the unused compute capacity
Under-utilized servers can be repurposed
Based on power consumption & utilization
patterns, hardware specs and age, ‘Repurpose
Candidates’ are identified that helps in deferring
new server hardware purchase
Hidden Computing Capacity
Repurpose Hardware
Discovery of hidden compute capacity defers
capital investment on new server hardware and
software licenses
14. 14
Reduces Capex: Minimizing Stranded Capacities
Visibility of consumed power against max
capacity in a rack
Provides real-time information on actual IT
load in a rack
Provides maximum power capacity
Provides available power capacity
Visibility of occupied rack space against
max available space
Provides real-time information on occupied
space in the rack in RU
Provides maximum space capacity
Provides available space capacity
Hidden Power Capacity
Hidden Space Capacity
15. 15
Reduces OpEx: Power Costs
Multi-level PUE Comparison
Compares PUE calculated at
multiple levels and identifies power
distribution losses that can be
rectified to improve efficiency and
reduce OpEx on Power
Detect Power Distribution Loss
L1 PUE: UPS Output
L2 PUE: PDU Output
L3 PUE: Device-level
reading
Detection of power distribution losses in the
electrical infrastructure helps in improving
energy efficiency of the data center and
reduce operating cost on power
16. 16
Reduces Opex: Process Automation & Improved Productivity
Automated discovery and inventory of
both IT and infrastructure assets
Intelligent assets are automatically
discovered using SNMP/IPMI
Manufacturer Repository contains
information on static attributes of assets
Assets data imported from
spreadsheets or asset management tool
Single management console to manage
IT and non-IT assets
Maintenance management for assets
done using plug-ins that sends scheduler
based proactive alerts
Workflow-based auto-provisioning
improves speed and reduces errors
Advanced Asset Management
17. 17
Reduces Opex: Asset Rationalization
Asset Rationalization
Asset Management module tracks & maintains inventory of all assets (IT
& non-IT) in the data Centre.
Helps identify legacy servers and replacement candidates
Reduces AMC, space rentals
Asset
Rationalization
Asset
Rationalization
Server
Virtualization
Server
Virtualization
Capacity
Planning
Capacity
Planning
Data Center
Consolidation
Data Center
Consolidation
GFS
Crane
DC
DCIM
GFS
Crane
DC
DCIM
Legacy Data
Center
Legacy Data
Center
Server & Rack
Consolidation
Server & Rack
Consolidation
Multiple
Data Centers
Multiple
Data Centers
18. 18
How GFS Crane DCIM Helps
• Helps Data Center Manager avoid unnecessary over-provisioning
• Helps plan investments and new capacity
• Helps reduce the capital costs
• Helps reduce power use and other operating costs
• Helps reduce risk of failures through critical alerts
• Helps adapting to technical and business change more easily
• Helps improvement plans through real-time metrics & dashboard
19. 19
GFS Crane DCIM Case Study 1: Financial Services
Industry Project Financing & Mutual Funds
Data Center Location India
Data Center Details Tier III certified by 451 Research, Energy Efficient ‘green’ Data Center
certified by TÜV Rheinland
DCIM Implementation
date
January, 2012
Business requirement
driving DCIM
implementation
Improve energy efficiency through better energy management
Comply with Green Grid recommendations and adopt best practices
in data center operations
Improve data center availability and meet business SLA through
better monitoring, failure prediction and faster turn-around-time
Integration Touch
Points
Power Systems: LT transformer panels, UPS, PDUs and Distribution
Panels, BUSBAR panels, Multifunction Energy Meters.
Environmental Systems: PAC units, temperature and humidity probes
Servers, Network devices, Storage devices
Siemens Building Management System
20. 20
Industry Mobile Operator
Data Center Location South Asia
Data Center Details Multiple data centers spread across 4 locations, covering 8,500 sq.ft. of
whitespace and housing 320 racks
DCIM Implementation
Date
Ongoing
Business requirement
driving DCIM
implementation
Improve data center efficiency through better energy management
Improve operational efficiency through better asset management,
capacity planning and converged infrastructure monitoring capability
Improve data center availability and meet business SLA through
better monitoring, failure prediction and faster turn-around-time
Integration Touch
Points
Power Systems: LT transformer panels, UPS, A/C & D/C PDUs and
Distribution Panels, BUSBAR panels, Multifunction Energy Meters.
Environmental Systems: PAC units, temperature and humidity probes
Diesel generator, flow and level sensors
IBM Netcool (ITSM), VESDA, ACS and IP Surveillance
GFS Crane DCIM Case Study 2: Telecom