Archiving has changed significantly since the early days of computing when it meant moving data on tape to a remote facility for long-term storage. With all of today's archiving options, it can mean something as easy as automatically archiving email messages or as burdensome as the traditional physical drop-off/pick up of tapes at an off-site storage site.
While the word “archive” suggests data will be stored for a very long time, that timeframe can vary by industry. For example, most financial data requires seven years of archives, pharmaceutical research can require 20 years, and some medical and nuclear records have to be kept for 50 years. In general, the cost of keeping data on spinning or spin-down disk for 10 years of more is very expensive. It's also difficult to predict what kind of archiving technology will exist 10 years from now, so for our purposes a “long time” for cloud archiving is between one and seven years.
Price vs. performance
Cloud-based archiving opens the possibility of a “just right” balance between cost and accessibility. Tape has been, and remains, far and away the lowest cost method of storing data for years. A typical LTO tape holding approximately 1 TB of data costs roughly $35 with monthly off-site storage in the range of 25 cents per month. There’s no way for even the cheapest cloud disk to compete with this price. On the downside, the normal retrieval time for a tape from archive is next-day delivery plus the time needed to mount and restore it. This means users will wait about a business day before being able to access the information requested.
Archive vs. backup
While many IT shops still consider their old backup tapes to be “archives,” there are specific use cases and access requirements that distinguish archives from backup data. Backups are done to protect data that’s currently in use; if data has to be restored from a backup, it generally happens shortly after that backup was made. Backup data typically has a short shelf life.
Archives are sets of data that will be retained for a long period of time for regulatory compliance, corporate governance or use as intellectual property. Archives are accessed infrequently, but are searchable so specific data can be recovered relatively quickly and easily.
The Storage Networking Industry Association makes a distinction between cloud backups and cloud archiving services: “Whereas with Cloud Backup the cloud is simply a repository of backup data, with Cloud Archive and Preservation, the Cloud is where the active processes occur that ensure long term retention, preservation and viability of data.”
Cloud storage, on the other hand, starts at approximately 10 cents/GB per month and up (depending on volumes). This adds up when contemplating hundreds of TBs, but it’s still often less than the cost to procure, deploy and manage arrays in a central data center. Whereas tape retrieval is measured in business days, data hosted on cloud-based storage can be accessed in seconds. For some apps, this may be the ideal tradeoff between price and performance.
Cloud advantages, disadvantages
Before going all-in on cloud archiving, however, IT needs to weigh the virtues of cloud with in-house archiving. Technologically, cloud providers can’t offer anything that can’t be implemented in-house. So a company may, for example, choose to implement a tiered storage infrastructure with tier 3 high-capacity SATA disk to achieve a lower average cost per GB stored. Generally, organizations will lean toward an in-house solution if they can’t risk the loss of connectivity to a remote location, have regulatory requirements that require strict data security oversight or have data retrieval requirements where remote latency would be unacceptable. This is a fairly restrictive list, but there are still many applications that are candidates for cloud-based archiving.
IT organizations can quantify the logistical effort to migrate to the cloud, but shouldn’t overlook a predictable but unforeseen challenge: a mind shift from a technology-centric perspective to a service-level management perspective. IT staff used to making technology choices and deployments often want to delve into the cloud vendor’s architecture and “suggest” product or technology-specific implementations. Rarely are such requests warranted, as the vendor maintains full responsibility for managing the cloud infrastructure. IT departments really shouldn’t be concerned with the underlying technology, provided contractual service levels are met. With experience, staff attention will gradually shift from low-level details to higher-level governance.
Service is a critical factor in cloud-based archiving
Service-level management, then, is critical to the initial decision for cloud-based archiving as well as ongoing operations. When shopping for a cloud archival vendor, consider the following service-level issues:
Uptime. For most applications, three nines or four nines of availability are sufficient to meet business requirements. If you need five nines, you probably have data access requirements that aren’t conducive to an archive tier. Data hosted in an archive tier is, by definition, non-critical. The uptime requirement largely determines how much infrastructure the vendor must provision, so it has a big impact on the hosting cost. Don’t guess; determine the actual hours when data will be accessed, access patterns and cost of downtime. These calculations can be compared to the cost of various uptime guarantees, and easily justified or rejected based on the comparison. Vendors often offer hosting-fee rebates or other performance penalties for missing cloud storage service-level agreements (cloud SLAs). However, the caveats are contained in the fine print, so read them.
Accessibility. Accessibility and uptime aren’t necessarily the same. The storage may be humming, but the subcomponents render an application unavailable. If you need redundancy or multiple redundancy of data links, for example, you’ll have to pay for them but the alternative may be unacceptable application outages. Make sure service levels encompass end-to-end data availability.
Performance. Quantify how many IOPS your applications require and ensure this number is part of the SLA. IOPS can be measured either as an average or during peak activity. If you demand IOPS guarantees at peak, then you’ll have to pay for the vendor to provision them. Some vendors may offer metered billing, but many organizations don’t like the potential uncertainty of such billing should demand suddenly spike. Most organizations will absorb a certain amount of constrained operation (especially for an archive tier) in return for cost certainty. In this case, the SLA is for guaranteed IOPS, not absolute performance experienced by the end user. If application demands exceed contracted IOPS capacity, it’s rightly the IT organization’s problem; additional IOPS can always be purchased.
Data recoverability. As they do for in-house applications, IT organizations need to specify recovery point objective (RPO) and recovery time objective (RTO) requirements for cloud-based archives. This is related to uptime, but also covers contingencies such as data corruption or a component failure that doesn’t affect overall uptime but impacts individual applications. The vendor should have default values for RPO and RTO, which may be sufficient for an archive tier. Again, don’t guess. Know what kind of data loss and application unavailability the business units can financially tolerate. In many cases, it’s much more than is intuitive.
Disaster recovery (DR). If the cloud archive is used as off-site replicated storage to satisfy data redundancy requirements, it may not be necessary to consider a DR strategy for this tier. But buyer beware: Most hosted storage doesn’t include any DR contingency. If the hosted data is “live” data provisioned as hybrid cloud storage, then a DR plan may be necessary. Hosting providers may regularly back up the data, but they generally don’t rotate the data off-site, and if they do, they do so infrequently (e.g., monthly). Although a disaster at a SAS-70 compliant data center is unlikely, it’s not impossible. DR capability from a hosting company is often a significant additional expense and can change the economics of hosting in a hurry. Make sure data isn’t left in a vulnerable state.
Backup and recovery. Even if the hosting vendor backs up the data regularly and rotates it off-site frequently, IT organizations may not be out of the woods. Hosting companies usually have a limited number of backup software options and tape technologies. This means their backup format (hardware, software or both) may be incompatible with your IT systems. If an IT organization is forced to do a recovery from the vendor’s tapes, there could be a substantial delay in acquiring the necessary infrastructure. Ensure there’s a way out in a worst-case scenario.
Compliance. Archived data that requires special compliance treatment may still be a candidate for cloud hosting. You’ll need to ensure the data is retained on immutable media, if required. You’ll probably also need assurance that strict access guidelines are followed and auditable; SAS-70 providers should have such processes in place.
Cost certainty and granularity. One of the key benefits to cloud storage hosting for archiving rather than using in-house infrastructure is that you pay only for the storage consumed. The metering should go up or down with use, though it may have a floor minimum.
Turn tapes into cloud archives
It’s clear that cloud-based archiving may be attractive to companies with aging data stored on relatively expensive in-house arrays. More questionable is whether or not converting from tape-based archives to cloud archives makes sense. Larger organizations may have tens of thousands of tapes in off-site archives. The process of retrieving all those tapes and reading them onto a cloud archive infrastructure is daunting. It also assumes the provider has the necessary hardware to read all the tapes, some of which may be in obsolete formats. Moreover, there’s no way a cloud provider could host such a data volume at anything close to the cost of tapes sitting in a glorified warehouse. Disk compression and data deduplication can help significantly, but the difference in cost is still likely to amount to a substantial premium.
Key cloud-based archiving considerations
- Cloud archiving is a tradeoff between accessibility and cost. It may yield the lowest cost while delivering acceptable data access performance.
- Using a cloud provider requires the IT organization to shift from managing machines to managing service levels.
- Clearly defined service levels are the key to successful cloud archive hosting.
- Organizations should have an exit strategy in case things go wrong.
Even though the hurdles for converting tape to cloud archiving are high, it may still be a consideration. Tapes more than seven years old are likely to be very expensive -- and possibly problematic -- to restore. Best practices dictate that organizations retrieve and rewrite tapes every five years to ensure the data is readable and the format is current. It’s a task to be reckoned with. For example, with a 10,000 tape archive and a five-year refresh cycle, a company would have to refresh 2,000 tapes each year. That comes to approximately eight tapes per workday, which is doable, but requires a year-around effort for what’s fundamentally a nonproduction exercise. Here again, the crux of the matter lies in the probability of retrieval. Some organizations choose to allow tapes to become obsolete in the vault with the knowledge that a recovery would be painful, but the probability of needing to restore the data is low enough to be worth the risk. On the other hand, if you know a recovery is all but inevitable, you may opt to incur the time and expense of moving from tape to cloud now, thus saving significant time and effort later, perhaps under urgent conditions.
That’s not to suggest that tape is losing its role in archiving. It’s still the lowest cost choice for most situations. In addition, LTO’s Linear Tape File System (LTFS) is enabling tape to take on a new role as “tier 4” storage, so it can act as another tier in the cloud (or data center) that’s provisioned along with tiers 0, 1, 2 and 3. In a cloud archive environment, this would effectively enable a hybrid cloud that offers relatively fast access (e.g., minutes) but at the price point of tape for rarely accessed data. The tapes will also have built-in compression, and the options of encryption and WORM. Using automated tiering software, data can be moved automatically to the archive tier.
The inevitable “what if”
So far, we’ve painted a fairly positive picture of cloud-based archiving services. Usually the effort yields the desired result, but not always. Organizations should consider what would happen if they transferred tens of TBs of data to a provider and then failed to realize the desired or contracted results. Sure, penalties might kick in, but small monetary penalties wouldn’t fully compensate for the true cost, aggravation or damage to the IT organization’s reputation for delivery. Contingencies begin with a contract that may be terminated without penalty for failure to meet specific performance levels. It should also include a plan for alternative hosting capabilities, either back in-house or with another provider. Cloud-based archiving is fairly low on the list of risky endeavors, but smart organizations will be prepared for anything.
BIO: Phil Goodwin is a storage consultant and freelance writer.
This was first published in June 2012