Category Archives: Data Storage

Using AWS S3 Storage Service as On-premises NAS

One of the fastest ways to start utilizing storage services in the cloud  such as AWS S3 storage is by using AWS File Storage Gateway. 

File Storage Gateway is a hybrid cloud storage service that provides on-premises access to virtually unlimited cloud storage in AWS S3.  It presents one or more AWS S3 buckets and their objects as a mountable NFS or SMB share to one or more clients on-premises.  In effect, you have an on-premises NAS, which keeps hot data locally, but the backend connects to AWS S3 where data ultimately resides.  The main advantages of using File Storage Gateway are:

  1. Data on AWS S3 can be tiered and life cycled into cost-effective storage  
  2. Data can be processed on both on-prem and in AWS, using on-prem legacy applications and Amazon EC2-based applications
  3. Data can be shared by users located in multiple geographic locations

One disadvantage of using File Storage Gateway is that it is not optimized for large number of users/connections.  It is designed for small number of users (about 100 connections per gateway), but high volume of data (in TB o PB scale). 

Improving Security of Backup Data

One of the best defense against ransomware is to backup data and verify its integrity regularly.  If your data has been breached by a ransomware, you can always restore the data from backup.  However, hackers using ransomware are increasingly targeting primary backups. Adding an air gap to the secondary copy of the backup can mitigate this, 

An air gap is a security measure that protects backup data from intrusion, malicious software and direct cyber attacks  The idea is to place a secondary copy of backups behind a private network that is not physically connected to the wider network (i.e. behind air gaps). These secondary air-gapped backups will provide preserved backup copies and will be capable of restoring data that have been attacked by ransomware.

One example of air gap implementation is by DellEMC.  In the figure below, the Data Domain primary backup storage (Source) is replicated to a Data Domain secondary backup storage (Target) inside a vault.  The vault is self-contained and self-secured.  It is air-gapped except for replication in cycles.  It also has encryption and data protection controls including mutual authentication of source and target, data-at-rest encryption, data-in-motion encryption, replication channel encryption, Data Domain hardening, and immutable data (using retention lock). In addition, it also contains applications that scans for security issues and tests for critical apps.

DellEMC Cyber Recovery


VMWare Instant Recovery

When a virtual machine crashes, there are two ways to quickly recover it – first is by using the VMware snapshot copy and second is by restoring an image-level backup.  Most VMware environment though do not usually perform snapshots on the virtual machines (VMs) due to increased usage on the primary storage, which can be costly.   On the other hand, using traditional method to restore image-level backup can take longer since it has to be copied back from the protection storage to the primary storage. 

However, most backup solutions nowadays – including Netbackup, Avamar/Data Domain, and Veeam – support VMware instant recovery where you can immediately restore VMs by running them directly from backup files.  The way it works is that the virtual machine image backup is staged to a temporary NFS share on the protection storage system (e.g. Data Domain).   You can then use the vSphere Client to power on the virtual machine (which is NFS mounted on the ESXi host), then initiate a vMotion of the virtual machine to the primary datastore within the vCenter. 

Since there is no need to extract the virtual machine from the backup file and copy it to production storage, you can perform restore from any restore point in a matter of minutes. VMware instant recovery helps improve recovery time objectives (RTO), and minimizes disruption and downtime of critical workloads.

There are also other uses for instant recovery. You can use it to verify the backup image, verify an application, test a patch on a restored virtual machine before you apply the patch to production systems, and perform granular restore of individual files and folders.

Unlike the primary storage, protection storage such as Data Domains are usually slow.  However, the new releases of Data Domains have improved random I/O (due to additional flash SSD), higher IOPS and better latency, enabling faster instant access and restore of VMs. 

Using BoostFS to Backup Databases

If your company is using DellEMC Data Domain appliance to backup your databases, you are probably familiar with DD Boost technology. DD Boost increases backup speed while decreasing network bandwidth utilization.  In the case of Oracle, it has a plugin that integrates directly into RMAN. RMAN backs up via the DD Boost plugin to the Data Domain. It is the fastest and most efficient method to backup Oracle databases. 

However, some database administrators are still more comfortable with performing cold backups.  These backups are usually dumped to the Data Domain via NFS mount.   This is not the most efficient way to backup large databases as they are not deduplicated before sending to the network, thus consuming a lot of bandwidth.

Luckily, DellEMC created the product BoostFS (Data Domain Boost Filesystem) which provides a general file-system interface to the DD Boost library, allowing standard backup applications to take advantage of DD Boost features.   In the case of database cold backup, instead of using NFS to mount the Data Domain, you can use BoostFS to stream the cold backups to the Data Domain, thus increasing backup speed and decreasing network bandwidth utilization. In addition, you can also take advantage of its load-balancing feature as well as in-flight encryption.

To implement BoostFS, follow these steps:

1. DDBoostFS is dependent on FUSE.  So before installing DDBoostFS, install fuse and fuse-libs first.

2. Edit the configuration file /opt/emc/boostfs/etc/boostfs.conf, specifying the Data Domain hostname, DD storage-unit, username, security option, and if you want to allow users other than the owner of the mount to access the mount.  This is useful if you are using the same storage-unit for multiple machines.

3. Create the lockbox file, if you specified lockbox as the security option.  This is the most popular choice.

4. Verify host has access to storage using command /opt/emc/boostfs/bin/boostfs lockbox show-hosts

5. Mount the new boostfs storage unit using command /opt/emc/boostfs/bin/boostfs mount

6. To retain the mount after reboots, add the boostfs entry on /etc/fstab

For more information, visit the DellEMC support site.

Hyper-converged Infrastructure: Hype or For Real?

One of the hottest emerging technologies in IT is hyper-converged infrastructure (HCI). What is the hype all about? Is it here to stay?

As defined by Techtarget, hyper-convergence infrastructure (HCI) is a system with a software-centric architecture that tightly integrates compute, storage, networking, virtualization resources (hypervisor, virtual storage, virtual networking) and other technologies (such as data protection and deduplication) in a commodity hardware box (usually x86) supported by a single vendor.

Hyper-convergence grew out of the concept of converged infrastructure, where engineers took it a little further – using very small hardware footprint, tight integration of components and simplified management. It is a relatively new technology. On the technology adoption curve, it is still at the early adopters stage.

Nutanix is the first vendor to offer hyper-converged solution, followed by Simplivity, and Scale Computing. Not to be outdone, VMWare developed its EVO-RAIL, then opened it for hardware vendors to OEM the product. Major vendors, including EMC, NetApp, Dell, HP, and Hitachi began selling EVO-RAIL products.

One of the best HCI product that I’ve seen is VxRail. Jointly engineered by VMware and EMC, the “VxRail appliance family takes full advantage of VMware Hyper-Converged Software capabilities and provides additional hardware and lifecycle management features and rich EMC data services, delivered in a turnkey appliance with integrated support.”

What are the advantages of HCI and where can it be used? Customers who are looking to start small and be able to scale out overtime, will find the HCI solution very attractive. It is a perfect fit for small to medium size companies, to be able to build their own data center without spending huge amount of money. It is simple (because it eliminates a lot of hardware clutter) and highly scalable (because it can be scaled very easily by adding small standardized x86 nodes). Since it is scalable, it will ease the burden of growth. Finally, its performance is comparable to big infrastructures because leveraging SSD storage and bringing the data close to the compute enables high IOPS at very low latencies.

References:

1. Techtarget
2. VMware Hyper-Converged Infrastructure: What’s All the Fuss About?

Replicating Massive NAS Data to a Disaster Recovery Site

Replicating Network Attached Storage (NAS) data to a Disaster Recovery (DR) site is quite easy when using big named NAS appliances such as NetApp or Isilon. Replication software is already built-in on these appliances – Snapmirror for NetApp and SyncIQ for Isilon. They just need to be licensed to work.

But how do you replicate terabytes, even petabytes of data, to a DR site when the Wide Area Network (WAN) bandwidth is a limiting factor? Replicating a petabyte of data may take weeks, if not months to complete even on a 622 Mbps WAN link, leaving the company’s DR plan vulnerable.

One way to accomplish this is to use a temporary swing array by (1) replicating data from the source array to the swing array locally, (2) shipping the swing frame to the DR site, (3) copying the data to the destination array, and finally (4) resyncing the source array with the destination array.

On NetaApp, this is accomplished by using the Snapmirror resync command. On Isilon, this is accomplished by turning on the option “target-compare-initial” in SynqIQ which compares the files between the source and destination arrays and sends only data that are different over the wire.

When this technique is used, huge company data sitting on NAS devices can be well protected right away on the DR site.

Protecting Data Located at Remote Sites

One of the challenges of remote offices with limited bandwidth and plenty of data is how to protect that data. Building a local backup infrastructure can be cost prohibitive and usually the best option is to backup the data to the company’s data center or to a cloud provider.

But how do you initially bring the data to the backup server without impacting the business users using the wide area network (WAN)?

There are three options:

1. The first option is to “seed” the initial backup. Start the backup locally to a USB drive, ship the drive to the data center, copy the data, then perform subsequent backups to the data center.

2. Use the WAN to backup the data but throttle the bandwidth until it completes. WAN utilization will be low, but it may take some time to complete.

3. Use the WAN to backup data and divvy up the data into smaller chunks. So that the users will not be affected during business hours, run the backup jobs only during off-hours and during the weekends. This may also take some time to complete.

Object Storage

A couple of days ago, a business user asked me if our enterprise IT provides object-based storage. I heard the term object storage before but I have little knowledge about it. I only know it’s a type of storage that is data aware. I replied “No, we don’t offer it yet.” But in the back of my mind, I was asking myself, should we be offering object storage to our users? Are we so behind, we haven’t implemented this cool technology? Is our business losing its competitive advantage because we haven’t been using it?

As I research more on the topic, I understood what it entails, its advantages and disadvantages.

Object storage is one of the hot technologies that is expected to grow adoption this year. As defined by Wikipedia, object storage, “is a storage architecture that manages data as objects, as opposed to other storage architectures like file systems which manage data as a file hierarchy and block storage which manages data as blocks within sectors and tracks. Each object typically includes the data itself, a variable amount of metadata, and a globally unique identifier.”

Its extended metadata allows for some intelligence in the data. For example, a user or application can tag a data object what type of file it is, how it should be used, who will use it, its contents, how long it should live, and so on. That metadata information could, in turn, inform a backup application, for instance, that the object is classified or that it should be deleted on a certain date. This makes tasks like automation and management simpler for the administrator.

The globally unique identifier allows a server or end user to retrieve the data without needing to know the physical location or hierarchical location of the data. This makes it a useful data storage for long-term data retention, backup, file-sharing, and cloud application. In fact, Facebook uses object storage when you upload a picture.

One drawback of object storage is performance – slow throughput and latency due to the amount of metadata. Another drawback is that data consistency is achieved slowly. Whenever an object is updated, the change has to be propagated to all of the replicas which takes time before the latest version becomes available. With these properties, it’s well suited for data that doesn’t change much, like backups, archives, video, and audio files. That’s why it’s heavily used by Facebook, Spotify, and other cloud companies because once you upload a picture or music file, it doesn’t change much and it stays forever.

Object storage may be one of the hottest technologies in the storage space, but for now, I don’t see compelling use cases in enterprise IT. Object storage is unsuitable for data that changes frequently. File systems and block storage do just fine in storing data that rarely changes or data that frequently changes. Enterprise backup systems are versatile as well for long-term data retention and backups. Object storage may provide more information about the data, but storage administrators primary concerns are to deliver the data faster and more efficiently, as well as to protect its integrity.

Object storage distributed nature enables IT shops to use low cost storage, but in reality, in enterprise IT, NAS and SAN are prevalent because they are reliable and easier to manage.

We need well defined use cases and compelling advantages for object-based storage to be widely used in enterprise IT.

Data Protection Best Practices

Data protection is the process of safeguarding information from threats to data integrity and availability.  These threats include hardware errors, software bugs, operator errors, hardware loss, user errors, security breaches, and acts of God.

Data protection is crucial to the operation of any company and a sound data protection strategy must be in place. Following is my checklist of a good data protection strategy, including implementation and operation:

1. Backup and disaster recovery (DR) should be a part of the overall design of the IT infrastructure.  Network, storage and compute resources must be allocated in the planning process. Small and inexperienced companies usually employ backup and DR as an afterthought.

2. Classify data and application according to importance.  It is more cost-effective and easier to apply the necessary protection when data are classified properly.

3. With regards to which backup technology to use – tape, disk or cloud, the answer depends on several factors including the size of the company and the budget.  For companies with budget constraints, tape backup with off-site storage generally provides the most affordable option for general data protection.  For medium-sized companies, a cloud backup service can provide a disk-based backup target via Internet connection or can be used as a replication target. For large companies with multiple sites, on-premise disk based backup with remote WAN-based replication to another company site or cloud service may provide the best option.

4. Use snapshot technology that comes with the storage array. Snapshots are the fastest way to restore data.

5. Use disk mirroring, array mirroring, and WAN-based array replication technology that come with the storage array to protect against hardware / site failures.

6. Use continuous data protection (CDP) when granular rollback is required.

7.  Perform disaster recovery tests at least once a year to make sure the data can be restored within planned time frames and that the right data is being protected and replicated.

8. Document backup and restore policies – including how often the backup occurs (e.g. daily), the backup method (e.g. full, incremental, synthetic full, etc), and the retention period (e.g. 3 months).  Policies must be approved by upper management and communicated to users.  Document as well all disaster recovery procedures and processes.

9. Monitor all backup and replication jobs on a daily basis and address the ones that failed right away.

10.  Processes must be in place to ensure that newly provisioned machines are being backed up.  Too often, users assume that data and applications are backed up automatically.

11. Encrypt data at rest and data in motion.

12. Employ third party auditors to check data integrity and to check if the technology and processes work as advertised.

A good data protection strategy consists of using the right tools, well trained personnel to do the job, and effective processes and techniques to safeguard data.

Enterprise File Sync and Share

Due to increased usage of mobile devices (iPhone, iPad, Android, tablet, etc) in the enterprise, the need for a platform where employees can synchronize files between their various devices is becoming a necessity. In addition, they need a platform where they can easily share files both inside and outside of the organization. Some employees have been using this technology unbeknownst to the IT department. The popular file sync and share cloud-based app dropbox has been very popular in this area. The issue with these cloud-based sync-and-share apps is that for corporate data that are sensitive and regulated, it can pose a serious problem to the company.

Enterprises must have a solution in their own internal data center where the IT department can control, secure, protect, backup, and manage the data. IT vendors have been offering these products over the last several years. Some examples of enterprise file sync are share are: EMC Syncplicity, Egnyte Enterprise File Sharing, Citirx Sharefile, and Accellion Kiteworks.

A good enterprise file sync and share application must have the following characteristics:

1. Security. Data must protected from malware and it must be encrypted in transit and at rest. The application must integrate with Active Directory for authentication and there must be a mechanism to remote lock and/or wipe the devices.
2. Application and data must be supported via WAN acceleration, so users do not perceive slowness.
3. Interoperability with Microsoft Office, Sharepoint, and other document management system.
4. Support for major endpoint devices (Android, Apple, Windows).
5. Ability to house data internally and in the cloud.
6. Finally, the app should be easy to use. Users’ files should be easy to access, edit, share, and restore, or else people will revert back to cloud-based apps that they find super easy to use.