IBM - Data Reduction Estimator Version 1.03 Build 153

 

Abstract

Data Reduction Estimator tool is a command-line host-based utility for estimating the data reduction saving on block devices.

 

In order to help with the profiling and analysis of existing user workloads that need to be migrated to a new system, IBM provides a highly accurate data reduction estimation tool which supports both deduplication and compression. The tool operates by scanning target workloads on any legacy storage array (IBM or 3rd party) and then merging all scan results to provide an integrated system level data reduction estimate.

 

 

Overview
The
Data Reduction Estimator utility uses advanced mathematical and statistical algorithms to perform an analysis with very low memory footprint. The utility runs on a host that has access to the devices to be analyzed. It performs only read operations so it has no effect on the data stored on the device. The following sections provide information on installing Data Reduction Estimator on a host and using it to analyze devices on it. Depending on the environment configuration, in many cases Data Reduction Estimator will be used on more than one host, in order to analyze additional data types.

It is important to understand block device behavior, when analyzing traditional (fully-allocated) volumes. Traditional volumes that were created without initial zeroing the device may contain traces of old data on the block device level. Such data is not accessible or viewable on the file system level. When using Data Reduction Estimator to analyze such volumes, the expected reduction results reflect the saving rate to be achieved for all the data on the block device level, including the traces of old data.

Regardless of the block device type being scanned, it is also important to understand a few principles of common file system space management. When files are deleted from a file system, the space they occupied prior to the deletion becomes free and available to the file system. This happens even though the data on disk was not actually removed, but rather the file system index and pointers were updated to reflect this change. When using Data Reduction Estimator to analyze a block device used by a file system, all underlying data in the device is analyzed, regardless of whether this data belongs to files that were already deleted from the file system. For example, you can fill a 100GB file system and make it 100% used, then delete all the files in the file system making it 0% used. When scanning the block device used for storing the file system in this example, Data Reduction Estimator (or any other utility for that matter) will access the data that belongs to the files that are already deleted.

In order to reduce the impact of block device and file system behavior mentioned above, it is recommended to use Data Reduction Estimator to analyze volumes that contain as much active data as possible rather than volumes that are mostly empty of data. This increases accuracy level and reduces the risk of analyzing old data that is already deleted, but may still have traces on the device.

Data Reduction Estimator adds support for analyzing expected reduction saving on FlashSystem A9000 and A9000R storage systems, running software version 12.0.2.

 

Minimum hardware requirements

·         HP-UX and ESXi–  100 MB of free RAM

·         Windows, Red Hat Linux, Ubuntu, AIX, Solaris500 MB of free RAM

·          

 

 

  

Installing Data Reduction Estimator

 

Data Reduction Estimator can be installed only on supported Windows operating systems (see list below). After installation, the binary files for other supported operating systems become available in the Windows installation folder.

By default, the files are copied to:

Windows 64-bit:
C:\Program Files (x86)\IBM\Data Reduction Estimation Tool

 

Windows 32-bit:
C:\Program Files\IBM\Data Reduction Estimation Tool

 

Data Reduction Estimator can be used on the following client operating systems:

·         Windows 2008 Server, Windows 2012

·         Red Hat Enterprise Linux Version 5.x, 6.x, 7.x (64-bit)

·         UBUNTU 12.04

·         ESX 5.0, 5.5, 6.0

·         AIX 6.1, 7.1

·         Solaris 10

·         HP-UX 11.31

 

Note:

·         If there is not enough unique data in the device, the results will not be accurate. Therefore the scan is limited to devices with >= 400M of unique data. In case a device does not have enough unique data, the scan will fail with “Not enough unique data to estimate savings” error.

 

Using Data Reduction Estimator tool

There are 3 steps that need to be performed to get the reduction ratio.

Step 1: Obtain the device list.

Step 2: Estimate the reduction ratio on each of the devices independently.

Step 3: In case there are 2 or more scanned devices, merge the results from all independent volume scans together to view the data reduction estimation across all scanned volumes.

 

The next section will describe how to perform each of the steps.

See the Syntax section for a detailed description of the syntax and command line options.

 

Step 1: – Obtain the device list.

This step is performed differently on different platforms.

Linux, ESX, AIX, Solaris and HP-UX server:

1.  Log into the host using the root account.

2.  Obtain the list of device names, using the following commands:
Linux:
"fdisk –l".
ESX
:   "esxcli storage core device list | grep Dev”.

   AIX:   "lsdev –Cc disk".

   Solaris: "format".

   HP-UX: "ioscan –kfnC disk".

 

Windows server:

1.  Log into the server, using an account with Administrator privileges.

2.  Open an elevated command prompt with Administrator rights (Run as Administrator).

3.  Run “wmic DISKDRIVE list brief”

C:\>wmic DISKDRIVE list brief

Caption                         DeviceID                    Model                  Partition  Size

IBM 2145  Multi-Path Disk Device \\.\PHYSICALDRIVE0 IBM 2145  Multi-Path Disk Device  1          256052966400

 

Step 2: estimate the reduction ratio on each of the devices independently.

 

Run Data Reduction Estimator tool with the following parameters:

 

1.  –d <device > the device to analyze, according to the DeviceID output

2.  --command scan|partialscan. Scan command will perform full scan, and report total data reduction saving estimation (deduplication + compression) while partial scan will do a short scan, estimating compression saving only but not dedup. Default is “scan”.

3.  –o outputFileName to keep the output results for the next step.

For example:

 

Linux:

Data-Reduction-Estimator -d /dev/sda1 -o scan_Linux_RHEL7

 

Windows:

Data-Reduction -Estimator.exe -d \\.\PHYSICALDRIVE0 -o scan_Win7

·         Data reduction estimator tool can only scan one device at a time in a single CLI command. Using the --batchfile parameter, will handle several devices in parallel, when each line represents a different device.

 

o    An example of several devices in a single batch file:

-d /dev/sda

-d /dev/sdb

-d /dev/sdc –-command partialscan

 

In this example, devices /dev/sda and /dev/sdb will be fully scanned, while device /dev/sdc will be partially scanned.

 

During the scan process checkpoints are stored in a file (with .rdb extension)

·         .dat sketches are for completed scanned devices.

 

·         If the scanning has been terminated before completion, on the next execution the scan will continue from the checkpoint.

 

 

Step 3: Merge the results from the independent scans together.

 

1.  Collect the output files of all scanned devices from step 2 and place them in the same directory, where the Data-Reduction-Estimator tool is located.

2.  To calculate the total data reduction saving, use the –-command merge option. Separate output file with a comma “,”:

 

For example:

 

Data-Reduction-Estimator --command merge --mergefiles scan_freebsd91_1024,scan_vendlist_1024,scan_win2008_1024,scan_Win7,scan_Linux_RHEL7

 

After the data reduction analysis is completed, the overall data reduction estimation for all devices is displayed.

 

 

·         Merges volumes with overall data reduction ratio threshold of 90% or below by default (equivalent to 10% savings or higher).

·         --mergeall – Overrides the data reduction threshold. All volumes will be merged.

 

Note:

 

·         Merge is applied to data files generated by the same binary build. Otherwise, the “Build mismatch” error is generated

·         Merge cannot be applied on .rdb sketches files. .rdb are intermediate files created by unfinished scans.

 

 

Syntax

Linux, ESX, AIX, Solaris and HP-UX:

 

Data-Reduction-Estimator –d <device> [-x Max MBps] [-o result data filename] [-s Update interval] [--command scan|merge|load|partialscan]     [--mergefiles Files to merge] [--loglevel Log Level] [--batchfile batch file to process] [-h]

 

Windows:

Data-Reduction-Estimator.exe –d <device> [-x Max MBps] [-o result data filename] [-s Update interval] [--command scan|merge|load|partialscan] [--mergefiles Files to merge] [--loglevel Log Level] [--batchfile batch file to process] [-h]

 


-d                The device name.

Linux:

Path of device to analyze (e.g. /dev/sda in Linux)

Windows:

DeviceID. In order to get the DeviceID, use the wmic Windows utility.

 

                  See the previous section for instructions on how to obtain the device list.


-x                Throughput limit up to X MBps. Default is 0 – No limit


-o                The name of the output file, the data file which contains the information on the analyzed device. Later it can be used for the merge option. If no name is provided, the output file is created with a default name.


-s                The update interval progress. Default is every 10 seconds.


--command         The operation mode.

                  scan Full scan (default). Estimates total data reduction saving.

                  merge Can be used after the scan of all the devices is completed in order to get the statistics average for all scanned devices. Minimum two files (volumes) are required.

                  load – Can be used after the scan of a device is completed in order to load the device statistics from the .dat sketches.

 

                  partialscan – Compression saving estimation. It is used for a quick scan sample.

--mergefiles      Total data saving for more than one scanned device.

                  File list is separated by commas “,”.

                  By default, devices with data saving lower than 90% are ignored. Every such instance is reported.


--mergeall        Override the 90% data saving threshold.


--loglevel        Log level to run. Values are 3 – 7 (default is 3).


--batchfile       Batch file to process. The batch file can contain several devices, with each line referring to a different device.


Examples:

Data Reduction Estimator output examples

 

 root@swfc120:/tmp# ./Data-Reduction-Estimator -d /dev/dm-10

Result data filename not given, auto-generating: file_C8F50050.dat

200.00 GB | 55.60 MBps: 0% [####################] 100%%

Estimated Dedupe Savings:         11.659%

Estimated Compression Savings:    65.797%

Data Reduction Savings:           69.784%

---------------------------------------

Zeroes Detected Savings:          0.227%

Total Data efficiency Savings:    69.853%

Time Consumed:                    00:11:01

 

Analyzing the results of the above example:

Volume size: 200GB

Auto generated output file: file_C8F50050.dat

 

 

 

To get the total data on disk after reduction, take the data reduction saving of the disk size:  

(100% - 69.784%) of 200GB = 60.432GB

Data on disk after reduction (dedup and compression) is 60.432GB

 

The Zeroes Detected saving refers to large sequences of zeros that were detected on the device. It is not an inherent part of the data reduction saving, as some systems consider this as thin provisioning. The total data efficiency is the total savings, including deduplication, compression and the large zero sequences combined.

 

First data is deduped, then it is compressed.

Dedup saving:

0.11659 * 200GB = 23.318GB

Compression saving (after dedup)

0.65797 * (200 - 23.318GB) = 116.210GB

 

Total saving: 23.318GB + 116.210GB = 139.528GB (~70% of 200GB)

 

Next example illustrates the total data saving on two volumes: RHEL7 and win2008.

 

Data-Reduction-Estimator --command merge --mergefiles scan_Linux_RHEL7,scan_win2008_1024

Result data filename not given, auto-generating: merge_out

Estimated Dedup Savings:         97.8%

Estimated Compression Savings:    16.3%

Data Reduction Savings:           98.2%

---------------------------------------

Zeroes Detected Savings:          4.11%

Total Data Efficiency Savings:    98.2%

Time Consumed:                    00:00:00

 

Mergeall CLI syntax example:

Data-Reduction-Estimator --command merge --mergefiles scan_Linux_RHEL7,scan_win2008_1024 --mergeall

 

Load CLI syntax example:

 

Data-Reduction-Estimator --command load o file_770F7F23.dat

 

Best practices

As a rule, ESXi operations require larger amounts of RAM. If the amount of RAM is not sufficient, the data reduction estimation tool performance may be degraded.

 

 

On Windows, Red Hat Linux, Ubuntu, AIX and Solaris, the default number of concurrent threads performing the scan is 10.

If the scanning task is taking too long, decrease the number of threads. You can try to reduce the number of threads to 5, and then down to a single thread.

 

Thread reduction example:

DEDUPEL3=1 ./Data-Reduction-Estimator -p 5 –d XXXX

 

On HP-UX and ESXi, the default number of threads is 1. If more than 100MB of RAM are available for the tool, increase the number of threads, as detailed above.

 

In some cases, especially with ESXi, changing the number of threads may not result in a significant performance improvement. In this case, run a partial scan. It estimates compression savings only and  requires smaller amount of RAM..

 

 

 

Enhancements introduced in V1.03

·         New command parameter “load”. Reloads devices statistics of previously scanned devices (see the Syntax section for details).

·         Resuming scans. Checkpoints are stored in .rdb files. If scanning has been stopped, the next scan continues from the last checkpoint.

·         Smart merges. Allow merging volumes with overall data reduction ratio above a defined threshold only (default is 90% or below).

 

Download the Data Reduction Estimator from IBM Fix Central.

When asked for a Machine Serial Number during the download, enter any combination of seven digits in the XXX-XXXX format to continue.