[Bugme-new] [Bug 8588] New: The automatic partition scan upon block device registration cause I/O trouble problems for other cluster nodes

bugme-daemon at bugzilla.kernel.org bugme-daemon at bugzilla.kernel.org
Tue Jun 5 11:56:51 PDT 2007


http://bugzilla.kernel.org/show_bug.cgi?id=8588

           Summary: The automatic partition scan upon block device
                    registration cause I/O trouble problems for other
                    cluster nodes
    Kernel Version: 2.6.20
            Status: NEW
          Severity: low
             Owner: axboe at kernel.dk
         Submitter: tore at fud.no


Most recent kernel where this bug did *NOT* occur: None
Distribution: Debian, Ubuntu, RHEL (it applies to any distro I guess)
Hardware Environment: Most kinds of low or midrange networked storage arrays
Software Environment: Clusters using dm-multipath
Problem Description:
When a cluster is using a storage array with active/passive controllers that
moves volumes when it sees I/O arriving to a passive controller, the automatic
partition table scan that happens when a block device is registered often
causes a volume transfer, which in turn makes I/O to the formerly active
controller fail for some time (probably until the volume is completely
transferred away, after which I/O wil just move it back again).  This applies
to many different hardware vendors like Engenio (OEM'ed by Sun/StorageTek, IBM,
probably others) and EMC.  In Engenio's case this "Automatic Volume Transfer"
mode is the only way to use it with Linux and dm-multipath, as it's alternative
mode where volumes are only transferred after an a specific SCSI command is
sent that explicitly requests the volume to be transferred ("RDAC mode") is one
dm-multipath does not yet have a hardware handler for.

This is problematic for clusters where several machines have access to the same
volume.  Consider the case where machine Foo and machine Bar is happily using
a volume on such an array.  They're both using controller 2 for all I/O, due
to dm-multipath's nice path grouping and prioritising features, and all is
well.  At a later time, though, Foo is rebooted, and when it is starting up it
loads the fibre channel HBA driver, which finds, say, eight paths to each of
the two controllers.  It then proceeds to register these in the kernel as
block devices, and when the block layer goes on to scan for a partition table
it generates I/O to controller 1, prompting the array to move it there and
making Bar experience I/O errors.  When the array has completed the volume
movement and Bar's dm-multipath has been able to redirect I/O there, you can be
certain that Foo is well on it's way to move it back by registering another
block device that represent a path through controller 2.  This can go on for
quite some time, depending on the complexity of the SAN and the number of
volumes and hosts.  In any case it's bad - until it settles down there has been
very poor if any service and if dm-multipath has noted all paths as failed VFS
could have seen I/O errors and remounted filesystems readonly and so on.  Pain. 

Steps to reproduce: Reload the SCSI driver on a cluster node (or boot it)

The solution seems to me to make it possible somehow to make the kernel NOT
scan newly registered block devices for partition tables, and instead delegate
this task to udev (which then can be used to selectively pick which devices to
scan).

There might be other things the block layer sends when it's being registered
too, I don't know.  But at least the INQUIRY, TEST UNIT READY, and READ
CAPACITY commands do not cause any volume transfers, so these should be safe.
At least on my EMC and Engenio gear.

Regards,
Tore

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


More information about the Bugme-new mailing list