[Ksummit-2008-discuss] Suggested topic: possible

Wed Aug 6 13:47:32 PDT 2008

On Wed, 2008-08-06 at 13:23 -0700, Andrew Morton wrote:
> On Wed, 06 Aug 2008 15:35:02 -0400
> Chris Mason <chris.mason at oracle.com> wrote:
> 
> > On Wed, 2008-08-06 at 18:48 +0100, David Woodhouse wrote:
> > > On Wed, 2008-08-06 at 20:22 +0300, Eyal Shani wrote:
> > > > Sectors will end up stored according to their context, and expected
> > > > life cycle.
> > > 
> > > All of which information is plucked out of the ether, presumably, since
> > > the file system isn't allowed to be involved? And when we move stuff
> > > from one to the other, we don't consult that file system about that
> > > either, or let it do its own defragmentation or whatever other
> > > housekeeping it might want to do at the same time?
> > > 
> > > > The innovation curve in SSDs will soar, I hope, 
> > > 
> > > I believe that innovation will always be easier, cheaper and more
> > > reliable when the software can see what's going on and get involved. 
> > > 
> > 
> > To me, this is a pretty simple factor of complexity vs reward vs
> > throughput.
> > 
> > Even with spinning media, surely the filesystems could do better some of
> > the time if we controlled decisions all the way down to the disk head.
> > 
> > Some of the time, on some of the devices, with some of the filesystems.
> > With full specs about each device, updated every time they release, and
> > every time the firmware is updated.  And people think the XFS allocator
> > is complex today....
> 
> Yes.  We've spent huge amounts of effort minimising seeks and
> maximising request sizes (and we still don't do it very well).
> 
> I assume that the returns from those optimisations is much smaller with
> SSD.
> 
> I assume that the return from request-size-maximisation on SSD is
> better than the return on seek-avoidance.

Not necessarily ... larger transfers are still better because of erase
block considerations, and they do help flash a lot.

The problem is that we need to parametrise boundaries in the elevator
(as in, if you have one erase block's worth of data, adding an extra
sector is actually detrimental, you should hold it up for the next erase
blocks worth).

> I assume that we can beneficially throw away (ie: bypass) a large part
> of the filesystem and block layer seek-minimisation and
> request-size-maximisation code when the backing device is SSD.

One of the sad facts of flash is that writes are much more expensive
than reads, so quite a lot of work done on the ioschedulers is still
valid (even if it was originally aimed at rotating media).

> I assume that you guys know a lot more than me about my assumptions ;)
> I would be interested in hearing your opinions on all the above.  How
> many person-years worth of code did we just ditch?
> 
> Approximate numbers would be interesting.  On a spinning disk, the
> difference between ten-randomly-splattered-4k-writes and one-40k-write
> is about an order of magnitude.  How much of a benefit is it on SSD,
> present and future?

The "future" piece is debatable.  Currently SSDs have a large erase
block, which kills performance if you get writes not aligned to it and
they have writes taking 10x more than reads anyway because of the cell
erase time.

The flash manufacturers portray them as superfast nonrotating media, but
forget about the above.  They then show slideware of Gen X devices which
*really (honestly this time)* are superfast nonrotating media with no
write penalties and no-one really knows whether to believe them or not.
Since we haven't even seen the first gen of SSDs yet.  I suspect until
we start playing with the real things, we won't have any idea what needs
tuning or replacing in the elevators.

James