[Openais] Corosync netmalloc TODO item
sdake at redhat.com
Wed Mar 2 09:50:20 PST 2011
On 03/01/2011 05:50 PM, Zane Bitter wrote:
> Once more, to the list this time. It seems the Reply-To header is now missing again.
> On 2011/03/01, at 12:48, Steven Dake wrote:
>> One more note totemsrp.c also uses free on these frames (which should
>> have a corresponding free call down through the
>> totemrrp/totemnet/totemiba+totemudp+totemudpu layers.
>> A bit more on this point as I was thinking about it. An IBA frame is
>> limited to 2048 bytes or 4096 bytes depending on the kernel driver. In
>> order to use a buffer to send packets, the buffer must be posted to the
>> send queue (ibv_post_send). Once a buffer has been posted, it may not
>> be posted again until it is processed by the hardware. ibverbs delivers
>> an event when a posted buffer is processed by the hardware via a
>> completion queue (see mcast_cq_send_event_fn).
> Interesting... the man page for ibv_post_send() says that "The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ)", which is open to interpretation of the word "reuse". Obviously you can't change the data and reuse the buffer for a different frame before the original one has been sent. But can you enqueue it again with the _same_ data?
With netmalloc I hadn't thought about the rrp case.
I believe the buffer can be posted to multiple queues. The reason it
can't be "reused" is because what the RDMA hardware is actually doing is
a remote dma operation on the hardware. If you were to queue the frame
in the hardware, then make changes before getting the transmitted event,
the hardware may end up transmitting a partially changed buffer.
This does create special problems for the rrp case - because rrp must
allocate one set of frames in iba which act as one global pool (vs the
current model where there are two separate pools per ring).
> The reason I ask is that the active rrp algorithm sends the token to all non-faulty interfaces. At the moment, the iba driver is doing a memcpy() for each of these; if it still requires a separate buffer for each outgoing frame then the best we can do is reduce the number of memcpy() calls by 1 (for the n=1 case that's still a 100% reduction, which is not nothing). I think it would also require a different interface to the totemnet_malloc() function.
>> A reference count is not needed for totemiba frames because all buffers
>> are "preallocated" (required by RDMA design) so a totemrrp_free (X)
>> operationn, which would call totemnet_free (X) which would call
>> totemiba_free (X) would be a no op.
>> One area I went wrong when I wrote the iba code originally is I
>> separated the send and receive buffer data structures into two separate
>> free lists with two separate data structures. This results in needless
>> complication and will have to be merged into one "free list" from which
>> prepared buffers can be retrieved and posted and then put back to. The
>> reason is because of how the memory protection domains work (a technical
>> detail of rdma) wouuld limit the ability for the software to work
>> properly with the current setup and a netmallocing feature. But before
>> heading down this road, I'd focus instead on keeping the current
>> totemiba behavior (of the memcpy) and get the rest of the interfaces in
> I'm happy to go ahead and implement the first patch, but I'm also trying to get my head around the iba stuff because it seems like how that works could potentially affect what the interface to the totemnet_malloc() function needs to be.
> Openais mailing list
> Openais at lists.linux-foundation.org
More information about the Openais