[Openais] Corosync netmalloc TODO item

Steven Dake sdake at redhat.com
Mon Mar 21 11:26:40 PDT 2011


On 03/20/2011 02:00 PM, Zane Bitter wrote:
> 
> On 2011/03/02, at 12:50, Steven Dake wrote:
> 
>> On 03/01/2011 05:50 PM, Zane Bitter wrote:
>>>
>>> On 2011/03/01, at 12:48, Steven Dake wrote:
>>>
>>>> One more note totemsrp.c also uses free on these frames (which should
>>>> have a corresponding free call down through the
>>>> totemrrp/totemnet/totemiba+totemudp+totemudpu layers.
>>>>
>>>> A bit more on this point as I was thinking about it.  An IBA frame is
>>>> limited to 2048 bytes or 4096 bytes depending on the kernel driver.  In
>>>> order to use a buffer to send packets, the buffer must be posted to the
>>>> send queue (ibv_post_send).  Once a buffer has been posted, it may not
>>>> be posted again until it is processed by the hardware.  ibverbs delivers
>>>> an event when a posted buffer is processed by the hardware via a
>>>> completion queue (see mcast_cq_send_event_fn).
>>>
>>> Interesting... the man page for ibv_post_send() says that "The buffers used by a WR can only be safely reused after WR the request is fully executed and a work completion has been retrieved from the corresponding completion queue (CQ)", which is open to interpretation of the word "reuse". Obviously you can't change the data and reuse the buffer for a different frame before the original one has been sent. But can you enqueue it again with the _same_ data?
>>>
>>
>> With netmalloc I hadn't thought about the rrp case.
>>
>> I believe the buffer can be posted to multiple queues.  The reason it
>> can't be "reused" is because what the RDMA hardware is actually doing is
>> a remote dma operation on the hardware.  If you were to queue the frame
>> in the hardware, then make changes before getting the transmitted event,
>> the hardware may end up transmitting a partially changed buffer.
>>
>> This does create special problems for the rrp case - because rrp must
>> allocate one set of frames in iba which act as one global pool (vs the
>> current model where there are two separate pools per ring).
> 
> After some more investigation, it seems like the obstacle in the rrp case is the fact that the protection domains are tied to a particular instance (separate ibv_context). Is it possible to register the same memory in multiple protection domains at once? (That certainly *sounds* dodgy.)
> 
> Unless I am missing something, it's not clear to me that we can avoid the copy in the rrp case without resorting to something that's probably even slower. Any ideas?
> 

One option is to have a non-instanced pd registration/malloc/free
operation in the implementation of the iba network driver.  By
globalizing the buffers, you should be able to post to separate queue pairs.

It may be possible to reg the same memory region to multiple protection
domains - i'll give it a test on the hardware today and let you know
later tonight.

> Also:
> 
> On 2011/03/01, at 12:48, Steven Dake wrote:
> 
>> One area I went wrong when I wrote the iba code originally is I
>> separated the send and receive buffer data structures into two separate
>> free lists with two separate data structures.  This results in needless
>> complication and will have to be merged into one "free list" from which
>> prepared buffers can be retrieved and posted and then put back to.  The
>> reason is because of how the memory protection domains work (a technical
>> detail of rdma) wouuld limit the ability for the software to work
>> properly with the current setup and a netmallocing feature.
> 
> Can you elaborate on what is driving that? It seems better to have a single list, but I'm not sure how the netmalloc makes it compulsory.
> 

because of protection domains as you point out above.

> thanks,
> Zane.



More information about the Openais mailing list