[Openais] Corosync netmalloc TODO item

Steven Dake sdake at redhat.com
Tue Mar 1 09:48:20 PST 2011


On 03/01/2011 12:28 AM, Steven Dake wrote:
> On 02/28/2011 09:54 PM, Zane Bitter wrote:
>> I'm looking at the topic-netmalloc item in the TODO file. Here is my understanding of the problem, if somebody (Steve/Andrew) could confirm or correct what follows, that would be much appreciated:
>>
>> For the APIs token_send, mcast_flush_send and mcast_noflush_send the message data is allocated on the stack in totemsrp.c. For the UDP Multicast and UDP Unicast drivers this data is simply transmitted using sendmsg(), but for the Infiniband driver it must be copied into a separate buffer that has been registered with libibverbs. In the case of the active rrp algorithm this copy potentially happens multiple times. We want to eliminate the copy by creating the message in a buffer supplied by the driver instead of on the stack. The driver would be responsible for freeing the buffer once the packet was transmitted.
>>
> 
> In current master, totemsrp.c mallocs a message data frame at lines
> 2076, 2149, 3842.  What should happen instead is totemsrp.c should call
> totemsrp_malloc which should call totemnet_malloc for it's instance,
> which should call either totemudp_malloc, totemudpu_malloc, or
> totemiba_malloc.  These functions should malloc the memory to be used by
> totemsrp.
> 
> Note these interfaces in the layers do not exist and need to be created.
>  I'd suggest making this one part of the patch and making
> totemiba_malloc simply do a malloc operation and behave as it currently
> does.
> 
>> Assuming that I'm on essentially the right track here, a couple of questions arise:
>>  * If the rrp algorithm is active or passive, the totemrrp_instance contains multiple totemnet_instances. Is it valid to assume that all net instances within a given rrp instance are of the same type (i.e. all Infiniband or all UDP), or can they be mixed?
>>  * I'm not familiar with how libibverbs works; is it legitimate to use the same buffer to send multiple packets (with the same content) that are 'in-flight' at the same time? I'm assuming yes, since all of the headers that differ between packets are outside of that buffer and I can't think of a reason any lower layers would need to modify it. There would need to be some sort of reference count on the buffer, but that should not be too hard to implement.
>>

One more note totemsrp.c also uses free on these frames (which should
have a corresponding free call down through the
totemrrp/totemnet/totemiba+totemudp+totemudpu layers.

A bit more on this point as I was thinking about it.  An IBA frame is
limited to 2048 bytes or 4096 bytes depending on the kernel driver.  In
order to use a buffer to send packets, the buffer must be posted to the
send queue (ibv_post_send).  Once a buffer has been posted, it may not
be posted again until it is processed by the hardware.  ibverbs delivers
an event when a posted buffer is processed by the hardware via a
completion queue (see mcast_cq_send_event_fn).

A reference count is not needed for totemiba frames because all buffers
are "preallocated" (required by RDMA design) so a totemrrp_free (X)
operationn, which would call totemnet_free (X) which would call
totemiba_free (X) would be a no op.

One area I went wrong when I wrote the iba code originally is I
separated the send and receive buffer data structures into two separate
free lists with two separate data structures.  This results in needless
complication and will have to be merged into one "free list" from which
prepared buffers can be retrieved and posted and then put back to.  The
reason is because of how the memory protection domains work (a technical
detail of rdma) wouuld limit the ability for the software to work
properly with the current setup and a netmallocing feature.  But before
heading down this road, I'd focus instead on keeping the current
totemiba behavior (of the memcpy) and get the rest of the interfaces in
shape.

Regards
-steve

> 
> Yes we require all RRP networks to be the same interface type.
> libibverbs is somewhat complicated - I hardly understand it myself :).
> I'd focus on the interfaces for mallocing through the stack at first,
> and then attack the totemiba case as a separate patch.
> 
> Once you get the first patch in place, I can guide you on the totemiba.c
> changes that are necessary.
> 
> Regards
> -steve
> 
>> thanks!
>> - Zane.
>> _______________________________________________
>> Openais mailing list
>> Openais at lists.linux-foundation.org
>> https://lists.linux-foundation.org/mailman/listinfo/openais
> 
> _______________________________________________
> Openais mailing list
> Openais at lists.linux-foundation.org
> https://lists.linux-foundation.org/mailman/listinfo/openais



More information about the Openais mailing list