<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">

<html>

<head>

  <title></title>

</head>

<body>

<small>See below.<br>

<br>

/Jon</small><br>

<br>

Andy Pfiffer wrote:<br>

<blockquote type="cite"

 cite="mid1045168134.18250.34.camel@andyp.pdx.osdl.net">

  <pre wrap="">On Thu, 2003-02-13 at 11:13, Rod Van Meter wrote:

  </pre>

  <blockquote type="cite">

    <pre wrap="">First off, this looks like very useful functionality.  I'm happy to

see it.  And it comes with documentation, too!

    </pre>

  </blockquote>

  <pre wrap=""><!---->

I've made a first pass through the PDF file that describes what it is

and how it does it.

Cold-medicine-induced rambling follows: ;^)

The first thing that strikes me is that it is similar in routing and

neighbor discovery to several distributed memory message-passing systems

developed during the mid-'80's.  They were characterized as systems

composed of nodes connected only by point-to-point networks, and all

routing was performed by store-and-forward of messages by the nodes

within the system.  Several platforms were built this way, including

those based upon INMOS Transputers, the early nCUBE, the Intel iPSC 1,

and a few-other early hypercubes.</pre>

</blockquote>

<small>You are right. The protocol is focused on intra-cluster communication<br>

with full connectivity, and when configuring otherwise one has to be<br>

aware of the limitations. It is not described in the document yet, but the<br>

way inter-cluster links are set up now means that each processor has at least<br>

two links to any neigbouring cluster. Hence there will at most be one routing<br>

step at TIPC level per message. Even this can be avoided if the most heavily<br>

traffic nodes can be identified. One can manually set up links between a

given<br>

pair of processors.<br>

Having said this, the inter-cluster links have not been used within our own<br>

products, so we do not really know their limitations. Until recently<br>

we have recommended use of TCP for inter cluster communication.<br>

<br>

On the other hand 'full connectivity" may mean setting up a geographically<br>

distributed&nbsp; cluster using an &nbsp;ip-based protocol (udp,sctp,tcp...), and passing

<br>

one or more routers without TIPC being aware of this.</small><br>

<blockquote type="cite"

 cite="mid1045168134.18250.34.camel@andyp.pdx.osdl.net">

  <pre wrap="">

If I understand the document correctly, the "Hello" mechanism and

routing table discovery/maintenance used by TIPC can have scaling

complications on very large systems (100's to 1000's of communicating

agents) when configured with insufficient internal connectivity.  A

spanning tree-based algorithm for maintaining the routing tables looks

like the ideal solution (nearly all of the needed adjacency information

is present) to apply here, rather than hardware-based broadcast on a

subnet or software-based "replicast."</pre>

</blockquote>

<small>I can not se that the "hello" mechanism is a limitation, it is only

a broadcast/<br>

multicast sent out over a limited period of time, using (now) an exponential<br>

backoff algorithm to determine frequency.<br>

<br>

A potentially bigger problem is the background supervision of links when<br>

we have hundreds of them in each processor. I have made some calculations<br>

on this, and with modern processor speed, memory amounts and bandwidth<br>

available we should be able to handle clusters with ~1000 nodes without<br>

any significant background load. (Remember that while total number of <br>

links grow as (nodes^2) for the whole cluster, number of links to maintain

<br>

*per node* still &nbsp;only grows at a linear rate.) This would have been a problem

<br>

in the 90:s, but as long as processor speed keeps evolving (a lot) faster

than <br>

cluster sizes this does not pose any serious problem. Moore's law is still

valid.<br>

<br>

About keeping naming and routing tables up to date, there is certainly better<br>

ways of doing this, but it has served us well so far, with clusters of ~50

processors.<br>

Maybe something for the TODO list...</small><br>

<blockquote type="cite"

 cite="mid1045168134.18250.34.camel@andyp.pdx.osdl.net">

  <pre wrap="">

The "zone" abstraction is also similar to techniques developed for

buffer management and flow-control in the high-performance

message-passing present on systems like the Intel iPSC2 and the Intel

Paragon.  In those systems, all-to-all communication needed to be

supported, but the O(N^2) time and space requirements rapidly became

prohibitive with 100's of nodes.  </pre>

</blockquote>

<small>As already said, this was an insurmountable problem some years ago,

but <br>

not with today's processors and switches.</small> <small>1000 nodes &nbsp;means

1998 links<br>

to maintain per node, meaning one supervision timer expiring &nbsp;each 0.5 ms

<br>

with today's supervision rate. An in most cases the timer will wake up to

do nothing,<br>

given the way the protocol works.</small><br>

<small>Certainly a challenge, but not impossible, an we are talking about

an extreme case<br>

which I don't think is very relevant right now. The day we see such clusters,

I am <br>

certain that we will have processor to deal with it as well.</small><br>

<blockquote type="cite"

 cite="mid1045168134.18250.34.camel@andyp.pdx.osdl.net">

  <pre wrap="">Internally, NX message passing

maintained an LRU of "nearest logical neighbors", and transparently

handled the attach/detach dynamically between one node and a set of

other nodes.  TIPC appears to be similar, at least in the description,

of that kind of behavior.

I'm curious as to the behavior of the protocol in some of the strange

boundary conditions, as in the case where the reroute counter of a

message has expired and the system is attempting to return it to the

sender, what happens if all routes to the original sender are cut or if

the sender has been removed?</pre>

</blockquote>

<small>It will be dropped. What else is there to do..</small>.<br>

<blockquote type="cite"

 cite="mid1045168134.18250.34.camel@andyp.pdx.osdl.net">

  <pre wrap="">

Andy

  </pre>

</blockquote>

<br>

</body>

</html>