<div class="gmail_quote">On Wed, Jan 20, 2010 at 1:04 PM, Brad Hudson <span dir="ltr">&lt;<a href="mailto:hudson@pythian.com">hudson@pythian.com</a>&gt;</span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">

Hi all;<br>

<br>

I have an odd problem that I have been dealing with for a week.  I was<br>

hoping someone could help, or point me in the right direction for clues.<br>

<br>

I have a standard bridge setup.  br0 is composed of eth0 and eth1.<br>

<br>

# brctl show bro<br>

bridge name     bridge id               STP enabled     interfaces<br>

br0             8000.000c292280b9       no              eth0<br>

                                                        eth1<br>

<br>

Eth0 and eth1 both have 0.0.0.0 (no) address assigned and are up.  br0<br>

is assigned the proper IP and the routing table is correct.  STP is off.<br>

<br>

I have been losing connectivity to hosts inside the local segment of the<br>

bridge.  Some investigation has revealed that the problem is related to<br>

arp not working correctly.  Arp packets going this way<br>

<br>

eth1-&gt;br0-&gt;eth0-&gt;network/internet<br>

<br>

have no problems at all.  The replies coming back the other way all get<br>

to br0, but only 33% (approx, it varies) make it to the eth1 side of the<br>

bridge.  I have verified this traffic pattern by tcpdump of arp packets<br>

through each of these devices while doing an nmap -sP of the /24 network<br>

to generate both arp and icmp.  We are not able to arp any host outside<br>

our local segment, including the default gateway (which is owned by the<br>

co-lo).  nmapping from the bridging server itself from interface br0<br>

gets the correct number of arp replies.<br>

<br>

ebtables and arp_tables are not running, and adding them in has had no<br>

change in result.  There was a server with 2 NICs, each with an IP on<br>

the same subnet, that was causing some MAC flapping but that has been<br>

fixed and no change to the described behaviour.  All items in<br>

/proc/sys/net/bridge are set to &#39;1&#39;, but setting them to &#39;0&#39; has no<br>

effect.  The server hosting the bridge has been rebooted several times<br>

with no effect.  proxy_arp does not help at all.  I also tried<br>

parprouted with no success.<br>

<br>

A couple other notes.<br>

<br>

- This behaviour suddenly appeared about a week ago.  I think this is<br>

probably related to an increase in network traffic but it&#39;s hard to say,<br>

the client does not buy into that statement.  If it was a matter of 0<br>

work or all work then there&#39;s places to look for that, but in this case<br>

the problem is intermittent and the lost arp replies are not the same<br>

every time.<br>

- In another test we found that if we ping the inside server from the<br>

firewall and also from an external machine the connectivity to the<br>

inside server dies.  Once the pings are stopped, the connectivity<br>

eventually returns.  If I ping out from the inside server while doing<br>

that test, the session keeps going through without hanging.<br>

- The firewall is a Vm running under ESX.  The vmxnet driver has been<br>

reinstalled and the pcnet32 driver is not loaded.  Both NICs are virtual<br>

so there is no chance of failed hardware, though I suppose the problem<br>

could be on the ESX layer.  I have made some attempt to diagnose the WSX<br>

layer but nothing jumps out at me.<br>

<br>

I have been watching tcpdumps and do not see any sign of frags, dupes,<br>

or anything that would cause lost packets.  I have combed the<br>

newsgroups, google and even irc looking for clues or similar situations,<br>

but nothing I have found fits the profile.<br>

<br>

The workaround we currently have in place is to make a static arp entry<br>

for the gateway on all servers on the inside.  This is not ideal because<br>

the co-lo controls the router and it could fail over to another device<br>

which would kill our route again.<br>

<br>

Can anyone suggest anyplace I can look for clues, settings I should<br>

check or other?  I am out of ideas at this point.<br>

<br>

Your help is very much appreciated.<br>

<br>

Regards;<br>

<br>

Brad<br>

<br>

<br>

<br>

--<br>

Brad Hudson<br>

SA Team Lead<br>

The Pythian Group - love your data<br>

Desk: 613-565-8696 x202<br>

IM: pythianhudson<br><br></blockquote><div class="gmail_quote"><br></div>I assume you have multiple physical NICs connected to your virtual switch. If so I&#39;ve posted my finding on my web page <a href="http://robert.leblancnet.us">http://robert.leblancnet.us</a> and I&#39;ve posted a message to this form two days ago entitled &quot;Need help writing ebtables rules&quot;. I&#39;m not sure my messages are getting through as I&#39;ve sent a few messages with no one responding. If we can work together to solve the problem, we can both benefit.</div>

<div class="gmail_quote"><br></div><div class="gmail_quote">Thanks,<br clear="all"><br>Robert LeBlanc<br>Life Sciences &amp; Undergraduate Education Computer Support<br>Brigham Young University </div><br>