<div class="gmail_quote">On Wed, Jan 20, 2010 at 1:04 PM, Brad Hudson <span dir="ltr"><<a href="mailto:hudson@pythian.com">hudson@pythian.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
Hi all;<br>
<br>
I have an odd problem that I have been dealing with for a week. I was<br>
hoping someone could help, or point me in the right direction for clues.<br>
<br>
I have a standard bridge setup. br0 is composed of eth0 and eth1.<br>
<br>
# brctl show bro<br>
bridge name bridge id STP enabled interfaces<br>
br0 8000.000c292280b9 no eth0<br>
eth1<br>
<br>
Eth0 and eth1 both have 0.0.0.0 (no) address assigned and are up. br0<br>
is assigned the proper IP and the routing table is correct. STP is off.<br>
<br>
I have been losing connectivity to hosts inside the local segment of the<br>
bridge. Some investigation has revealed that the problem is related to<br>
arp not working correctly. Arp packets going this way<br>
<br>
eth1->br0->eth0->network/internet<br>
<br>
have no problems at all. The replies coming back the other way all get<br>
to br0, but only 33% (approx, it varies) make it to the eth1 side of the<br>
bridge. I have verified this traffic pattern by tcpdump of arp packets<br>
through each of these devices while doing an nmap -sP of the /24 network<br>
to generate both arp and icmp. We are not able to arp any host outside<br>
our local segment, including the default gateway (which is owned by the<br>
co-lo). nmapping from the bridging server itself from interface br0<br>
gets the correct number of arp replies.<br>
<br>
ebtables and arp_tables are not running, and adding them in has had no<br>
change in result. There was a server with 2 NICs, each with an IP on<br>
the same subnet, that was causing some MAC flapping but that has been<br>
fixed and no change to the described behaviour. All items in<br>
/proc/sys/net/bridge are set to '1', but setting them to '0' has no<br>
effect. The server hosting the bridge has been rebooted several times<br>
with no effect. proxy_arp does not help at all. I also tried<br>
parprouted with no success.<br>
<br>
A couple other notes.<br>
<br>
- This behaviour suddenly appeared about a week ago. I think this is<br>
probably related to an increase in network traffic but it's hard to say,<br>
the client does not buy into that statement. If it was a matter of 0<br>
work or all work then there's places to look for that, but in this case<br>
the problem is intermittent and the lost arp replies are not the same<br>
every time.<br>
- In another test we found that if we ping the inside server from the<br>
firewall and also from an external machine the connectivity to the<br>
inside server dies. Once the pings are stopped, the connectivity<br>
eventually returns. If I ping out from the inside server while doing<br>
that test, the session keeps going through without hanging.<br>
- The firewall is a Vm running under ESX. The vmxnet driver has been<br>
reinstalled and the pcnet32 driver is not loaded. Both NICs are virtual<br>
so there is no chance of failed hardware, though I suppose the problem<br>
could be on the ESX layer. I have made some attempt to diagnose the WSX<br>
layer but nothing jumps out at me.<br>
<br>
I have been watching tcpdumps and do not see any sign of frags, dupes,<br>
or anything that would cause lost packets. I have combed the<br>
newsgroups, google and even irc looking for clues or similar situations,<br>
but nothing I have found fits the profile.<br>
<br>
The workaround we currently have in place is to make a static arp entry<br>
for the gateway on all servers on the inside. This is not ideal because<br>
the co-lo controls the router and it could fail over to another device<br>
which would kill our route again.<br>
<br>
Can anyone suggest anyplace I can look for clues, settings I should<br>
check or other? I am out of ideas at this point.<br>
<br>
Your help is very much appreciated.<br>
<br>
Regards;<br>
<br>
Brad<br>
<br>
<br>
<br>
--<br>
Brad Hudson<br>
SA Team Lead<br>
The Pythian Group - love your data<br>
Desk: 613-565-8696 x202<br>
IM: pythianhudson<br><br></blockquote><div class="gmail_quote"><br></div>I assume you have multiple physical NICs connected to your virtual switch. If so I've posted my finding on my web page <a href="http://robert.leblancnet.us">http://robert.leblancnet.us</a> and I've posted a message to this form two days ago entitled "Need help writing ebtables rules". I'm not sure my messages are getting through as I've sent a few messages with no one responding. If we can work together to solve the problem, we can both benefit.</div>
<div class="gmail_quote"><br></div><div class="gmail_quote">Thanks,<br clear="all"><br>Robert LeBlanc<br>Life Sciences & Undergraduate Education Computer Support<br>Brigham Young University </div><br>