badercchmc's Posts

badercchmc's Posts

We have been looking into issues with NFS file copies during a node reboot, like with a rolling upgrade. EMC states; """""""OneFS 8: NFSv4 failover ------ with the introduction of NF... See more...
We have been looking into issues with NFS file copies during a node reboot, like with a rolling upgrade. EMC states; """""""OneFS 8: NFSv4 failover ------ with the introduction of NFSv4 failover, when a client’s virtual IP address moves, or a OneFS group change event occurs, the client application will continue without disruption. As such, no unexpected I/O error will propagate back up to the client application. In OneFS 8.0, both NFSv3 and NFSv4 clients can now use dynamic”"""""" In working with EMC, it looks like this can be affected by the Linux Kernel and file size and smart connect. To summarize with CentOS, It works from kernel 3.10.0-514.el7.x86_64 but fails with 3.10.0-862.el7.x86_64 when copying a 5 GB file and rebooting a node. Per EMC we are going to verify that the issue happens with Redhat and then open a call. Has anyone else seen this? -------------------------  Detailed explanation from EMC --------------------------------- The NFS SME wanted to go back over everything before moving forward with getting Engineering involved.  He found the smoking gun that explains the client behavior from our original packet capture a few weeks ago. The pcaps indicate that a client running the affected kernel isn’t properly supplying the file handle during the PUTFH process. ****** Here’s the connection from node 2’s perspective before failover is induced: ****** Isilon-dev-2.lagg1_09072018_134025.pcap ***************** 1029  12.559173 XX.XXX.XXX..16 → XX.XXX.XXX..30 NFS 270 V4 Call OPEN_CONFIRM 1030  12.559263 XX.XXX.XXX..30 → XX.XXX.XXX..16 NFS 142 V4 Reply (Call In 1029) OPEN_CONFIRM ***************** 1031  12.559376 XX.XXX.XXX..16 → XX.XXX.XXX..30 NFS 302 V4 Call SETATTR FH: 0xa577051b 1032  12.561461 XX.XXX.XXX..30 → XX.XXX.XXX..16 NFS 318 V4 Reply (Call In 1031) SETATTR 1122  12.931344 XX.XXX.XXX..16 → XX.XXX.XXX..30 NFS 31926 V4 Call WRITE StateID: 0xcddd Offset: 0 Len: 1048576[TCP segment of a reassembled PDU] 1136  12.936254 XX.XXX.XXX..30 → XX.XXX.XXX..16 NFS 206 V4 Reply (Call In 1122) WRITE tshark -r Isilon-dev-2.lagg1_09072018_134025.pcap -O nfs -Y "frame.number==1029" Frame 1029: 270 bytes on wire (2160 bits), 270 bytes captured (2160 bits) Ethernet II, Src: Vmware_84:2c:6f (00:50:56:84:2c:6f), Dst: Broadcom_77:c4:f0 (00:0a:f7:77:c4:f0) 802.1Q Virtual LAN, PRI: 0, CFI: 0, ID: 2215 Internet Protocol Version 4, Src: XX.XXX.XXX..16, Dst: XX.XXX.XXX..30 Transmission Control Protocol, Src Port: 872, Dst Port: 2049, Seq: 877, Ack: 765, Len: 200 Remote Procedure Call, Type:Call XID:0x40a1d0c1 Network File System, Ops(2): PUTFH, OPEN_CONFIRM     [Program Version: 4]     [V4 Procedure: COMPOUND (1)]     Tag: <EMPTY>         length: 0         contents: <EMPTY>     minorversion: 0     Operations (count: 2): PUTFH, OPEN_CONFIRM ************************         Opcode: PUTFH (22)             filehandle                 length: 53                 [hash (CRC-32): 0xa577051b]                 filehandle: 011f0000000200c50201000000ffffffff00000000020000... ************************         Opcode: OPEN_CONFIRM (20)             stateid                 [StateID Hash: 0xc32a]                 seqid: 0x00000001                 Data: 019842390100000000000000                 [Data hash (CRC-32): 0x57d33b9b]             seqid: 0x00000001     [Main Opcode: OPEN_CONFIRM (20)] ************* Here is where the connection fails after moving over to node 3. ************ We see that failover occurs and the client reestablishes connection after failover, but fails to provide a file handle during that PUTFH operation.  This is why the cluster is returning “NFS4ERR_BADHANDLE” at that point. Isilon-dev-3.lagg1_09072018_134025.pcap ***************** 25390  28.650376 XX.XXX.XXX..16 → XX.XXX.XXX..30 NFS 214 V4 Call OPEN_CONFIRM 25391  28.650455 XX.XXX.XXX..30 → XX.XXX.XXX..16 NFS 118 V4 Reply (Call In 25390) PUTFH Status: NFS4ERR_BADHANDLE ***************** tshark -r Isilon-dev-3.lagg1_09072018_134025.pcap -O nfs -Y "frame.number==25390" Frame 25390: 214 bytes on wire (1712 bits), 214 bytes captured (1712 bits) Ethernet II, Src: Vmware_84:2c:6f (00:50:56:84:2c:6f), Dst: QlogicCo_a5:54:00 (00:0e:1e:a5:54:00) 802.1Q Virtual LAN, PRI: 0, CFI: 0, ID: 2215 Internet Protocol Version 4, Src: XX.XXX.XXX..16, Dst: XX.XXX.XXX..30 Transmission Control Protocol, Src Port: 772, Dst Port: 2049, Seq: 782969877, Ack: 42129, Len: 144 Remote Procedure Call, Type:Call XID:0xd0a5d0c1 Network File System, Ops(2): PUTFH, OPEN_CONFIRM     [Program Version: 4]     [V4 Procedure: COMPOUND (1)]     Tag: <EMPTY>         length: 0         contents: <EMPTY>     minorversion: 0     Operations (count: 2): PUTFH, OPEN_CONFIRM ********************         Opcode: PUTFH (22)             filehandle                 length: 0 ********************* Opcode: OPEN_CONFIRM (20)             stateid                 [StateID Hash: 0x5a23]                 seqid: 0x00000001                 Data: 013830d0ac03000000000000                 [Data hash (CRC-32): 0xc2270b06]             seqid: 0x00000003     [Main Opcode: OPEN_CONFIRM (20)] They believe this to be fairly definitive evidence that the client kernel’s behavior here is something that Isilon likely has no control over.  We can create a knowledge base article for awareness surrounding the issue, but this wouldn’t be going up to Dev based on those findings according to the L3. Best regards, Technical Support Engineer, Global Support Center ------------------- ------------------- ------------------- ------------------- ------------------- My test notes --------------- I used VMware player and had a three-node OneFS 8.0.0.7 simulator, and Centos 7 client. I had the networking all isolated to the VMware player no external network access. Copying small files (10 MB) mounted via NFSv4 to the smart connect IP and rebooting the node worked. The file copy would pause on one file then it would pick up, and the copy would continue. All files looked good with MD5. Copying small files (10 MB), mounted via a nodes IP (we used the same IP we received in the previous example) and rebooting the node DID NOT work. The file copy would pause on one file, then it would pick up, and the copy would continue. All files looked good with MD5, except for the one that failed. Copying a large file (5 GB), mounted via NFSv4 to the smart connect IP and rebooting the node would NOT work. We got an I/O error. Copying a large file (5 GB), mounted via a nodes IP (we used the same IP we received in the previous example) and rebooting the node would NOT work. We got an I/O error. ------------------------------------------
So I have the port (12228) frm thje isilon to the CEE server, but then the CEE server sends the data on to Varonis. (in our case). From Varonis I got this info: " As you confirmed, port 1... See more...
So I have the port (12228) frm thje isilon to the CEE server, but then the CEE server sends the data on to Varonis. (in our case). From Varonis I got this info: " As you confirmed, port 135 is used for the initial call to the CEE/CEPA server. From there, any one of the ports within the standard dynamic port range is utilized. I may have mentioned before that I can't speak to why any particular port is selected from that range, but it will always be between 49152 and 65535. In this case we see that the port used is just a bit higher than the low-end of the range."
The Isilon uses port 1228 to talk to the CEE client (running on windows in our setup). But what ports are needed for the windows CEE client to talk to an agent (in our case Data advantage)? ... See more...
The Isilon uses port 1228 to talk to the CEE client (running on windows in our setup). But what ports are needed for the windows CEE client to talk to an agent (in our case Data advantage)? In looking at our firewall , it tries port 135?  is that the only port needed  or do I need to open more. Thanks Bob