SmartConnect, Network Pools and HDFS Racks for Hadoop Part 1


SmartConnect, Network Pools and HDFS Racks for Hadoop Part 1




Note: This topic is part of the Using Hadoop with OneFS - PowerScale Info Hub.

I wanted to create a few posts revisiting the networking options when designing and configuring network connectivity for use with Hadoop clusters with OneFS. This topic and the recommended best practices have evolved significantly over the last few years as OneFS has evolved and changes were made to HDFS on PowerScale.

Since ultimately the compute services and clients connect to PowerScale via the defined File System URI DNS name, there are a number of potential options to consider when creating a SmartConnect network pool strategy for integration into a hadoop compute cluster:

  • IP Network Pools – One or many IP address pools; NameNode, DataNode or Single pool, SmartConnect
  • Dynamic or Static pools
  • HDFS racks implemented

With every node in an PowerScale cluster being able to act as a NameNode and DataNode some options exist in how to best deploy the network configuration to best optimize the PowerScale nodes and your clients. Before we get started, let’s recap a few concepts.

Remember: Pools are about segregating interfaces and traffic, while Allocation methods are about address failover behavior.


IP Address Pools

IP address pools are assigned within a subnet and consist of one or more IP address ranges. You can partition nodes and network interfaces into logical IP address pools. IP address pools are also utilized when configuring SmartConnect DNS zones and client connection management.

You can add network interfaces to IP address pools to associate address ranges with a node or a group of nodes. SmartConnect settings that manage DNS query responses and client connections are configured at the IP address pool level.


SmartConnect Zones

Clients can connect to the PowerScale clusters through a specific IP address or through a name that represents an IP address pool. You can configure a SmartConnect DNS zone name for each IP address pool. The zone name must be a fully qualified domain name. SmartConnect requires that you add a new name server (NS) record that references the SmartConnect service IP address in the existing authoritative DNS zone that contains the cluster. You must also provide a zone delegation to the fully qualified domain name (FQDN) of the SmartConnect zone in your DNS infrastructure.


Static IP Allocation

Assigns one IP address to each network interface added to the IP address pool, but does not guarantee that all IP addresses are assigned. IP addresses do not failover if an interface becomes unavailable.


Dynamic IP Allocation

Assigns IP addresses to each network interface added to the IP address pool until all IP addresses are assigned. This guarantees a response when clients connect to any IP address in the pool. If a network interface becomes unavailable, its IP addresses are automatically moved to other available network interfaces in the pool as determined by the IP address failover policy


Virtual HDFS Racks

OneFS enables you to define a subset of node interfaces on the PowerScale cluster through a pool and an associated group of Hadoop compute clients as a virtual HDFS rack. Virtual HDFS racks allow you to fine-tune client connectivity by directing Hadoop compute clients to preferentially connect to a specific set of nodes; these could be located on the same switch or faster nodes classes, depending on your network and cluster topology.

In a simple topology all PowerScale nodes act as NameNode and DataNodes, this would be implemented as a single IP Pool/SmartConnect zone. A client requests access via the SmartConnect FQDN name associated with the HDFS root. In order to determine which NameNode we connect to, a DNS query is made against the SmartConnect zone name and we return any node in the cluster per normal SmartConnect behavior (1 - 4). The client then makes a NameNode request to that specific PowerScale node (5 & 6), the PowerScale node responds with which PowerScale node to connect to get access to those data blocks from (can be any node in the cluster in the IP Pool assigned to the SmartConnect pool). The client then makes a data node connection to that PowerScale node (7 & 8).

When a rack is introduced, all nodes still act as a NameNode and DataNode but the response of which DataNode to connect to can be managed. The process occurring is similar except on querying the NameNode (5 & 6) to get a DataNode to connect to, the cluster consults the defined rack to determine if the client IP requesting data should connect to only a specific set of nodes (7 & 8) (the ones defined by the rack allocation).The architecture this benefits is if the client and the PowerScale nodes are located within the same rack/switch to limit cross switch traffic. Note: NameNode traffic can cross switches as all nodes are in the same pool and any PowerScale node can be returned via SmartConnect as the NameNode for the client to connect to, NameNode traffic is significantly smaller than DataNode traffic so this should not be an issue.

A Virtual HDFS rack is the association between a range of hadoop client source IP’s and an PowerScale IP pool. The base implementation of using racks requires a minimum of two pools but it may contain more.

  1. The NameNode pool; this pool likely contains all the nodes in the cluster that will provide HDFS protocol access, HDFS clients make connections to this pools SmartConnect name for NameNode requests.
  2. A DataNode pool; this is all the PowerScale nodes that you wish to provide DataNode access to.
The rack definition can be explicit, a specific range of hadoop client IP's:
# isi hdfs racks list --zone=zone1-cdh
Name Client IP Ranges IP Pools
------------------------------------------------------------------------------------
/rack1 10.99.36.1-10.99.36.124 subnet0:hadoop-pool-cdh1
------------------------------------------------------------------------------------
Total: 1

A rack can also be defined as a default rack, basically stating all source IP’s should be used in the rack definition.

# isi hdfs racks list --zone=zone1-cdh
Name Client IP Ranges IP Pools
------------------------------------------------------------------------------------
/rack1 0.0.0.0-255.255.255.255 subnet0:hadoop-pool-cdh1

In earlier versions of OneFS it was recommended to use multiple IP pools and racks for all hdfs configurations, as improvements and new features were introduced into OneFS this recommended best practice has evolved depending on the cluster architecture and how clients and nodes are racked.

In the next post we will look at how to implement IP pool strategies and racks if they are indeed even needed.

Part2 ------ > SmartConnect, Network Pools and HDFS Racks for Hadoop Part 2



Article ID: SLN319145

Last Date Modified: 07/08/2020 06:03 PM

Rate this article

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please provide ratings (1-5 stars).
Please select whether the article was helpful or not.
Comments cannot contain these special characters: <>()\
characters left.