Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

PowerScale OneFS HDFS Configuration Guide

Cloudera Navigator

OneFS provides support for Cloudera's Navigator application with the release of OneFS 8.1.2 and later versions.

OneFS supports the following data management tasks in Cloudera Navigator:

  • Browse and search data: Find the owner, creation and modification dates, understand data origins, and history.
  • Lineage and provenance: Track data from its source and monitor downstream dependencies.
  • Discovery and exploration: Add, review, and update metadata on the objects contained in the Hadoop data store.
  • Custom metadata and tagging: Add custom tags and information to data and objects in HDFS.

The Cloudera Navigator Data Management component is a comprehensive data governance and stewardship tool available to supplement Cloudera's distribution including Apache Hadoop (CDH). Navigator recognizes HDFS, Yarn, Impala, and Hive as sources of data that it can manage. It extracts information from these services to provide additional insight into how data was created and managed—and when and by whom it was changed—using metadata and job history along with HDFS data feeds.

The primary use of Navigator is data governance to monitor and track data in a HDFS workflow. One of the unique challenges with very large data sets is being able to track and monitor how data moves through the data analytics workflow. A key Navigator feature is the ability to link between input and output data through analytics jobs like Mapred, or perform data transformations on table-based data in Hive or Impala databases. Navigator then analyzes the metadata and job history and links it together to generate lineage.

Traditional HDFS metadata management

In a traditional Direct Attached Storage (DAS) Hadoop with NameNode (NN) deployment of HDFS, the NameNode's main role is to store all the metadata of the underlying data blocks: the HDFS namespace, directory structures, file permissions, and block IDs to files. While this data is held in memory for operational use, it is critical that this data is persisted to disk for recovery and fault tolerance.

In traditional HDFS, this metadata is stored in two ways:

  • FSImage (a binary image file accessed through an HTTP end point)
  • INotify stream (an ordered JSON edit log retrieved through HDFS RPCs)

The FSImage image file is a complete point-in-time representation of the HDFS file system metadata. The FSImage file is used on NameNode startup to load the metadata into memory. Because it is inefficient at handling incremental updates, all of the modifications to the HDFS file system are recorded in a transaction log (INotify stream) rather than frequently rewriting the FSImage file. This provides the NameNode with a number of capabilities, and modifications can be tracked without having to constantly regenerate the FSImage file. In the event of a NameNode restart, the combination of the latest FSImage and INotify log can be used to provide an accurate view of the file system at any point in time.

Eventually the HDFS cluster will need to construct a new FSImage that encompasses all INotify log file entries consolidated with the old FSImage directly into a new updated FSImage file to provide an updated point-in-time representation of the file system. This is known as checkpointing and is a resource-intensive operation. During checkpointing, the NameNode has to restrict user access to the system, so instead of restricting access to the active NameNode, HDFS offloads this operation to the Secondary NameNode (SN)—or to a standby NameNode—when operating in high availability (HA) mode. The secondary NameNode handles the merge of existing FSImage and INotify transaction logs and generates a new complete FSImage for the NameNode. At this time, the latest FSImage can be used in conjunction with the new INotify log files to provide the current file system. It is important that the checkpoints occur, otherwise on a NameNode restart, it has to construct the entire HDFS metadata from the available FSImage and all INotify logs. This can take a significant amount of time, and the HDFS file system will be unavailable while this occurs.

Cloudera Navigator metadata management

The Navigator metadata service accesses data in a number of ways, such as Yarn application logs, Hive and Impala applications, and HDFS metadata through polling of the FSImage file and INotify transaction logs. It collects all of this information and stores it within Apache Solr databases on the Hadoop cluster. Navigator then runs additional extractions and analytics to create the data that you can view in Navigator. The ability to collect the underlying HDFS metadata from FSImage and INotify is critical to Navigator's ability to view the file system and is why, up until the release of OneFS 8.1.1, OneFS Hadoop clusters were unable to provide HDFS file system data to Navigator.

Navigator's primary function is to read an initial FSImage and then use the INotify logs to gain access to all file system updates that have occurred. It is possible under specific situations that Navigator is required to refresh its data from a full FSImage rather than leveraging the INotify log, but this does not occur normally.

It is important to recognize that Navigator data is not real-time; it periodically updates the data through polling and extraction to create the data reviews. This behavior is consistent with both DAS and OneFS deployments and is how Cloudera Navigator is designed to operate.

OneFS support for Cloudera Navigator

The OneFS approach to handling file system allocation, block location, and metadata management is fundamentally different than how a traditional Apache-based HDFS file system manages its data and metadata. When OneFS is integrated into a Hadoop cluster, it provides the storage file system to the Hadoop cluster that is based on OneFS and not on an HDFS-based file system. Its layout and protection scheme is fundamentally different than HDFS, and so is its management of metadata and blocks. Since OneFS is not a NameNode-based HDFS file system—and no NameNode is present in the Hadoop cluster—the OneFS file system presents NameNode and DataNode-like functionality to the remote Hadoop cluster through the HDFS service. OneFS doesn't rely on FSImage and INotify transaction log-based metadata management within OneFS with HDFS data. In order to support the native OneFS capabilities, enterprise features for Hadoop, and provide multiprotocol access, OneFS uses the underlying file system presented to the HDFS protocol for Hadoop access. Therefore, prior to OneFS 8.1.1, OneFS could not provide an FSImage and INotify log for consumption.

With the release of OneFS 8.1.1 and later versions, OneFS integrates with Cloudera Navigator by enabling an FSImage and INotify log file on OneFS in an HDFS access zone. By enabling an HDFS Hadoop access zone root for FSImage and INotiffy integration, you are, in effect, telling OneFS to create an FSImage file and start tracking HDFS file system events in an INotify log file, thereby making that data available for consumption by Navigator. Once enabled, OneFS effectively begins to mimic the behavior of a traditional NameNode deployment, and an FSImage file is generated by OneFS. All HDFS file system operations are logged into an INotify stream.

Periodically OneFS will regenerate a new FSImage, but this operation is not true checkpointing or merging of the INotify log as performed on an HDFS NameNode, because the actual file system and operations are still handled by the core OneFS file system. The FSImage and INotify logs are generated by OneFS to provide the required data to Cloudera Navigator in the required format.

The FSImage regeneration job runs daily to recreate a current FSImage which—combined with the current INotify logs—will represent the current state of data and metadata in the HDFS root from an HDFS perspective.

OneFS is a multi-protocol file system, which provides unified access to its data through many protocols, including HDFS, NFS, SMB, and others. Since only HDFS file system operations are captured by the INotify log, Navigator will only initially see this metadata; any metadata created in the HDFS data directories by NFS or SMB will not get included in the INotify stream. However, on regeneration of an FSImage, these files will be included in the current FSImage, and Navigator will see them the next time it uses a later refreshed FSImage. Since Navigator's primary method of obtaining updated metadata is based on INotify logs, it may take some time before non-HDFS-originating data is included. This is expected behavior and should be taken into account if multiprotocol workflows are in use.

Using Navigator with OneFS

In order to enable Navigator integration, both FSImage and INotify need to be enabled on the HDFS access zone within OneFS. Once enabled, they should not be disabled unless the use of Navigator is to be permanently discontinued.

You should not enable FSImage and INotify on any zones that do not use Navigator, as these add unnecessary overhead. Within OneFS, the FSImage and INotify features are access zone-aware and should only be enabled on any Hadoop-enabled access zone that will use Navigator. There is no reason to enable it on a zone that is not being monitored by Navigator, since it will add overhead to that cluster due to a feature that is not being consumed.

No additional configuration changes are required within Cloudera Manager or Navigator to enable integration. When integration is initially enabled, it will take some time for the initial HDFS data to become visible within Navigator and additional time is needed to generate linkage. As new data is added, it will show up in Navigator and will be linked based on the polling and extraction period within Navigator.

Additionally, note the following:

  • You can enable FSImage and INotify either through the command line interface or through the web administration interface.
  • Once FSImage and INotify are enabled, you must deploy CDH 5.12 with Cloudera Navigator. Cloudera deployments prior to CDH 5.12 will not allow Navigator installation.
  • Wait approximately an hour until Navigator has gathered information from applications.
  • Clusters will need to be sized to accommodate the performance impact of INotify.
  • Events are logged to the /var/log/hdfs.log file and messages.
  • You should avoid disabling INotify—or toggling INotify and FSImage off and on—as these are destructive actions in Cloudera Navigator and can cause metadata data loss.
  • Do not set the FSImage generation interval (the interface between successive FSImages) beyond the INotify retention period (the minimum duration edit logs will be retained). The INotify minimum retention period must be longer than the FSImage generation interval.
  • With INotify enabled there is an expected performance impact for all edit actions over HDFS.
  • FSimage generation takes approximately one hour for every three million files.
  • To view the data in Navigator, use Yarn, Hive, or another application.
  • OneFS 8.1.1 and later releases do not support Cloudera Navigator data audit capabilities.

For more information about Cloudera Navigator, see:


Rate this content

Accurate
Useful
Easy to understand
Was this article helpful?
0/3000 characters
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please provide ratings (1-5 stars).
  Please select whether the article was helpful or not.
  Comments cannot contain these special characters: <>()\