Note: This topic is part of the Using Hadoop with OneFS - Isilon Info Hub.
Also review: OneFS 220.127.116.11 and Cloudera CDH 5.13+ Support for Cloudera Navigator
The release of the OneFS 18.104.22.168 MR sees the addition of OneFS support for Cloudera's Navigator application for Metadata Management with Isilon OneFS. The Cloudera Navigator Data Management component is a comprehensive data governance and data stewardship tool. This blog post will look to highlight how Navigator works and how integration with Isilon OneFS is enabled. It is intended to provide a high-level overview of the basic capabilities and how Navigator can be used in conjunction with Hadoop clusters and HDFS data.
With the inclusion of Navigator support in OneFS we now support the following data management tasks:
As of this release of OneFS 8.1.0.x we do not support Navigator data audit capabilities.
Cloudera Navigator is a tool available to supplement Cloudera's Hadoop distribution CDH, it is a licensed feature that can be integrated to provide additional data tools to administrators and end users of Hadoop data sets. Navigator currently recognizes HDFS, Yarn, Impala and Hive as sources of information it can manage. It extracts information from these services periodically to provide additional insight into how data was created and managed when it was manipulated and by who. Using metadata and job history along with HDFS data feeds into various components of Navigator.
The main Search page of Cloudera Navigator-allows you to search and filter on many criteria, source, type, owner etc. to find the data you are looking for.
An example of the search interface based on one of the attributes of metadata - HDFS created in the last day
View detailed information on a File System Object
View additional Information on an Impala database
View Table information on Impala tables
One of the primary uses of Navigator is to monitor and track data in an HDFS workflow, one of the unique challenges with very large data sets is being able to track and monitor how data moves through the data analytics workflow. A key Navigator feature is the ability to link data between input and output through analytics jobs like mapred or data transformations on table-based data in Hive or Impala databases.
Navigator internal analyzes metadata and job history and links it together to generate lineage.
An example of simple lineage is how jobs create, ingest and output data in a simple terasuite run: teragen-terasort-teravalidate
Review table based lineage data (Impala), how data was ingested and transformed by database workflows: A simple TPC-DS table ingest and query.
A very useful feature of Navigator is the ability to add custom metadata to objects; hdfs, yarn jobs, tables etc. This allows for easier searching and classification of data and can make it simpler to track and monitor data and usage of your data sets.
Add a custom tag, select a piece of data in Navigator and select Actions
Search on that custom tag
From an Administrative perspective, the ability to monitor and review who, when and how data is moving in the system can be very useful. Navigator provides and number of interfaces to review this information
HDFS Analytics - When, Who, Size of Data View
Review Data Stewardship - Review data trends within the HDFS data
Navigator can also enforce policies to manage data and auto tag new data within Navigator to facilitate data management. A number of the capabilities of Navigator including role-based access, purge management and custom properties are beyond the scope of this blog and Cloudera's documentation should be consulted for additional information.
Additional information on Cloudera Navigator can be found here:
In a Direct Attached Storage (DAS) Hadoop cluster with a NameNode (NN) integrated deployment of HDFS, the NN's main role is to store the HDFS namespace; directory structures, file permissions and block ID's to files - all the metadata of the underlying data blocks. While this data is held in memory for operational use it is critical this data is persisted to disk for recovery and fault tolerance.
In normal HDFS this metadata is stored in two ways; FSImage and an edit log (INotify stream). The FSImage image is a complete point in time representation of the HDFS file systems metadata. The FSImage file is very efficient to read and is used on NN startup to load the metadata into memory, but it is very poor at handling incremental updates. So, rather than rewriting the FSImage all the time, all modifications to the HDFS file system are recorded in a transaction log (INotify stream). This provides the NN a number of capabilities; modifications can be tracked without having to constantly regenerate the FSImage file and in the event of a NN restart, the combination of the latest FSImage and INotify log can be replayed to provide an accurate view of the file system to that point in time.
Eventually, the HDFS cluster will need to construct a new FSImage that encompasses all INotify log file entries consolidated with the old FSImage directly into a new updated FSImage file to provide an updated point in time representation of the file system. This is known as checkpointing and is a resource-intensive operation. Also during this time, the NN has to restrict user access to the system, so instead of restricting access to the active NN, HDFS offloads this operation to the Secondary NameNode (SN) or a standby NN when operating in HA mode. The SN handles this merge of existing FSImage and INotify transaction logs and generates a new complete FSImage for the NN. At this time the latest FSImage can be used in conjunction with new INotify log files to provide the current file system. It is important that the checkpoints occur otherwise on NN restart the NN has to construct the entire HDFS metadata from the available FSImage and all INotify logs, this can take a long time and the HDFS file system will be unavailable while this occurs.
The Navigator Metadata Service accesses information through a number of ways; yarn applications logs, Hive and Impala applications and HDFS metadata through polling of the FSImage file and INotify transaction logs. It collects all this information and stores in within a Solr databases on the Hadoop cluster. Navigator then runs additional extractions and analytics on this data to create the data seen in Navigator. The ability to collect the underlying HDFS metadata from FSImage and INotify is critical to Navigators ability to view the file system and is why up until now OneFS based Hadoop clusters were unable to provide HDFS file system data to Navigator. Navigator’s primary behavior is to read an initial FSImage and then use the INotify logs to gain access to all file system updates that have occurred. It is possible under specific situations that Navigator is required to refresh its data from a full FSImage, rather than leveraging the INotify log but this does not occur normally.
It is important to recognize Navigator data is not real-time but it periodically updates through polling and extraction to create the data views. This behavior is consistent with both DAS and Isilon based deployments and is how Navigator is designed to operate
Isilon OneFS when integrated into a Hadoop cluster provides the storage file system to the Hadoop cluster that is based on OneFS and not on an HDFS based file system.The layout and protection scheme is fundamentally different than HDFS and so is its management of metadata and blocks. Since OneFS is not a NN based HDFS file system and no NN is present in the Hadoop cluster, but rather OneFS provides NN and DataNode (DN) like functionality to the native OneFS system for the remote Hadoop cluster to access via the HDFS services and protocols. Our approach to handling file system allocation, block location and metadata management is fundamentally different than how a traditional Apache based HDFS file system manages its data and metadata.
The long and short of this is we don't rely on FSImage and INotify transaction log based metadata management within OneFS for HDFS data. In order to support the native OneFS capabilities as described in the Enterprise Features for Hadoop whitepaper and provide multiprotocol access, we use the underlying OneFS file system presented to the HDFS protocol for Hadoop access. Therefore we had no capabilities to provide a FSImage and INotify log for consumption by Navigator. Until now that is, with the release of MR 22.214.171.124, OneFS now includes the capability to integrate with Navigator by enabling a FSImage and INotify log stream on OneFS in a HDFS Access Zone. Enabling this feature in effect tells OneFS to create a FSImage file and start tracking HDFS file system events in an INotify log file that are available for consumption by Navigator in this case.
Since these components are now accessible to Navigator, OneFS based Hadoop can provide the required HDFS metadata to Navigator for inclusion and analytics. Once we enable a HDFS Hadoop Access Zone root for FSImage and INotify integration; OneFS effectively begins to mimic the behavior of a traditional NN deployment, a FSImage file is generated by OneFS and all HDFS file system operations are logged into an INotify stream. Periodically OneFS will regenerate a new FSImage, it is not true checkpointing and merging of the INotify log like a HDFS NN does, as the actual file system and operations are still handled by the core OneFS file system. The FSImage and INotify logs are generated to provide the required data to Navigator in the required format.
The FSImage regeneration job runs daily to recreate a current FSImage which combined with the current INotify logs will represent the current state of data and metadata in the HDFS root from a HDFS perspective. At its heart OneFS is true multi-protocol filesystem which provides unified access to its data through many protocols; HDFS, NFS, SMB and others. Since only HDFS file system operations are captured by the INotify log, Navigator will only initially see this metadata, any metadata created in the HDFS data directories by NFS or SMB will not get included in the INotify stream. But, on the regeneration of a FSImage, these files will get included in the current FSImage and Navigator will see them the next time Navigator uses a later refreshed FSImage. Since Navigator primary method of obtaining updated metadata is based on INotify logs it may be a sometime before none-HDFS originating data is included. This is expected behavior and should be taken into account if multiprotocol workflows are in use
Within OneFS the FSImage and INotify features are Access Zone aware and should only be enabled on any Hadoop enabled Access Zone that will use Navigator, there is no reason to enable it on a zone that is not being monitored by Navigator, it will just add additional overhead to that the cluster on a feature that is not being consumed. In order to enable Navigator integration; both FSImage and INotify need to be enabled on the HDFS Access Zone. Once enabled, they should not be disabled unless the use of Navigator is to be permanently discontinued.
No additional configuration changes are required within Cloudera Manager or Navigator to enable functionality when integration is initially enabled it will take some time for the initial HDFS data to show within Navigator and additional time to generate linkage. As new data is added it will show and be linked based on the polling and extraction period within Navigator.
The following section outlines how to enable this feature within OneFS:
Enable FSImage on the HDFS Access Zone:
isi hdfs fsimage settings modify --enabled=true --zone=zone1-cdh --verbose
Review the status of FSImage:
isi hdfs fsimage settings view --zone=zone1-cdh
Review the status of the FSimage job:
isi hdfs fsimage job view --zone=zone1-cdh
Review the frequency of the FSImage job:
isi hdfs fsimage job settings view --zone=zone1-cdh
Review the latest FSImage:
isi hdfs fsimage latest view --zone=zone1-cdh
It may take some time for the initial FSImage to be generated.
Enable INotify on the HDFS Access Zone:
isi hdfs inotify settings modify --enabled=true --zone=zone1-cdh --verbose
Review the configuration of the INotify stream:
isi hdfs inotify settings view --zone=zone1-cdh
Review the INotify stream:
isi hdfs inotify stream view --zone=zone1-cdh
The Sync and Current ID’s will update periodically as you run this command.
When enabling integration of Cloudera Navigator and OneFS it can take a few hours for initial HDFS data to show up within Navigator based on the generation of a FSImage and INotify stream. This is expected behavior.
Post enablement and generation you will see HDFS objects in Navigator for browsing.
An overview if FSImage and INotify commands in OneFS:rsteven-3u7xf1k-1# isi hdfs fsimage job settings modify -
In order to implement and OneFS and Navigator integration the minimum required versions are:As of October 2nd:
FSimage and INotify functionality will be integrated into an upcoming Major Release of OneFS removing the requirement for any DA patches to expose this functionality.
The integration of FSImage and INotify capabilities into OneFS now provides support for Isilon OneFS based Hadoop cluster deployments to provide metadata management and data lineage with Cloudera Navigator compatibility. This integration extends the enterprise capabilities of OneFS based Hadoop deployments providing parity to native DAS based HDFS file systems and data management options. Additional information on Cloudera Navigator integration with OneFS can be obtained from your account team.
Article ID: SLN319472Last Date Modified: 01/31/2020 01:49 PM