am seeing below error when running MR job in customer's production environment which is kerberos secured and using Isilon..
Has anybody come across this error before in a secured hadoop environment?
16/03/04 13:13:11 INFO impl.TimelineClientImpl: Timeline service address: dca1hdm01pp.med...ws/v1/timeline/
16/03/04 13:13:11 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 0 for ambari-qa on 10.20.37.222:8020
16/03/04 13:13:11 INFO security.TokenCache: Got dt for hdfs://isilon-dl1.medctr.ad.wfubmc.edu:8020; Kind: HDFS_DELEGATION_TOKEN, Service: 10.20.37.222:8020, Ident: (HDFS_DELEGATION_TOKEN token 0 for ambari-qa)
16/03/04 13:13:11 INFO input.FileInputFormat: Total input paths to process : 1
16/03/04 13:13:11 INFO mapreduce.JobSubmitter: number of splits:1
16/03/04 13:13:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1457115085399_0001
16/03/04 13:13:11 INFO mapreduce.JobSubmitter: Kind: HDFS_DELEGATION_TOKEN, Service: 10.20.37.222:8020, Ident: (HDFS_DELEGATION_TOKEN token 0 for ambari-qa)
16/03/04 13:13:13 INFO impl.YarnClientImpl: Submitted application application_1457115085399_0001
16/03/04 13:13:13 INFO mapreduce.JobSubmitter: Cleaning up the staging area /user/ambari-qa/.staging/job_1457115085399_0001
java.io.IOException: Failed to run job : Failed to renew token: Kind: HDFS_DELEGATION_TOKEN, Service: 10.20.37.222:8020, Ident: (HDFS_DELEGATION_TOKEN token 0 for ambari-qa)
at java.security.AccessController.doPrivileged(Native Method)
I have not seen this particular issue before; can you provide us with some more details about your environment?
1. How many HDFS clients are starting jobs at any given time?
2. Are you experiencing any HDFS restarts on the cluster? You may also see core files in /var/crash on your cluster nodes.
3. Was this setup ever working on a previous version of OneFS?
Let me know and I can look in to this for your further!
This is a client bug that occurs in Kerberized clusters.
An explanation can be viewed here: [YARN-4632] Replacing _HOST in RM_PRINCIPAL should not be the responsibility of the client code - AS...
The fix looks like it is targeted for Hadoop 2.9.0 (see: [YARN-4629] Distributed shell breaks under strong security - ASF JIRA).
On OneFS 188.8.131.52, it is characterized by the presence of the following message in the logs:
RPC V9 renewDelegationToken user: yarn/[ResourceManager-FQDN-here]@[Realm-here] exception: java.io.IOException cause: User yarn/[ResourceManager-FQDN-here]@[Realm-here] tried to renew a token with renewer specified as yarn/_HOST@[Realm-here]
Assuming there is only 1 ResourceManager in the cluster, a workaround is available. Since the problem stems from the client failing to translate the magic
_HOST placeholder in
yarn.resourcemanager.principal (yarn-site.xml), manually replacing it with the ResourceManager's FQDN on all clients should prevent this error. In Ambari, this requires disabling and re-enabling Kerberos.
I was able to make progress by making below changes:
I believe the problem was when delegation token was created initially, it has included NN ip address in the token.
When the job was submitted to resource manager, and resource manager resolve NN hostname again, it got different IP address for NN from Isilon and it started complaining there is no DT found for this IP address..
After I added hadoop.security.token.service.use_ip=false to core-site, my job was still failing but with different kerberos error..
After looking through the flow, it appears resource localizer builds it's own security context and does not use conf files if they don't exist in MR classpath but builds it's own. By adding /etc/hadoop/conf to MR classpath, resource localizer will use property hadoop.security.token.service.use_ip=false and my job ran successfully..