PowerScale: Job engine database is reporting locked, or long wait time to succession

Summary: The protection level of the job engine reports database (reports.db) on big clusters may cause issues in report database access.

This article applies to This article does not apply to This article is not tied to any specific product. Not all product versions are identified in this article.

Symptoms

Jobs are going into a waiting state multiple times.

Database updates are taking a long time to complete.

In the isi_job_d.log and messages.log, messages are seen about long wait times, database locks, and possible job coordinator jumping nodes on a frequent basis.

Symptom 1:
isi_job_d.log reports a long wait on a succeeded update:
isi_job_d[45179]: Reports database update (job state) succeeded but took 11272 ms
Symptom 2:
isi_job_d.log reports database is locked:
isi_job_d[97274]: Failed to update Jobs (state): database is locked
Symptom 3: 
Job coordinator switches nodes on a frequent basis:
2018-05-02T02:00:48Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[31517]: Becoming job engine coordinator
2018-05-02T02:11:26Z <24.5> cluster01-33(id52) cluster01-33 isi_job_d[36865]: Becoming job engine coordinator
2018-05-02T02:25:39Z <24.5> cluster01-33(id52) cluster01-33 isi_job_d[37310]: Becoming job engine coordinator
2018-05-02T02:36:25Z <24.5> cluster01-37(id56) cluster01-37 isi_job_d[77098]: Becoming job engine coordinator
2018-05-02T02:38:24Z <24.5> cluster01-37(id56) cluster01-37 isi_job_d[77167]: Becoming job engine coordinator
2018-05-02T02:43:33Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[32917]: Becoming job engine coordinator
2018-05-02T02:59:58Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[33518]: Becoming job engine coordinator
2018-05-02T03:02:44Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[33782]: Becoming job engine coordinator
2018-05-02T03:08:02Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[33969]: Becoming job engine coordinator

Cause

  1. The Job engine reports database (reports.db) is constantly updated because of a long-running job that is generating lots of updates.
  2. Every update to the file is done six times across the cluster. Lowering its protection level, apparently makes the updates faster.
  3. Depending on the amount of time paused in isi_papi_d, you expect to see the coordinator time-out in its write to the database (logging). It may also succeed but log that the write took longer than wanted, logging to isi_job_d.log in both cases.
To determine the current protection level of the reports.db
isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db
cluster01-16# isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db
POLICY   W   LEVEL PERFORMANCE COAL  ENCODING      FILE              IADDRS
8x        8     6x concurrency off   UTF-8         reports.db        <25,14,1575794508288:512>, <26,17,484528026624:512>, <26,31,924241684480:512>, <27,31,145164204544:512>, <28,30,2451893562880:512>, <29,29,6440579584:512> ct: 1506750730 rt: 0
*************************************************
* IFS inode: [ 25,14,1575794508288:512, 26,17,484528026624:512, 26,31,924241684480:512, 27,31,145164204544:512, 28,30,2451893562880:512, 29,29,6440579584:512 ]
*************************************************
...
*  Protection Policy:  8x
*  Target Protection:  6x                  <-- six times protection

Resolution

Dell engineering is investigating the issue. This article is updated as more information becomes available.

To work around this issue, run the following command to change the protection level of the job reports database as follows:
  1. Pause any running jobs. Verify that there are no jobs running with the command:
isi job status
  1. Disable the job engine and verify isi_job_d is dead on all nodes:
isi services -a isi_job_d disable
isi_for_array -sX 'ps auxww |grep -i isi_job_d |grep -v grep'
  1. Change the protection level of the reports.db to 3x random access layout:
isi set -r -g reprotect -a random -p 3x -F /ifs/.ifsvar/modules/jobengine/reports.db
  1. Another run of the below command should confirm the reports.db has changed:
isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db

cluster01-16# isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db
POLICY   W   LEVEL PERFORMANCE COAL  ENCODING      FILE              IADDRS
3x       3      3x random off   UTF-8         reports.db        <25,14,1575794508288:512>, <26,31,924241684480:512>, <29,16,157594713088:512> ct: 1506750730 rt: 0
*************************************************
* IFS inode: [ 25,14,1575794508288:512, 26,31,924241684480:512, 29,16,157594713088:512 ]
*************************************************
*...
*  Protection Policy:  3x
*  Target Protection:  3x
  1. Enable the job engine:
isi services -a isi_job_d enable
isi_for_array -sX 'ps auxww |grep -i isi_job_d |grep -v grep'
  1. Resume jobs. The messages in isi_job_d.log should abate.
The above should restripe reports.db to 3x immediately. However, some users may configure their SmartPools job to work on 'all' files instead of 'default'.
  • In full log:
[xus25@elvis 2021-09-24-003]$ cat local/isi_storagepool_settings
     Automatically Manage Protection: all  <<<<
Automatically Manage Io Optimization: all 
  • On a live cluster
steven-8220-1# isi storagepool settings view
     Automatically Manage Protection: all  <<<<
Automatically Manage Io Optimization: all 

With 'automatically manage' set to all, the next SmartPools job may restripe reports.db back to 6x or 8x mirror.
To avoid this, suggest setting 'Automatically Manage Protection' to 'files_at_default'.

With it set to 'files_at_default', SmartPools jobs bypass manually managed files, which means it leaves reports.db with whatever protection level we specified.

Additional Information

Here are some recommended resources related to this topic that may be of interest:

Affected Products

PowerScale, Isilon

Products

Isilon, Isilon NL410
Article Properties
Article Number: 000066019
Article Type: Solution
Last Modified: 17 Sep 2025
Version:  16
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.