PowerScale: Job engine database is reporting locked, or long wait time to succession
Summary: The protection level of the job engine reports database (reports.db) on big clusters may cause issues in report database access.
This article applies to
This article does not apply to
This article is not tied to any specific product.
Not all product versions are identified in this article.
Symptoms
Jobs are going into a waiting state multiple times.
Database updates are taking a long time to complete.
In the
Symptom 1:
Job coordinator switches nodes on a frequent basis:
Database updates are taking a long time to complete.
In the
isi_job_d.log and messages.log, messages are seen about long wait times, database locks, and possible job coordinator jumping nodes on a frequent basis.
Symptom 1:
isi_job_d.log reports a long wait on a succeeded update:
isi_job_d[45179]: Reports database update (job state) succeeded but took 11272 msSymptom 2:
isi_job_d.log reports database is locked:
isi_job_d[97274]: Failed to update Jobs (state): database is lockedSymptom 3:
Job coordinator switches nodes on a frequent basis:
2018-05-02T02:00:48Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[31517]: Becoming job engine coordinator 2018-05-02T02:11:26Z <24.5> cluster01-33(id52) cluster01-33 isi_job_d[36865]: Becoming job engine coordinator 2018-05-02T02:25:39Z <24.5> cluster01-33(id52) cluster01-33 isi_job_d[37310]: Becoming job engine coordinator 2018-05-02T02:36:25Z <24.5> cluster01-37(id56) cluster01-37 isi_job_d[77098]: Becoming job engine coordinator 2018-05-02T02:38:24Z <24.5> cluster01-37(id56) cluster01-37 isi_job_d[77167]: Becoming job engine coordinator 2018-05-02T02:43:33Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[32917]: Becoming job engine coordinator 2018-05-02T02:59:58Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[33518]: Becoming job engine coordinator 2018-05-02T03:02:44Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[33782]: Becoming job engine coordinator 2018-05-02T03:08:02Z <24.5> cluster01-39(id58) cluster01-39 isi_job_d[33969]: Becoming job engine coordinator
Cause
- The Job engine reports database (
reports.db) is constantly updated because of a long-running job that is generating lots of updates. - Every update to the file is done six times across the cluster. Lowering its protection level, apparently makes the updates faster.
- Depending on the amount of time paused in
isi_papi_d, you expect to see the coordinator time-out in its write to the database (logging). It may also succeed but log that the write took longer than wanted, logging toisi_job_d.login both cases.
To determine the current protection level of the
reports.db
isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db
cluster01-16# isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db POLICY W LEVEL PERFORMANCE COAL ENCODING FILE IADDRS 8x 8 6x concurrency off UTF-8 reports.db <25,14,1575794508288:512>, <26,17,484528026624:512>, <26,31,924241684480:512>, <27,31,145164204544:512>, <28,30,2451893562880:512>, <29,29,6440579584:512> ct: 1506750730 rt: 0 ************************************************* * IFS inode: [ 25,14,1575794508288:512, 26,17,484528026624:512, 26,31,924241684480:512, 27,31,145164204544:512, 28,30,2451893562880:512, 29,29,6440579584:512 ] ************************************************* ... * Protection Policy: 8x * Target Protection: 6x <-- six times protection
Resolution
Dell engineering is investigating the issue. This article is updated as more information becomes available.
To work around this issue, run the following command to change the protection level of the job reports database as follows:
With 'automatically manage' set to all, the next SmartPools job may restripe
To avoid this, suggest setting 'Automatically Manage Protection' to 'files_at_default'.
With it set to 'files_at_default', SmartPools jobs bypass manually managed files, which means it leaves
To work around this issue, run the following command to change the protection level of the job reports database as follows:
- Pause any running jobs. Verify that there are no jobs running with the command:
isi job status
- Disable the job engine and verify
isi_job_dis dead on all nodes:
isi services -a isi_job_d disable isi_for_array -sX 'ps auxww |grep -i isi_job_d |grep -v grep'
- Change the protection level of the
reports.dbto 3x random access layout:
isi set -r -g reprotect -a random -p 3x -F /ifs/.ifsvar/modules/jobengine/reports.db
- Another run of the below command should confirm the
reports.dbhas changed:
isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db cluster01-16# isi get -DD /ifs/.ifsvar/modules/jobengine/reports.db POLICY W LEVEL PERFORMANCE COAL ENCODING FILE IADDRS 3x 3 3x random off UTF-8 reports.db <25,14,1575794508288:512>, <26,31,924241684480:512>, <29,16,157594713088:512> ct: 1506750730 rt: 0 ************************************************* * IFS inode: [ 25,14,1575794508288:512, 26,31,924241684480:512, 29,16,157594713088:512 ] ************************************************* *... * Protection Policy: 3x * Target Protection: 3x
- Enable the job engine:
isi services -a isi_job_d enable isi_for_array -sX 'ps auxww |grep -i isi_job_d |grep -v grep'
- Resume jobs. The messages in
isi_job_d.logshould abate.
reports.db to 3x immediately. However, some users may configure their SmartPools job to work on 'all' files instead of 'default'.
- In full log:
[xus25@elvis 2021-09-24-003]$ cat local/isi_storagepool_settings Automatically Manage Protection: all <<<< Automatically Manage Io Optimization: all
- On a live cluster
steven-8220-1# isi storagepool settings view Automatically Manage Protection: all <<<< Automatically Manage Io Optimization: all
With 'automatically manage' set to all, the next SmartPools job may restripe
reports.db back to 6x or 8x mirror.
To avoid this, suggest setting 'Automatically Manage Protection' to 'files_at_default'.
With it set to 'files_at_default', SmartPools jobs bypass manually managed files, which means it leaves
reports.db with whatever protection level we specified.Additional Information
Here are some recommended resources related to this topic that may be of interest:
Affected Products
PowerScale, IsilonProducts
Isilon, Isilon NL410Article Properties
Article Number: 000066019
Article Type: Solution
Last Modified: 17 Sep 2025
Version: 16
Find answers to your questions from other Dell users
Support Services
Check if your device is covered by Support Services.