Start a Conversation

Unsolved

This post is more than 5 years old

22551

May 14th, 2015 15:00

Ask the Expert: The What, Why and How of Isilon Job Engine

YOU MAY ALSO BE INTERESTED ON THESE ATE EVENTS...

Ask The Expert – Isilon’s New Releases: IsilonSD Edge, OneFS.NEXT and CloudPools

Ask the Expert: Are you ready to manage deep archiving workloads with Isilon’s HD400 node and OneFS 7.2.0? Find out more…

https://community.emc.com/thread/195695

Welcome to the EMC Support Community Ask the Expert conversation. On this occasion we will be covering EMC Isilon Job Engine. Among the many areas we'll be discussing our experts will answer your questions in regards what is Isilon Job Engine, Why it’s important to know, how to tweak them, and any issue with job engine.


Meet Your Experts:

profile-image-display.jspa?imageID=13438&size=350

Rashmikant Vyas

Technical Account Manager - EMC

Rash works on providig solution and support to Isilon customers via proactive account management. His customers are breaking the storage boundaries everyday by capacity and performance, and that comes with challenges. Prior to Isilon Rash worked on many different products in different roles and capacity -- Team lead – Designing new services. Solution Architect and Installation - EMC Block and File Storage (VNX, Clariion), Visualization Platform (VMware, Hypervisor), Fibre Channel and Network protocols. System Admin -- Sun Microsystem, VERITAS Products (Cluster, Volume Manager), Linux, UNIX flavors, writing shell scripts.


This discussion takes place from May 18th - June 5th. Get ready by bookmarking this page or signing up for e-mail notifications.


Share this event on Twitter or LinkedIn:

>> Join the Ask the Expert: The What, Why and How of Isilon Job Engine http://bit.ly/1EIHONl #EMCATE<<

1.2K Posts

June 3rd, 2015 08:00

Thank you! (in hope of MultiScanLin...)

287 Posts

June 4th, 2015 23:00

Rash,

When will be SSD balance with only SSD disk pools?

As far as I understand, currently Autobalance will balance HDD,SDD in node total.

17 Posts

June 6th, 2015 04:00

Hi go.y,

There were some challenges in 6.5.x code with respect to balancing metadata and inodes in SSD. In 7.x Isilon changed the architecture and created disk pools – SSD being a separate diskpool. Idea was that should fixed the issue found in 6.5.x. In most cases 7.x should fix the issue. However, in few instances it was found that SSD is not balanced in 7.0.2.4 code. There were some changes done in 7.1.1.x code and I've not seen SSD imbalance in 7.1.1.x code.

I would first recommend upgrading to 7.1.1.2 (latest recommended OneFS code) or if you wait for 1 or 2 months you could go with OneFS 7.2.0.2 which is our GA release and candidate to become target code.


If you still see SSD imbalanced let me know what OneFS you’re running, and if there is SR open. I would follow-up with engineering team and update this thread.


Rash

EMC | Isilon

Technical Account Manager

June 8th, 2015 11:00

This Ask the Expert event has officially ended, but don't let that retract you from asking more questions. At this point our SME are still welcomed to answer and continue the discussion though not required. Here is where we ask our community members to chime in and assist other users if they're able to provide information.

Many thanks to our SMEs who selflessly made themselves available to answer questions. We also appreciate our users for taking part of the discussion and ask so many interesting questions.

ATE events are made for your benefit as members of ECN. If you’re interested in pitching a topic or Subject Matter Experts we would be interested in hearing it. To learn more on what it takes to start an event please visit our Ask the Expert Program Space on ECN.

117 Posts

July 6th, 2015 06:00

To share my recent experience on balancing usage on SSD drives that are used for metadata acceleration.

Cluster is running OneFS version 7.1.0.5 and we had to smartfail and re-add some SSD (for a separate reason) and obviously this caused an imbalance in the usage.  We started an AutoBalanceLin job and it did re-balance the usage on the SSD.

Before starting the job we had 5 nodes with this usage:

isi statistics drive --nodes=all --type=ssd --long

   Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes

LNN:bay N/s     B/s B    N/s B/s       B ms  N/s ms %    %

     1:1  SSD 138.8 1.5M    11K  366.4 2.9M    8.0K     0.1 0.0     0.4    0.0  4.5 97.3 90M

     1:2  SSD 136.0 1.5M    11K  364.4 2.9M    8.0K     0.1 0.0     0.2    0.1  5.1 97.3 90M

     2:1  SSD 182.0 1.7M   9.6K  355.8 2.8M    8.0K     0.1 0.0     0.7    0.0  5.1 97.3 90M

     2:2  SSD 150.4 1.5M    10K  340.6 2.7M    8.0K     0.1 0.0     0.5    0.0  5.5 97.3 90M

     3:1  SSD 131.2 1.7M    13K  334.6 2.6M    7.9K     0.1 0.0     0.4    0.0  6.5 97.7    90M

     3:2  SSD 177.4 1.7M   9.6K  332.6 2.6M    7.9K     0.1 0.0     0.7    0.0  4.3 97.7 90M

     4:1  SSD 91.0    2.6M    29K 113.4     920K 8.1K     0.1  0.0 0.4    0.0  1.3  2.6   1.3M

     4:2  SSD 81.6    2.2M    26K 126.8     1.0M 8.1K     0.1  0.0 0.3    0.0  2.1  2.6   1.3M

     5:1  SSD 239.2 6.1M    26K  696.8 5.6M    8.0K     0.1 0.0     0.1    0.0  8.5 41.0 31M

     5:2  SSD 223.2 5.9M    26K  669.6 5.3M    8.0K     0.1 0.0     0.1    0.0 10.5 41.0    31M

After letting the job run for ~ 5 days we were now in this situation with a much better balance:

isi statistics drive --nodes=all --type=ssd --long

Fri Jul  3 15:27:57 EDT 2015

   Drive Type OpsIn BytesIn SizeIn OpsOut BytesOut SizeOut TimeAvg Slow TimeInQ Queued Busy Used Inodes

LNN:bay N/s     B/s B    N/s B/s       B ms  N/s ms %    %

     1:1  SSD 0.0     0.0    0.0 1.2K     8.6M 7.4K     0.1  0.0 0.0    0.0 14.7 69.2    61M

     1:2  SSD 0.0     0.0    0.0 1.2K     8.4M 7.3K     0.1  0.0 0.0    0.0 15.9 69.2    61M

     2:1  SSD 0.0     0.0    0.0 1.0K     7.6M 7.3K     0.1  0.0 0.0    0.0 11.1 69.1    61M

     2:2  SSD 0.0     0.0    0.0 1.0K     7.4M 7.1K     0.1  0.0 0.0    0.0 11.7 69.1    61M

     3:1  SSD 0.0     0.0    0.0 1.1K     8.0M 7.2K     0.1  0.0 0.0    0.0 15.7 69.7    61M

     3:2  SSD 0.0     0.0    0.0 1.1K     8.0M 7.2K     0.1  0.0 0.0    0.0 13.9 69.7    61M

     4:1  SSD 0.0     0.0    0.0 8.4      69K 8.2K     0.1  0.0 0.0    0.3  0.1 64.8    60M

     4:2  SSD 0.0     0.0    0.0 10.6      87K 8.2K     0.1  0.0     0.0 0.2  0.3 64.8    60M

     5:1  SSD 0.0     0.0    0.0 200.0     1.5M 7.5K     0.1  0.0 0.0    0.0  3.3 68.2    63M

     5:2  SSD 0.0     0.0    0.0 216.6     1.7M 7.6K     0.1  0.0 0.0    0.0  3.7 68.2    63M

68 Posts

December 29th, 2016 15:00

What are the main causes that the all jobs are paused at the same time including the flex protect?,

and if you want to launch one job more automatically will be paused including flexprotect?

Why the system paused all jobs?

what must I  do for quit of that status(all paused jobs)?

how could you run a job in a degraded mode?

Thanks

January 3rd, 2017 09:00

Hi Francisco, ideally Rash, if he's still around, may be able to address your questions, but if not, I moved this thread to the Isilon Product community in case there's a community member who can help.

Happy New Year!

17 Posts

January 3rd, 2017 09:00

What are the main causes that the all jobs are paused at the same time including the flex protect?,

FlexProtect would pause all the jobs except you've job engine tweaked.

If FlexProtect job is also paused then something is wrong with job engine -- isi_job_d may not be running or one of the node is in readonly mode or down or cluster is unable to connect to one of the node via backend (IB). At this stage I would ask you to log support case and have support work at it. I may write troubleshooting steps, but I don't know user's experience level, so it will be best for support to fix it.

and if you want to launch one job more automatically will be paused including flexprotect?

Isilon job engine is written in a way to give top most priority to Data Integrity and hence when a drive or a node is in Smartfail status OneFS would run FlexProtect and reprotect data. You could pause FlexProtect job and run other job by removing job engine from "Degraded" mode, but at this stage again I would ask you to check with support because you need to know protection level on the cluster, what's in smartfail status, and reason to pause FlexProtect

Why the system paused all jobs?

See answer to 1st question

what must I  do for quit of that status(all paused jobs)?

See answer to 1st question

how could you run a job in a degraded mode?

You need to take job engine out of degraded mode. Again I can't share these commands on a public forum as changes to job engine without proper knowledge could cause other issues. Please log a support case, and if there is a reason support would make those changes.

Hope this helps!

68 Posts

April 28th, 2017 10:00

Thanks for your help, Rash and Nestor

I have the next questions I start the job flexprotect, multiscan, integrityscan, What is the reason that the jobs failed, and it start again and again?

Thanks

68 Posts

December 30th, 2017 18:00

Thanks Roberto and Rash for your answer

the issue was:  because I have 3 nodes down and My protection was n+2 and I couldn't run the job flexprotect.

I needed to put the cluster in mode degraded to run the job flexprotect , you was right

Thanks

No Events found!

Top