Start a Conversation

Unsolved

This post is more than 5 years old

1716

August 4th, 2009 09:00

To cluster or not to cluster?

We are looking at implementing clustering (Solaris Cluster 3.2 on Solaris 10) but as in all things cost is an issue. We have 5 backups servers at present with 2 more to go in before the year is out - we have two datacentres with 4 servers in one and the 5th in the other. These need a "higher availability" than being standalone servers at the moment and the initial plan was for stretch clusters with a node in each datacentre. All metadata will be stored on ZFS on the same HDS storage array and there will be no storage node functionality on the servers with the exception of an AFTD which will back up indexes and stage to storage nodes via network.

Clustering them all requires 10 (14) servers which is obviously a lot of hardware that is going to spend most of its time unused. So we now have a question how can we achieve a "highish availability" with less servers. We expect the failover process to require manual intervention so a full high availability automatic failover solution is not expected.

One option appears to be to cluster the servers into an N+1 topology where one (it does not state or more) failover server is shared between all nodes which seems a sensible option unless there is an instance of multi-failure which would probably mean there is a more serious problem to address anyhow...

Another (I think, but I have found little on this as it appears to be a little unconventional) would be to implement the servers as single node clusters again with one or more dedicated failover nodes which can access the disk and assume the virtual node.

Does anyone have any experience or advice on clustering Networker in a similar configuration or suggestion of any other option worth looking at?

186 Posts

August 5th, 2009 00:00

So why cluster the servers, you could just cluster storage nodes if you wanted but make em hot one in each DC. i use RSM to present my AVA's to the storage nodes a Luns and then backup my data from there as it is all over fibre and a shite load faster. the only failover box i have is the networker server it's self. all my libries are on fabric so if a storage node goes down i can just canhe the fabic group and present it to the other storage node within 5 minutes.

I run 4X scripts to present the lun, backup the data, unpresent the lun, and delete the replication of data on the presented lun.

Wicked i know.

Rocks man low cost, heaps fast. bit more management but it impresses the wigs mate.
just wish i would see the cost saving in my pay packet.....

1.1K Posts

August 5th, 2009 03:00

What we want to prepare for is the event of a backup server going down so that we can be up and continuing to backup as soon as possible.

Most servers are working at maximum capacity with 1000 or so clients so shifting the clients around to other servers in the instance of a server failure is not an option. Storage node capacity will not be an issue - we have 4 storage nodes at present per server but these are low spec and are being upgraded so should be capable of driving multiple drives easily; our STK8500 is maxed out with drives so there will be no expansion there, and out EDL4300 shared between 4 servers will be joined by a second EDL4300 to be shared between 6 servers eventually (the 7th server backs NAS data up directly to tape) - all data already goes across the SAN...

244 Posts

August 5th, 2009 05:00

I'm always sceptic to idea of "clustering NW server". Usually you have backup windows which are the only time within you can do a backup of a particular client. Let's say you have a client with 4TB of data (which should be configured as SN - i think). When the backup server fails after backing up 3.9TB of data you will have to restart backup "starting from zero" on the other stand by node of NW server. Usually it will lead to cross the backup window limits. To minimize the impact of NW server failure I usually try to implement more NW servers rather than cluster them. I know that it leads to more licenses/more expenses (let's say - more expensive) but generally this solution gives me better granularity so the failure of the NW server will impact only the subset of clients connected to it.

1.1K Posts

August 5th, 2009 07:00

Piotr, that's a sensible idea, and definitely one that should be helpful in instances where overload is responsible for failure, but in instances where failure is unrelated, for example a hardware failure, then another server taking over would be beneficial, especially when this happens early evening, though we do have backups that run in the daytime so even then it would be useful to avoid interruption to backups. And in cases where you have say 4x4TB file systems and 3 complete before failure you only have to run one again...
We have a global site license so I don't believe we have any additional licensing costs but I may be wrong there...

244 Posts

August 5th, 2009 08:00

Yeah, there is always a problem to find best solution, which will compromise all your needs. To be safe and comfortable to administer, to be fault tolerant and cheap, to be not complicated and easy to learn etc.. Definitely cluster solution will help you in case of hardware failure. What i can add to the discussion is that I'm usually place NW recourses (/nsr dir) on the SAN storage. Then I get the ability to mount NW resources on other host and start NW server - it is simpler to manage, but it will not work during night when I'm sleeping :)

1.1K Posts

August 5th, 2009 08:00

Well the /nsr resources are on SAN storage, its just (I assume) mounting on a different server is going to change the name of that server and effect the backups (for one thing the servers file on the client will not let it back them up). But I assume with a cluster (even a "single node cluster") I can create a logical node corresponding to my backup server and mount the resources there...

186 Posts

August 5th, 2009 14:00

Right, Mate.
This is the go.

Create a hot cluster 2 nodes right. the Production node does it's thing as normal. Then use Replistore to replicate the \nsr directory to the second node. this is continually done up to the minute.

Now i use this solution for my DR system although i don't have it clustered. I have the services stopped on the second node (My DR Box) and if i have a failure on prod i just start the services on the DR (Node2) box. All the clients, jobs and rest of the config is there.

Now here is the kicker. The licenses go back to the 45 day eval because the host name does not match. but here is the second kicker. once the prod box is back up and running and replistore replicates again it doesn't matter because if prod goes down again you get another 45 days.

I have failover licenses for my DR box but i can't add them because it gets over written. and prod won't except them.

nothing else changes. you could then do the fabric switching if a SN goes down and present the drives, Silo, Library to another box. Most likelly be a good idea to have a cold DR SN box sitting around.

1.1K Posts

August 6th, 2009 01:00

But as the hostname does not match the authentication will fail for the backup (unless of course you update the name of the second server within the servers file on all clients). There is no need to use Replistor as the /nsr data is on the SAN and should be resilient.

186 Posts

August 6th, 2009 14:00

100%, Yes mate in my case i use it for DR only so i never really do need to do backups from that box.

If i was in a situation where i did have a failure on the NW server then yse i would have to go into a DR situation and manually logon to the clients and amend the entries in the servers file.

Do you know if there can be 2 entries in this file? just thought of it. So what do you do if you loose a lun or a HBA goes belly up? If you get sink loss over the fabric you could be corrupting the Media DB.

for mine i would host it locally and replicate it. at least that way i am not 100% dependent on the Network.

1.1K Posts

August 7th, 2009 02:00

We have 3000 clients, most of which we do not have access to, so an amendment is not practical... There should be no problems losing a HBA - all servers have 2 or 4 HBAs and are on a fabric so unless we lose multiple HBAs we should not see issues.

116 Posts

August 7th, 2009 02:00

Hi Scott - we do the same for DR contigency.

Replicate indexes & /nsr content using rsync every 1/2 hour to remote site. Also the NMC stuff as have a number of reports set up that I wouldnt want to have to recreate.

Might change the replication for indexes to MirrorView/A or SANCopy as they are all on Clariion SAN but waiting for purchase of a FC/IP router for the other site.

We also have failover codes for our licences but I have never been able to find any documentation on how they should be used - assume you cant have the DR server up & running on the failover licences at the same time the Production server is up?

186 Posts

August 8th, 2009 15:00

Hi nick.

Yeah i think your right, I have my DR server up all the time but services stopped. If i amm the Failover Codes then it is useless.

So i don't bother. Looks like Wasted money to me. I am reviewing the Clust document to see if it suggests a better way.

Sorry if i missed this but does anyone know if 2X entries in the Server file will work? It is somthing i am thinking of for an extended outage.

I was replicating my index store to DR prior to staging but had problem where Replistore was not updating the DR side and the Partition ran out of space. "I need to look in to this"

I think Davids environment is scaled for HA where as most businesses have very little budget to scale backup & Recovery as HA or they don't view backups as highly as they should. Backup Admins only get exposure when a DR is in operation and then it is only for a short period. Then we are foggotten about again.

56 Posts

August 10th, 2009 02:00

You can add multiple hosts in servers file and it works if everythin else is configured properly. I use this for restoring from another datazone and in my NMC server to monitor all five backup servers.

1.1K Posts

August 10th, 2009 03:00

You can add multiple entries to the servers file without problem - most places I've worked in with multiple backup servers have a standard hosts file with all backup servers to make things easier.

My environment is scaled for HA and free from budget restrictions - I have been told budget is not a restriction which may be good or may be bad!

116 Posts

August 10th, 2009 03:00

Is there such a thing as a Clustered backup solution where a datastream that is writing to a device can survive a failover & continue writing? I looked into clustering a long time ago but the expense didnt really justify what it seemed to achieve.

If the savegroup fails despite the clustering setup then in our case where there are weekly full large monolithic windows LUN's (badly designed storage I know) being backed up the savegroup then restart basically starts from square one again. In this case the backup window has probably been missed in any case so I revert to using SAN snapshots if a file recovery is requested & could dump the snap if really necessary.
No Events found!

Top