Unsolved
This post is more than 5 years old
1.1K Posts
0
1781
To cluster or not to cluster?
We are looking at implementing clustering (Solaris Cluster 3.2 on Solaris 10) but as in all things cost is an issue. We have 5 backups servers at present with 2 more to go in before the year is out - we have two datacentres with 4 servers in one and the 5th in the other. These need a "higher availability" than being standalone servers at the moment and the initial plan was for stretch clusters with a node in each datacentre. All metadata will be stored on ZFS on the same HDS storage array and there will be no storage node functionality on the servers with the exception of an AFTD which will back up indexes and stage to storage nodes via network.
Clustering them all requires 10 (14) servers which is obviously a lot of hardware that is going to spend most of its time unused. So we now have a question how can we achieve a "highish availability" with less servers. We expect the failover process to require manual intervention so a full high availability automatic failover solution is not expected.
One option appears to be to cluster the servers into an N+1 topology where one (it does not state or more) failover server is shared between all nodes which seems a sensible option unless there is an instance of multi-failure which would probably mean there is a more serious problem to address anyhow...
Another (I think, but I have found little on this as it appears to be a little unconventional) would be to implement the servers as single node clusters again with one or more dedicated failover nodes which can access the disk and assume the virtual node.
Does anyone have any experience or advice on clustering Networker in a similar configuration or suggestion of any other option worth looking at?
Clustering them all requires 10 (14) servers which is obviously a lot of hardware that is going to spend most of its time unused. So we now have a question how can we achieve a "highish availability" with less servers. We expect the failover process to require manual intervention so a full high availability automatic failover solution is not expected.
One option appears to be to cluster the servers into an N+1 topology where one (it does not state or more) failover server is shared between all nodes which seems a sensible option unless there is an instance of multi-failure which would probably mean there is a more serious problem to address anyhow...
Another (I think, but I have found little on this as it appears to be a little unconventional) would be to implement the servers as single node clusters again with one or more dedicated failover nodes which can access the disk and assume the virtual node.
Does anyone have any experience or advice on clustering Networker in a similar configuration or suggestion of any other option worth looking at?
dugans1
186 Posts
0
August 5th, 2009 00:00
I run 4X scripts to present the lun, backup the data, unpresent the lun, and delete the replication of data on the presented lun.
Wicked i know.
Rocks man low cost, heaps fast. bit more management but it impresses the wigs mate.
just wish i would see the cost saving in my pay packet.....
DavidHampson
1.1K Posts
0
August 5th, 2009 03:00
Most servers are working at maximum capacity with 1000 or so clients so shifting the clients around to other servers in the instance of a server failure is not an option. Storage node capacity will not be an issue - we have 4 storage nodes at present per server but these are low spec and are being upgraded so should be capable of driving multiple drives easily; our STK8500 is maxed out with drives so there will be no expansion there, and out EDL4300 shared between 4 servers will be joined by a second EDL4300 to be shared between 6 servers eventually (the 7th server backs NAS data up directly to tape) - all data already goes across the SAN...
benzino1
244 Posts
1
August 5th, 2009 05:00
DavidHampson
1.1K Posts
0
August 5th, 2009 07:00
We have a global site license so I don't believe we have any additional licensing costs but I may be wrong there...
benzino1
244 Posts
0
August 5th, 2009 08:00
DavidHampson
1.1K Posts
0
August 5th, 2009 08:00
dugans1
186 Posts
0
August 5th, 2009 14:00
This is the go.
Create a hot cluster 2 nodes right. the Production node does it's thing as normal. Then use Replistore to replicate the \nsr directory to the second node. this is continually done up to the minute.
Now i use this solution for my DR system although i don't have it clustered. I have the services stopped on the second node (My DR Box) and if i have a failure on prod i just start the services on the DR (Node2) box. All the clients, jobs and rest of the config is there.
Now here is the kicker. The licenses go back to the 45 day eval because the host name does not match. but here is the second kicker. once the prod box is back up and running and replistore replicates again it doesn't matter because if prod goes down again you get another 45 days.
I have failover licenses for my DR box but i can't add them because it gets over written. and prod won't except them.
nothing else changes. you could then do the fabric switching if a SN goes down and present the drives, Silo, Library to another box. Most likelly be a good idea to have a cold DR SN box sitting around.
DavidHampson
1.1K Posts
0
August 6th, 2009 01:00
dugans1
186 Posts
0
August 6th, 2009 14:00
If i was in a situation where i did have a failure on the NW server then yse i would have to go into a DR situation and manually logon to the clients and amend the entries in the servers file.
Do you know if there can be 2 entries in this file? just thought of it. So what do you do if you loose a lun or a HBA goes belly up? If you get sink loss over the fabric you could be corrupting the Media DB.
for mine i would host it locally and replicate it. at least that way i am not 100% dependent on the Network.
DavidHampson
1.1K Posts
0
August 7th, 2009 02:00
nicbone
116 Posts
0
August 7th, 2009 02:00
Replicate indexes & /nsr content using rsync every 1/2 hour to remote site. Also the NMC stuff as have a number of reports set up that I wouldnt want to have to recreate.
Might change the replication for indexes to MirrorView/A or SANCopy as they are all on Clariion SAN but waiting for purchase of a FC/IP router for the other site.
We also have failover codes for our licences but I have never been able to find any documentation on how they should be used - assume you cant have the DR server up & running on the failover licences at the same time the Production server is up?
dugans1
186 Posts
0
August 8th, 2009 15:00
Yeah i think your right, I have my DR server up all the time but services stopped. If i amm the Failover Codes then it is useless.
So i don't bother. Looks like Wasted money to me. I am reviewing the Clust document to see if it suggests a better way.
Sorry if i missed this but does anyone know if 2X entries in the Server file will work? It is somthing i am thinking of for an extended outage.
I was replicating my index store to DR prior to staging but had problem where Replistore was not updating the DR side and the Partition ran out of space. "I need to look in to this"
I think Davids environment is scaled for HA where as most businesses have very little budget to scale backup & Recovery as HA or they don't view backups as highly as they should. Backup Admins only get exposure when a DR is in operation and then it is only for a short period. Then we are foggotten about again.
jsallila
56 Posts
0
August 10th, 2009 02:00
DavidHampson
1.1K Posts
0
August 10th, 2009 03:00
My environment is scaled for HA and free from budget restrictions - I have been told budget is not a restriction which may be good or may be bad!
nicbone
116 Posts
0
August 10th, 2009 03:00
If the savegroup fails despite the clustering setup then in our case where there are weekly full large monolithic windows LUN's (badly designed storage I know) being backed up the savegroup then restart basically starts from square one again. In this case the backup window has probably been missed in any case so I revert to using SAN snapshots if a file recovery is requested & could dump the snap if really necessary.