We are looking at implementing clustering (Solaris Cluster 3.2 on Solaris 10) but as in all things cost is an issue. We have 5 backups servers at present with 2 more to go in before the year is out - we have two datacentres with 4 servers in one and the 5th in the other. These need a "higher availability" than being standalone servers at the moment and the initial plan was for stretch clusters with a node in each datacentre. All metadata will be stored on ZFS on the same HDS storage array and there will be no storage node functionality on the servers with the exception of an AFTD which will back up indexes and stage to storage nodes via network.
Clustering them all requires 10 (14) servers which is obviously a lot of hardware that is going to spend most of its time unused. So we now have a question how can we achieve a "highish availability" with less servers. We expect the failover process to require manual intervention so a full high availability automatic failover solution is not expected.
One option appears to be to cluster the servers into an N+1 topology where one (it does not state or more) failover server is shared between all nodes which seems a sensible option unless there is an instance of multi-failure which would probably mean there is a more serious problem to address anyhow...
Another (I think, but I have found little on this as it appears to be a little unconventional) would be to implement the servers as single node clusters again with one or more dedicated failover nodes which can access the disk and assume the virtual node.
Does anyone have any experience or advice on clustering Networker in a similar configuration or suggestion of any other option worth looking at?
So why cluster the servers, you could just cluster storage nodes if you wanted but make em hot one in each DC. i use RSM to present my AVA's to the storage nodes a Luns and then backup my data from there as it is all over fibre and a shite load faster. the only failover box i have is the networker server it's self. all my libries are on fabric so if a storage node goes down i can just canhe the fabic group and present it to the other storage node within 5 minutes.
I run 4X scripts to present the lun, backup the data, unpresent the lun, and delete the replication of data on the presented lun.
Wicked i know.
Rocks man low cost, heaps fast. bit more management but it impresses the wigs mate. just wish i would see the cost saving in my pay packet.....
What we want to prepare for is the event of a backup server going down so that we can be up and continuing to backup as soon as possible.
Most servers are working at maximum capacity with 1000 or so clients so shifting the clients around to other servers in the instance of a server failure is not an option. Storage node capacity will not be an issue - we have 4 storage nodes at present per server but these are low spec and are being upgraded so should be capable of driving multiple drives easily; our STK8500 is maxed out with drives so there will be no expansion there, and out EDL4300 shared between 4 servers will be joined by a second EDL4300 to be shared between 6 servers eventually (the 7th server backs NAS data up directly to tape) - all data already goes across the SAN...
I'm always sceptic to idea of "clustering NW server". Usually you have backup windows which are the only time within you can do a backup of a particular client. Let's say you have a client with 4TB of data (which should be configured as SN - i think). When the backup server fails after backing up 3.9TB of data you will have to restart backup "starting from zero" on the other stand by node of NW server. Usually it will lead to cross the backup window limits. To minimize the impact of NW server failure I usually try to implement more NW servers rather than cluster them. I know that it leads to more licenses/more expenses (let's say - more expensive) but generally this solution gives me better granularity so the failure of the NW server will impact only the subset of clients connected to it.
Piotr, that's a sensible idea, and definitely one that should be helpful in instances where overload is responsible for failure, but in instances where failure is unrelated, for example a hardware failure, then another server taking over would be beneficial, especially when this happens early evening, though we do have backups that run in the daytime so even then it would be useful to avoid interruption to backups. And in cases where you have say 4x4TB file systems and 3 complete before failure you only have to run one again... We have a global site license so I don't believe we have any additional licensing costs but I may be wrong there...
Yeah, there is always a problem to find best solution, which will compromise all your needs. To be safe and comfortable to administer, to be fault tolerant and cheap, to be not complicated and easy to learn etc.. Definitely cluster solution will help you in case of hardware failure. What i can add to the discussion is that I'm usually place NW recourses (/nsr dir) on the SAN storage. Then I get the ability to mount NW resources on other host and start NW server - it is simpler to manage, but it will not work during night when I'm sleeping
Well the /nsr resources are on SAN storage, its just (I assume) mounting on a different server is going to change the name of that server and effect the backups (for one thing the servers file on the client will not let it back them up). But I assume with a cluster (even a "single node cluster") I can create a logical node corresponding to my backup server and mount the resources there...
Create a hot cluster 2 nodes right. the Production node does it's thing as normal. Then use Replistore to replicate the \nsr directory to the second node. this is continually done up to the minute.
Now i use this solution for my DR system although i don't have it clustered. I have the services stopped on the second node (My DR Box) and if i have a failure on prod i just start the services on the DR (Node2) box. All the clients, jobs and rest of the config is there.
Now here is the kicker. The licenses go back to the 45 day eval because the host name does not match. but here is the second kicker. once the prod box is back up and running and replistore replicates again it doesn't matter because if prod goes down again you get another 45 days.
I have failover licenses for my DR box but i can't add them because it gets over written. and prod won't except them.
nothing else changes. you could then do the fabric switching if a SN goes down and present the drives, Silo, Library to another box. Most likelly be a good idea to have a cold DR SN box sitting around.
But as the hostname does not match the authentication will fail for the backup (unless of course you update the name of the second server within the servers file on all clients). There is no need to use Replistor as the /nsr data is on the SAN and should be resilient.
100%, Yes mate in my case i use it for DR only so i never really do need to do backups from that box.
If i was in a situation where i did have a failure on the NW server then yse i would have to go into a DR situation and manually logon to the clients and amend the entries in the servers file.
Do you know if there can be 2 entries in this file? just thought of it. So what do you do if you loose a lun or a HBA goes belly up? If you get sink loss over the fabric you could be corrupting the Media DB.
for mine i would host it locally and replicate it. at least that way i am not 100% dependent on the Network.