Mako cluster will not be accessible to users from Thursday, April 15th, 8:00 am PDT to Tuesday, April 20th 5:00 pm PDT.
During this downtime we will be working on replacing the InfiniBand switch in the ShaRCS Mako cluster to improve cluster performance by reducing communication latency between nodes.
This will require powering down all the cluster nodes for most of the duration. We will also have to power down the login and management nodes for shorter periods so there will be some unavoidable interruptions to user access of the cluster and data on the cluster filesystems. Batch queues will be unavailable for the whole duration and there will be a reservation in place to prevent jobs from running. The reservation will start on April 15th at 8:00 am PDT.
After the switch replacement the cluster access will be restored and we shall sent you an email notification. There will not be any affect on the user data in home and scratch directories.
We apologize for this late notification and sorry for this interruption while we work on improving the system performance.
The ShaRCS cluster Thresher will be unavailable on Monday, April 12, and Tuesday, April 13, to complete the replacement of the InfiniBand switch. This will require removing power from the login and management nodes, so logins and batch queues will be unavailable.
You are responsible for logging yourself out, and ensuring that any processes you are running exit cleanly. There is a reservation in place on the batch queues to prevent jobs from running during the downtime, so no jobs should be interrupted. The reservation begins at 6:00 a.m. PDT on Monday, and the machine will be shutdown at 7:00 a.m. PDT. The batch queues will be purged during the outage, and jobs will need to resubmitted.
We will also take advantage of the down time to perform a software update. If you have compiled any software on thresher, you may need to recompile it to ensure that it works correctly with the new environment. Your home directories and scratch space should not be affected.
Thanks for your continued patience while we improve the system. Thresher should return to service Tuesday afternoon, and we will keep you up to date if anything changes.
Over the next few weeks, we will be working to replace the InfiniBand switch in the ShaRCS cluster Thresher. During this time, there will be some unavoidable periods of reduced capability and downtime.
From now March 31, 2010 to April 11, 2010, Thresher will run with a reduced number of nodes (136), while the other half of the nodes are isolated for testing the new switch.
Between Monday, April 12, and Wednesday, April 14, the new switch will be integrated into the cluster. This will require shutting down all of the management and login nodes, so batch queues and user access will be unavailable. The dates for this integration are tentative; we will keep you informed when we confirm the required outage period.
Thanks for your patience while we improve the system.
- SDSC Team