| New here? |
Info |
News |
Projects |
Helpdesk |
| Info->Outages: |
| 2007-02-12 06:00/120: (scheduled) PACS will be updating the operating systems for Physics and INT with the latest patches. In addition, some new servers will be installed in the racks. Finally, PACS will be moving PDUs to a new private network. 2007-02-05 12:00/60: **UNSCHEDULED** A switch in an office was plugged into two network ports leading to the 99 network switches. This caused a switching loop which in turn started a packet storm and impacted performance for the entire Astronomy network. 2007-02-01 10:00/180: **UNSCHEDULED** A burning smell was detected in B232. We immediately shut down the power on a breaker that had previously tripped and then quickly shut down all of the clusters. The campus electrical support staff arrived on the scene and discovered that one of the screws that holds the 208/30 circuit breaker for Rehr's cluster was not tightened all of the way. This cause the circuit breaker to trip. When the circuit breaker was reset, the loose electrical connection heated up (as resistance tends to generate heat) and created a plastic, burning smell in the room. Most systems were back up in an hour. This only affected clusters in B232. 2007-01-29 06:00/120: (scheduled) PACS updated the BIOS on 5 Astronomy servers as well as RAID controller firmware on two of them. 2007-01-17 16:00/30: **UNSCHEDULED** The file server for LSST in Astronomy has performance problems which required us to shut it down. 2007-01-15 06:00/120: (scheduled) PACS updated all of the Physics and INT Windows and Linux server operating systems. This required a reboot for all servers. In addition, we installed a fiber channel controller on the graduate student STF server. 2006-12-27 08:00/180: **UNSCHEDULED** A file server that serves up significant disk space in Astronomy stopped responding. PACS did not hear about this outage until 11AM. 2006-12-14 18:00/840: **UNSCHEDULED** PACS took down a series of servers to protect against winter storm. In particular systems that didn't need to be up overnight and file servers attached to the SAN. 2006-12-12 06:00/360: (scheduled) Cooling and power upgrades are being finished in this outage. All of the non-critical servers in the department will be down. In this upgrade PACS will do the following improvements:
2006-12-11 06:00/480: (scheduled) Cooling is being upgraded and clusters are being serviced in this outage. Specifically, tigger has a bad fan and one of robert's cluster nodes appears to have a RAM error. In addition, we're re-racking John Rehr's clusters for 2006-10-10 08:30/180 Serviced deuteronomy head node UPS because of "battery error" status. After diagnostics, error message went away. 2006-10-03-0630-180: (scheduled) In this scheduled outage, the circuit breaker panel in B232 needed to be shut down in order to add three additional 208V services. This required a shutdown of the machines in the server room. During this outage PACS rewired a rack with a new PDU, updated all of the server operating systems for Physics, Astronomy, and INT. The outage was extended 1 hour due to the following reasons:
2006-09-21-1100-060: Astronomy's file server crypt2 is down currently due to a Local Loop error on the Infortrend raid controller. It looks like the chassis needs to be swapped out. We'll be contacting the vendor today. 2006-09-21-0700-030: Took down Astronomy fileserver crypt1 (servers megas 1 and 2) due to software RAID controller failure. Lost a software raid disk, took it offline, rebuilt it, and then rebooted crypt1. We rebooted crypt1 because of a history of being flaky after the software RAID fails and is recovered. |
| help
at phys dot washington dot edu "Are you ready for the Summer of Love and Research?" |