x
This website is using cookies. We use cookies to ensure that we give you the best experience on our website. More info. That's Fine
HPC:Factor Logo 
 
Latest Forum Activity

C:Amie Page Icon Posted 2018-07-26 5:11 PM
#
Avatar image of C:Amie
Administrator
H/PC Oracle

Posts:
17,979
Location:
United Kingdom
Status:
The outage today was caused by a drive failure which managed to completely pollute my main SAN array which is where the data for the HPC:Factor server lives.

There was more than enough RAID parity for everything to be OK, but whaterver went wrong with the drive decimated the I/O on the SAN channel group for *mysterious* reasons. The fault seems to have occurred ~6:22am and I was aware of it by 7:05am and started scrambling to rescue data. The RAID controller firmware only noticed at 10:45am, setting off an alarm which made me jump out of my skin; finally identifying the drive through its forced eviction from the group. Once evicted the drive the array correctly went to over to parity mode and everything settled down.

The backups were all in place and are OK from last night, but through some luck, nothing has needed to be restored.

The HBA is damaged and I'm definately lost 1 drive. The drive I have RMA'd, the controller is out of warranty. Ouch.

It is 5pm now... that was a waste of a day.

Fngers crossed I can keep everything up until parts arrive.
 Top of the page Quote Reply
Rich Hawley Page Icon Posted 2018-07-26 7:05 PM
#
Avatar image of Rich Hawley
Global Moderator
H/PC Guru

Posts:
7,188
Location:
USA
Status:
And how many others reading this post are clueless to what Chris just entered except that to know something broke and he fixed it...if you are...then join the club with me and there are now two of us...
 Top of the page Quote Reply
stingraze Page Icon Posted 2018-07-26 11:41 PM
#
Avatar image of stingraze
Subscribers
H/PC Vanguard

Posts:
3,678
Location:
Japan
Status:
Yes, I noticed the outage. I am glad it has been rescued. Thank you for the hard work. I hate when server related issue happens. I once had to re-flash BIOS for some reason some time ago.

Edited by stingraze 2018-07-26 11:42 PM
 Top of the page Quote Reply
CE Geek Page Icon Posted 2018-07-27 6:58 AM
#
Avatar image of CE Geek
Global Moderator
H/PC Oracle

Posts:
12,668
Location:
Southern California
Status:
I've been so busy today that this is the first chance I've had to check in. Gotta love how C:Amie swiftly fixes stuff when we're not looking.
 Top of the page Quote Reply
C:Amie Page Icon Posted 2018-07-27 7:16 AM
#
Avatar image of C:Amie
Administrator
H/PC Oracle

Posts:
17,979
Location:
United Kingdom
Status:
It is all skill and has nothing to do with timezones I promise

Looks like we survived the night at least!
 Top of the page Quote Reply
Jump to forum:
Seconds to generate: 0.171 - Cached queries : 57 - Executed queries : 14