x
This website is using cookies. We use cookies to ensure that we give you the best experience on our website. More info. That's Fine
 
 

Outage 2018-07-26
Moderators: C:Amie

Jump to page : 1
Now viewing page 1 [15 messages per page]

Reply

Forums · General Discussion · HPC:Factor Related Discussion

C:Amie
Posted 2018-07-26 5:11 PM
#


Administrator
H/PC Oracle

1000020002000100100100


Posts:
14321
Member Nº:
1
Location:
Fields End, UK
Status:
The outage today was caused by a drive failure which managed to completely pollute my main SAN array which is where the data for the HPC:Factor server lives.

There was more than enough RAID parity for everything to be OK, but whaterver went wrong with the drive decimated the I/O on the SAN channel group for *mysterious* reasons. The fault seems to have occurred ~6:22am and I was aware of it by 7:05am and started scrambling to rescue data. The RAID controller firmware only noticed at 10:45am, setting off an alarm which made me jump out of my skin; finally identifying the drive through its forced eviction from the group. Once evicted the drive the array correctly went to over to parity mode and everything settled down.

The backups were all in place and are OK from last night, but through some luck, nothing has needed to be restored.

The HBA is damaged and I'm definately lost 1 drive. The drive I have RMA'd, the controller is out of warranty. Ouch.

It is 5pm now... that was a waste of a day.

Fngers crossed I can keep everything up until parts arrive.
 Top of the page
Quote Reply
Rich Hawley
Posted 2018-07-26 7:05 PM
#

Global Moderator
H/PC Guru

50001000500100100100100252525


Posts:
6987
Member Nº:
122
Location:
USA
Status:
And how many others reading this post are clueless to what Chris just entered except that to know something broke and he fixed it...if you are...then join the club with me and there are now two of us...
 Top of the page
Quote Reply
stingraze
Posted 2018-07-26 11:41 PM
#


Writing Team
H/PC Elder

200025


Posts:
2048
Member Nº:
35
Location:
Japan
Status:
Yes, I noticed the outage. I am glad it has been rescued. Thank you for the hard work. I hate when server related issue happens. I once had to re-flash BIOS for some reason some time ago.

Edited by stingraze 2018-07-26 11:42 PM
 Top of the page
Quote Reply
CE Geek
Posted 2018-07-27 6:58 AM
#


Global Moderator
H/PC Oracle

10000100050010010010010025


Posts:
11948
Member Nº:
845
Location:
Southern California
Status:
I've been so busy today that this is the first chance I've had to check in. Gotta love how C:Amie swiftly fixes stuff when we're not looking.
 Top of the page
Quote Reply
C:Amie
Posted 2018-07-27 7:16 AM
#


Administrator
H/PC Oracle

1000020002000100100100


Posts:
14321
Member Nº:
1
Location:
Fields End, UK
Status:
It is all skill and has nothing to do with timezones I promise

Looks like we survived the night at least!
 Top of the page
Quote Reply
hpc:factor« View previous thread · HPC:Factor Related Discussion · View next thread »

Jump to page : 1
Now viewing page 1 [15 messages per page]

Reply
Jump to forum :
Search this forum
Printer friendly version
E-mail a link to this thread
Seconds to generate: 0.156 - Cached queries : 60 - Executed queries : 8
Server Time now is: Tuesday, 16 October 2018 - 20:12