Main Page

From Clackamas ESD Outage Wiki

Jump to: navigation, search

Clackamas ESD // NIS Outage Wiki


Contents

Current Outage Status

02/21/2012: 8:48am : No known outages

Past Outage Status

Outage/Notice:

10/14/11 Cisco 7606 BGP and Internet Issue

  • It appears our edge device issue has stabilized and everyone should have full access to all services and the internet.

The root cause of today's issue is based upon a previously unknown bug in the upgraded cisco IOS loaded this morning at 5AM. Apparently the bug causes a memory leak in our BGP process and while everything was operating normally after the upgrade, it deteriorated throughout the morning causing excessive traffic lose and retransmits. As more retransmits occurred the throughput steadily decreased.

What has happened to get us to this point is starting last Monday, 10/3/11, we began losing IPv6 connectivity. The root cause of that issue still has not been identified and we continue to work with both Integra and Cisco TAC to find that. This morning at 5AM I (Brad) replaced our edge router supervisor module which had the new IOS code on it.

We do still have the IPv6 issue, in addition to another memory issue with the original supervisor module.

Update: Sups went down again and new Sup cards are being shipped to us and should be here by 4:15PM


10/1/2010: 9:48am - 9:52am: Internet Disruption At 9:48 our Ironport Webfilter #3 lost its ability to process web site requests. While the box appeared to be operating normally, links were not properly returned thereby causing an outage to all who traversed it. Once the problem was identified, the box was taken out of the active pool in an effort to determine the root cause. However, all existing connections were maintained so a hard power down was required. The box will stay out of commission until a trouble ticket with Cisco can be initiated. The remaining filters are currently maintaining our load requirements.


09/27/10 11:20am: DNS Issue

  • The cause of the Internet slowness that everyone experienced today was due to a corrupt ARP table on the 198.236.20.8 DNS server. Since the DNS server was unable to respond back properly to all client requests or transfer requests between other servers there was a large delay when hosts would failover to our secondary DNS servers.


06/3/10 7:15AM-8:20AM: North Clackamas Lost Email Access

  • At 7:15AM ESD Customer support started receiving calls regarding the inability of North Clackamas to access the Domino Notes Email system. James who was sitting in for Fred, monitored error initiated a call to IBM. The system because operational again at 8:20AM. We are awaiting a root cause analysis report from IBM stating the cause of the outage.


05/17/10 10:40AM-11:07AM: CESD lost connectivity to the SIX

  • At 10:40AM our circuit provider to the SIX inadvertently created a loop in their network as they were trying to test for bringing a new circuit and customer up at the SIX. This was fixed at 11:07AM.


04/9/2010 5:10PM: CESD Edge ASA Emergency Maintenance

  • At 5:10PM I fixed the edge ASA code issues and brought the secondary back up and it replicated perfectly. I tested the failover and it worked great with only losing 3 ping on a power failure test and 2 pings on a link failure test. This is what it is supposed to do.


04/9/2010 2:54PM-3:05PM: CESD Core Router Routing Failure

  • At 2:54PM CESD experienced at 11 min outage as the edge ASA's firewalls both acted as primary devices instead of a primary and secondary units. We realize what the issue is and plan to fix it tonight after 5PM. This was fixed temporarily by shutting down the secondary unit and rebooting the primary unit.

At 5 PM today we will repair the edge firewalls. You may experience a small outage at this time.


04/9/2010 9:30AM: CESD Core Router Routing Failure

  • Clackamas ESD had a partial outage of internet traffic leaving Clackamas ESD. For some, you were not effected much more then a 2 minute outage and for others you were effected for 2 hours. This was a human error that was a typo in the edge router that was fixed in less then 2 minutes but the edge router had a lot of unusual routing to our internal network. We had to physically reboot our edge device to clear all our routing issues up.


03/10/2010 12:05PM: Comcast Core Router Failure

  • Comcast sustained another unscheduled outage. Again it was the core router located at the Troutdale facility and a technician was on-site and the outage was limited to 11 minutes (11:29AM to 11:40AM). The root cause appears to be the Supervisor module of the core Foundry switch. The system returned on the backup supervisor module and the primary which failed will be replace. No further interruptions are anticipated.


03/10/2010 11:00AM: Comcast Core Router Failure

  • Comcast experienced a flapping router at the Troutdale facility. The outage this morning lasted from 2:40AM until 4:04AM. Most customers returned to service at 4:04AM. OETC did not return and continued without service until 8:48AM when the port could be manually taken out of error disable mode. For a still undetermined reason, the Comcast router flapped thereby creating garbled data on both ring ports. A reboot was required and they continue researching the root cause of the issue.


03/04/2010 10:04AM: DNS and OAKS Admin Access

  • The root cause of the OAKS/DNS interruption was based upon work being performed at the airast.org level. As shown by the still low DNS TTL of 300, it's surmised they anticipate further modifications so they continue with the frequent refreshes. Once they complete their work they should increase the TTL to a larger value. This issue is widespread and any DNS server referenced has the potential to have the adverse effect of the DNS information being flushed. Since this issue can only be resolved from airast.org there is nothing else for us to do, so we're closing this case.


01/20/2010 05:18PM: ASA Failover Issue - UPDATE 2

  • The updated IOS code has been uploaded to both ASA firewalls -- everything appears to be working fine, we will continue to monitor the firewalls over the next 12 hours. We are still considering this an open outage case.


01/20/2010 02:35PM: ASA Failover Issue - UPDATE

  • After talking with CISCO since 11am we have found a block level corruption caused by the version of IOS code. CISCO has sent us an un-released, stable version of code that will resolve the issue (we/they hope) -- Jeremy and I will conduct the upgrade after 5pm unless this occurs again. (we did notice another firewall failure around 2:15pm


01/20/2010 10:57AM: ASA Failover Issue

  • We experienced a small network blip around 10:57am when our edge firewalls went into failover mode. Normally a failover happens in miliseconds, this took quite a bit longer. We have opened a high priority support case to see why the firewall failed and what can be done to fix it.


11/9/2009 9:55AM: Integra ISP Glitch:

  • This morning at 8:56AM, our Integra services (both ISP and Seattle Peering) were unavailable for up to 2 minutes. The Integra Cisco device had a Route Service Processor Module malfunction, and the outage duration conincided with the switchover to the backup RSP module. Integra is opening a ticket with Cisco to further investigate the issue and anticipate further maintenance (offhours) to repair the failed RSP.


10/7/2009 13:52

  • WCCP stopped redirecting traffic to the web filters at 13:52. Once it was brought to my attention I restarted the service and traffic started to be filtered again.


9/24/2009 11:48am - 9/25/2009 8:30am: Oregon Trail School District // Sandy High School outage:

  • Due to a UPS failure in the MDF the CISCO 3750 stack failed. The stack contained a 3750-12S running the EMI code version and a 3750-48TS running the SMI version of code. When the 3750-12S lost power, the code flipped from the EMI version to SMI version which could not handle the OSPF process.


  • Jeremy was out of cell coverage, so the repair process to longer than expected. Scott installed some static routes to temporarily fix the solution. Around 11:45pm Scott reloaded both switches remotely and brought the OSPF process back online. Around 8:30am Scott fixed one of the network announcements on the OSPF process so that the wireless LAN controller could properly route. Marie K. had to reboot the wireless LAN controller to make it work.


9/15/2009 7:07AM: TWTelecom ISP Glitch:

  • Per scheduled maintenance, TWTelecom increased CESD's committed ISP rate from 600MB to 800MB. This upgrade produced a ISP access hesitation which was not long enough to drop active TCP connections. No further interruptions are anticipated.


6/29/2009 9:27AM: Reporting Multiple Service Interruptions:

  • 6/28/2009 8:57AM: Pittock router required a hard power cycle after a failed 1:00AM scheduled reload. The initial reload produced a boot loop resulting in an ISP outage.
  • 6/28/2009 6:35PM: An intermittent problem was identified specific to Clackamas ESD, its component districts and clients. The problem was not identified, but resolved upon rebooting the Edge device. Consistent service returned at 6:50PM.


5-18-2009 PMR 00666,550 Amy Hoerle Canby Mail server went down at 10:30. Mailed in NSD. Mail in Mail.box is causing the problem... Renames Mail.box to mailbox051809.old Server came up OK about 11:30. Hotfix for the problem will be received later today.


5/15/2009 9:03AM: Service Interruption; ISP and ESiS access experienced a few second loss for reasons undetermined. The symptoms suggest an ASA transition from the primary to secondary, though the logs don’t support this assessment. Further research has not provided concret evidence of the cause of the disruption. (Follow-up: We have high confidence the problem was caused by the Packetshaper rebooting. During the reboot, the device fails open which caused the first momentary disruption, then, the second disruption happened when it came back into service. We'll monitor and replace the device if needed.)


5-11-2009 PMR 00297,550 Darryl Sampson 610-885-9010 Nclack mail server went down at 7:52 and came back up.

AT 9:02 it went down again and has not rebooted. HTTP job is still running.

Down again and rebooted at 10:11

Down again and rebooted at 11:00

Down again at 11:29 HTTP job is still running.

Down again at 12:19 HTTP job is still running.

Down again at 13:44 HTTP job is still running.

Upgraded NClack to 8.5 at 3:00 and problem was fixed.


04/28/2009 9:52 AM: Router on NClack Domino Server Died, but did't take down the server. IBM Worked on server collecting data to analyze the Router crash and rebooted the server at 1:30.


04/25/2009: Oregon Trail Fiber Outage. According to Wave Broadband they were setting up a 10Gig fiber link from Sandy to Woodburn and that is what caused the outage on Oregon Trail's network.


04/21/2009 07:36: NClack Domino process died due to router process error. RESOLVED as of 7:46


03/31/2009 15:46: No Internet or file access - RESOLVED as of 15:51 - server was rebooted due to security update.

Planned Outage Information

SLA

Personal tools