Gibbard Operation Practices N45
Gibbard Operation Practices N45
Gibbard Operation Practices N45
Introduction
What are we covering?
How to maintain your network.
What to do when it breaks. How to manage changes. How to keep your network from breaking. Documentation. External Communication.
Follow your procedures. Dont panic. You dont need a permanent fix right away.
Prioritization
What services do you care most about? What sorts of customer requests get high priority? Does your night shift NOC person know that? Separate request-types into different priority levels.
Document the priority levels. Document your procedures for different priorities.
Paging/Escalation
What happens when theres an alert?
Do you have a NOC with judgment, or an autopager? Can your NOC fix it? Do they have to page somebody else? If paged, do you fix it yourself or talk NOC through fixing it?
Generating too many alerts causes them to get ignored. Getting woken up about stuff that doesnt matter is bad.
Dont panic
Its the middle of the night. Youre tired. Its tempting to start changing things.
Youll feel like youre doing something. Dont!
Planning
So, you turned something off or propped something up, and went back to sleep Now its daytime. Its time for a real fix. Your network is running. Its not an emergency. Your interim configuration is probably unstable.
Failure analysis
Youve had a bad outage, and cant afford another one. Youre having the same outage over and over again. Find out why.
Does the same component break repeatedly? Are there problems with the network architecture? Is it a mystery?
Mystery failures
Collect what information you can.
What does the network look like when its broken? Is there other data that would point to a cause? Does it happen at the same time every day?
Problems you can see are easier to solve. Is there log data? What else happened at the same time? What could cause that sort of issue? Can you test hypotheses? Dont be afraid to ask for help.
Audience stories.
Managing changes
Managing changes
Sometimes you have to make changes.
Routine changes are changes you make regularly. Non-routine changes are special cases. These are Real changes.
Routine changes
Document procedures and follow them.
You know what worked last time. Dont make it up as you go along each time.
Why type:
ssh user@router enable <password> conf t neighbor 192.168.1.5 remote-as 65454 neighbor 192.168.1.5 peer-group PEER neighbor 192.168.1.5 description peer.net #12345 end write logout
Have you tested your procedure? What assumptions are you making? How will you test? Have somebody else review the plan.
Scheduling
How long a window do you need? What will be down during that window? When will customers accept downtime? Are your resources available? Do you have time to get stuck there? Will your co-workers be annoyed if you need their help?
UPS fails. Goes into bypass mode. UPS thought to be fixed. Turning UPS on causes explosion, and blows circuit breaker. Takes large number of customers offline. Utility power restored, but no back-up. UPS fixed again. Without UPS, risk of utility power failure. Cutover to UPS shown to be risky. What do you do?
Risk assessment
Sometimes, all your choices are risky. Sometimes, you dont know what will happen. Or, you think you know what will happen. Use judgment. Pick the option youre least uncomfortable with. Do cost analysis on potential failures and improvements. There are known knowns.These are things we know that we know. There are known unknowns. That is to say, there are things that we know we don't know. But there are also unknown unknowns. There are things we don't know we don't know. -Donald Rumsfeld
Tools
Good tools make life much easier. If youve got more than a few routers, manual changes are a real pain. Its better to make a change once and have it happen everywhere. Tools dont have to be complex. RANCID clogin/jlogin makes tool development easy.
#!/bin/sh UPASS=$1 ENABLEPASS=$2 ROUTERLIST=/usr/local/rancid/tools/rout erlist for router in `cat $ROUTERLIST` do /usr/local/rancid/bin/clogin -c \ "conf t\r \ username user pass $UPASS\r \ enable secret $ENABLEPASS\r \ end\r \ write" \ $router done
Documentation
When you change something, document it.
Otherwise, you get woken up when it breaks. If you dont remember the details, youre in real trouble. Or, you might not work there anymore.
Architecture
Avoid single points of failure.
Ideally, network failures are self-correcting. Otherwise, being able to turn off broken components is nice.
The KISS Principle says, Keep it Simple, Stupid. Scaling: If youre successful, your network will need to grow.
You dont need to build the whole thing right away, but dont make growth require a redesign.
Limits of redundancy
Redundancy is a statistical game.
You can still have bad luck. More pieces are good, but diminishing returns hit quickly.
Interconnected devices can fail together. Redundancy protocols can introduce complexity and cause problems. Some vulnerabilities can take out both sides:
Software bugs. Load-related problems. Attacks.
Scaling
What is scaling?
How big does your network need to be now? How big might it need to be eventually? How will you get from here to there?
Somewhat bigger
After scaling
Standardization: Templates
Standard configurations make life much easier. You shouldnt keep reinventing things. Knowing how one device is configured should mean knowing how the others are configured. Changes can be standardized, too.
interface FastEthernet0/1 description <exch-name> switch ip address <exch-addr> no ip proxy-arp full-duplex no cdp enable ! interface FastEthernet0/0 description trunk to switch.<locname> no ip address no ip proxy-arp speed 100 full-duplex no shutdown ! interface FastEthernet0/0.1 description <loc-name> subnet encapsulation dot1Q 1 native ip address <local-ip> 255.255.255.240
Procedures
Think about services, not components. Repair components proactively. Monitor, but dont over-monitor. Prioritize alerts. Dont get woken up when you dont need to. Plan network changes carefully.
Network engineers are a leading cause of network outages.
Those components are going to break. What happens when they break?
Monitoring
Are both sides of redundant pairs working? How are you doing on capacity?
Circuits, CPU load, memory, disk space. Network and server performance.
Dont over-monitor. Prioritize your alerts. Ill say more about handling alerts later.
Be proactive
Do repairs proactively.
If you see a problem, schedule a time to fix it. Use your change management process. Dont cause an outage in the process.
Auditing
Are your configurations standardized? Are your redundant pairs really redundant?
Do your cables go where you think they do? Are all your routing protocol sessions up? Do you have enough capacity?
Testing: If youre confident and brave, schedule a window and turn components off.
But make sure you know what youre doing first.
Documentation
Have information youll need before you need it:
Network diagrams. Service contract numbers. Useful phone numbers. Circuit IDs and end points. Why things were done.
Maintenance announcements
Let people know whats going on.
Ticket systems
Gathers customer communications in one place. Sets up an audit trail. You know what was done, what was said. Makes it easy to take over projects started by other people. Maintains to do lists. RT is a good open-source ticket system.
Ticket systems
Maintenance announcements
Tell your customers and peers before causing outages.
Avoid surprises. Dont make them waste time troubleshooting.