The Border Gateway Protocol
BGP is the industry standard routing protocol used for inter-organization and Internet routing. While it is based on the rather simple distance/path vector paradigm, it offers an extensive, highly customizable set of tools and attributes for policy control.
BGP policies are extremely powerfull and expressive also when used internally by organizations (transit service providers or large enterprises), due to its continuity of policy enforcement, stability and scalability. Aside from its Internet oriented origin, BGP has further evolved, becoming also the main tool for inter-organization multicast topology propagation and in particular for MPLS VPN (VRF) route propagation.
Peering Communication
BGP uses as transport the TCP protocol on port 179. This ensures its transport reliability. Once the tcp session is established, bgp speakers (peer routers) exchange OPENmessages to determine the capabilities and connection parameters and, if this is required, to mutually authenticate. These messages are used to communicate values of key importance such: AS number of membership, BGP version, timers for keepalive and hold time, the BGP router ID.
Upon establishment or hard reset of a BGP session, all candidate BGP routes are exchanged between the peers. After the initial (full) route exchange has occurred, only incremental updates are sent as network information changes. The incremental update approach provides an enormous improvement in CPU overhead.
Routes are advertised or withdrawned using UPDATE messages. The UPDATE message contains, among other things, a list of <length, prefix> tuples that indicate the list of destinations that can be reached/removed for a BGP speaker and all the bgp attributes attached to those tuples.
Until no routing change happens, routers do exclusively send KEEPALIVE packets to keep the bgp session active. On Cisco routers the default keepalive is 60 seconds (the holdtime is 180 seconds).
Further BGP provides a NOTIFICATION mechanism to gracefully close a connection with a peer. In other words, in the event of a disagreement between the peers or due to a manual shutdown or reset, a NOTIFICATION error message is sent to inform the counterpart (graceful session shutdown). The benefit of this mechanism is that both peers understand that the connection is teared down and do not waste their resources or blindly re-attempt to establish the connection. A NOTIFICATION message is always sent also whenever an error is detected.
Concerning Peering customization (see further for configuration):
– Basic global BGP parameters (router-id, timers)
– Basic peering parameters (neighborship, as number)
– BGP Soft reconfiguration type
– MD5 authentication
– Peer Groups, Update groups, Templates
BGP Attributes
BGP policies are built around the use of attributes attached to the exchanged prefixes and are modified or interpreted according to administrators configuration and agreements between organizations. BGP is at the baseline a rather simple routing algorythm. Apart of the path selection, only loop avoidance and conditional advertisments can influence the BGP decision of what path to install and advertise.
The main attributes used in BGP policies are:
NAME Support Precence eBGP/iBGP
——————- ———- ————- ————————
WEIGHT Cisco – –
ORIGIN well-known mandatory (transitive)
AS_PATH well-known mandatory (transitive)
NEXT_HOP well-known mandatory (transitive)
LOCAL_PREF well-known discretionary (transitive / iBGP only)
ATOMIC_AGGREGATE well-known discretionary (transitive)
AGGREGATOR optional – transitive
MULTI_EXIT_DISC(MED) optional – non-transitive (E-BGP)
COMMUNITY optional – transitive
ORIGINATOR_ID optional – non-transitive
Cluster List optional – non-transitive
Legenda:
well-known = all implementations must support this attribute.
optional = implementations may support or ignore this attribute.
mandatory = it must be present in any announcement.
discretionary = it can be present in announcements or not depending on peer policy.
transitive = it is maintained when crossing AS (organization) boundary
non-transitive = it is removed when crossing AS (organization) boundary (receiving peer must/can use it, though)
NEXT_HOP (Type Code 3): Checked as a condition to be met.
The NEXT HOP is the next hop announced with the prefix. It is mandatory for each route and must always be verified against the router’s RIB. Anyway, aside the validation, it is not a comparison attribute: simply IT MUST be reacheable in order for the prefix to be slectable for the RIB. In iBGP, by default, the next hop is not altered, while it is instead always modified by eBGP. In iBGP it is possible to set the iBGP peer to set itself as the NEXT_HOP value (“neighbor xxxx next-hop self” command) to other iBGP peers (typical of an iBGP peer which is also the eBGP border router or in case of NBMA network with hub’n’spoke topology).
WEIGHT Attribute: Highest, the better
This is actually a Cisco proprietary attribute only LOCALLY used on the receiving router (it is actually not even transported on BGP announcements). This allows the administrator to specify a preferred path if multiple paths exist out of a router for a destination. It basically overrules the IETF standard BGP selection process. Weights can be applied to individual routes or to all routes that are received from a peer. The weight value ranges from 0 to 65,535. Weight can be used in iBGP, instead of local preference, or in eBGP, instead of AS PATH or MED, to influence the selected path.
LOCAL_PREF (Type Code 5): Highest, the better
Well-known discretionary attribute. It used by an iBGP speaker to inform other iBGP speakers about the degree of preference for a given route. If the receiving peer gets the same route from other sources, the local preference can be used to select the best route (according to the selection algorithm). As the Local preference is the first tie-breaker used in iBGP, this attribute has a great influence within the organization AS borders. Please note that the Local Preference is compared even when route originators are in different AS, as long as the receiving peer is connected through iBGP peering (it can be a way to force one egress point on a couple of border peers: they both have direct eBGP, but using iBGP between them and setting the local preference very high on the preferred exit point). The default value is 100.
AS_PATH (Type Code 2): Shortest, the better
AS_PATH is a well-known mandatory attribute that is composed of a sequence of AS path segments: it rapresents BGP “Path Vector” and is used as loop prevention mechanism: when an eBGP peer receives a prefix carrying its own AS number, it will refuse the path. The AS-Path also acts as a metric for the selection of routes: if in the selection algorithm the routes must be compared for their AS Path (as this is typical of eBGP, which does not use the Local Preference), the shortest AS-Path (the one having less AS numbers listed) is preferable. The AS Path vector is populated from right to left, with the leftmost AS being the receiving AS neighbor, the rightmost AS the route originating AS. Being mandatory, it can be used to “influence” the return from another AS (instead of MED which is only an optional attribute): AS-PATH prepending is often used to achive this. Anyway IF an agreement can be made with the external ISP, MED normally offers a conceptually more state-of-art approach. Note: there is actually a private set of AS numbers that can be used internally by organisations without being assigned officially, but cannot be used for eBGP routing with “real” AS’s (with private AS number being removed in case of further interconnection to real AS’s). This is the range 64512-65535, and is normally used in Confederations or non transit environments.
ORIGIN (Type Code 1): IGP << EGP << Incomplete
ORIGIN is a well-known mandatory attribute that defines the origin of the path information and is considered in the BGP decision process after the AS PATH Selection. It is not very used as an explicit discriminator, but it is controlled by the BGP path selection algorythm and should be prperly set, as:
0 IGP – Network Layer Reachability Information is interior to the originating AS (= injected with “network command”);
1 EGP – Network Layer Reachability Information learned via EGP;
2 INCOMPLETE – Network Layer Reachability Information learned by some other means (= injected via redistribution).
MULTI_EXIT_DISC (Type Code 4): Lowest, the better
An optional, non transitive attribute, used at the boundary between different AS’s (upon agreement). The lowest MED value is preferred, being the default value 0. As the MED attribute is non transitive, it is removed from iBGP path announcements. Important: sometimes a strange behaviour can occurr if a given peer does not set the MED attribute at all (instead of a default of 0). In this case routes which require MED to be compared for a choice can be preferred if the MED is not present, being this better if compared to a MED 0 (!!). This can be overruled explictely by setting the “bgp bestpath med missing-as-worst”. Further, by default, routes advertised from different routers in the same AS are not compared using the MED values. To tell the router to do so, use the “bgp deterministic-med” command. Finally, by default, routes advertised by different AS are also not compared between each other using MED. This can be overuled by the “bgp always-compare-med” command usage.
COMMUNITY (Type Code 8)
Optional, transitive attribute which is not directly used to perform path selection, but instead is used to tag routes and apply policy decisions on it. Typically used in iBGP scenarios for transit policies. It is expecially useful to trace the location of origin of a route (a router or set of routers), the protocol of origin of a route (when redistribution is used) or other custom conditions. It is basically the attribute to use when customization of BGP is required and the standard attributes do not offer such flexibility. As it is a transitive attribute, it is normally used within the organization for transit policy purposes and cleared on the egress point (when crossing AS boundaries).
ATOMIC_AGGREGATE:
A well-known discretionary attribute, purely used to signal that a route has been aggregated and that, somewhere along the path, a more specific route will be used. Note: if the router is also the aggregator, also the following attribute is set. Used by BGP internal mechanisms and not for customized policies.
AGGREGATOR (Type Code 7)
The attribute contains the last AS number that formed the aggregate route (encoded as 2 octets), followed by the IP
address of the BGP speaker that formed the aggregate route (encoded as 4 octets). Used by BGP internal mechanisms and not for customized policies.
ORIGINATOR_ID (Type Code 9)
Optional nontransitive. Used by Route Reflector to avoid accepting path back to the RR. Used by BGP internal mechanisms and not for customized policies.
Cluster ID (Type Code 10)
Optional nontransitive. Also used by Route Reflectors, for multiple Clusters loop avoidance. Used by BGP internal mechanisms and not for customized policies.
=> Extra community “attributes” that are supported on Cisco routers:
– BGP BW link extended community (influences on Multipath on eBGP paths)
– BGP Cost community (only for iBGP purposes);
More on BGP Peering Communication
BGP uses TCP as its transport protocol (port 179). This ensures that all the transport reliability (such as retransmission) is taken care of by TCP and does not need to be implemented in BGP, thereby simplifying the complexity associated with designing reliability into the protocol itself.
Routers that run a BGP routing process are often referred to as BGP Speakers. Two BGP speakers that form a TCP connection between one another for the purpose of exchanging routing information are referred to as Neighbors or Peers.
Peer routers exchange OPEN messages to determine the connection parameters. If this is required, they authenticate each other in the OPEN message communication. These messages are used to communicate values such as: AS number, BGP version running, Keepalive hold time the BGP, BGP router ID (typically a loopback address).
BGP also provides a NOTIFICATION mechanism to gracefully close a connection with a peer. In other words, in the event of a disagreement between the peers, be it resultant of configuration, incompatibility, operator intervention, or other circumstances, a NOTIFICATION error message is sent, and the peer connection does not get established or is torn down if it’s already established. The benefit of this mechanism is that both peers understand that the connection could not be established or maintained and do not waste resources that would otherwise be required to maintain or blindly reattempt to establish the connection. A NOTIFICATION message is always sent whenever an error is detected.
Initially, when a BGP session is established between a set of BGP speakers, all candidate BGP routes are exchanged. After the session has been established and the initial (full) route exchange has occurred, only incremental updates are sent as network information changes. The incremental update approach provides an enormous improvement in CPU overhead.
Routes are advertised between a pair of BGP routers in UPDATE messages. The UPDATE message contains, among other things, a list of <length, prefix> tuples that indicate the list of destinations that can be reached/removed for a BGP speaker. The UPDATE message also contains the path attributes, which include such information as the degree of preference for a particular route and the list of ASs that the route has traversed.
The UPDATE message is used for both adverstising and widrawing of routes.
If no routing changes occur, the routers exchange only KEEPALIVE packets. A recommended KEEPALIVE rate is one-third of the Hold Timer value. If the Hold Timer value is 0, periodic KEEPALIVE messages are not sent. On Cisco routers default is keepalive being 60 seconds (the holdtime is 180 seconds).
Summarising:
– OPEN
– UPDATE
– KEEPALIVE
– NOTIFICATION
The following picture shows the BGP state machine.
Notes: At the OpenSent state, the BGP recognizes, by comparing its AS number to the AS number of its peer, whether the peer belongs to the same AS (Internal BGP) or to a different AS (External BGP).
UPDATE Message Format
An UPDATE message can advertise at most ONE NEW route, which may be described by several path attributes. All path attributes contained in a given UPDATE messages apply to the destinations carried in the Network Layer Reachability Information field of the UPDATE message.
An UPDATE message can list multiple routes to be withdrawn from service. Each such route is identified by its destination (expressed as an IP prefix), which unambiguously identifies the route in the context of the BGP speaker – BGP speaker connection to which it has been previously been advertised.
An UPDATE message may advertise only routes to be withdrawn from service, in which case it will not include path attributes or Network Layer Reachability Information.
Conversely, it may advertise only a feasible route, in which case the WITHDRAWN ROUTES field need not be present.
The UPDATE message always includes the fixed-size BGP header, and can optionally include the other fields as shown below:
Unfeasible Routes Length: indicates the total length of the “Withdrawn Routes” field in octets. A value of 0 indicates that no routes are being withdrawn from service, and that the WITHDRAWN ROUTES field is not present.
Withdrawn Routes: IP prefixes being withdrawn from service. Each IP address prefix is encoded as a 2-tuple of the form <length, prefix>. Example <14,172.16.0.0> means that route 172.16.0.0/14 should be removed from service.
Total Path Attribute Length: indicates the total length of the Path Attributes field in octets. A value of 0 indicates that no Network Layer Reachability Information field is present in this UPDATE message.
Path Attributes: a variable length sequence of path attributes is present in every UPDATE. Each path attribute is a A-TLV triple: <Attribute Type, Attribute Length, Attribute Value>.
Network Layer Reachability Information: contains a list of IP address prefixes. Reachability information is encoded as one or more 2-tuples of the form <length, prefix>. Example <14,172.16.0.0> means that route 172.16.0.0/14 is offered.
Notification Message Format
Error Code and Subcode:
1 Message Header Error subcodes:
1 – Connection Not Synchronized.
2 – Bad Message Length.
3 – Bad Message Type.
2 OPEN Message Error subcodes:
1 – Unsupported Version Number.
2 – Bad Peer AS.
3 – Bad BGP Identifier. ‘
4 – Unsupported Optional Parameter.
5 – Authentication Failure.
6 – Unacceptable Hold Time.
3 UPDATE Message Error subcodes:
1 – Malformed Attribute List.
2 – Unrecognized Well-known Attribute.
3 – Missing Well-known Attribute.
4 – Attribute Flags Error.
5 – Attribute Length Error.
6 – Invalid ORIGIN Attribute
7 – AS Routing Loop.
8 – Invalid NEXT_HOP Attribute.
9 – Optional Attribute Error.
10 – Invalid Network Field.
11 – Malformed AS_PATH.