k8s网络模型与cni插件

k8s集群4种通信类型

pod内容器通信：
- 借助pause通信，同一个pod中容器共享了net名称空间，借助lo接口进行通信；
pod之间通信
- pod的ip都在一个大的扁平的二层网络内部，（同网段一个大的本地局域网）；
- 地址范围如10.244.0.0/16，启动api-server由启动参数传入
- 实现方式主要有：叠加网络、路由网络，通过网络插件如flannel、calico实现；
pod与serviceip通信
- serviceip，启动api-server时通过启动参数传入，如10.96.0.0/12
- serviceip对应各个节点的ipvs或iptables规则，pod与serviceip通信有节点的内核转发规则实现
pod与集群外部ip通信
- 通过nodeport或loadbalancer类型的serviceip实现
- 本质是借助了节点上开放的端口实现
- 经由节点的nodeport，serviceip两次转发实现，才到pod

pod网络实现模型

容器的伪网络接口实现方式

虚拟网桥
- 由内核或OVS创建的虚拟网桥实现；
- 每个容器会被分配一对虚拟网卡，借助虚拟网桥桥接到节点的物理网卡上
多路复用
- macvlan为每个虚拟接口配置一个mac实现二层转发
- ipvlan借助ip实现，更适合vm
硬件交换
- 硬件辅助，借助支持sr-iov的网卡实现

图示：

cni插件及常见实现

cni为coreos和google联合制定的标准，它连接了容器编排系统k8s、和具体的网络插件实现如flannel；之间以json配置文件通信；cni具有很强的扩展性和灵活性，可以通过输入的args和环境变量CNI_ARGS进行传递，

cni插件分3部分：main插件、meta插件、ipam插件

main插件，负责实现bridge、macvlan、ipvlan等特定网络
meta插件，负责调用其他插件
ipam插件，负责为pod中容器分配ip地址，如dhcp

k8s设计了k8s网络模型，但代码实现有cni插件实现，cni（container network interface）只是指定的标准，其具体实现有：

flnanel：基于linux tun/tap的叠加网络实现；
calico：基于bgp的的网络
cannel：flannel和calico的结合
kube-route：k8s网络一体化解决方案，可取代kube-proxy实现基于ipvs的service，提供pod网络，支持网络策略，兼容bgp；
weave net
contiv
...

cni工作流程

某节点的kubelet监测到自己被分配到一个新pod
调用自己本地的cni插件为新pod分配网络信息
1. 创建虚拟网卡对，附加到底层网络中；
2. 根据本节点的pod ip范围设置ip地址；
3. 设置路由等信息，并注入到pod中；
kubelet先到默认目录/etc/cni/net.d/下查找json配置文件
1. 根据json配置文件中type字段所述，找到cni插件的二进制文件
2. 由cni插件调用IPAM插件（IP地址管理插件如dhcp，host-local）设置接口地址

示例：

[root@node1 ~]# cat /etc/cni/net.d/10-flannel.conflist 
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

flannel插件

docker默认的单机网络模型，在k8s集群多机互联的时候，存在2个问题：

每台docker节点都默认是172.17.0.0/16网络，k8s要求的pod在同一个平面网络会导致，ip地址冲突；
默认为docker桥接模式，各pod没有到达其他节点网络的路由信息

为解决这2个问题，各种网络插件都有不同的解决方案，以flannel为例；

第一个问题：预留一个大的网段，并自动为集群中每个节点都分配一个子网，分配信息存储到etcd存储中；
第二个问题：flannel有多种处理方法，每一种称为一种网络模型，也叫做flannel使用的后端

flannel常用后端

flannel共有3种常用后端：

vxlan
host-gw
udp
alivpc
aws vpc
...

vxlan

内核3.7.0后支持的vxlan，flannel借助内核的vxlan模块封装报文，也是flannel推荐后端；

host-gw

host gateway，通过在节点上创建到达各个目标容器网段的路由实现，此种方式必须要求各节点在同一个二层网络，因此不适合大的网络规模；有较好的转发性能，

udp

使用udp包对容器之间的通信报文进行封装、隧道转发，性能较低；属于叠加网络；

flannel配置参数

pod的ip分配由运作在每个节点上的flannel进程控制；

flannel使用etcd存储给各个节点分配ip的信息，存储路径在etcd的/coreos.com/network/config下，config值为json格式的字典数据结构，示例配置：

{
	"network": "10.244.0.0/16",
	"SubnetLen": 24,
	"Backend": {
		"Type": "Vxlan",
		"Port": 8472
	}
}

vxlan后端和direct routing后端

vxlan，全程virtual extensible local area network，虚拟可扩展局域网，采用的是mac in udp，容器之间通信的数据包，进入节点的物理网络前，被封装在节点的物理网络的udp包中，udp包根据物理节点之间的路由信息正常转发；**可跨不同网段，**到达目标节点后再拆掉udp包，然后交给目录容器，2次封装，造成性能较差

flannel支持vxlan+direct routing模式，即若目标容器所在物理节点跨了物理网段，就采用vxlan方式采用udp包跨网段转发，若目标容器所在物理节点在同一个物理网段，就采用direct routing的方式，较少了二层封包带来的损耗，直接路由就是同一个物理网络的物理节点上，添加上彼此分配到的pod的网段的路由信息，pod通信借助节点的路由即可，（仅限同一个物理网段）

1、修改flannel的部署清单文件，启用vxlan+direct routing，默认是vxlan

在清单文件中，定义configmap中，后端类型添加一个directrouting；

  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "vxlan"
        "Directrouting": true
      }
    }
---

2、查看节点上生成的对应pod网段的路由信息

[root@node1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.80.2    0.0.0.0         UG    102    0        0 eth0
10.244.0.0      10.244.0.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.2.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.4.0      10.244.4.0      255.255.255.0   UG    0      0        0 flannel.1

host-gw后端

host-gw和vxlan的direct routing类似，**都是在节点色和功能添加pod网段路由的方式，实现pod间通信；**只是仅限于一个二层网络，不具备vxlan支持跨物理节点网段；

host-gw相比vxlan这种承载网络，性能表现更好，但不适合大型的集群，网络规模大的时候，维护节点上路由就很困难；

1、修改flannel的清单，改为host-gw模式；重新apply即可

  net-conf.json: |
    {
      "Network": "10.244.0.0/16",
      "Backend": {
        "Type": "host-gw"
      }
    }
---

2、查看节点路由信息

物理节点上，仍会生成各pod网段的路由信息；

[root@node1 ~]# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.80.2    0.0.0.0         UG    102    0        0 eth0
10.244.0.0      10.244.0.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.1.0      10.244.1.0      255.255.255.0   UG    0      0        0 flannel.1
10.244.2.0      0.0.0.0         255.255.255.0   U     0      0        0 cni0
10.244.4.0      10.244.4.0      255.255.255.0   UG    0      0        0 flannel.1

网络策略

flannel本身只实现了网络模型的通信，但并未实现网络隔离等策略

网络策略简介

支持网络策略的插件有：calico、cannel、kube-router；

网络策略networkPolicy为k8s标准资源，定义的网络策略，由网络插件实现，类比ingress资源和ingress controller的关系，需要具有网络策略功能的网络插件才可以，flannel就不支持网络策略；

pod网络流量分为：出栈egress、进栈ingress，策略有拒绝、允许；

被网络策略选择器选中的pod，所有未明确允许的流量都会被禁止；未被选中的pod仍流量来去自如

网络策略生效过程：

定义的网络策略对象，由网络插件解释执行；
新创建pod对象，会生成对应的端点api，反映在etcd和api-server中
向api-server注册了监听事件的网络插件，监听到新的pod端口
将对应规则推送到各个节点的agent
agent在各个节点生成对应该pod的规则，（例如iptables规则）
networkpolicy对象发生变更，也会被网络插件监听到，并推送各个节点，更新网络策略

图示：

部署cannal提供网络策略

calico本身可为k8s提供网络模型通信、和网络策略，也可以结合flannel，由flannel负责网络通信、calico负责网络策略；此时合称为canal；

官方安装文档：https://docs.projectcalico.org/getting-started/kubernetes/flannel/flannel

问题？版本不匹配，配置清单的语法版本高于k8s集群

配置网络策略

networkPolicy介绍：

spec中，主要定义字段，策略方向、pod的选择器、

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50


[root@client canal]# kubectl explain networkPolicy
KIND:     NetworkPolicy
VERSION:  extensions/v1beta1

DESCRIPTION:
     DEPRECATED 1.9 - This group version of NetworkPolicy is deprecated by
     networking/v1/NetworkPolicy. NetworkPolicy describes what network traffic
     is allowed for a set of Pods
 
 ---
 [root@client canal]# kubectl explain networkPolicy.spec
KIND:     NetworkPolicy
VERSION:  extensions/v1beta1

RESOURCE: spec <Object>

DESCRIPTION:
     Specification of the desired behavior for this NetworkPolicy.

     DEPRECATED 1.9 - This group version of NetworkPolicySpec is deprecated by
     networking/v1/NetworkPolicySpec.

FIELDS:
   egress	<[]Object>
     List of egress rules to be applied to the selected pods. Outgoing traffic
     is allowed if there are no NetworkPolicies selecting the pod (and cluster
     policy otherwise allows the traffic), OR if the traffic matches at least
     one egress rule across all of the NetworkPolicy objects whose podSelector
     matches the pod. If this field is empty then this NetworkPolicy limits all
     outgoing traffic (and serves solely to ensure that the pods it selects are
     isolated by default). This field is beta-level in 1.8

   ingress	<[]Object>
     List of ingress rules to be applied to the selected pods. Traffic is
     allowed to a pod if there are no NetworkPolicies selecting the pod OR if
     the traffic source is the pod's local node, OR if the traffic matches at
     least one ingress rule across all of the NetworkPolicy objects whose
     podSelector matches the pod. If this field is empty then this NetworkPolicy
     does not allow any traffic (and serves solely to ensure that the pods it
     selects are isolated by default).

   podSelector	<Object> -required-
     Selects the pods to which this NetworkPolicy object applies. The array of
     ingress rules is applied to any pods selected by this field. Multiple
     network policies can select the same set of pods. In this case, the ingress
     rules for each are combined additively. This field is NOT optional and
     follows standard label selector semantics. An empty podSelector matches all
     pods in this namespace.

   policyTypes	<[]string>

networkpolicy常用术语

pod组，网络策略作用的对象，通过pod选择器选择，matchLabel或matchExpression选定

egress出栈流量，被选中的pod组发往其他网络端点的流量，
用to 和ports 去往某网络的某端口的含义

ingress 入栈流量，其他网络端口来往被选中的pod组的流量，
from源端点的地址，和访问自己的目标端口定义

端口：tcp udp端口号

端点：流量的发起的源，或去往的目标，可以由cidr地址块ipblock、ns选择器namespaceselector(多租户常用)，pod选择器podselector选中

注：

定义了ingress或egress字段后，其to或from字段指定或通过选择器选中的端点即为白名单，其余为黑名单；

管控入栈流量

ingress用于定义入栈流量，即访问networkpolicy标签选择器选中的pod组，

from定义哪些来源可访问，ports定义可以访问我的哪些端口，定义的为白名单、其余为决绝，若定义时字段留空，则为拒绝所有

from可以由ipblock nsselector podselecor选择匹配来源；

networkpolicy为名称空间级别，可以为ns设置默认拒绝所有的策略，然后指定用啥放行啥；

[root@client canal]# kubectl explain networkPolicy.spec.ingress.
KIND:     NetworkPolicy
VERSION:  extensions/v1beta1

RESOURCE: ingress <[]Object>

FIELDS:
   from	<[]Object>


   ports	<[]Object>

1、放行流量示例

该策略定义了标签app=myapp的pod开放80端口给10.244.0.0/16除了10.244.3.0/24网段的ip，以及开放给自己人app=myapp标签的

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: allow-myapp-ingress
 namespace: default
spec:
 podSelector:
  matchLabels:
   app: myapp
 policyTypes: 
 - Ingress
 ingress:
 - from:
   - ipBlock:
      cidr: 10.244.0.0/16
      expect:
      - 10.244.3.0/24
   - podSelector:
      matchLebels:
       app: myapp
  ports:
  - protocol: TCP
    port: 80

管控出栈流量

出栈流量，一般都予以放行，在严格要求的场景下，也可以定义默认策略：拒绝所有出栈流量，然后显示放行需要的出栈流量；

1、定义默认策略，拒绝所有

podselector为空，表示选择此ns中所有pod，policytype策略类型选择为egress，但未定义任何egress字段，定义的egress字段为白名单，未定义即没有白名单，拒绝所有。

1
2
3
4
5
6
7
8


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: deny-all-egress
spec:
 podSelector: {}
 policyTypes: ["Egress"]
 

2、显示放行某些出栈流量

pod选择器选中了tomcat的pod，然后放行了其访问nginx的80端口和mysql的3306端口；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: allow-tomcat-egress
spec:
 podSelector:
  matchLabels:
   app: tomcat
 policyTypes: ["Egress"]
 egress:
 - to:
   - podSelector: 
      matchLabels:
       app: nginx
   ports:
   - protocol: TCP
     ports: 80
 - to:
   - podSelector:
      matchLabels:
       app: mysql
   ports:
   - protocol: TCP
     port: 3306

隔离名称空间

多租户环境中，各个用户的名称空间应严格隔离，因此各用户的名称空间应该进行所有出栈、入栈流量，但和系统应用所在的名称空间的流量应该相互放行

示例：

该网络策略对象定义中，放行了user1和kube-system名称空间的进出栈流量，自己名称空间内部的进入栈流量，

然后禁用了和其他用户的名称空间的所有进出栈流量；

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: ns-deny-all
 namespace: user1
spec:
 PolicyTypes: ["Egress","Ingress"]
 podSelector: {}
 # 禁止了user1用户名称空间中，所有pod的所有出入栈流量，包括该名称空间内部的通信流量，此时每个pod都是“孤岛”
 ---
 
 apiVersion: networking.k8s.io.v1
 kind: NetworkPolicy
 metadata:
  name: all-kubesystem-and-allow-user1
  namespace: user1
 spec:
  PolicyTypes: ["Egress","Ingress"]
  PodSelector: {}
  ingress:
  - from:
    - namespaceSelector:
       matchExpressions:
       - key: name
         operator: In
         values: ["default","kube-system"]
  egress:
  - to:
    - namespaceSelector:
       matchExpressions:
       - key: name
         operator: In
         values: ["default","kube-system"]
       

ps：

其他类的系统附件一般部署在单独的名称空间中，如prometheus部署到了prom名称空间、nginx-ingress-controller部署到了nginx-ingress中，这些管控类的pod所在名称空间，也应和kube-system一样，放开和普通的用户使用的名称空间的进入栈流量！

网络策略应用示例

场景要求：

名为testing的ns中，有myapp和nginx2组pod
myapp可以访问nginx所有端口，可被nginx访问80端口，可和kube-system中所有pod相互通信
nginx可以被所有来源访问80端口，可以访问所有的pod，可以访问myapp的80，可被myapp访问所有端口，可和kube-system中所有pod相互通信

定义网络策略：

1、先拒绝testing中所有出入栈流量

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: deny-all
 namespace: testing
spec:
 podSelector: {}
 PolicyTypes: 
 - Ingress
 - Egress

2、开发nginx的80端口给所有来源

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: nginx-allow-all
 namespace: testing
spec:
 podSelector:
  matchLabels:
   app: nginx
 ingress:
 - ports:
   - port: 80
   from:
   - namespaceSelector:
      matchLabels:
       ns: kube-system
 egress:
 - to:
 PolicyTypes:
 - Ingress
 - Egress

3、开放myapp的80端口，及它与kube-system之间流量

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
 name: myapp-allow
 namespace: testing
spec:
 podSelector:
  matchLabels:
   app: myapp
 ingress:
 - from:
   - podSelector:
      matchLabels:
       app: nginx
   ports:
   - port: 80
 - from:
   - namespaceSelector:
      matchLabels:
       ns: kube-system
 egress:
 - to:
   - podSelector:
      matchLabels:
       app: nginx
 - to:
   - namespaceSelector:
      matchLabels:
       ns: kube-system
 policyTypes:
 - Ingress
 - Egress

calico插件

calico为同时支持网络模型通信，和网络策略的网络插件，可集成于k8s、openstack、mesos等编排系统之上，

calico本身为三层的虚拟网络方案，工作原理：将每个节点视为路由器，节点上的pod视为挂在节点路由器下的一个个终端，节点路由器通过bgp路由协议，动态的学习集群各节点的路由信息，从而节点（节点路由器）可以帮助其上的pod完成路由转发

因此calico不局限物理节点同在一个二层网络，可以跨三层，进而扩大了集群的网络规模，

calico同时支持ip-in-ip网络模型，内置在内核中，属于overlay叠加网络，但理论上比vxlan的叠加性能要好；

calico特性

calico利用linux内核，在每个计算节点实现了一个vrouter进行报文转发，每个vrouter利用bgp协议，通过节点上agent-felix将节点上的pod的地址信息广播出去，直到全网络相互学习到路由信息；felix还支持acl实现安全策略

经由ip路由直连
- pod的ip信息借助bgp广播到全网，
- 无报文封装、隧道
简单高效容易扩展
- bgp协议本就适合大规模网络
- calico也适合大规模集群
安全性较好
- 可借助内核的iptables规则，acl策略实现多租户的网络隔离
简洁
- 没有报文的多层封装、隧道，
- 实现了what you see is what you get，方便管理员分析报文

calico系统架构

calico系统组件：

felix，calico的agent，运行于每个节点，管理节点的acl，和路由信息
etcd，存储calico配置，路由等信息
route reflector，bgp路由反射器，大规模网络采用
orchestrator plugin ，编排系统插件，将calico集成到编排系统的插件，如k8s的cni
BIRD，分发路由信息的bgp客户端

k8s之网络模型与网络策略

相关文章：