Docker中的Mongodb：numactl –interleave =所有解释

我正在尝试基于 https://hub.docker.com/_/mongo/的官方仓库为内存中的MongoDB创建Dockerfile.

在dockerfile-entrypoint.sh我遇到过：

numa='numactl --interleave=all'
if $numa true &> /dev/null; then
    set -- $numa "$@"
fi

基本上,当numactl存在时,它会将numactl –interleave = all添加到原始docker命令.

但我真的不明白这个NUMA政策的事情.你能解释一下NUMA的真正含义,以及–interleave = all代表什么？

为什么我们需要使用它来创建MongoDB实例？

最佳答案

man page mentions：

The libnuma library offers a simple programming interface to the NUMA (Non Uniform Memory Access) policy supported by the Linux kernel. On a NUMA architecture some memory areas have different latency or bandwidth than others.

这并非适用于所有架构,这就是issue 14确保仅在numa机器上调用numa的原因.

如“Set default numa policy to “interleave” system wide”中所述：

It seems that most applications that recommend explicit numactl definition either make a 07003 or incorporate numactl in 07004.

interleave = all缓解了应用程序遇到的问题,如cassandra(用于管理跨多个商品服务器的大量结构化数据的分布式数据库)：

By default, Linux attempts to be smart about memory allocations such that data is close to the NUMA node on which it runs. For big database type of applications, this is not the best thing to do if the priority is to avoid disk I/O. In particular with Cassandra, we’re heavily multi-threaded anyway and there is no particular reason to believe that one NUMA node is “better” than another.

Consequences of allocating unevenly among NUMA nodes can include excessive page cache eviction when the kernel tries to allocate memory – such as when restarting the JVM.

有关更多信息,请参阅“The MySQL “swap insanity” problem and the effects of the NUMA architecture”

没有numa

In a NUMA-based system, where the memory is divided into multiple nodes, how the system should handle this is not necessarily straightforward.
The default behavior of the system is to allocate memory in the same node as a thread is scheduled to run on, and this works well for small amounts of memory, but when you want to allocate more than half of the system memory it’s no longer physically possible to even do it in a single NUMA node: In a two-node system, only 50% of the memory is in each node.

与Numa：

An easy solution to this is to interleave the allocated memory. It is possible to do this using numactl as described above:

# numactl --interleave all command

我提到in the comments numa枚举硬件以理解物理布局.然后将处理器(而不是核心)划分为“节点”.
对于现代PC处理器,这意味着每个物理处理器有一个节点,而不管存在的核心数量.

这有点过分简化,正如Hristo Iliev所指出的：

AMD Opteron CPUs with larger number of cores are actually 2-way NUMA systems on their own with two 070010-interconnected dies with own memory controllers in a single physical package.
Also, Intel Haswell-EP CPUs with 10 or more cores come with two cache-coherent ring networks and two memory controllers and can be operated in a cluster-on-die mode, which presents itself as a two-way NUMA system.

It is wiser to say that a NUMA node is some cores that can reach some memory directly without going through a HT, 070011, NUMAlink or some other interconnect.