

Reliably allocating huge pages in Linux
source link: https://mazzo.li/posts/check-huge-page.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

2021-11-22 Reliably allocating huge pages in Linux
Let’s say that you have a program which relies on huge pages for performance. I couldn’t find a resource fully explaining how to allocate huge pages at runtime, making sure that the huge page allocation was successful, so here it is.
High level steps (or skip to the code):
Make sure that transparent huge pages are enabled:1
% cat /sys/kernel/mm/transparent_hugepage/enabled always [madvise] never
madvise
oralways
are what we want.Run the program where you want to perform this check as root.2
Allocate memory using
aligned_alloc
orposix_memalign
, with a 2MiB alignment — the huge page size. Linux also supports 1GiB huge pages on some systems, but here we’ll be working with 2MiB pages:34void* buf = aligned_alloc(1 << 21, size);
Instruct the kernel to allocate the page using a huge pages with
madvise
:madvise(buf, size, MADV_HUGEPAGE)
It is important to issue this command before the page is allocated (next step). Also, this step is not needed if transparent huge pages are set to
always
.For each 2MiB chunk in your buffer:
Allocate the page backing your buffer — setting the first byte for each page would be enough:
memset(buf, 0, 1);
Get the page frame number (PFN) by reading
/proc/self/pagemap
.See if the
KPF_THP
flag is set for the PFN retrieved above in/proc/kpageflags
.
The gory details:
#include <errno.h>
#include <fcntl.h>
#include <linux/kernel-page-flags.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <unistd.h>
#define fail(...) do { fprintf(stderr, __VA_ARGS__); exit(EXIT_FAILURE); } while (0)
// normal page, 4KiB
#define PAGE_SIZE (1 << 12)
// huge page, 2MiB
#define HPAGE_SIZE (1 << 21)
// See <https://www.kernel.org/doc/Documentation/vm/pagemap.txt> for
// format which these bitmasks refer to
#define PAGEMAP_PRESENT(ent) (((ent) & (1ull << 63)) != 0)
#define PAGEMAP_PFN(ent) ((ent) & ((1ull << 55) - 1))
static void check_huge_page(void* ptr);
int main(void) {
// allocate 10 huge pages
size_t size = HPAGE_SIZE * 10;
void* buf = aligned_alloc(HPAGE_SIZE, size);
if (!buf) {
fail("could not allocate buffer: %s", strerror(errno));
}
madvise(buf, size, MADV_HUGEPAGE);
// allocate and check each page
for (void* end = buf + size; buf < end; buf += HPAGE_SIZE) {
// allocate page
memset(buf, 0, 1);
// check the page is indeed huge
check_huge_page(buf);
}
printf("all good, exiting\n");
return 0;
}
// Checks if the page pointed at by `ptr` is huge. Assumes that `ptr` has already
// been allocated.
static void check_huge_page(void* ptr) {
int pagemap_fd = open("/proc/self/pagemap", O_RDONLY);
if (pagemap_fd < 0) {
fail("could not open /proc/self/pagemap: %s", strerror(errno));
}
int kpageflags_fd = open("/proc/kpageflags", O_RDONLY);
if (kpageflags_fd < 0) {
fail("could not open /proc/kpageflags: %s", strerror(errno));
}
// each entry is 8 bytes long
uint64_t ent;
if (pread(pagemap_fd, &ent, sizeof(ent), ((uintptr_t) ptr) / PAGE_SIZE * 8) != sizeof(ent)) {
fail("could not read from pagemap\n");
}
if (!PAGEMAP_PRESENT(ent)) {
fail("page not present in /proc/self/pagemap, did you allocate it?\n");
}
if (!PAGEMAP_PFN(ent)) {
fail("page frame number not present, run this program as root\n");
}
uint64_t flags;
if (pread(kpageflags_fd, &flags, sizeof(flags), PAGEMAP_PFN(ent) << 3) != sizeof(flags)) {
fail("could not read from kpageflags\n");
}
if (!(flags & (1ull << KPF_THP))) {
fail("could not allocate huge page\n");
}
if (close(pagemap_fd) < 0) {
fail("could not close /proc/self/pagemap: %s", strerror(errno));
}
if (close(kpageflags_fd) < 0) {
fail("could not close /proc/kpageflags: %s", strerror(errno));
}
}
Some useful resources apart what was already linked:
page-info
, a small library by Travis Downs to get most of the information out of/proc/[pid]/pagemap
and/proc/kpageflags
.- This useful StackOverflow answer, also by Travis Downs, describing the method in this article, but without code details.
transhuge-stress.c
, a useful stress test for page tables found in the kernel tree.
CONFIG_TRANSPARENT_HUGEPAGE
needs to be enabled in the kernel config for things to work, but this has been the case for all the systems I’ve tried, and I didn’t bother checking what happens to/sys/kernel/mm/transparent_hugepage/enabled
if it’s not enabled.↩︎Getting the page frame number (PFN) from
/proc/self/pagemap
requiresCAP_SYS_ADMIN
capability, therefore it would be possible to read it as a normal user by issuing% sudo setcap cap_sys_admin+ep <executable>
And then enable dumping explicitly with
prctl(PR_SET_DUMPABLE, 1, 0, 0)
The “dumpable” flag regulates whether the
/proc/[pid]
files are owned to the user or toroot
, as described in the man page for/proc/[pid]
:The files inside each
/proc/[pid]
directory are normally owned by the effective user and effective group ID of the process. However, as a security measure, the ownership is maderoot:root
if the process’s “dumpable” attribute is set to a value other than 1.The dumpable flag is normally set, but if we set the capability like described above it is not, as described in this StackOverflow answer.
However, even after doing all this work, we still won’t be able to read from
/proc/kpageflags
, which is only readable byroot
🙃.↩︎The man page for
madvise
states (emphasis mine):Enable Transparent Huge Pages (THP) for pages in the range specified by addr and length. Currently, Transparent Huge Pages work only with private anonymous pages (see
mmap(2)
). The kernel will regularly scan the areas marked as huge page candidates to replace them with huge pages. The kernel will also allocate huge pages directly when the region is naturally aligned to the huge page size (seeposix_memalign(2)
).Travis Downs pointed out that
mmap
might be a safer option, sincealigned_alloc
and friends might preemptively allocate pages.Moreover, Paul Khuong provided a way to easily get a huge page aligned area using
mmap
.↩︎
Recommend
-
85
Helping SaaS companies run reliably on Google Cloud 2018-08-16adminGoogleCloud...
-
14
Using huge pages on Linux 2020-10-08 In this article I will explain when and how to use huge pages. Workloads that performs random memory access’ to a big working set can be limited by
-
6
Copy link Contributor nnethercote
-
4
Finance and investing Allocating Capital When Interest Rates Are High
-
4
Huge Pages are a Good Idea (evanjones.ca) [ 2023-January-16 11:46 ] Nearly all programs are written to access virtual memory addresses, which the CPU must translate to physical addresses. These transla...
-
7
1,848 Views CrossRef citations to date
-
6
Linux为什么要有大页内存?为什么DPDK要求必须要设置大页内存?这都是由系统架构决定的,系统架构发展到现在,又是在原来的基础上一点点演变的。一开始为了解决一个问题,大家设计了一个很好的方案,随着事物的发展,发现无法满足需求,就在原来的基础上改进,慢慢的...
-
3
Transparent Huge Pages (THP) is a memory management feature in Linux operating systems that aims to enhance system performance. While THP can be beneficial for many applications, enabling it on a database server could have unintended consequences....
-
7
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK