1

Docker对JVM一些限制的研究

 1 year ago
source link: https://club.perfma.com/article/2153816
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Docker对JVM一些限制的研究 | PerfMa应用性能技术社区
文章>Docker对JVM一些限制的研究

Docker对JVM一些限制的研究

豆大侠
docker
3天前

首先说一个老生常谈的限制:我们在对Docker中的Java应用使用诸如jmap等命令时常常会报错:

Can't attach to the process: ptrace(PTRACE_ATTACH, ..).

这个主要是因为像jstack、jmap等工具主要是通过两种方式来实现的:

  1. Attach机制(也可以叫做Vitural Machine.attach(),主要是用通过Socket 与目标JVM的Attach Listener线程进行交互,详情可看笨神文章《JVM源码分析之Attach机制实现完全解读》).
  2. Serviceability Agent(其实也是一种Attach,在Linux中要靠系统调用ptrace来实现).

而 Docker 自 1.10 版本开始,默认的 seccomp 配置文件中禁用了 ptrace,所以一些通过SA进行的操作如:jmap -heap就会报错,而Docker官方也给出了解决方法:

  1. 使用–cap-add=SYS_PTRACE明确添加指定功能[docker run --cap-add=SYS_PTRACE ...]
  2. 关闭 seccomp /将ptrace添加到允许的名单中docker run --security-opt seccomp:unconfined ...

除了这个限制,前一段时间我在翻JDK的JDK BUG SYSTEM的时候无意间发现了这么一个Bug:JDK-8140793

getAvailableProcessors may incorrectly report the number of cpus in Docker container

BUG大致描述的现象是,Java在Docker容器中运行时,获取到的CPU的数目可能是不正确的。

Docker大家都知道是依托于Cgroups和Namespace的,而Cgroups 是一种 Linux 内核功能,可以限制和隔离进程的资源使用情况(CPU、内存、磁盘 I/O、网络等),所以我猜可能是JVM在运行时并没有读取到Docker使用Cgroups进行的限制.

继续查看这个BUG,发现状态是RESOLVED,于是继续翻找,在官方的Blog中发现了这么一篇文章

《Java SE support for Docker CPU and memory limits》(文章关联了反应Docker中CPU计算出错的JDK-8140793、Docker中内存限制的增强JDK-8170888、容器检测和资源配置使用率增强的JDK-8146115).

文章中提到在JDK8u121之前的版本中(Java SE 8u121 and earlier),JVM读取的CPU数以及内存等都是不受到Cgroups限制的数据,那么这么做又会出现什么问题呢?据我所知,在我们不显式的指明一些参数的时候,往往会用到JVM读取的数据做一些默认的配置。比如如果不显式的指定 -XX:ParallelGCThreads and -XX:CICompilerCount,那么JVM就会根据读到的CPU数目进行计算来设置数值,如在计算Parallel GC的Threads数目的地方runtime\vm_version.cpp(以下基于openJDK1.8 b120):

if (FLAG_IS_DEFAULT(ParallelGCThreads)) {
    assert(ParallelGCThreads == 0, "Default ParallelGCThreads is not 0");
    // For very large machines, there are diminishing returns
    // for large numbers of worker threads.  Instead of
    // hogging the whole system, use a fraction of the workers for every
    // processor after the first 8.  For example, on a 72 cpu machine
    // and a chosen fraction of 5/8
    // use 8 + (72 - 8) * (5/8) == 48 worker threads.
    unsigned int ncpus = (unsigned int) os::active_processor_count();
    return (ncpus <= switch_pt) ?
           ncpus :
          (switch_pt + ((ncpus - switch_pt) * num) / den);
  } else {
    return ParallelGCThreads;
  }

进入到获取CPU数目的os::active_processor_count()(linux实现os_linux.cpp)

int os::active_processor_count() {
  // Linux doesn't yet have a (official) notion of processor sets,
  // so just return the number of online processors.
  int online_cpus = ::sysconf(_SC_NPROCESSORS_ONLN);
  assert(online_cpus > 0 && online_cpus <= processor_count(), "sanity check");
  return online_cpus;
}

我们发现确实是通过::sysconf(_SC_NPROCESSORS_ONLN)来读取的物理机的CPU,如此看来GC的线程数目的计算就会出现一定的问题,同理JIT compiler threads也会遇到同样的问题。

而除了CPU的读取会出错,内存也是如此,我们在不显式的指定一些参数时如-Xmx(MaxHeapSize)、-Xms(InitialHeapSize)时,JVM会根据它读取到的机器的内存大小做一些默认的设置如:

void Arguments::set_heap_size() {
  if (!FLAG_IS_DEFAULT(DefaultMaxRAMFraction)) {
    // Deprecated flag
    FLAG_SET_CMDLINE(uintx, MaxRAMFraction, DefaultMaxRAMFraction);
  }

  const julong phys_mem =
    FLAG_IS_DEFAULT(MaxRAM) ? MIN2(os::physical_memory(), (julong)MaxRAM)
                            : (julong)MaxRAM;

  // If the maximum heap size has not been set with -Xmx,
  // then set it as fraction of the size of physical memory,
  // respecting the maximum and minimum sizes of the heap.
  if (FLAG_IS_DEFAULT(MaxHeapSize)) {
    julong reasonable_max = phys_mem / MaxRAMFraction;

    if (phys_mem <= MaxHeapSize * MinRAMFraction) {
      // Small physical memory, so use a minimum fraction of it for the heap
      reasonable_max = phys_mem / MinRAMFraction;
    } 
    .
    .
    .
    .
  }
}

其中读取内存的os::physical_memory()读取也是physical memory,而这在Docker中运行可能引发一系列的错误比如被OOMKiller给杀掉(参考).

可见当我们使用一些比较老的JDK8版本时,如果我们没有显式指定一些参数可能会遇到一些稀奇古怪的问题,我在JDK-8146115中发现此对Docker支付的增强已经在JDK10中实现了,使用-XX:+UseContainerSupport可以开启容器支持,而且这一增强已经被backport到了JDK8的一些新版本中(JDK8u131之后的版本).


我下载了新版本的OpenJDK8,翻阅源码发现Oracle果然做了相应的处理.

原先os::active_processor_count()变成了:

// Determine the active processor count from one of
// three different sources:
//
// 1. User option -XX:ActiveProcessorCount
// 2. kernel os calls (sched_getaffinity or sysconf(_SC_NPROCESSORS_ONLN)
// 3. extracted from cgroup cpu subsystem (shares and quotas)
//
// Option 1, if specified, will always override.
// If the cgroup subsystem is active and configured, we
// will return the min of the cgroup and option 2 results.
// This is required since tools, such as numactl, that
// alter cpu affinity do not update cgroup subsystem
// cpuset configuration files.
int os::active_processor_count() {
  // User has overridden the number of active processors
  if (ActiveProcessorCount > 0) {
    if (PrintActiveCpus) {
      tty->print_cr("active_processor_count: "
                    "active processor count set by user : %d",
                    ActiveProcessorCount);
    }
    return ActiveProcessorCount;
  }

  int active_cpus;
  if (OSContainer::is_containerized()) {
    active_cpus = OSContainer::active_processor_count();
    if (PrintActiveCpus) {
      tty->print_cr("active_processor_count: determined by OSContainer: %d",
                     active_cpus);
    }
  } else {
    active_cpus = os::Linux::active_processor_count();
  }

  return active_cpus;
}

可以清晰的看到,如果有-XX:ActiveProcessorCount参数则使用参数,如果没有就会去OSContainer::is_containerized()判断是否容器化:

inline bool OSContainer::is_containerized() {
  assert(_is_initialized, "OSContainer not initialized");
  return _is_containerized;
}

而_is_containerized是由Threads::create_vm调用OSContainer::init()时检查虚拟机是否运行在容器中得来的(具体方法太长了):

/* init
 *
 * Initialize the container support and determine if
 * we are running under cgroup control.
 */
void OSContainer::init() {
  int mountid;
  int parentid;
  int major;
  int minor;
  FILE *mntinfo = NULL;
  FILE *cgroup = NULL;
  char buf[MAXPATHLEN+1];
  char tmproot[MAXPATHLEN+1];
  char tmpmount[MAXPATHLEN+1];
  char tmpbase[MAXPATHLEN+1];
  char *p;
  jlong mem_limit;

  assert(!_is_initialized, "Initializing OSContainer more than once");

  _is_initialized = true;
  _is_containerized = false;

  _unlimited_memory = (LONG_MAX / os::vm_page_size()) * os::vm_page_size();

  if (PrintContainerInfo) {
    tty->print_cr("OSContainer::init: Initializing Container Support");
  }
  if (!UseContainerSupport) {
    if (PrintContainerInfo) {
      tty->print_cr("Container Support not enabled");
    }
    return;
  }

  ...........

  _is_containerized = true;

}

方法就是对一些地方做了检查,如UseContainerSupport参数是否开启、/proc/self/mountinfo、/proc/self/cgroup是否可读等等,如果判断JVM运行在容器中,那么就会调用OSContainer::active_processor_count()获取容器限制的CPU数目:

/* active_processor_count
 *
 * Calculate an appropriate number of active processors for the
 * VM to use based on these three inputs.
 *
 * cpu affinity
 * cgroup cpu quota & cpu period
 * cgroup cpu shares
 *
 * Algorithm:
 *
 * Determine the number of available CPUs from sched_getaffinity
 *
 * If user specified a quota (quota != -1), calculate the number of
 * required CPUs by dividing quota by period.
 *
 * If shares are in effect (shares != -1), calculate the number
 * of CPUs required for the shares by dividing the share value
 * by PER_CPU_SHARES.
 *
 * All results of division are rounded up to the next whole number.
 *
 * If neither shares or quotas have been specified, return the
 * number of active processors in the system.
 *
 * If both shares and quotas have been specified, the results are
 * based on the flag PreferContainerQuotaForCPUCount.  If true,
 * return the quota value.  If false return the smallest value
 * between shares or quotas.
 *
 * If shares and/or quotas have been specified, the resulting number
 * returned will never exceed the number of active processors.
 *
 * return:
 *    number of CPUs
 */
int OSContainer::active_processor_count() {
  int quota_count = 0, share_count = 0;
  int cpu_count, limit_count;
  int result;

  cpu_count = limit_count = os::Linux::active_processor_count();
  int quota  = cpu_quota();
  int period = cpu_period();
  int share  = cpu_shares();
...........
}

通过注释发现,此时的计算是通过cgroup cpu quota & cpu period、cgroup cpu shares得来的,而Docker可以通过–cpu-period、–cpu-quota等来进行设置

同理,对于Memory的处理,如果不标明-Xmx,JVM可以开启*-XX:+UnlockExperimentalVMOptions*、 -XX:+UseCGroupMemoryLimitForHeap这两个参数,来使得JVM使用Linux cgroup的配置确定最大Java堆大小。

Arguments::set_heap_size()方法:

void Arguments::set_heap_size() {
  if (!FLAG_IS_DEFAULT(DefaultMaxRAMFraction)) {
    // Deprecated flag
    FLAG_SET_CMDLINE(uintx, MaxRAMFraction, DefaultMaxRAMFraction);
  }

  julong phys_mem =
    FLAG_IS_DEFAULT(MaxRAM) ? MIN2(os::physical_memory(), (julong)MaxRAM)
                            : (julong)MaxRAM;

  // Experimental support for CGroup memory limits
  if (UseCGroupMemoryLimitForHeap) {
    // This is a rough indicator that a CGroup limit may be in force
    // for this process
    const char* lim_file = "/sys/fs/cgroup/memory/memory.limit_in_bytes";
    FILE *fp = fopen(lim_file, "r");
    if (fp != NULL) {
      julong cgroup_max = 0;
      int ret = fscanf(fp, JULONG_FORMAT, &cgroup_max);
      if (ret == 1 && cgroup_max > 0) {
        // If unlimited, cgroup_max will be a very large, but unspecified
        // value, so use initial phys_mem as a limit
        if (PrintGCDetails && Verbose) {
          // Cannot use gclog_or_tty yet.
          tty->print_cr("Setting phys_mem to the min of cgroup limit ("
                        JULONG_FORMAT "MB) and initial phys_mem ("
                        JULONG_FORMAT "MB)", cgroup_max/M, phys_mem/M);
        }
        phys_mem = MIN2(cgroup_max, phys_mem);
      } else {
        warning("Unable to read/parse cgroup memory limit from %s: %s",
                lim_file, errno != 0 ? strerror(errno) : "unknown error");
      }
      fclose(fp);
    } else {
      warning("Unable to open cgroup memory limit file %s (%s)", lim_file, strerror(errno));
    }
  }
....................
}

JVM会通过使用cgroup文件系统中的memory_limit()值初始化os::physical_memory()中的值,不过我有注意到注释上有Experimental support的字样,估计不太成熟哈哈还只是实验性质的支持。


这么看Java在Docker中运行小坑与限制还不少呢,不知道哪里设置的不好就会出现一些莫名其妙的问题,我们最好还是根据Docker的配置来显式设置JVM的参数以避免大部分问题。如果还是有问题,可以考虑下升级较高版本的JDK8u,如果成本高不想升级请参考方案来外部加载一些库进行拦截修改。


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK