最近被python折腾惨了,所以分享一次segfault错误排查的经历。

背景是这样的,在我的python服务(可以看做是机器学习模型运行的容器)中运行着不同的机器/深度学习模型,这天呢,一个小哥哥上传了个新的模型,我们暂且叫它为模型A。

当服务容器尝试对该模型加载启动时,服务安静的闪人了,what,这是什么鬼,难道又是内存溢出,但是观测内存很正常啊,不,一定是意外,重启服务,哇,见鬼了,还是一声不吭的走人,留下来喝杯茶好不好。

一点点排查

有了之前排查服务静默退出的经验,直接从/var/log/message中找蛛丝马迹,果然发现了这样一行日志:

Dec  4 17:38:36 centos kernel: python[31139]: segfault at a38 ip 00007f2ffb31e52a sp 00007ffe8b09f6e0 error 6 in _C.cpython-36m-x86_64-linux-gnu.so[7f2ffaf68000+8ae7000]

segfault这是什么,赶紧google一下:存储器段错误,通常该错误是由于调用一个地址,而该地址为空(NULL)所造成的。

好吧,大概知道了这个概念,那怎么去分析到底那边产生这个错误呢,从python cookbook中找到了打印段错误堆栈信息的方法,有两种方式:

  1. 程序代码中开启段错误捕获

    import faulthandler
    faulthandler.enable()
    
  2. 解释器执行时指定参数

    # python3 -Xfaulthandler <filename>.py
    

采用第二种方法启动实例,看到如下堆栈信息:

[I 181204 17:47:34 DynamicLoadByHooks:56] import module:Multipose and module prefix is Multipose_detection/0_0_3/package
Fatal Python error: Segmentation fault

Current thread 0x00007f23ba8f2740 (most recent call first):
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 922 in create_module
  File "<frozen importlib._bootstrap>", line 571 in module_from_spec
  File "<frozen importlib._bootstrap>", line 658 in _load_unlocked
  File "<frozen importlib._bootstrap>", line 955 in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/opt/anaconda3/lib/python3.6/site-packages/torch/__init__.py", line 78 in <module>
  File "<frozen importlib._bootstrap>", line 219 in _call_with_frames_removed
  File "<frozen importlib._bootstrap_external>", line 678 in exec_module
  .......
  File "<frozen importlib._bootstrap>", line 971 in _find_and_load
  File "/home/admin/daemon/daemonspace/deployerPath/fms-python3-tf18-56.123_9260/base/home/Multipose_detection/0_0_3/package/utils/posenet.py", line 4 in <module>
  ........
  ........
  File "/home/admin/daemon/daemonspace/deployerPath/fms-python3-tf18-56.123_9260/app/src/com/focustech/fmodel/core/ModelRegister.py", line 125 in reload
  File "/home/admin/daemon/daemonspace/deployerPath/fms-python3-tf18-56.123_9260/app/src/com/focustech/fmodel/core/SystemRecovery.py", line 21 in recovery
  File "/home/admin/daemon/daemonspace/deployerPath/fms-python3-tf18-56.123_9260/app/src/com/focustech/fmodel/FmsServer.py", line 55 in start
  File "Main.py", line 44 in <module>

从堆栈发现,是import torch这行代码引起的,是不是因为torch版本与操作系统不兼容呢? 在shell中打开python解释器,执行import torch,并没有什么问题啊,咋办呢,尝试安装gdb进行调试,无奈一堆环境问题,不多折腾了。

突然脑子一哆嗦,好像发现了什么,既然直接解释器中import torch没有问题,那是不是模型代码的问题呢,想到这之后,直接模拟服务容器中对模型A的加载逻辑,然后执行,并没有问题呀!

不对服务容器中除了模型A,还有其它几个模型,既然这样,就先将其它几个模型停用,不作加载,只加载模型A,再次启动,OK了。

也就是说,其它几个模型的加载的依赖可能会对import torch产生冲突,把几个模型的import的依赖都列出来如下:

from PIL import Image
import cv2
import math
import dlib
from keras.layers import Conv2D, Input, Flatten, Dense, TimeDistributed, Reshape, Lambda, Dropout, BatchNormalization
from keras.models import Model
from keras.layers import MaxPooling2D,concatenate,merge
from keras import optimizers
from keras.layers.recurrent import LSTM
import keras.backend as K
import numpy as np
import tensorflow as tf

感觉剩下的事情很简单了,挨个和import torch组合,功夫不负有心人,终于找到了问题所在:

python> import dlib
python> import torch

紧接着出了现段错误,甚是开心(其实神坑还在后面)。

下面就是看怎么解决dlib对torch的影响,反正没找到官方的辟谣,只好去找了一下线下训练机器用的dlib版本和torch版本,发现线下训练机器:dlib==19.6.0,而服务器上dlib==19.16.0

安装dlib==19.16.0

Installing apps.
Getting distribution for 'dlib==19.6.0'.
error: Setup script exited with error: cmake configuration failed!
An error occurred when trying to install /tmp/tmp25hqhosfget_dist/dlib-19.6.0.tar.gz. Look above this message for any errors that were output by easy_install.
While:
  Installing apps.
  Getting distribution for 'dlib==19.6.0'.

再次import dlib

python>> import dlib
	Traceback (most recent call last):
	File "<stdin>", line 1, in <module>
	File "/opt/anaconda3/lib/python3.6/site-packages/dlib/__init__.py", line1, in <module>
	from .dlib import *
	ImportError: /opt/anaconda3/lib/python3.6/site-packages/dlib/dlib.so:	undefined symbol: _ZN5boost6python6detail11init_moduleER11PyModuleDefPFvvE

是不是少缺少依赖?

# ldd /opt/anaconda3/lib/python3.6/site-packages/dlib/dlib.so
    linux-vdso.so.1 =>  (0x00007ffd2d384000)
    libboost_python-mt.so.1.53.0 => /lib64/libboost_python-mt.so.1.53.0 (0x00007f7a572c4000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f7a570a8000)
    libnsl.so.1 => /lib64/libnsl.so.1 (0x00007f7a56e8e000)
    libpng16.so.16 => not found
    libz.so.1 => /lib64/libz.so.1 (0x00007f7a56c78000)
    libjpeg.so.9 => not found
    libmkl_rt.so => not found
    libsqlite3.so.0 => /lib64/libsqlite3.so.0 (0x00007f7a569c2000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f7a566b9000)	
    libm.so.6 => /lib64/libm.so.6 (0x00007f7a563b7000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f7a561a1000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f7a55ddd000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007f7a55bda000)
    libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00007f7a5580e000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007f7a55609000)
    librt.so.1 => /lib64/librt.so.1 (0x00007f7a55401000)
    /lib64/ld-linux-x86-64.so.2 (0x000056363296c000)

看似缺少其实不少,anaconda都提供了,只需要在实例启动脚本中添加动态链接库指定就好:

export LD_LIBRARY_PATH=/opt/anaconda3/lib:$LD_LIBRARY_PATH

指定完后,再用ldd查看,并不少依赖了,那是什么问题呢,翻了一下google怎么说的,看到有个小哥哥说了这么一句:

Many users have this problem. It's because you have multiple versions of python and/or boost installed and are mixing them up. You need to compile both boost.python and dlib for the version of python you will be using. That fixes this problem.

那我们来确认一下是不是boost的鬼,看一下boost的依赖:

ldd /lib64/libboost_python-mt.so.1.53.0
    linux-vdso.so.1 =>  (0x00007ffe9a9b6000)
    libutil.so.1 => /lib64/libutil.so.1 (0x00007fa8a8665000)
    libpython2.7.so.1.0 => /lib64/libpython2.7.so.1.0 (0x00007fa8a8298000)
    libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa8a807c000)
    libdl.so.2 => /lib64/libdl.so.2 (0x00007fa8a7e78000)
    librt.so.1 => /lib64/librt.so.1 (0x00007fa8a7c6f000)
    libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007fa8a7967000)
    libm.so.6 => /lib64/libm.so.6 (0x00007fa8a7665000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fa8a744e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fa8a708b000)
    /lib64/ld-linux-x86-64.so.2 (0x000055f79f8da000)

what?怎么又有python2.7的事了,看样子真的和boost有关系喽,下面就是折腾安装支持python3.6的boost,大致步骤如下:

  1. 下载boost 1.65源码包
  2. 解压并进入安装目录
# tar -xzvf boost_1_65_0.tar.gz
# cd boost_1_65_0
  1. 配置
# mkdir /opt/boost
# ./bootstrap.sh --prefix=/opt/boost --with-python-root=/opt/anaconda3 --with-python=/opt/anaconda3/bin/python  --with-python-version=3.6

注意还需要额外修改project-config.jam文件中关于python的配置,修改成如下:

# Python configuration
import python ;
if ! [ python.configured ]
{
    using python : 3.6 : /opt/anaconda3 : /opt/anaconda3/include/python3.6m : /opt/anaconda3/lib ;
}
  1. 编译以及安装

切记要耐心等待!

# ./b2
# ./b2 install
  1. 要让系统或者python能够发现boost,有很多方式,这边给出最暴力的,/opt/boost中的libinclude分别加入/usr/local/lib64/usr/local/include
# cp -rf boost/lib64/* /usr/local/lib64
# cp -r boost/include/boost include/ 

再次重新更换dlib的版本,哇,折腾了我两天,终于解决了,回家喝杯小酒!