nn-meter——构建CNN推理预测器
1 nn-meter构建流程
2 构建tflite预测器
2.1 环境搭建
follow它的readme提示,准备nn-meter的安装
1
2
3
4
5
6
7git clone https://github.com/microsoft/nn-Meter
cd nn-Meter
conda create -n nnmeter_tflite python=3.8
# 当前nn-meter#8006ed6eaa62816c70737c9ff26a7445589bd36e支持到了2.11版本
pip install -r docs/requirements/requirements_builder.txt
# 安装nn-Meter
pip install .将tflite的benchmark工具推送到手机设备上
我们从nn-meter上下载benchmark文件,我选择了tflite_benchmark_tools_v2.7.zip文件。1
2
3
4
5
6
7# 创建几个临时文件夹给nn-Meter存放文件
adb shell "mkdir -p /mnt/sdcard/tflite_model"
adb shell "mkdir -p /mnt/sdcard/tflite_kernel"
# 推送benchmark文件到远程手机上
adb push benchmark_model_cpu_gpu_v2.7 /data/local/tmp
# 给benchmark设置可执行权限
adb shell chmod +x /data/local/tmp/benchmark_model_cpu_gpu_v2.7创建workspace,准备后端
1
nn-meter create --tflite-workspace /root/workspace/nn-Meter/workspace/RedmiK30Pro-sd865-tflite2.7cpu
创建完之后,会出现configs/*.yaml文件,主要需要修改backend_config.yaml,其余两个不需要啥修改。
- backend_config.yaml, 设置远程手机上的目录、benchmark位置,以及远程手机的地址(序列号或者IP),这个参数结合2.1#step 2。
1
2
3
4REMOTE_MODEL_DIR: /mnt/sdcard/tflite_bench
BENCHMARK_MODEL_PATH: /data/local/tmp/benchmark_model_cpu_gpu_v2.7
DEVICE_SERIAL: '3a9c4f5'
KERNEL_PATH: /mnt/sdcard/tflite_kernel - predictorbuild_config.yaml,设置预测器相关的参数。
- ruletest_config.yaml,设置OP融合规则相关的参数。
- backend_config.yaml, 设置远程手机上的目录、benchmark位置,以及远程手机的地址(序列号或者IP),这个参数结合2.1#step 2。
2.2 测试融合规则
在配置完环境和参数后,我们可以运行.py脚本自动化的执行OP融合测试和预测器了。nn-Meter提供了一些端到端的测试代码和分步的测试代码。
1 | # 参考文档: https://github.com/microsoft/nn-Meter/blob/main/docs/builder/test_fusion_rules.md#end-to-end-demo |
执行结束后,我们的{workspace}/fusion_rule_test/
文件夹下会出现测试结果。
2.3 构建kernel预测器
1 | # 参考文档: https://github.com/microsoft/nn-Meter/blob/main/docs/builder/build_kernel_latency_predictor.md#end-to-end-demo |
2.4 构建model预测器
同样根据文档步骤将2.2和2.3的OP融合规则和Kernel Predictor放到一个文件夹下,同时增加一个yaml配置文件,就可以注册一个Model Latency Predictor了。
- 拷贝文件和重命名 目录树如下
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20# 1. 将finegrained2.pkl复制到指定目录然后rename
cp workspace/RedmiK30Pro-sd865-tflite2.7cpu/predictor_build/results/predictors/*finegrained2.pkl /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu
#!/bin/bash
# 遍历当前目录下所有的文件
for file in *
do
# 判断文件名是否以"_finegrained2.pkl"结尾
if [[ $file == *_finegrained2.pkl ]]
then
# 替换文件名中的"_finegrained2.pkl"为".pkl"
new_name=${file/_finegrained2.pkl/.pkl}
# 重命名文件
echo "$new_name"
mv "$file" "$new_name"
fi
done
# 2. 融合规则
cp workspace/RedmiK30Pro-sd865-tflite2.7cpu/fusion_rule_test/results/detected_fusion_rule.json /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu/fusion_rules.json1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18redmik30p_sd865_tflite2.7cpu
|-- add.pkl
|-- addrelu.pkl
|-- avgpool.pkl
|-- bn.pkl
|-- bnrelu.pkl
|-- channelshuffle.pkl
|-- concat.pkl
|-- conv-bn-relu.pkl
|-- dwconv-bn-relu.pkl
|-- fc.pkl
|-- fusion_rules.json
|-- global-avgpool.pkl
|-- hswish.pkl
|-- maxpool.pkl
|-- relu.pkl
|-- se.pkl
|-- split.pkl - 写一个yaml文件索引文件位置
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21name: redmik30p_sd865_tflite2.7cpu
version: 1.0
category: cpu
package_location: /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu
kernel_predictors:
- conv-bn-relu
- dwconv-bn-relu
- fc
- global-avgpool
- hswish
- relu
- se
- split
- add
- addrelu
- maxpool
- avgpool
- bn
- bnrelu
- channelshuffle
- concat - 注册预测器 成功注册后一般会显示
1
2
3
4# 注册
nn-meter register --predictor /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu.yaml
#
nn-meter --list-predictors1
2
3
4
5
6
7(nn-Meter) Successfully register predictor: redmik30p_sd865_tflite2.7cpu
(nn-Meter) Supported latency predictors:
(nn-Meter) [Predictor] cortexA76cpu_tflite21: version=1.0
(nn-Meter) [Predictor] adreno640gpu_tflite21: version=1.0
(nn-Meter) [Predictor] adreno630gpu_tflite21: version=1.0
(nn-Meter) [Predictor] myriadvpu_openvino2019r2: version=1.0
(nn-Meter) [Predictor] redmik30p_sd865_tflite2.7cpu: version=1.0
3 测试
3.1 预测值和实际值的差别
基于Tensorflow2 API导出一个resnet50的模型
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
# 加载模型
model = ResNet50(weights='imagenet')
full_model = tf.function(lambda x: model(x))
shape = [1,224,224,3] # model.inputs[0].shape
full_model = full_model.get_concrete_function(
tf.TensorSpec(shape, model.inputs[0].dtype))
# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()
layers = [op.name for op in frozen_func.graph.get_operations()]
print("-" * 50)
print("Frozen model layers: ")
for layer in layers:
print(layer)
print("-" * 50)
print("Frozen model inputs: ")
print(frozen_func.inputs)
print("Frozen model outputs: ")
print(frozen_func.outputs)
# Save frozen graph from frozen ConcreteFunction to hard drive
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
logdir="./frozen_models",
name="frozen_graph.pb",
as_text=False)
# 将模型转换为 TensorFlow Lite 格式,并保存为 .tflite 文件
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('resnet50.tflite', 'wb') as f:
f.write(tflite_model)用nn-Meter预测
1
benchmark运行
1
2
3
4/data/local/tmp/benchmark_model_cpu_gpu_v2.7 --num_threads=4 \
--graph=/mnt/sdcard/tflite_models/resnet50.tflite \
--warmup_runs=30 \
--num_runs=501
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22STARTING!
Log parameter values verbosely: [0]
Min num runs: [50]
Num threads: [4]
Min warmup runs: [30]
Graph: [/mnt/sdcard/tflite_models/resnet50.tflite]
#threads used for CPU inference: [4]
Loaded model /mnt/sdcard/tflite_models/resnet50.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Replacing 75 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions.
The input model file size (MB): 102.161
Initialized session in 98.471ms.
Running benchmark for at least 30 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=30 first=104448 curr=87126 min=86737 max=104448 avg=88622.5 std=3079
Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=87163 curr=89038 min=86939 max=93704 avg=88199.2 std=1353
Inference timings in us: Init: 98471, First inference: 104448, Warmup (avg): 88622.5, Inference (avg): 88199.2
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=134.562 overall=208.699
3.2 预测一些未训练的kernel模型
这里我选了一个SSD模型,它的算子种类如下:
1 | Add |
执行预测后nn-meter predict --predictor redmik30p_sd865_tflite2.7cpu --predictor-version 1.0 --onnx /root/workspace/nn-Meter/workspace/models/ssd-12.onnx
1 | (nn-Meter) Start latency prediction ... |
爆出的Empty shape information
发生在nn_meter/ir_converter/onnx_converter/converter.py#OnnxConverter.fetch_attrs
函数中。这导致返回的attr变量为空,最终报错。
nn-Meter在计算时间要获取OP的输入输出shape,这里的shape等算子不是传统的OP,所有报错了。这里总结了这个模型报错的的算子类型
1 | Add |
感觉这里的问题有点复杂,有些算子是nn-Meter训练过的,但是还是报错了比如Add
,Concat
.
3.3 nn-Meter的问题
- 对于Tensorflow模型来说,nn-Meter可能会在ShapeInference出现问题.
- nn-Meter目前对于仅支持一些CNN常用的算子.
- nn-Meter支持的模型数据类型只有float类型和int32类型.
4. 参考资料
5. 参考代码
- 打印onnx模型的OP
1
2
3
4
5
6
7
8import onnx
model_file = "/root/workspace/nn-Meter/workspace/models/mobilenetv3small_0.onnx" # ONNX模型文件路径
model = onnx.load(model_file)
op_types = set()
for node in model.graph.node:
op_types.add(node.op_type)
op_types = list(op_types)
[print(op_type) for op_type in op_types]
nn-meter——构建CNN推理预测器