Posted 2023-05-22Updated 2023-10-16DNNa few seconds read (About 23 words)

nn-Meter——新增onnxruntime后端

1. onnxruntime后端

2. 参考资料

https://github.com/microsoft/nn-Meter/blob/main/docs/builder/prepare_backend.md#-build-customized-backend-

Posted 2023-05-22Updated 2023-10-16DNN5 minutes read (About 704 words)

nn-meter——预测流程分析

nn-Meter预测流程

输入预测指令

1	nn-meter predict --predictor RedmiK30Pro_cpu_tflite27 --predictor-version 1.0 --onnx /root/workspace/nn-Meter/workspace/models/mobilenetv3small_0.onnx

这条指令将会被nn_meter/utils/nn_meter_cli/predictor.py#apply_latency_predictor_cli这个函数接收到。函数内进行了参数解析，主要解析了–predictor, –predictor-version,–onnx/–tensorflow等参数。

def apply_latency_predictor_cli(args):
    # specify model type
    if args.tensorflow:
        input_model, model_type, model_suffix = args.tensorflow, "pb", ".pb"
    elif args.onnx:
        input_model, model_type, model_suffix = args.onnx, "onnx", ".onnx"
    elif args.nn_meter_ir:
        input_model, model_type, model_suffix = args.nn_meter_ir, "nnmeter-ir", ".json"
    elif args.torchvision: # torch model name from torchvision model zoo
        input_model_list, model_type = args.torchvision, "torch" 
    ...

    # load predictor
    predictor = load_latency_predictor(args.predictor, args.predictor_version)

    ...
    # predict latency
    result = {}
    for model in input_model_list:
        latency = predictor.predict(model, model_type) # in unit of ms
        result[os.path.basename(model)] = latency
        logging.result(f'[RESULT] predict latency for {os.path.basename(model)}: {latency} ms')
    
    return result

加载预测器load_latency_predictor
step 1中，解析出–predictor, –predictor-version参数后会去加载相关的predictor文件，这里调用了nn_meter/predictor/nn_meter_predictor.py#load_latency_predictor函数根据用户目录下cache过的文件路径去找到预测器和融合规则。这些文件要么是官方默认提供的，要么是用户自己客制化的。找到然后返回一个nnMeterPredictor对象。

def load_latency_predictor(predictor_name: str, predictor_version: float = None):
    user_data_folder = get_user_data_folder()
    pred_info = load_predictor_config(predictor_name, predictor_version)
    if "download" in pred_info:
        kernel_predictors, fusionrule = loading_to_local(pred_info, os.path.join(user_data_folder, 'predictor'))
    else:
        kernel_predictors, fusionrule = loading_customized_predictor(pred_info)
        
    return nnMeterPredictor(kernel_predictors, fusionrule)

预测器对象预测模型延时predictor.predict
在step 2中得到一个预测器后，我们就会调用nn_meter/predictor/nn_meter_predictor.py#nnMeterPredictor.predict函数去预测模型了。

self.kd.load_graph(graph)这里先将模型转换成graph，解析成kernels,这步的实现在nn_meter/kernel_detector/kernel_detector.py#KernelDetector.load_graph，这里考虑了融合规则，如何发现op的组合符合融合规则的，那么这些OP将会组合在一起去预测。

nn_predict(self.kernel_predictors, self.kd.get_kernels()),将step 3.1中检测出来的kernels送入kernel预测器进行单逐一预测,nn_meter/predictor/prediction/predict_by_kernel.py#nn_predict，这里主要是提取kernels的特征，其实就是conv2d这些op的参数，例如输入输出维度、卷积和大小等。然后就是逐层根据op/kernel name去选择预测器加载特征去预测延时了。

def predict(
        self, model, model_type, input_shape=(1, 3, 224, 224), apply_nni=False
    ):
        logging.info("Start latency prediction ...")
        if isinstance(model, str):
            graph = model_file_to_graph(model, model_type, input_shape, apply_nni=apply_nni)
        else:
            graph = model_to_graph(model, model_type, input_shape=input_shape, apply_nni=apply_nni)
        
        # logging.info(graph)
        self.kd.load_graph(graph)

        py = nn_predict(self.kernel_predictors, self.kd.get_kernels()) # in unit of ms
        logging.info(f"Predict latency: {py} ms")
        return py

Posted 2023-05-22Updated 2023-10-16DNN14 minutes read (About 2162 words)

nn-meter——构建CNN推理预测器

1 nn-meter构建流程

2 构建tflite预测器

2.1 环境搭建

follow它的readme提示，准备nn-meter的安装

git clone https://github.com/microsoft/nn-Meter
cd nn-Meter
conda create -n nnmeter_tflite python=3.8
# 当前nn-meter#8006ed6eaa62816c70737c9ff26a7445589bd36e支持到了2.11版本
pip install -r docs/requirements/requirements_builder.txt
# 安装nn-Meter
pip install .

将tflite的benchmark工具推送到手机设备上
我们从nn-meter上下载benchmark文件，我选择了tflite_benchmark_tools_v2.7.zip文件。

# 创建几个临时文件夹给nn-Meter存放文件
adb shell "mkdir -p /mnt/sdcard/tflite_model"
adb shell "mkdir -p /mnt/sdcard/tflite_kernel"
# 推送benchmark文件到远程手机上
adb push benchmark_model_cpu_gpu_v2.7 /data/local/tmp
# 给benchmark设置可执行权限
adb shell chmod +x /data/local/tmp/benchmark_model_cpu_gpu_v2.7

创建workspace,准备后端
1
nn-meter create --tflite-workspace /root/workspace/nn-Meter/workspace/RedmiK30Pro-sd865-tflite2.7cpu
创建完之后，会出现configs/*.yaml文件,主要需要修改backend_config.yaml，其余两个不需要啥修改。
- backend_config.yaml, 设置远程手机上的目录、benchmark位置，以及远程手机的地址(序列号或者IP)，这个参数结合2.1#step 2。
  1
  2
  3
  4
  REMOTE_MODEL_DIR: /mnt/sdcard/tflite_bench
  BENCHMARK_MODEL_PATH: /data/local/tmp/benchmark_model_cpu_gpu_v2.7
  DEVICE_SERIAL: '3a9c4f5'
  KERNEL_PATH: /mnt/sdcard/tflite_kernel
- predictorbuild_config.yaml,设置预测器相关的参数。
- ruletest_config.yaml,设置OP融合规则相关的参数。

2.2 测试融合规则

在配置完环境和参数后，我们可以运行.py脚本自动化的执行OP融合测试和预测器了。nn-Meter提供了一些端到端的测试代码和分步的测试代码。

# 参考文档: https://github.com/microsoft/nn-Meter/blob/main/docs/builder/test_fusion_rules.md#end-to-end-demo
workspace ="/root/workspace/nn-Meter/workspace/RedmiK30Pro-sd865-tflite2.7cpu"

from nn_meter.builder import profile_models, builder_config
builder_config.init(workspace) # initialize builder config with workspace
from nn_meter.builder.backends import connect_backend
from nn_meter.builder.backend_meta.fusion_rule_tester import generate_testcases, detect_fusion_rule

# generate testcases
origin_testcases = generate_testcases()

# connect to backend
backend = connect_backend(backend_name='tflite_cpu')

# run testcases and collect profiling results
profiled_results = profile_models(backend, origin_testcases, mode='ruletest')

# determine fusion rules from profiling results
detected_results = detect_fusion_rule(profiled_results)

执行结束后，我们的{workspace}/fusion_rule_test/文件夹下会出现测试结果。

2.3 构建kernel预测器

# 参考文档: https://github.com/microsoft/nn-Meter/blob/main/docs/builder/build_kernel_latency_predictor.md#end-to-end-demo
workspace ="/root/workspace/nn-Meter/workspace/RedmiK30Pro-sd865-tflite2.7cpu"

from nn_meter.builder import builder_config
builder_config.init(workspace)

# build latency predictor for kernel
from nn_meter.builder import build_latency_predictor
build_latency_predictor(backend="tflite_cpu")

2.4 构建model预测器

同样根据文档步骤将2.2和2.3的OP融合规则和Kernel Predictor放到一个文件夹下,同时增加一个yaml配置文件，就可以注册一个Model Latency Predictor了。

拷贝文件和重命名

# 1. 将finegrained2.pkl复制到指定目录然后rename
cp workspace/RedmiK30Pro-sd865-tflite2.7cpu/predictor_build/results/predictors/*finegrained2.pkl /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu

#!/bin/bash
# 遍历当前目录下所有的文件
for file in *
do
    # 判断文件名是否以"_finegrained2.pkl"结尾
    if [[ $file == *_finegrained2.pkl ]]
    then
        # 替换文件名中的"_finegrained2.pkl"为".pkl"
        new_name=${file/_finegrained2.pkl/.pkl}
        # 重命名文件
        echo "$new_name"
        mv "$file" "$new_name"
    fi
done

# 2. 融合规则
cp workspace/RedmiK30Pro-sd865-tflite2.7cpu/fusion_rule_test/results/detected_fusion_rule.json /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu/fusion_rules.json

目录树如下

redmik30p_sd865_tflite2.7cpu
|-- add.pkl
|-- addrelu.pkl
|-- avgpool.pkl
|-- bn.pkl
|-- bnrelu.pkl
|-- channelshuffle.pkl
|-- concat.pkl
|-- conv-bn-relu.pkl
|-- dwconv-bn-relu.pkl
|-- fc.pkl
|-- fusion_rules.json
|-- global-avgpool.pkl
|-- hswish.pkl
|-- maxpool.pkl
|-- relu.pkl
|-- se.pkl
|-- split.pkl

写一个yaml文件索引文件位置

name: redmik30p_sd865_tflite2.7cpu
version: 1.0
category: cpu
package_location: /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu
kernel_predictors:
    - conv-bn-relu
    - dwconv-bn-relu
    - fc
    - global-avgpool
    - hswish
    - relu
    - se
    - split
    - add
    - addrelu
    - maxpool
    - avgpool
    - bn
    - bnrelu
    - channelshuffle
    - concat

注册预测器

# 注册
nn-meter register --predictor /root/workspace/nn-Meter/workspace/predictor/redmik30p_sd865_tflite2.7cpu.yaml
# 
nn-meter --list-predictors

成功注册后一般会显示

(nn-Meter) Successfully register predictor: redmik30p_sd865_tflite2.7cpu
(nn-Meter) Supported latency predictors:
(nn-Meter) [Predictor] cortexA76cpu_tflite21: version=1.0
(nn-Meter) [Predictor] adreno640gpu_tflite21: version=1.0
(nn-Meter) [Predictor] adreno630gpu_tflite21: version=1.0
(nn-Meter) [Predictor] myriadvpu_openvino2019r2: version=1.0
(nn-Meter) [Predictor] redmik30p_sd865_tflite2.7cpu: version=1.0

3 测试

3.1 预测值和实际值的差别

基于Tensorflow2 API导出一个resnet50的模型


import tensorflow as tf
from tensorflow.keras.applications.resnet50 import ResNet50
from tensorflow.python.framework.convert_to_constants import convert_variables_to_constants_v2
# 加载模型
model = ResNet50(weights='imagenet')

full_model = tf.function(lambda x: model(x))
shape = [1,224,224,3] #  model.inputs[0].shape
full_model = full_model.get_concrete_function(
    tf.TensorSpec(shape, model.inputs[0].dtype))

# Get frozen ConcreteFunction
frozen_func = convert_variables_to_constants_v2(full_model)
frozen_func.graph.as_graph_def()

layers = [op.name for op in frozen_func.graph.get_operations()]
print("-" * 50)
print("Frozen model layers: ")
for layer in layers:
    print(layer)

print("-" * 50)
print("Frozen model inputs: ")
print(frozen_func.inputs)
print("Frozen model outputs: ")
print(frozen_func.outputs)

# Save frozen graph from frozen ConcreteFunction to hard drive
tf.io.write_graph(graph_or_graph_def=frozen_func.graph,
                logdir="./frozen_models",
                name="frozen_graph.pb",
                as_text=False)


# 将模型转换为 TensorFlow Lite 格式，并保存为 .tflite 文件
converter = tf.lite.TFLiteConverter.from_keras_model(model)
tflite_model = converter.convert()
with open('resnet50.tflite', 'wb') as f:
    f.write(tflite_model)

用nn-Meter预测
1

benchmark运行

/data/local/tmp/benchmark_model_cpu_gpu_v2.7 --num_threads=4 \
--graph=/mnt/sdcard/tflite_models/resnet50.tflite  \
--warmup_runs=30 \
--num_runs=50

STARTING!
Log parameter values verbosely: [0]
Min num runs: [50]
Num threads: [4]
Min warmup runs: [30]
Graph: [/mnt/sdcard/tflite_models/resnet50.tflite]
#threads used for CPU inference: [4]
Loaded model /mnt/sdcard/tflite_models/resnet50.tflite
INFO: Initialized TensorFlow Lite runtime.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
INFO: Replacing 75 node(s) with delegate (TfLiteXNNPackDelegate) node, yielding 1 partitions.
The input model file size (MB): 102.161
Initialized session in 98.471ms.
Running benchmark for at least 30 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=30 first=104448 curr=87126 min=86737 max=104448 avg=88622.5 std=3079

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=87163 curr=89038 min=86939 max=93704 avg=88199.2 std=1353

Inference timings in us: Init: 98471, First inference: 104448, Warmup (avg): 88622.5, Inference (avg): 88199.2
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Memory footprint delta from the start of the tool (MB): init=134.562 overall=208.699

3.2 预测一些未训练的kernel模型

这里我选了一个SSD模型，它的算子种类如下:

Add
BatchNormalization
Cast
Concat
Constant
ConstantOfShape
Conv
Exp
Gather
MaxPool
Mul
NonMaxSuppression
ReduceMin
Relu
Reshape
Shape
Slice
Softmax
Squeeze
Sub
TopK
Transpose
Unsqueeze

执行预测后nn-meter predict --predictor redmik30p_sd865_tflite2.7cpu --predictor-version 1.0 --onnx /root/workspace/nn-Meter/workspace/models/ssd-12.onnx

(nn-Meter) Start latency prediction ...
(nn-Meter) Empty shape information with Constant_339
(nn-Meter) Empty shape information with Shape_340
(nn-Meter) Empty shape information with Gather_341
(nn-Meter) Empty shape information with Constant_342
(nn-Meter) Empty shape information with Constant_343
(nn-Meter) Empty shape information with Unsqueeze_344
(nn-Meter) Empty shape information with Unsqueeze_345
(nn-Meter) Empty shape information with Unsqueeze_346
(nn-Meter) Empty shape information with Concat_347
(nn-Meter) Empty shape information with Reshape_348
(nn-Meter) Empty shape information with Constant_350
(nn-Meter) Empty shape information with Shape_351
(nn-Meter) Empty shape information with Gather_352
(nn-Meter) Empty shape information with Constant_353
(nn-Meter) Empty shape information with Constant_354
...
(nn-Meter) Empty shape information with Unsqueeze_scores
Traceback (most recent call last):
  File "/root/anaconda3/envs/nnmeter_tflite/bin/nn-meter", line 8, in <module>
    sys.exit(nn_meter_cli())
  File "/root/anaconda3/envs/nnmeter_tflite/lib/python3.8/site-packages/nn_meter/utils/nn_meter_cli/interface.py", line 266, in nn_meter_cli
    args.func(args)
  File "/root/anaconda3/envs/nnmeter_tflite/lib/python3.8/site-packages/nn_meter/utils/nn_meter_cli/predictor.py", line 56, in apply_latency_predictor_cli
    latency = predictor.predict(model, model_type) # in unit of ms
  File "/root/anaconda3/envs/nnmeter_tflite/lib/python3.8/site-packages/nn_meter/predictor/nn_meter_predictor.py", line 111, in predict
    self.kd.load_graph(graph)
  File "/root/anaconda3/envs/nnmeter_tflite/lib/python3.8/site-packages/nn_meter/kernel_detector/kernel_detector.py", line 19, in load_graph
    new_graph = convert_nodes(graph)
  File "/root/anaconda3/envs/nnmeter_tflite/lib/python3.8/site-packages/nn_meter/kernel_detector/utils/ir_tools.py", line 14, in convert_nodes
    type = node["attr"]["type"]
KeyError: 'type'

爆出的Empty shape information发生在nn_meter/ir_converter/onnx_converter/converter.py#OnnxConverter.fetch_attrs函数中。这导致返回的attr变量为空，最终报错。
nn-Meter在计算时间要获取OP的输入输出shape，这里的shape等算子不是传统的OP，所有报错了。这里总结了这个模型报错的的算子类型

Add
Cast
Concat
Constant
ConstantOfShape
Exp
Gather
Mul
NonMaxSuppression
ReduceMin
Reshape
Shape
Slice
Softmax
Squeeze
Sub
TopK
Transpose
Unsqueeze

感觉这里的问题有点复杂,有些算子是nn-Meter训练过的,但是还是报错了比如Add,Concat.

3.3 nn-Meter的问题

对于Tensorflow模型来说,nn-Meter可能会在ShapeInference出现问题.
nn-Meter目前对于仅支持一些CNN常用的算子.
nn-Meter支持的模型数据类型只有float类型和int32类型.

4. 参考资料

5. 参考代码

打印onnx模型的OP

import onnx
model_file = "/root/workspace/nn-Meter/workspace/models/mobilenetv3small_0.onnx"   # ONNX模型文件路径
model = onnx.load(model_file)
op_types = set()
for node in model.graph.node:
    op_types.add(node.op_type)
op_types = list(op_types)
[print(op_type) for op_type in op_types]

Posted 2023-05-21Updated 2023-10-16DNN4 minutes read (About 585 words)

Focal Loss的Pytorch实现

1 公式推导

1.1 交叉熵(cross entropy)

信息论中认为事件$X$中概率小的可能性$x_i$如果发生了将会包含了更多的信息量。假设$x_i$就是指的某个可能性,而$P(X=x_i)=p(x_i)$是该可能性发生的概率,所以对信息量的定义就是
$$
I(x_i) = \log{\frac{1}{p(x_i)}}
$$
而熵的概念就是对于事件的所有可能性的期望，$N$指的是样本数量
$$
H(X) = -\sum_{i=1}^{N}{p(x_i)\log{p(x_i)}}
$$
交叉熵为我们提供了一种表达两种概率分布的差异的方法。
$X$和$Y$的分布越不相同， $X$相对于$Y$的交叉熵将越大于$Y$的熵
$$
H_{Y}(X) = -\sum_{i=1}^{N}{p(y_i)\log{p(x_i)}}
$$
多分类的交叉熵损失函数
假设$N$个样本,$K$个分类,$I(y_i=k)$记作$y_{i,k}$,这一般是gt,要么为1，要么为0
$$
l(X)=-\frac{1}{N}\sum_{i=1}^{N}{\sum_{k=1}^{K}{y_{i,k}\log{x_{i,k}}}}
$$

1.2 平衡交叉熵函数(balanced cross entropy)

$$
l(X)=-\frac{1}{N}\sum_{i=1}^{N}{\sum_{k=1}^{K}{\alpha_{k}y_{i,k}\log{x_{i,k}}}}
$$
$\alpha_{k}$为样本分布比例

1.3 Focal Loss

如果数据集中的分类样本不均匀，会导致损失函数中多数类别的权重会提高，少数样本的参数学习会很困难。
$$
l(X)=-\frac{1}{N}\sum_{i=1}^{N}{\sum_{k=1}^{K}{\alpha_{k}(1-x_{i,k})^{\gamma}y_{i,k}\log{x_{i,k}}}}
$$

focal loss相比balanced cross entropy而言，二者都是试图解决样本不平衡带来的模型训练问题，后者从样本分布角度对损失函数添加权重因子，前者从样本分类难易程度出发，使loss聚焦于难分样本。

2 代码实现

# pytorch实现
class FocalLoss(nn.Module):
    def __init__(self, gamma = 2, alpha = 1, size_average = True):
        super(FocalLoss, self).__init__()
        self.gamma = gamma
        self.alpha = alpha
        self.size_average = size_average
        self.elipson = 0.000001
    
    def forward(self, outputs, labels):
        # 先计算CE Loss
        ce_loss = torch.nn.functional.cross_entropy(outputs, labels, reduction='none')
        # 消掉log
        pt = torch.exp(-ce_loss)
        # mean over the batch
        focal_loss = (self.alpha * (1-pt)**self.gamma * ce_loss).mean()
        return focal_loss