1. 程式人生 > >OpenCV+yolov3實現目標檢測(C++,Python)

OpenCV+yolov3實現目標檢測(C++,Python)

OpenCV+yolov3實現目標檢測(C++,Python)


    目標檢測演算法主要分為兩類:一類是基於Region Proposal(候選區域)的演算法,如R-CNN系演算法(R-CNN,Fast R-CNN, Faster R-CNN),它們是two-stage(兩步法)的,需要先使用Selective search或者CNN網路(RPN)產生Region Proposal,然後再在Region Proposal上做分類與迴歸。而另一類是Yolo,SSD這類one-stage演算法(一步法),其僅僅使用一個CNN網路直接預測不同目標的類別與位置。第一類方法是準確度高一些,但是速度慢,而第二類演算法是速度快,但是準確性要低一些。

    YOLO是一種比SSD還要快的目標檢測網路模型,作者在其論文中說FPS是Fast R-CNN的100倍,這裡首先簡單的介紹一下YOLO網路基本結構,然後通過OpenCV C++呼叫Darknet的,實現目標檢測。OpenCV在3.3.1的版本中開始正式支援Darknet網路框架並且支援YOLO1與YOLO2以及YOLO Tiny網路模型的匯入與使用。後面測試,OpenCV3.4.2也支援YOLO3 。另外,OpenCV dnn模組目前支援Caffe、TensorFlow、Torch、PyTorch等深度學習框架,關於《OpenCV呼叫TensorFlow預訓練模型》可參考鄙人的另一份部落格:

https://blog.csdn.net/guyuealian/article/details/80570120

    關於《OpenCV+yolov2-tiny實現目標檢測(C++)》請參考我的另一篇部落格:https://blog.csdn.net/guyuealian/article/details/82950283

    本部落格原始碼都放在Github上:https://github.com/PanJinquan/opencv-learning-tutorials/tree/master/dnn_tutorial,麻煩給個“Star”哈

參考資料:

Deep Learning based Object Detection using YOLOv3 with OpenCV ( Python / C++ )

》:

《YOLOv3 + OpenCV 實現目標檢測(Python / C ++)》:https://blog.csdn.net/haoqimao_hard/article/details/82081285

 Github參考原始碼:https://github.com/spmallick/learnopencv/tree/master/ObjectDetection-YOLO

 darknt yolo官網:https://pjreddie.com/darknet/yolo/


目錄

OpenCV+yolov3實現目標檢測(C++,Python)

1、YOLO網路

(1)YOLO網路結構

2、OpenCV使用YOLO v3實現目標檢測

2.1 C++程式碼

2.2 Python程式碼 

3、YOLO的缺點

4、參考資料:


1、YOLO網路

   YOLO全稱YOU ONLY  Look Once表達的意思只要看一眼就能感知識別的物體了。YOLO的核心思想:就是利用整張圖作為網路的輸入,直接在輸出層迴歸物體的bounding box位置和所屬的類別。

(1)YOLO網路結構

   實現過程:首先把輸入影象劃分成S×S的格子,然後對每個格子都預測BBounding Boxes(物體框),每個Bounding Boxes都包含5個預測值:x,y,w,hconfidence置信度,另外每個格子都預測C個類別的概率分數,但是這個概率分數和物體框的confidence置信度分數是不相關的。這樣,每個單元格需要預測(B×5+C)個值。如果將輸入圖片劃分為S×S個網格,那麼最終預測值為S×S×(B×5+C)大小的張量。整個模型的預測值結構如下圖所示。

  • 1、將一幅影象分成SxS個網格(grid cell),如果某個object的中心 落在這個網格中,則這個網格就負責預測這個object。
  • 2、每個網格要預測B個bounding box,每個bounding box除了要回歸自身的位置(x,y,w,h)之外,還要附帶預測一個confidence值(每個bounding box要預測(x, y, w, h)和confidence共5個值)。這個confidence代表了所預測的box中含有object的置信度和這個box預測的有多準兩重資訊,其值是這樣計算的:

說明:如果有object落在一個grid cell裡,第一項取1,否則取0。 第二項是預測的bounding box和實際的ground truth之間的IOU值

  • 3、每個網格還要預測一個類別概率資訊,記為C類。這樣所有網格的類別概率就構成了class probability map

注意:class資訊是針對每個網格的,confidence資訊是針對每個bounding box的。

      舉個栗子:在PASCAL VOC中,影象輸入為448x448,取S=7(將影象成7x7個網格(grid cell)),B=2(每個網格要預測2個bounding box),一共有C=20個類別(PASCAL VOC共有20類別)。則輸出就是S x S x (5*B+C)=7x7x30的一個張量tensor。整個網路結構如下圖所示:

  • 4、在test的時候,每個網格預測的class資訊和bounding box預測的confidence資訊相乘,就得到每個bounding box的class-specific confidence score:
    這裡寫圖片描述
    等式左邊第一項就是每個網格預測的類別資訊,第二三項就是每個bounding box預測的confidence。這個乘積即encode了預測的box屬於某一類的概率,也有該box準確度的資訊。

  • 5、得到每個box的class-specific confidence score以後,設定閾值,濾掉得分低的boxes,對保留的boxes進行NMS處理,就得到最終的檢測結果。

 這部分的講解可以參考資料:https://blog.csdn.net/tangwei2014/article/details/50915317


 

2、OpenCV使用YOLO v3實現目標檢測

    yolov3模型下載地址:連結: https://pan.baidu.com/s/1TugNSWRZaJz1R6IejRtNiA 提取碼: 46mh 

2.1 C++程式碼

// This code is written at BigVision LLC. It is based on the OpenCV project.
//It is subject to the license terms in the LICENSE file found in this distribution and at http://opencv.org/license.html

// Usage example:  ./object_detection_yolo.out --video=run.mp4
//                 ./object_detection_yolo.out --image=bird.jpg
#include <fstream>
#include <sstream>
#include <iostream>

#include <opencv2/dnn.hpp>
#include <opencv2/imgproc.hpp>
#include <opencv2/highgui.hpp>
;
using namespace cv;
using namespace dnn;
using namespace std;

string pro_dir = "E:/opencv-learning-tutorials/"; //專案根目錄

// Initialize the parameters
float confThreshold = 0.5; // Confidence threshold
float nmsThreshold = 0.4;  // Non-maximum suppression threshold
int inpWidth = 416;  // Width of network's input image
int inpHeight = 416; // Height of network's input image
vector<string> classes;

// Remove the bounding boxes with low confidence using non-maxima suppression
void postprocess(Mat& frame, const vector<Mat>& out);

// Draw the predicted bounding box
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame);

// Get the names of the output layers
vector<String> getOutputsNames(const Net& net);

void detect_image(string image_path, string modelWeights, string modelConfiguration, string classesFile);

void detect_video(string video_path, string modelWeights, string modelConfiguration, string classesFile);


int main(int argc, char** argv)
{
	// Give the configuration and weight files for the model
	String modelConfiguration = pro_dir + "data/models/yolov3/yolov3.cfg";
	String modelWeights = pro_dir + "data/models/yolov3/yolov3.weights";
	string image_path = pro_dir + "data/images/bird.jpg";
	string classesFile = pro_dir + "data/models/yolov3/coco.names";// "coco.names";
	//detect_image(image_path, modelWeights, modelConfiguration, classesFile);
	string video_path = pro_dir + "data/images/run.mp4";
	detect_video(video_path, modelWeights, modelConfiguration, classesFile);
	cv::waitKey(0);
	return 0;
}

void detect_image(string image_path, string modelWeights, string modelConfiguration, string classesFile) {
	// Load names of classes
	ifstream ifs(classesFile.c_str());
	string line;
	while (getline(ifs, line)) classes.push_back(line);

	// Load the network
	Net net = readNetFromDarknet(modelConfiguration, modelWeights);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);
	net.setPreferableTarget(DNN_TARGET_OPENCL);

	// Open a video file or an image file or a camera stream.
	string str, outputFile;
	cv::Mat frame = cv::imread(image_path);
	// Create a window
	static const string kWinName = "Deep learning object detection in OpenCV";
	namedWindow(kWinName, WINDOW_NORMAL);

	// Stop the program if reached end of video
	// Create a 4D blob from a frame.
	Mat blob;
	blobFromImage(frame, blob, 1 / 255.0, cvSize(inpWidth, inpHeight), Scalar(0, 0, 0), true, false);

	//Sets the input to the network
	net.setInput(blob);

	// Runs the forward pass to get output of the output layers
	vector<Mat> outs;
	net.forward(outs, getOutputsNames(net));

	// Remove the bounding boxes with low confidence
	postprocess(frame, outs);
	// Put efficiency information. The function getPerfProfile returns the overall time for inference(t) and the timings for each of the layers(in layersTimes)
	vector<double> layersTimes;
	double freq = getTickFrequency() / 1000;
	double t = net.getPerfProfile(layersTimes) / freq;
	string label = format("Inference time for a frame : %.2f ms", t);
	putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 255));
	// Write the frame with the detection boxes
	imshow(kWinName, frame);
	cv::waitKey(30);
}

void detect_video(string video_path, string modelWeights, string modelConfiguration, string classesFile) {
	string outputFile  = "./yolo_out_cpp.avi";;
	// Load names of classes
	ifstream ifs(classesFile.c_str());
	string line;
	while (getline(ifs, line)) classes.push_back(line);

	// Load the network
	Net net = readNetFromDarknet(modelConfiguration, modelWeights);
	net.setPreferableBackend(DNN_BACKEND_OPENCV);
	net.setPreferableTarget(DNN_TARGET_CPU);


	// Open a video file or an image file or a camera stream.
	VideoCapture cap;
	//VideoWriter video;
	Mat frame, blob;

	try {
		// Open the video file
		ifstream ifile(video_path);
		if (!ifile) throw("error");
		cap.open(video_path);
	}
	catch (...) {
		cout << "Could not open the input image/video stream" << endl;
		return ;
	}

	// Get the video writer initialized to save the output video
	//video.open(outputFile, 
	//	VideoWriter::fourcc('M', 'J', 'P', 'G'), 
	//	28, 
	//	Size(cap.get(CAP_PROP_FRAME_WIDTH), cap.get(CAP_PROP_FRAME_HEIGHT)));

	// Create a window
	static const string kWinName = "Deep learning object detection in OpenCV";
	namedWindow(kWinName, WINDOW_NORMAL);

	// Process frames.
	while (waitKey(1) < 0)
	{
		// get frame from the video
		cap >> frame;

		// Stop the program if reached end of video
		if (frame.empty()) {
			cout << "Done processing !!!" << endl;
			cout << "Output file is stored as " << outputFile << endl;
			waitKey(3000);
			break;
		}
		// Create a 4D blob from a frame.
		blobFromImage(frame, blob, 1 / 255.0, cvSize(inpWidth, inpHeight), Scalar(0, 0, 0), true, false);

		//Sets the input to the network
		net.setInput(blob);

		// Runs the forward pass to get output of the output layers
		vector<Mat> outs;
		net.forward(outs, getOutputsNames(net));

		// Remove the bounding boxes with low confidence
		postprocess(frame, outs);

		// Put efficiency information. The function getPerfProfile returns the overall time for inference(t) and the timings for each of the layers(in layersTimes)
		vector<double> layersTimes;
		double freq = getTickFrequency() / 1000;
		double t = net.getPerfProfile(layersTimes) / freq;
		string label = format("Inference time for a frame : %.2f ms", t);
		putText(frame, label, Point(0, 15), FONT_HERSHEY_SIMPLEX, 0.5, Scalar(0, 0, 255));

		// Write the frame with the detection boxes
		Mat detectedFrame;
		frame.convertTo(detectedFrame, CV_8U);
		//video.write(detectedFrame);
		imshow(kWinName, frame);

	}

	cap.release();
	//video.release();

}

// Remove the bounding boxes with low confidence using non-maxima suppression
void postprocess(Mat& frame, const vector<Mat>& outs)
{
	vector<int> classIds;
	vector<float> confidences;
	vector<Rect> boxes;

	for (size_t i = 0; i < outs.size(); ++i)
	{
		// Scan through all the bounding boxes output from the network and keep only the
		// ones with high confidence scores. Assign the box's class label as the class
		// with the highest score for the box.
		float* data = (float*)outs[i].data;
		for (int j = 0; j < outs[i].rows; ++j, data += outs[i].cols)
		{
			Mat scores = outs[i].row(j).colRange(5, outs[i].cols);
			Point classIdPoint;
			double confidence;
			// Get the value and location of the maximum score
			minMaxLoc(scores, 0, &confidence, 0, &classIdPoint);
			if (confidence > confThreshold)
			{
				int centerX = (int)(data[0] * frame.cols);
				int centerY = (int)(data[1] * frame.rows);
				int width = (int)(data[2] * frame.cols);
				int height = (int)(data[3] * frame.rows);
				int left = centerX - width / 2;
				int top = centerY - height / 2;

				classIds.push_back(classIdPoint.x);
				confidences.push_back((float)confidence);
				boxes.push_back(Rect(left, top, width, height));
			}
		}
	}

	// Perform non maximum suppression to eliminate redundant overlapping boxes with
	// lower confidences
	vector<int> indices;
	NMSBoxes(boxes, confidences, confThreshold, nmsThreshold, indices);
	for (size_t i = 0; i < indices.size(); ++i)
	{
		int idx = indices[i];
		Rect box = boxes[idx];
		drawPred(classIds[idx], confidences[idx], box.x, box.y,
			box.x + box.width, box.y + box.height, frame);
	}
}

// Draw the predicted bounding box
void drawPred(int classId, float conf, int left, int top, int right, int bottom, Mat& frame)
{
	//Draw a rectangle displaying the bounding box
	rectangle(frame, Point(left, top), Point(right, bottom), Scalar(255, 178, 50), 3);

	//Get the label for the class name and its confidence
	string label = format("%.2f", conf);
	if (!classes.empty())
	{
		CV_Assert(classId < (int)classes.size());
		label = classes[classId] + ":" + label;
	}

	//Display the label at the top of the bounding box
	int baseLine;
	Size labelSize = getTextSize(label, FONT_HERSHEY_SIMPLEX, 0.5, 1, &baseLine);
	top = max(top, labelSize.height);
	rectangle(frame, Point(left, top - round(1.5*labelSize.height)), Point(left + round(1.5*labelSize.width), top + baseLine), Scalar(255, 255, 255), FILLED);
	putText(frame, label, Point(left, top), FONT_HERSHEY_SIMPLEX, 0.75, Scalar(0, 0, 0), 1);
}

// Get the names of the output layers
vector<String> getOutputsNames(const Net& net)
{
	static vector<String> names;
	if (names.empty())
	{
		//Get the indices of the output layers, i.e. the layers with unconnected outputs
		vector<int> outLayers = net.getUnconnectedOutLayers();

		//get the names of all the layers in the network
		vector<String> layersNames = net.getLayerNames();

		// Get the names of the output layers in names
		names.resize(outLayers.size());
		for (size_t i = 0; i < outLayers.size(); ++i)
			names[i] = layersNames[outLayers[i] - 1];
	}
	return names;
}

2.2 Python程式碼 

# This code is written at BigVision LLC. It is based on the OpenCV project. It is subject to the license terms in the LICENSE file found in this distribution and at http://opencv.org/license.html

# Usage example:  python3 object_detection_yolo.py --video=run.mp4
#                 python3 object_detection_yolo.py --image=bird.jpg

import cv2 as cv
import argparse
import sys
import numpy as np
import os.path

# Initialize the parameters
confThreshold = 0.5  #Confidence threshold
nmsThreshold = 0.4   #Non-maximum suppression threshold
inpWidth = 416       #Width of network's input image
inpHeight = 416      #Height of network's input image

parser = argparse.ArgumentParser(description='Object Detection using YOLO in OPENCV')
parser.add_argument('--image', help='Path to image file.')
parser.add_argument('--video', help='Path to video file.')
args = parser.parse_args()
        
# Load names of classes
classesFile = "coco.names";
classes = None
with open(classesFile, 'rt') as f:
    classes = f.read().rstrip('\n').split('\n')

# Give the configuration and weight files for the model and load the network using them.
modelConfiguration = "yolov3.cfg";
modelWeights = "yolov3.weights";

net = cv.dnn.readNetFromDarknet(modelConfiguration, modelWeights)
net.setPreferableBackend(cv.dnn.DNN_BACKEND_OPENCV)
net.setPreferableTarget(cv.dnn.DNN_TARGET_CPU)

# Get the names of the output layers
def getOutputsNames(net):
    # Get the names of all the layers in the network
    layersNames = net.getLayerNames()
    # Get the names of the output layers, i.e. the layers with unconnected outputs
    return [layersNames[i[0] - 1] for i in net.getUnconnectedOutLayers()]

# Draw the predicted bounding box
def drawPred(classId, conf, left, top, right, bottom):
    # Draw a bounding box.
    cv.rectangle(frame, (left, top), (right, bottom), (255, 178, 50), 3)
    
    label = '%.2f' % conf
        
    # Get the label for the class name and its confidence
    if classes:
        assert(classId < len(classes))
        label = '%s:%s' % (classes[classId], label)

    #Display the label at the top of the bounding box
    labelSize, baseLine = cv.getTextSize(label, cv.FONT_HERSHEY_SIMPLEX, 0.5, 1)
    top = max(top, labelSize[1])
    cv.rectangle(frame, (left, top - round(1.5*labelSize[1])), (left + round(1.5*labelSize[0]), top + baseLine), (255, 255, 255), cv.FILLED)
    cv.putText(frame, label, (left, top), cv.FONT_HERSHEY_SIMPLEX, 0.75, (0,0,0), 1)

# Remove the bounding boxes with low confidence using non-maxima suppression
def postprocess(frame, outs):
    frameHeight = frame.shape[0]
    frameWidth = frame.shape[1]

    classIds = []
    confidences = []
    boxes = []
    # Scan through all the bounding boxes output from the network and keep only the
    # ones with high confidence scores. Assign the box's class label as the class with the highest score.
    classIds = []
    confidences = []
    boxes = []
    for out in outs:
        for detection in out:
            scores = detection[5:]
            classId = np.argmax(scores)
            confidence = scores[classId]
            if confidence > confThreshold:
                center_x = int(detection[0] * frameWidth)
                center_y = int(detection[1] * frameHeight)
                width = int(detection[2] * frameWidth)
                height = int(detection[3] * frameHeight)
                left = int(center_x - width / 2)
                top = int(center_y - height / 2)
                classIds.append(classId)
                confidences.append(float(confidence))
                boxes.append([left, top, width, height])

    # Perform non maximum suppression to eliminate redundant overlapping boxes with
    # lower confidences.
    indices = cv.dnn.NMSBoxes(boxes, confidences, confThreshold, nmsThreshold)
    for i in indices:
        i = i[0]
        box = boxes[i]
        left = box[0]
        top = box[1]
        width = box[2]
        height = box[3]
        drawPred(classIds[i], confidences[i], left, top, left + width, top + height)

# Process inputs
winName = 'Deep learning object detection in OpenCV'
cv.namedWindow(winName, cv.WINDOW_NORMAL)

outputFile = "yolo_out_py.avi"
if (args.image):
    # Open the image file
    if not os.path.isfile(args.image):
        print("Input image file ", args.image, " doesn't exist")
        sys.exit(1)
    cap = cv.VideoCapture(args.image)
    outputFile = args.image[:-4]+'_yolo_out_py.jpg'
elif (args.video):
    # Open the video file
    if not os.path.isfile(args.video):
        print("Input video file ", args.video, " doesn't exist")
        sys.exit(1)
    cap = cv.VideoCapture(args.video)
    outputFile = args.video[:-4]+'_yolo_out_py.avi'
else:
    # Webcam input
    cap = cv.VideoCapture(0)

# Get the video writer initialized to save the output video
if (not args.image):
    vid_writer = cv.VideoWriter(outputFile, cv.VideoWriter_fourcc('M','J','P','G'), 30, (round(cap.get(cv.CAP_PROP_FRAME_WIDTH)),round(cap.get(cv.CAP_PROP_FRAME_HEIGHT))))

while cv.waitKey(1) < 0:
    
    # get frame from the video
    hasFrame, frame = cap.read()
    
    # Stop the program if reached end of video
    if not hasFrame:
        print("Done processing !!!")
        print("Output file is stored as ", outputFile)
        cv.waitKey(3000)
        break

    # Create a 4D blob from a frame.
    blob = cv.dnn.blobFromImage(frame, 1/255, (inpWidth, inpHeight), [0,0,0], 1, crop=False)

    # Sets the input to the network
    net.setInput(blob)

    # Runs the forward pass to get output of the output layers
    outs = net.forward(getOutputsNames(net))

    # Remove the bounding boxes with low confidence
    postprocess(frame, outs)

    # Put efficiency information. The function getPerfProfile returns the overall time for inference(t) and the timings for each of the layers(in layersTimes)
    t, _ = net.getPerfProfile()
    label = 'Inference time: %.2f ms' % (t * 1000.0 / cv.getTickFrequency())
    cv.putText(frame, label, (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))

    # Write the frame with the detection boxes
    if (args.image):
        cv.imwrite(outputFile, frame.astype(np.uint8));
    else:
        vid_writer.write(frame.astype(np.uint8))

    cv.imshow(winName, frame)


3、YOLO的缺點

  • YOLO對相互靠的很近的物體,還有很小的群體 檢測效果不好,這是因為一個網格中只預測了兩個框,並且只屬於一類。
  • 對測試影象中,同一類物體出現的新的不常見的長寬比和其他情況是。泛化能力偏弱。
  • 由於損失函式的問題,定位誤差是影響檢測效果的主要原因。尤其是大小物體的處理上,還有待加強。

4、參考資料:

[1].《論文閱讀筆記:You Only Look Once: Unified, Real-Time Object Detection》https://blog.csdn.net/tangwei2014/article/details/50915317

[2]. https://blog.csdn.net/xiaohu2022/article/details/79211732 

[3]. https://blog.csdn.net/u014380165/article/details/72616238