1. 程式人生 > >使用OpenCv+ENet實現語義分割

使用OpenCv+ENet實現語義分割

效果圖

介紹

在本教程中,您將學習如何使用OpenCV,深度學習和ENet架構執行語義分割。閱讀本教程後,您將能夠使用OpenCV對影象和視訊應用語義分割。深度學習有助於增加計算機視覺的前所未有的準確性,包括影象分類,物件檢測,現在甚至是分割。傳統分割涉及將影象分割成若干模組(Normalized Cuts, Graph Cuts, Grab Cuts, superpixels,等); 但是,演算法並沒有真正理解這些部分代表什麼。

另一方面,語義分割演算法會作如下的工作:

  • 1、分割影象劃分成有意義的部分
  • 2、同時,將輸入影象中的每個畫素與類標籤(即人,道路,汽車,公共汽車等)相關聯。

語義分割演算法很強大,有很多用例,包括自動駕駛汽車 - 在今天的文章中,我將向您展示如何將語義分割應用於道路場景影象以及視訊!

OpenCV和深度學習的語義分割

在這篇文章中,我們將討論ENet深度學習框架,並且演示如何使用ENet對影象和視訊流進行語義分割。

ENet語義分割框架

ENet語義分割框架

Abstract: …In this paper, we propose a novel deep neural network architecture named ENet (efficient neural network), created specifically for tasks requiring low latency operation. ENet is up to 18× faster, requires 75× less FLOPs, has 79× less parameters, and provides similar or better accuracy to existing models. …(大概就是速度提高了18倍,引數減少了79倍,然後精度更高速度更快)。

一個正向傳播在我的(垃圾)筆記本CPU(i5-6200)上花費了0.5S左右的時間,如果使用GPU將更快。Paszke等人在The Cityscapes Dataset訓練了他們的資料集,你可以根據需求選擇你需要的資料集進行訓練。並且這個資料集還帶有用於城市場景理解的影象示例。

我們使用訓練了20種類的模型,包括:

Unlabeled (i.e., background)
Road
Sidewalk
Building
Wall
Fence
Pole
TrafficLight
TrafficSign
Vegetation
Terrain
Sky
Person
Rider
Car
Truck
Bus
Train
Motorcycle
Bicycle

接下來,您將學習如何應用語義分段來提取影象和視訊流中每個類別,畫素之間的對映關係。如果您有興趣訓練自己的ENet模型以便在自己的自定義資料集上進行分割,可以參考此頁面,作者已提供了有關如何進行訓練的教程。

工程結構

若需要工程原始碼可以直接在下方留言郵箱或者公眾號留言郵箱。
下面讓我們在工程目錄下面執行 tree

.
├── enet-cityscapes
│   ├── enet-classes.txt
│   ├── enet-colors.txt
│   └── enet-model.net
├── images
│   ├── example_01.png
│   ├── example_02.jpg
│   ├── example_03.jpg
│   └── example_04.png
├── output
│   └── massachusetts_output.avi
├── segment.py
├── segment.pyc
├── segment_video.py
└── videos
    ├── massachusetts.mp4
    └── toronto.mp4

4 directories, 13 files

工程包括四個目錄:

  • enet-cityscapes/: 包含了訓練好了的深度學習模型,顏色列表,顏色labels。
  • images/: 包含四個測試用的圖片。
  • output/: 生成的輸出視訊。
  • videos/: 包含了兩個示例視訊用於測試我們的程式。

接下來,我們將分析兩個python指令碼:

  • segment.py: 對單個圖片進行深度學習語義分割,我們將首先在單個影象進行測試然後再將其運用到視訊中。
  • segment_video.py: 對視訊進行語義分割。

使用OpenCv對影象進行語義分割:

# import the necessary packages

import numpy as np
import argparse
import imutils
import time
import cv2

# construct the argument parse and parse the arguments
ap = argparse.ArgumentParser()
ap.add_argument("-m", "--model", required=True, 
	help="path to deep learning segmentation model")
ap.add_argument("-c", "--classes", required=True, 
	help="path to .txt file containing class labels")
ap.add_argument("-i", "--image", required=True,
	help="path to input image")
ap.add_argument("-l", "--colors", type=str,
	help="path to .txt file containing colors for labels")
ap.add_argument("-w", "--width", type=int, default=500,
	help="desired width (in pixels) of input image")
args = vars(ap.parse_args())

首先我們需要匯入相應的包, 並且設定相應的引數:

  • numpy Python 科學計算基礎包。
  • argparse: python的一個命令列解析包。
  • imutils: Python影象操作函式庫,提供一系列的便利功能。
  • time: Time access and conversions。
  • cv2 :建議安裝3.4+的版本。

接下來讓我們解析類標籤檔案和顏色:

# load the class label names
CLASSES = open(args["classes"]).read().strip().split("\n")
 
# if a colors file was supplied, load it from disk
if args["colors"]:
	COLORS = open(args["colors"]).read().strip().split("\n")
	COLORS = [np.array(c.split(",")).astype("int") for c in COLORS]
	COLORS = np.array(COLORS, dtype="uint8")
 
# otherwise, we need to randomly generate RGB colors for each class
# label
else:
	# initialize a list of colors to represent each class label in
	# the mask (starting with 'black' for the background/unlabeled
	# regions)
	np.random.seed(42)
	COLORS = np.random.randint(0, 255, size=(len(CLASSES) - 1, 3),
		dtype="uint8")
	COLORS = np.vstack([[0, 0, 0], COLORS]).astype("uint8")

首先將CLASSES載入到記憶體中,如果我們提供了每一個類別的標籤的COLORS,那麼我們就將其載入到記憶體; 若沒有則為每一個標籤隨機生成 COLORS
為了更好的視覺化,我們使用OpenCv繪製一個顏色和類別的圖列(legend):

# initialize the legend visualization
legend = np.zeros(((len(CLASSES) * 25) + 25, 300, 3), dtype="uint8")
 
# loop over the class names + colors
for (i, (className, color)) in enumerate(zip(CLASSES, COLORS)):
	# draw the class name + color on the legend
	color = [int(c) for c in color]
	cv2.putText(legend, className, (5, (i * 25) + 17),
		cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
	cv2.rectangle(legend, (100, (i * 25)), (300, (i * 25) + 25),
		tuple(color), -1)

如圖左邊所示為所繪製的legend 的效果圖:

效果圖

然後我們將深度學習分割應用於影象:

# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNet(args["model"])
 
# load the input image, resize it, and construct a blob from it,
# but keeping mind mind that the original input image dimensions
# ENet was trained on was 1024x512
image = cv2.imread(args["image"])
image = imutils.resize(image, width=args["width"])
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (1024, 512), 0,
	swapRB=True, crop=False)
 
# perform a forward pass using the segmentation model
net.setInput(blob)
start = time.time()
output = net.forward()
end = time.time()
 
# show the amount of time inference took
print("[INFO] inference took {:.4f} seconds".format(end - start))

上面這段程式碼,使用Python和OpenCv對影象進行語義分割:

  • cv2.dnn.readNet(): 記載模型。
  • 構建一個 blob: 由於我們訓練的ENet模型的輸入影象的大小為1024X512因此這裡應該使用相同的大小。
  • blob輸入到網路中,並且通過這個神經網路執行一個 forward pass, 並且輸出使用的時間。

視覺化我們的結果

最後我們需要視覺化我們的結果:

在程式的其餘行中,我們將生成一個顏色蒙層以覆蓋原始影象。 每個畫素都有一個相應的類標籤索引,使我們可以看到螢幕上的語義分割結果。

# infer the total number of classes along with the spatial dimensions
# of the mask image via the shape of the output array
(numClasses, height, width) = output.shape[1:4]

# our output class ID map will be num_classes x height x width in
# size, so we take the argmax to find the class label with the
# largest probability for each and every (x, y)-coordinate in the
# image

classMap = np.argmax(output[0], axis=0)

# given the class ID map, we can map each of the class IDs to its
# corresponding color

mask = COLORS[classMap]
cv2.imshow("mask", mask)

# resize the mask and class map such that its dimensions match the
# original size of the input image (we're not using the class map
# here for anything else but this is how you would resize it just in
# case you wanted to extract specific pixels/classes)
mask = cv2.resize(mask, (image.shape[1], image.shape[0]),
	interpolation=cv2.INTER_NEAREST)
classMap = cv2.resize(classMap, (image.shape[1], image.shape[0]),
	interpolation=cv2.INTER_NEAREST)

# perform a weighted combination of the input image with the mask to
# form an output visualization
output = ((0.4 * image) + (0.6 * mask)).astype("uint8")

# show the input and output images
cv2.imshow("Legend", legend)
cv2.imshow("Input", image)
cv2.imshow("Output", output)
cv2.waitKey(0)
if cv2.waitKey(1) & 0xFF == ord('q'):
    exit

我們首先是從output中提取出 numClasses, height, width,然後計算 classMapmask。其中 classMapoutput的每個(x,y)座標的最大概率的類標籤索引(class label index)。通過 calssMap作為Numpy的陣列索引來找到每個畫素相對應的視覺化顏色。
之後就是簡單的尺寸變換以使得尺寸相同,之後進行疊加。

單個影象的結果:

根據用法輸入相應的命令列引數,執行程式,以下是一個示例:


python3 segment.py --model enet-cityscapes/enet-model.net --classes enet-cityscapes/enet-classes.txt --colors enet-cityscapes/enet-colors.txt --image images/example_03.jpg

最終得到的結果:

效果圖
效果圖

很容易發現,它可以清晰地分類並準確識別人和自行車。確定了道路,人行道,汽車。。

在視訊中執行語義分割:

這個部分的程式碼位於 segment_video.py中, 首先載入模型,初始化視訊流:

# load our serialized model from disk
print("[INFO] loading model...")
net = cv2.dnn.readNet(args["model"])

# initialize the video stream and pointer to output video file
vs = cv2.VideoCapture(args["video"])
writer = None

# try to determine the total number of frames in the video file
try:
	prop =  cv2.cv.CV_CAP_PROP_FRAME_COUNT if imutils.is_cv2() \
		else cv2.CAP_PROP_FRAME_COUNT
	total = int(vs.get(prop))
	print("[INFO] {} total frames in video".format(total))

# an error occurred while trying to determine the total
# number of frames in the video file
except:
	print("[INFO] could not determine # of frames in video")
	total = -1

之後讀取視訊流,並且作為網路的輸入,這部分和 segment.py大致相同:

# loop over frames from the video file stream
while True:
	# read the next frame from the file
	(grabbed, frame) = vs.read()

	# if the frame was not grabbed, then we have reached the end
	# of the stream
	if not grabbed:
		break

	# construct a blob from the frame and perform a forward pass
	# using the segmentation model
	frame = imutils.resize(frame, width=args["width"])
	blob = cv2.dnn.blobFromImage(frame, 1 / 255.0, (1024, 512), 0,
		swapRB=True, crop=False)
	net.setInput(blob)
	start = time.time()
	output = net.forward()
	end = time.time()

	# infer the total number of classes along with the spatial
	# dimensions of the mask image via the shape of the output array
	(numClasses, height, width) = output.shape[1:4]

	# our output class ID map will be num_classes x height x width in
	# size, so we take the argmax to find the class label with the
	# largest probability for each and every (x, y)-coordinate in the
	# image
	classMap = np.argmax(output[0], axis=0)

	# given the class ID map, we can map each of the class IDs to its
	# corresponding color
	mask = COLORS[classMap]

	# resize the mask such that its dimensions match the original size
	# of the input frame
	mask = cv2.resize(mask, (frame.shape[1], frame.shape[0]),
		interpolation=cv2.INTER_NEAREST)

	# perform a weighted combination of the input frame with the mask
	# to form an output visualization
	output = ((0.3 * frame) + (0.7 * mask)).astype("uint8")

然後我們將輸出的視訊流寫入到檔案中:

	# check if the video writer is None
	if writer is None:
		# initialize our video writer
		fourcc = cv2.VideoWriter_fourcc(*"MJPG")
		writer = cv2.VideoWriter(args["output"], fourcc, 30,
			(output.shape[1], output.shape[0]), True)

		# some information on processing single frame
		if total > 0:
			elap = (end - start)
			print("[INFO] single frame took {:.4f} seconds".format(elap))
			print("[INFO] estimated total time: {:.4f}".format(
				elap * total))

	# write the output frame to disk
	writer.write(output)

	# check to see if we should display the output frame to our screen
	if args["show"] > 0:
		cv2.imshow("Frame", output)
		key = cv2.waitKey(1) & 0xFF
 
		# if the `q` key was pressed, break from the loop
		if key == ord("q"):
			break

最終的視訊演示可以檢視下面的視訊:


python3 segment_video.py --model enet-cityscapes/enet-model.net \
	--classes enet-cityscapes/enet-classes.txt \
	--colors enet-cityscapes/enet-colors.txt \
	--video videos/massachusetts.mp4 \
	--output output/massachusetts_output.avi

最後如何訓練自己的模型:

總結

在這個文章中,我們學習瞭如何使用OpenCV,深度學習和ENet架構來應用語義分割。在Cityscapes資料集上使用預先訓練的ENet模型,我們能夠在自動駕駛汽車和道路場景分割的背景下將影象和視訊流分成20個類別,包括人(步行和騎自行車),車輛(汽車,卡車,公共汽車,摩托車等),建築(建築物,牆壁,圍欄等),以及植被,地形和地面本身。如果您喜歡今天的文章,請分享哦!

需要原始碼的朋友可以在下方留言或者公眾號(SLAM 技術交流)後臺留言郵箱即可。

參考文獻

不足之處,敬請斧正; 若你覺得文章還不錯,請關注微信公眾號“SLAM 技術交流”繼續支援我們,筆芯:D。