1. 程式人生 > >Stuart Weitzman Boots, Designer Bags, and Outfits with Mask R

Stuart Weitzman Boots, Designer Bags, and Outfits with Mask R

Stuart Weitzman Boots, Designer Bags, and Outfits with Mask R-CNN

I was chatting with some folks recently about how you can build multi stage models for deep learning pipelines. The problem in particular was how to do similarity comparisons of objects in an image to a known database of objects for matching. This problem is very similar to the pipeline that Andrew Ng lays out for facial recognition in his deep learning course. The way that he lays it out is that there are two models, stage one and two. stage one can be an object detection model to localize to faces in an image and then the second stage model can do facial similarity comparison based on the localized faces. Building the pipeline in two stages is useful because it lets the second stage model deal with less arbitrary noise so it can focus on comparisons to whatever objects you care about.

In my very first medium post I walked through how I built a facial recognition pipeline for the game Rainbow Six Siege. However in this post I want to build a multi class image segmentation model as an example of a stage one model in this pipeline. My reasoning is that sometimes detecting the bounding box for an object is fine, but segmenting the image at a pixel by pixel level should give a cleaner input into the second stage since there will be little to no background for the second stage model to have to deal with.

As the title alludes to, for this post the subject matter of this post is a multi-class image segmentation model that I custom train to detect articles of clothing.

Besides that conversation acting as a primer, I thought that this would be a good exercise to go through because it is fairly hard to find tutorials on how to do multiclass image segmentation projects. For this project I am going to use Matterport’s Mask R-CNN which is a Keras implementation of Mask R-CNN and it does preform quite well. In the future it might also be good to do it with the raw Tensorflow implementation, but I wanted to get this prototype done quickly. (By quickly I mean I decided to do this on a Saturday evening at 10pm, gathered data, annotated it,and had it trained by 4am, went to bed and then wrote this blog the next day XD)

Brief background on Mask R-CNN

For many tasks it is important not just to know that there is an object and where it is in an image (object detection), but also to know which pixels in a given image correspond to that object (image segmentation). Mask R-CNN is a modification to previous object detection architectures R-CNN and Faster R-CNN to serve this purpose. The ability to perform image segmentation was done by adding a fully convolutional network branch onto the existing network architectures. So while the main branch generates bounding boxes and identifies the object’s class the fully convolutional branch, which is fed features from the main branch, generates image masks.

Another post where I saw learned a good bit about Mask R-CNN and image segmentation in general was from (here). Finally if you really want to you can check out the original Mask R-CNN paper which is worth a read (here).

Gathering Data and Annotating

Although I picked the topic of the post, I know relatively little about designer brands of clothes. However one of my good friends enjoys Stuart Weitzman boots so I figured I could use that to help narrow down the images to use for this project. Doing some quick google searches and pulling relevant images I gathered around 130 images of models mostly wearing Stuart Weitzman boots, but I would say the dataset is of people who seem quite fashionable.

For my annotations I used VGG Image Annotator which is a nice interface for object detection or image segmentation annotation out of Oxford.

screenshot of the VGG annotator

The major change here for me was just having to make sure each polygon region is labeled with a class name by using the Region Attributes in the annotator. I added a field called “name” which I could then later reference in the output JSON file so the model would know the underlying class of each polygon region.

For my three classes I used boots (which includes shoes and sandals as you can see below), bags (most of which are handbags of different sizes, but also shopping bags and one or two backpacks), finally my most troublesome category is “top”. I was initially going to do shits, coats, pants, etc. But that would have increased this annotation process significantly so I condensed most of those into a “top” category and decided not to annotate pants. All of this can be adjusted if I decided to build a stronger version of this model.

When I remembered that I had included this picture I immediately regretted it because of the amount of annotation. It also shows how my class system is probably not ideal… I have to label a one piece outfit as a “top” but don’t include jeans/pants so the system seems a bit convoluted…

Using the VGG annotator I generated json files that contained all of the polygon segmentation areas for the model along with the associated names of the polygons.

Modifications to Code and Training

This part was actually quite nice for me because it forced me to dig through the Matterport code to try and figure out how to modify it for multi class segmentation.

see my fashion.py file for all of the code, I will just show some segments here.

first modifications occur on line 46.

NUM_CLASSES = 1 + 3  # Background + bag +top +boots

instead of the basic:

NUM_CLASSES = 1 + 1  # Background + single_class

on lines 70 in the load_balloon function (In hind sight I should have modified those function names since they are not descriptive)

self.add_class("fashion", 1, "top") #adjusted hereself.add_class("fashion", 2, "boots") #adjusted hereself.add_class("fashion", 3, "bag") #adjusted here

I took a look at the way the backend dataset was generated for the coco dataset and saw that the required information is the dataset/model name, in this case ‘fashion” then added class number and associated label mapping for the model to use.

So now that the dataset and model would now physically be able to accept the multiple classes I had to make sure that when images were loaded the class information would be recorded.

To do this I inserted a few lines of code (94–102) to generate that label mapping to the region polygons for a given image.

class_names_str  = [r['region_attributes']['name'] for r in a['regions'].values()]class_name_nums = []for i in  class_names_str:    if i == 'top':        class_name_nums.append(1)    if i == 'boots':        class_name_nums.append(2)    if i == 'bag':        class_name_nums.append(3)

This section just gets the names of the regions from the region_attributes section of the json file for a given image, then iterates over it to create an output list that converts the strings to numbers 1, 2, and 3 matching the class label mapping above.

Next on line 119 I added a new field to the add_image call so that I could store the class listing for a given image and then pull that listing when the masks were generated later in the script.

class_list = np.array(class_name_nums)

The most interesting part for me was realizing that the load_mask function was handling the class labeling by generating a numpy array of all 1s with the idea it was just doing single class classification as is common with most custom dataset training. So all I had to do here was reference the class_lists that I attached to the images the load_mask function references (line 143).

In this case the return function returns a set of image masks representing the N polygons to segment in that image and with my addition it now returns the a numeric array of the polygon classes also of length N.

class_array = info['class_list']return mask.astype(np.bool), class_array

Once these details were ironed out I was able to train the Mask R-CNN model using pretrained weights from the coco challenge. I let the model train for 20 epochs on an Nvidia 1080 GPU where it resized the images to have the long side be 1024 pixels, I fed 2 images at a time through the GPU, but did not experiment heavily on how many images at that resolution I could handle.

Current Results and Thoughts on Building a Stronger Model

Given that the dataset I used was only 100 training images with another 20–30 for evaluation the results are actually quite good. Given that this was a harder task than the single class custom segmentation tasks I had done before using Mask R-CNN I was pleasantly surprised.

Shows original and the predicted image masks

Obviously there is a lot of room for improvement. The model is still uncertain in areas like the edges of coats and outfits. In good part likely due to my inconsistent labeling/poor class choices in how to deal with clothing items that were not boots and bags.

If I were going to improve this quick prototype there are two areas to focus a good bit of effort on. First would be to define what exactly we want to do with the model. For instance the first stage model may not actually need to tell the difference between types of clothing if the second stage model/models are capable of doing so. This would mean you could use a single class segmentation model with a generic category of “clothes” or something similar. If we went in the opposite direction and added more granularity, this would just require well thought out classes for what mattered in the long run. This would mean we would have to figure out how to handle different things like a person wearing a coat and shirt, do we label the coat and shirt even though we only have a partial of the shirt? Or pants which are mostly obstructed by tall boots? These are all areas that would need to be examined based on the task/problems at hand.

The second area for obvious improvement is to gather more images. This model was only trained with 100 images, a lot of the uncertainties that the model has such as how it doesn’t always included the edges of coats or other clothes. This looks like it is the model not including them because it is still uncertain given its limited exposure. This part of the performance should improve.

Not Done Yet! Extracting Articles of Clothing

As I stated at the beginning the goal isn't just to segment the articles of clothing in an image, but to use it as a stage one model to help reduce the noise for subsequent models to use as input.

The output of the Mask R-CNN model is an array of results with a number of different data fields.

The ones that I found to be useful for this are “rois” which contain the bounding box coordinates for the object, “masks” which are the binary arrays for each predicted mask.

So to isolate those articles of clothing all I had to do was use the binary image mask to only show that particular item of clothing, and then use the bounding box from the “roi” field to crop the image down to only show a specific article of clothing. The result is a fairly closely cropped image of just that article of clothing rather than the noisy original one! I referenced a 2017 post by Brad Sliz for inspiration on how to isolate the masks.

See the stages of input below:

First the original image.

Next generate the segmentation masks. The model identifies one “top”, one “bag” and two instances of the “boots” class.

predicted image segmentation and classifications

Finally, by using the segmentation polygons as masks we can isolate just the clothes. Essentially the mask works by multiplying each of the 3 dimensions of the RGB array against the binary image mask. This means all areas not in the mask go to 0 and the areas in the mask remain untouched. The raw output of this is here… However I am not a fan of the black background.

I quickly hacked together some code just replacing the pure black which is a value of 0 in the array with a value of 255 to make it white. However this also messes with the colors in other parts of the black clothes… so it is less than ideal, but works for demonstration purposes, I will adjust that later it needed.

Brute force changing values of 0 to 255 causes odd outlines and such.

So now that we can generate these cropped images which isolate just the articles of clothing, you could use this for a ton of things! For instance you could see what is available in your online store that most closely matches a user submitted photo or use this cleaned image to generate tags for an article of clothing to make it searchable. The idea for both of these is that the second stage model would be helped by having less input noise to deal with.

If you went and built a similarity model which did not first localize input, then you could have two models wearing the same top, but if their skin tones are different then the similarity model would need to be able to account for that additional variation. Instead if you used a segmentation model like this as the first stage then the models’ skin tones wouldn't matter because you just have to compare the tops now.

Final Thoughts

This quick project is an example of how to build a custom multi class image segmentation model and then displaying how that model can be used to help other models do their job better. As stated above I think depending on the use cases this type of model could help in wide area of cases where it is often useful to be able to localize to specific objects/areas in images rather than feeding the raw images through the networks.

I am also quite happy since I have not trained a multi class segmentation network previously. Mostly because I have used most object detection/segmentation models as a way to clean data for that second stage network rather than training them to do the classification at a granular level. Part of this is that it is easier to retrain that second stage model and have the first stage model stay nice and flexible . For instance say a company builds an AI greeter and uses an object detection model to identify their employees who are walking into the lobby and unlocks the door for them. This would technically work fine. But it would be inconvenient in the long run because you would have to retrain the detector every time you added a new employee to your company (or someone left). But rather it is easier to have the detector stay generalized and just do faces, on the backend you could build a stage 2 network like a siamese network which gets good at telling people apart, extract features from the stage one object detector, and match that against your employees. This also solves the problem of how to add or subtract employees since you just need to make appropriate modifications to the dataset of employees you maintain.

github repo where I have the training and evaluation code