

Person Segmentation in Vision framework [FREE]
source link: https://www.raywenderlich.com/29650263-person-segmentation-in-vision-framework
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Person Segmentation in Vision framework
Learn how to use person segmentation via the Vision framework.
5/5
Version
Computer Vision has gained more prominence than ever before. Its applications include cancer detection, cell classification, traffic flow analysis, real-time sports analysis and many more. Apple introduced the Vision framework as part of iOS 11. It allows you to perform various tasks, such as face tracking, barcode detection and image registration. In iOS 15, Apple introduced an API in the Vision framework to perform person segmentation, which also powers the Portrait mode.
In this tutorial, you’ll learn:
- What image segmentation is and the different types of segmentation.
- Created a person segmentation for a photo.
- Understand the different quality levels and performance tradeoffs.
- Created person segmentation for live video capture.
- Other frameworks that provide person segmentation.
- Best practices for person segmentation.
Getting Started
Download the project by clicking Download Materials at the top or bottom of this page. Open RayGreetings in starter. Build and run on a physical device.
You’ll see two tabs: Photo Greeting and Video Greeting. The Photo Greeting tab will show you a nice background image and a family picture. In this tutorial, you’ll use person segmentation to overlay family members on the greeting background. Tap the Video Greeting tab and grant the camera permissions. You’ll see the camera feed displayed. The starter project is set up to capture and display the camera frames. You’ll update the live frames to generate a video greeting!
Before you dive into implementing these, you need to understand what person segmentation is. Get ready for a fun ride.
Introducing Image Segmentation
Image segmentation divides an image into segments and processes them. It gives a more granular understanding of the image. Object detection provides a bounding box of the desired object in an image, whereas image segmentation provides a pixel mask for the object.
There are two types of image segmentation: semantic segmentation and instance segmentation.
Semantic segmentation is the process of detecting and grouping together similar parts of the image that belong to the same class. Instance segmentation is the process of detecting a specific instance of the object. When you apply semantic segmentation to an image with people, it generates one mask that contains all the people. Instance segmentation generates an individual mask for each person in the image.
The person segmentation API provided in Apple’s Vision framework is a single-frame API. It uses semantic segmentation to provide a single mask for all people in a frame. It’s used for both stream and offline processing.
The process of person segmentation has four steps:
- Creating a person segmentation request.
- Creating a request handler for that request.
- Processing the request.
- Handling the result.
Next, you’ll use the API and these steps to create a photo greeting!
Creating Photo Greeting
You have an image of a family and an image with a festive background. Your goal is to overlay the people in the family picture over the festive background to generate a fun greeting.
Open RayGreetings and open GreetingProcessor.swift.
Add the following below import Combine
:
import Vision
This imports the Vision framework. Next, add the following to GreetingProcessor
below @Published var photoOutput = UIImage()
:
let request = VNGeneratePersonSegmentationRequest()
Here, you create an instance of the person segmentation request. This is a stateful request and can be reused for an entire sequence of frames. This is especially useful when processing videos offline and for live camera capture.
Next, add the following to GreetingProcessor
:
func generatePhotoGreeting(greeting: Greeting) { // 1 guard let backgroundImage = greeting.backgroundImage.cgImage, let foregroundImage = greeting.foregroundImage.cgImage else { print("Missing required images") return } // 2 // Create request handler let requestHandler = VNImageRequestHandler( cgImage: foregroundImage, options: [:]) // TODO }
Here’s what the code above is doing:
- Accesses
cgImage
frombackgroundImage
andforegroundImage
. Then, it ensures both the images are valid. You’ll be using them soon to blend the images using Core Image. - Creates
requestHandler
as an instance ofVNImageRequestHandler
. It takes in an image along with an optional dictionary that specifies how to process the image.
Next, replace // TODO
with the following:
do { // 1 try requestHandler.perform([request]) // 2 guard let mask = request.results?.first else { print("Error generating person segmentation mask") return } // 3 let foreground = CIImage(cgImage: foregroundImage) let maskImage = CIImage(cvPixelBuffer: mask.pixelBuffer) let background = CIImage(cgImage: backgroundImage) // TODO: Blend images } catch { print("Error processing person segmentation request") }
Here’s a breakdown of the code above:
-
requestHandler
processes the person segmentation request usingperform(_:)
. If multiple requests are present, it returns after all the requests have been either completed or failed.perform(_:)
can throw an error while processing the request, so you handle it by enclosing it in ado-catch
. - You then retrieve the mask from the results. Because you submitted only one request, you retrieve the first object from the results.
- The
pixelBuffer
property of the returned result has the mask. You then create the CIImage versions of the foreground, background and mask. The CIImage is the representation of an image that the Core Image filter will process. You’ll need this to blend the images.
Blending All the Images
Add the following in GreetingProcessor.swift below import Vision
:
import CoreImage.CIFilterBuiltins
Core Image provides methods that give type-safe instances of CIFilter
. Here, you import CIFilterBuiltins
to access its type-safe APIs.
Next, add the following to GreetingProcessor
:
func blendImages( background: CIImage, foreground: CIImage, mask: CIImage ) -> CIImage? { // 1 let maskScaleX = foreground.extent.width / mask.extent.width let maskScaleY = foreground.extent.height / mask.extent.height let maskScaled = mask.transformed( by: __CGAffineTransformMake(maskScaleX, 0, 0, maskScaleY, 0, 0)) // 2 let backgroundScaleX = (foreground.extent.width / background.extent.width) let backgroundScaleY = (foreground.extent.height / background.extent.height) let backgroundScaled = background.transformed( by: __CGAffineTransformMake(backgroundScaleX, 0, 0, backgroundScaleY, 0, 0)) // 3 let blendFilter = CIFilter.blendWithMask() blendFilter.inputImage = foreground blendFilter.backgroundImage = backgroundScaled blendFilter.maskImage = maskScaled // 4 return blendFilter.outputImage }
The code above:
- Calculates the X and Y scales of the mask with respect to the foreground image. It then uses
CGAffineTransformMake
to scale themask
size to theforeground
image. - Like the scaling of
mask
, it calculates the X and Y scales ofbackground
and then scalesbackground
to the size offoreground
. - Creates
blendFilter
, which is a Core Image filter. It then sets theinputImage
of the filter to theforeground
. ThebackgroundImage
and themaskImage
of the filter are set to the scaled versions of the image. -
outputImage
contains the result of the of the blend.
The returned result is of the type CIImage
. You’ll need to convert this to a UIImage
to display it in the UI.
In GreetingProcessor
, add the following at the top, below let request = VNGeneratePersonSegmentationRequest()
:
let context = CIContext()
Here, you create an instance of CIContext
. It’s used to create a Quartz 2D image from a CIImage
object.
Add the following to GreetingProcessor
:
private func renderAsUIImage(_ image: CIImage) -> UIImage? { guard let cgImage = context.createCGImage(image, from: image.extent) else { return nil } return UIImage(cgImage: cgImage) }
Here, you use context
to create an instance of CGImage
from CIImage
.
Using cgImage
, you then create a UIImage
. The user will see that image.
Displaying the Photo Greeting
Replace // TODO: Blend images
in generatePhotoGreeting(greeting:)
and add the following:
// 1 guard let output = blendImages( background: background, foreground: foreground, mask: maskImage) else { print("Error blending images") return } // 2 if let photoResult = renderAsUIImage(output) { self.photoOutput = photoResult }
Here’s what’s happening:
-
blendImages(background:foreground:mask:)
blends the images and ensures the output isn’tnil
. - Then, you convert the output to an instance of a
UIImage
and set it tophotoOutput
.photoOutput
is a published property. It’s accessed to display the output in PhotoGreetingView.swift.
As a last step, open PhotoGreetingView.swift. Replace // TODO: Generate Photo Greeting
in the action closure of Button
with the following:
GreetingProcessor.shared.generatePhotoGreeting(greeting: greeting)
Here, you call generatePhotoGreeting(greeting:)
to generate the greeting when Button
is tapped.
Build and run on a physical device. Tap Generate Photo Greeting.
Voila! You’ve now added a custom background to your family pic. It’s time to send that to your friends and family. :]
By default, you get the best quality person segmentation. It does have a high processing cost and might not be suitable for all real-time scenarios. It’s essential to know the different quality and performance options available. You’ll learn that next.
Quality and Performance Options
The person segmentation request you created earlier has a default quality level of VNGeneratePersonSegmentationRequest.QualityLevel.accurate
.
You can choose from three quality levels:
-
accurate
: Ideal in the scenario where you want to get the highest quality and aren’t constrained by time. -
balanced
: Ideal for processing frames for video. -
fast
: Best suited for processing streaming content.
The quality of the generated mask depends on the quality level set.
Notice that as the quality level increases, the quality of the mask looks much better. Accurate quality shows more granular details in the mask. The frame size, memory and time to process vary depending on the quality level.
The frame size for the accurate level is a whopping 64x compared to the fast quality level. The memory and the time to process for an accurate level are much higher when compared to the fast and balanced levels. This represents the trade-off on the quality of the mask and the resources needed to generate it.
Now that you know the trade-off, it’s time to generate a fun video greeting! :]
Creating Video Greeting
Open CameraViewController.swift. It has all the functionality set up to capture camera frames and render them using Metal. To learn more about setting up a camera with AVFoundation and SwiftUI, check out this tutorial and this video series.
Check out the logic in CameraViewController
, which conforms to AVCaptureVideoDataOutputSampleBufferDelegate
.
extension CameraViewController: AVCaptureVideoDataOutputSampleBufferDelegate { func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) { // Grab the pixelbuffer frame from the camera output guard let pixelBuffer = sampleBuffer.imageBuffer else { return } self.currentCIImage = CIImage(cvPixelBuffer: pixelBuffer) } }
Here, notice that pixelBuffer
is retrieved from sampleBuffer
. It’s then rendered by updating currentCIImage
. Your goal is to use this pixelBuffer
as the foreground image and create a video greeting.
Open GreetingProcessor.swift and add the following to GreetingProcessor
:
func processVideoFrame( foreground: CVPixelBuffer, background: CGImage ) -> CIImage? { let ciForeground = CIImage(cvPixelBuffer: foreground) // TODO: person segmentation request return nil }
Here, you create an instance of CIImage
from the foreground CVPixelBuffer
so you can blend the images using Core Image filter.
So far, you’ve used the Vision framework to create, process and handle the person segmentation request. Although it’s easy to use, other frameworks offer similar functionality powered by the same technology. You’ll learn this next.
Alternatives for Generating Person Segmentation
You can use these frameworks as alternatives to Vision for generating a person segmentation mask:
-
AVFoundation: Can generate a person segmentation mask on certain newer devices when capturing a photo. You can get the mask via the
portraitEffectsMatte
property ofAVCapturePhoto
. -
ARKit: Generates the segmentation mask when processing the camera feed. You can get the mask using the
segmentationBuffer
property ofARFrame
. It’s supported on devices that have A12 Bionic and later. -
Core Image: Core Image provides a thin wrapper over the Vision framework. It exposes the
qualityLevel
property you set forVNGeneratePersonSegmentationRequest
.
Next, you’ll use Core Image to generate a person segmentation mask for the video greeting.
Using Core Image to Generate Person Segmentation Mask
Replace // TODO: person segmentation request
in processVideoFrame(foreground:background:)
with the following:
// 1 let personSegmentFilter = CIFilter.personSegmentation() personSegmentFilter.inputImage = ciForeground personSegmentFilter.qualityLevel = 1 // 2 if let mask = personSegmentFilter.outputImage { guard let output = blendImages( background: CIImage(cgImage: background), foreground: ciForeground, mask: mask) else { print("Error blending images") return nil } return output }
Here’s what that does:
- Creates
personSegmentFilter
using Core Image’sCIFilter
and setsinputImage
with the foreground image. ThequalityLevel
takes in a number. The different quality level options are:- 0: Accurate
- 1 Balanced
- 2: Fast
Here, you set
qualityLevel
to 1. - Fetches the mask from
outputImage
ofpersonSegmentationFilter
and ensures it’s notnil
. Then, it usesblendImages(background:foreground:mask:)
to blend the images and return the result.
Open CameraViewController.swift. Replace the contents of captureOutput(_:didOutput:from:)
in CameraViewController
extension with the following:
// 1 guard let pixelBuffer = sampleBuffer.imageBuffer, let backgroundImage = self.background?.cgImage else { return } // 2 DispatchQueue.global().async { if let output = GreetingProcessor.shared.processVideoFrame( foreground: pixelBuffer, background: backgroundImage) { DispatchQueue.main.async { self.currentCIImage = output } } }
Here’s a breakdown of the code above. It:
- Checks that
pixelBuffer
andbackgroundImage
are valid. - Processes the video frame asynchronously by calling
processVideoFrame(foreground:background:)
defined inGreetingProcessor
. Then, it updatescurrentCIImage
with theoutput
.
Build and run on a physical device. Tap the Video Greeting tab.
Oh no! There’s no visible camera stream. What happened?
Open GreetingProcessor.swift and put a breakpoint at guard let output = blendImages
in processVideoFrame(foreground:background:)
. Notice the mask generated using Quick Look in the debugger.
The mask is red! You’ll need to create a Blend filter using the red mask instead of the regular white mask.
Update blendImages(background:foreground:mask:)
to take a new Boolean parameter as shown below:
func blendImages( background: CIImage, foreground: CIImage, mask: CIImage, isRedMask: Bool = false ) -> CIImage? {
This uses isRedMask
to determine the type of blend filter to generate. By default, its value is false
.
Replace let blendFilter = CIFilter.blendWithMask()
in blendImages(background:foreground:mask:isRedMask:)
as shown below:
let blendFilter = isRedMask ? CIFilter.blendWithRedMask() : CIFilter.blendWithMask()
Here, you generate blendFilter
with a red mask if isRedMask
is true
. Otherwise, you create with a white mask.
Next, replace:
guard let output = blendImages( background: CIImage(cgImage: background), foreground: ciForeground, mask: mask) else {
in processVideoFrame(foreground:background:) with the following:
guard let output = blendImages( background: CIImage(cgImage: background), foreground: ciForeground, mask: mask, isRedMask: true) else {
Here, you specify to generate the blend filter with a red mask.
Build and run on a physical device. Tap Video Greeting and point the front camera toward you.
You now see your image overlaid on a friendly greeting. Great job creating a video greeting!
You can now create a Zoom blur background filter. :]
Understanding Best Practices
While person segmentation worked in photo and video greetings, here are some best practices to keep in mind:
- Try to segment a maximum of four people in a scene and ensure all are visible.
- A person's height should be at least half the image height.
- Avoid the following ambiguities in a frame:
- Statues
- Long distance
Where to Go From Here?
Download the completed version of the project using the Download Materials button at the top or bottom of this tutorial.
To learn more, check out this WWDC video: Detect people, faces, and poses using Vision
I hope you enjoyed this tutorial. Please join the forum discussion below if you have any questions or comments.
raywenderlich.com Weekly
The raywenderlich.com newsletter is the easiest way to stay up-to-date on everything you need to know as a mobile developer.
Get a weekly digest of our tutorials and courses, and receive a free in-depth email course as a bonus!
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK