Apple's ML Stack Overview

Apple's ML stack isn't a monolith—it's an Onion of Abstractions. We peel back the layers from the "magic" domain frameworks (Vision, NL, Sound) down through Core ML, Create ML, and the raw physics of Metal and Accelerate.

🧠The Abstraction Hierarchy

Think of the Apple ML stack as an Automated Factory with four layers:

1.Domain Frameworks (Vision, NL, Sound): The Specialized Robots. Purpose-built, highly optimized, and pre-programmed for common tasks.
2.Core ML (The Engine):The Programmable Arm. Raw inference capability for custom models Apple didn't foresee.
3.Create ML (The Teacher): Training tools to teach the Programmable Arm new movements with your own data.
4.Accelerate / Metal (The Physics):The raw hydraulics and electricity. Only touch this if you're building a new framework from scratch.

The Golden Rule: Always use the highest-level framework that solves the problem. Only drop down a layer when you hit a hard constraint.

🔍Vision: The Workhorse

Vision isn't just a model wrapper—it's a Pipeline. It bridges the Semantic Gap between your 4K camera feed and the model's 256x256 tensor input, handling buffer conversion, cropping/scaling, and device orientation.

Beyond plumbing, Vision offers "Solved Problems"—models shipped with iOS, pre-trained and optimized for the ANE:

VNDetectFaceRectanglesRequest — Face detection
VNRecognizeTextRequest — Full OCR
VNDetectBarcodesRequest — QR/Barcode scanning
VNDetectHumanBodyPoseRequest — Body pose estimation

Zero-Copy Performance

Vision avoids the "Copy-Convert-Copy" loop by leveraging IOSurface—a kernel-level construct that lets the ISP, GPU, ANE, and CPU share the same physical memory without copying bytes.

import Vision

let request = VNRecognizeTextRequest { request, error in
    guard let observations = request.results
        as? [VNRecognizedTextObservation] else { return }
    for observation in observations {
        let text = observation.topCandidates(1).first?.string ?? ""
        if text.filter(\.isNumber).count == 16 {
            print("Found Card: \(text)")
        }
    }
}
request.recognitionLevel = .accurate

let handler = VNImageRequestHandler(
    cvPixelBuffer: pixelBuffer, options: [:]
)
try? handler.perform([request])

No model loading. No input shaping. No output decoding. You ask for text, you get text.

📝Natural Language & Sound Analysis

NaturalLanguage: Smart Search for Free

Before you bundle a 500MB language model, check NLEmbedding. Apple ships massive language models within the OS. Convert text into numerical vectors to build "Smart Search" features—with zero storage cost.

import NaturalLanguage

guard let embedding = NLEmbedding.sentenceEmbedding(
    for: .english
) else { return }

let userQuery = "Groceries"
let noteTitle = "Milk, Eggs, Bread"

// These strings share NO words, but the OS knows they are related.
let distance = embedding.distance(
    between: userQuery, and: noteTitle
)
print("Match Score: \(distance)")

SoundAnalysis: Real-Time Audio Monitoring

SoundAnalysis manages a Sliding Window so the model always sees complete audio events—critical for detecting sirens, car horns, or smoke detectors.

import SoundAnalysis

let request = try! SNClassifySoundRequest(
    classifierIdentifier: .version1
)

// "Siren": 0.99 confidence
if top.identifier == "siren" && top.confidence > 0.9 {
    print("[Alert] Siren Detected!")
}

⚙️Core ML: The Custom Engine

When the Specialized Robots aren't enough—Apple doesn't ship a VNDetectHotDogRequest—you build your own tool with Core ML.

Use Core ML for domain-specific prediction problems: medical imaging, sports analysis, or content moderation that generic filters miss.

import CoreML

let config = MLModelConfiguration()
config.computeUnits = .all

let model = try MyCustomClassifier(configuration: config)
let input = MyCustomClassifierInput(image: pixelBuffer)
let output = try model.prediction(input: input)

print(output.classLabel) // "HotDog"

The trade-off: You leave Vision's pipeline and must manage data yourself, including the tricky MLMultiArray with its strided access patterns.

🛠️Create ML: Transfer Learning

Training from scratch requires millions of labeled images. Transfer Learning changes that equation: Create ML freezes a robust scene feature extractor shipped with the OS and only trains the final layer to map known patterns to your specific labels.

Think of it as buying a generic injection molding machine (that already knows how to melt and shape plastic) and just 3D printing a new mold for your toy. Create ML can produce a functional prototype with just tens of images per class in minutes.

Create ML also supports On-Device Personalization via MLUpdateTask—updating a model on the user's phone with local data. Face ID and keyboard suggestions use exactly this pattern. The data never leaves the device.

🎯Key Takeaways

1.Start at the top—Check Vision, NaturalLanguage, or SoundAnalysis before reaching for Core ML. The domain frameworks are free, optimized, and updated annually by Apple.
2.Vision is a Pipeline, not a wrapper—It handles buffer conversion, cropping, scaling, and orientation using IOSurface Zero-Copy. Manual preprocessing burns the CPU.
3.Core ML is for custom problems—Medical imaging, domain classifiers, or architectures Apple doesn't ship. You gain control but lose the pipeline convenience.
4.Stay in the Onion—By using higher-level frameworks, your app automatically benefits from future ANE improvements without recompiling code.

About Sandboxed

Sandboxed is a podcast for iOS developers who want to add AI and machine learning features to their apps—without needing a PhD in ML.

Each episode, we take one practical ML topic—like Vision, Core ML, or Apple Intelligence—and walk through how it actually works on iOS, what you can build with it, and how to ship it this week.

If you want to build smarter iOS apps with on-device AI, subscribe to stay ahead of the curve.

Ready to dive deeper?

Next, we address the environment where these intelligent features live—the Sandbox, the Secure Enclave, and Private Cloud Compute.